Topic cluster · 3 items
multimodal
model
VioletVision-3B
An open vision-language model for captioning and VQA.
paperVision-language pretraining at scale
Joint training recipes that align images and text in one embedding space.
repovlm-starter
A starter kit for training vision-language models.