Topic cluster · 3 items

multimodal

model

VioletVision-3B

An open vision-language model for captioning and VQA.

paper

Vision-language pretraining at scale

Joint training recipes that align images and text in one embedding space.

repo

vlm-starter

A starter kit for training vision-language models.

Related topics