paper · arXiv

Vision-language pretraining at scale

Joint training recipes that align images and text in one embedding space.

Want the primary source?View original →