paper · arXiv

See & Sniff: Learning Visuo-Olfactory Representations

While modern multimodal models integrate vision with language, audio, or touch, olfaction remains largely unexplored due to the lack of paired visuo-olfactory data. We introduce SmellNet-V, a scalable visuo-olfactory dataset built on the insight that odor identity is largely invariant to visual transformations within a semantic category. This allows us to synthetically pair smell-only samples with semantically aligned in-the-wild web images, converting a unimodal olfactory dataset into a cross-modal benchmark without costly co-collection. Building on this dataset, we propose See & Sniff, a sel

Want the primary source?View original →

newsIntroducing Gemma 4 12B: a unified, encoder-free multimodal model

cs.CV