paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

Vision-language models can produce confident answers on visually ambiguous inputs, resulting in biased predictions. Common entropy-based methods, such as Semantic Entropy (SE), rely on output diversity. Yet our analysis shows that overconfident visual embeddings suppress output diversity under stochastic decoding, causing SE to underestimate uncertainty in such cases. Recent methods instead probe output diversity through input perturbations, including textual paraphrasing or joint text-image perturbations, and show improved performance. We study these approaches and reveals that the resulting

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Has model

modelVioletVision-3B

Related across the graph

modelVioletVision-3B

Topics

cs.CL