paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

Latent Noise Mask for Reducing Visual Redundancy in Multimodal Large Language Models

Multimodal large language models (MLLMs) often fail in fine-grained visual reasoning, as question-relevant visual cues are diluted by dense and redundant image tokens. Recent multimodal reasoning methods usually extend chain-of-thought from language models into visual or latent spaces, seeking to add intermediate reasoning states while overlooking the negative impact of redundant visual tokens. We propose LatEnt Noise maSk (Lens), a question-conditioned visual evidence purification framework that empowers MLLMs to reason with cleaner visual cues in latent space. Lens introduces a lightweight L

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Has model

modelVioletVision-3B

Related across the graph

modelVioletVision-3B

Topics

cs.CV