paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago
Latent Noise Mask for Reducing Visual Redundancy in Multimodal Large Language Models
Multimodal large language models (MLLMs) often fail in fine-grained visual reasoning, as question-relevant visual cues are diluted by dense and redundant image tokens. Recent multimodal reasoning methods usually extend chain-of-thought from language models into visual or latent spaces, seeking to add intermediate reasoning states while overlooking the negative impact of redundant visual tokens. We propose LatEnt Noise maSk (Lens), a question-conditioned visual evidence purification framework that empowers MLLMs to reason with cleaner visual cues in latent space. Lens introduces a lightweight L
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
