Read original ↗
paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

Large Vision Language Models (LVLMs) have achieved remarkable success on vision-language tasks, yet fine-grained perception over high-resolution images and long-context videos remains challenging. As the number of visual tokens increases, the visual attention sink phenomenon becomes increasingly severe, causing irrelevant tokens to absorb a disproportionate amount of attention mass. Recent approaches attempt to mitigate this issue by explicitly predicting bounding boxes or temporal spans and re-encoding the cropped visual regions. Such methods depend on unreliable numeric localization in the d

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Has model

Implements

Covers

Related across the graph

Topics