paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago
VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context
Large Vision Language Models (LVLMs) have achieved remarkable success on vision-language tasks, yet fine-grained perception over high-resolution images and long-context videos remains challenging. As the number of visual tokens increases, the visual attention sink phenomenon becomes increasingly severe, causing irrelevant tokens to absorb a disproportionate amount of attention mass. Recent approaches attempt to mitigate this issue by explicitly predicting bounding boxes or temporal spans and re-encoding the cropped visual regions. Such methods depend on unreliable numeric localization in the d
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
