paperarXivTrust 82 · PrimaryPublished yesterdayLive · 12h ago

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded se

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorLiyan Tang →
Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
Linked via arxiv authorFangcong Yin →
Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
Linked via arxiv authorGreg Durrett →
Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Has model

modelVioletVision-3B

authored (incoming)

personLiyan Tang personFangcong Yin personGreg Durrett

Related across the graph

personGreg Durrett personLiyan Tang personFangcong Yin modelVioletVision-3B

Topics

cs.CL