Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 12h ago

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded se

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorLiyan Tang

    Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

  • Linked via arxiv authorFangcong Yin

    Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

  • Linked via arxiv authorGreg Durrett

    Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Has model

authored (incoming)

Related across the graph

Topics