paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA

High benchmark accuracy does not guarantee genuine use of visual evidence. We study this problem in traffic accident Video Question Answering (VideoQA), where correct answers should depend on scene-specific visual evidence but may instead be inferred from textual shortcuts. Through an audit of four public benchmarks, we find that several recent open-weight Vision-Language Models (VLMs) perform competitively, and sometimes better, without video input. On the MM-AU benchmark, removing video consistently improves accuracy, and adding more frames further degrades performance. To quantify visual de

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Has model

modelVioletVision-3B

Implements

repovlm-starter

Related across the graph

modelVioletVision-3B repovlm-starter

Topics

cs.CV