Read original ↗
paperarXivTrust 82 · PrimaryPublished 5d agoLive · 3d ago

Enhancing Part-Level Point Grounding for Any Open-Source MLLMs

Visual grounding aims to associate free-form textual queries with specific regions in an image. While recent Multimodal Large Language Models (MLLMs) have demonstrated promising capabilities in this domain, they primarily excel at object-level grounding and often struggle with part-level grounding-an essential requirement for fine-grained tasks such as robotic manipulation. In this work, we introduce a general approach that equips any open-source MLLMs with accurate 2D part-level point grounding, offering a more direct alternative to conventional grounding representations. Our method leverages

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Has model

Covers

Implements

Related across the graph

Topics