paperarXivTrust 82 · PrimaryPublished 5d agoLive · 3d ago

Enhancing Part-Level Point Grounding for Any Open-Source MLLMs

Visual grounding aims to associate free-form textual queries with specific regions in an image. While recent Multimodal Large Language Models (MLLMs) have demonstrated promising capabilities in this domain, they primarily excel at object-level grounding and often struggle with part-level grounding-an essential requirement for fine-grained tasks such as robotic manipulation. In this work, we introduce a general approach that equips any open-source MLLMs with accurate 2D part-level point grounding, offering a more direct alternative to conventional grounding representations. Our method leverages

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Has model

modelVioletVision-3B

Covers

newsEmbed the world: Multimodal AI for searchable aerial imagery at scale

Implements

repovlm-starter

Related across the graph

modelVioletVision-3B newsEmbed the world: Multimodal AI for searchable aerial imagery at scale repovlm-starter

Topics

cs.CV