paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap,

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorJunhao Shi →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
Linked via arxiv authorSiyin Wang →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
Linked via arxiv authorXiaopeng Yu →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
Linked via arxiv authorLi Jin →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
Linked via arxiv authorJingjing Gong →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
Linked via arxiv authorXipeng Qiu →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Implements

repovlm-starter

authored (incoming)

personJunhao Shi personSiyin Wang personXiaopeng Yu personLi Jin personJingjing Gong personXipeng Qiu

Related across the graph

personLi Jin personXiaopeng Yu personXipeng Qiu personJingjing Gong personSiyin Wang repovlm-starter personJunhao Shi

Topics

cs.AI