Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap,
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorJunhao Shi →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
- Linked via arxiv authorSiyin Wang →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
- Linked via arxiv authorXiaopeng Yu →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
- Linked via arxiv authorLi Jin →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
- Linked via arxiv authorJingjing Gong →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
- Linked via arxiv authorXipeng Qiu →
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
