Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap,

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorJunhao Shi

    Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

  • Linked via arxiv authorSiyin Wang

    Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

  • Linked via arxiv authorXiaopeng Yu

    Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

  • Linked via arxiv authorLi Jin

    Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

  • Linked via arxiv authorJingjing Gong

    Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

  • Linked via arxiv authorXipeng Qiu

    Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Implements

authored (incoming)

Related across the graph

Topics