Read original ↗
paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision

Cross-embodiment transfer in vision-language-action (VLA) models remains challenging because low-level state and action spaces differ fundamentally across robot platforms. We observe that the high-level cognitive process underlying manipulation, including scene perception, object identification, task planning, and sub-task decomposition, is largely shared across embodiments. Based on this observation, we present ZR-0, a 2.6 billion parameter end-to-end VLA model that uses dense Embodied Chain-of-Thought (ECoT) supervision to align cross-embodiment representations within the vision-language mod

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Implements

Related across the graph

Topics