paperarXivTrust 82 · PrimaryPublished 5d agoLive · 3d ago

Zero-Gated Language-conditioned Human Motion Prediction

Pose histories provide the core kinematic evidence for 3D human motion prediction, but they lack explicit high-level semantic guidance. This paper introduces ZGL, a lightweight language-conditioned predictor that uses captions of the observed motion as a semantic prior while preserving a strong motion backbone as the main source of dynamics. We render only the observed poses, generate a one-sentence description with a vision-language model, encode the caption with a frozen CLIP-L text tower, and project it into a small set of conditioning tokens. These tokens are injected into a DCT-based spat

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Has model

modelVioletVision-3B

Related across the graph

modelVioletVision-3B

Topics

cs.CV