Read original ↗
paperarXivTrust 82 · PrimaryPublished 5d agoLive · 3d ago

PHF: Privileged Hidden Flow for On-Policy Self-Distillation

On-policy self-distillation (OPSD) trains a reasoning model on rollouts sampled from its own policy by matching a privileged teacher that also sees verified reference solutions. Existing OPSD objectives supervise only the output distribution, so privileged context affects training through a token-level divergence without directly supervising the internal computation that produced that distribution. We propose Privileged Hidden Flow (PHF), which additionally distills how a privileged teacher's hidden states move along the same rollout. Rather than forcing each student hidden vector to match the

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Implements (incoming)

Related across the graph

Topics