Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 9h ago

DemoPSD: Disagreement-Modulated Policy Self-Distillation

On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorYunhe Li

    DemoPSD: Disagreement-Modulated Policy Self-Distillation

  • Linked via arxiv authorHao Shi

    DemoPSD: Disagreement-Modulated Policy Self-Distillation

  • Linked via arxiv authorWenhao Liu

    DemoPSD: Disagreement-Modulated Policy Self-Distillation

  • Linked via arxiv authorMengzhe Ruan

    DemoPSD: Disagreement-Modulated Policy Self-Distillation

  • Linked via arxiv authorHanxu Hou

    DemoPSD: Disagreement-Modulated Policy Self-Distillation

  • Linked via arxiv authorZhongxiang Dai

    DemoPSD: Disagreement-Modulated Policy Self-Distillation

  • Linked via arxiv authorShuang Qiu

    DemoPSD: Disagreement-Modulated Policy Self-Distillation

  • Linked via arxiv authorLinqi Song

    DemoPSD: Disagreement-Modulated Policy Self-Distillation

Implements

Covers

authored (incoming)

Related across the graph

Topics