DemoPSD: Disagreement-Modulated Policy Self-Distillation
On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorYunhe Li →
DemoPSD: Disagreement-Modulated Policy Self-Distillation
- Linked via arxiv authorHao Shi →
DemoPSD: Disagreement-Modulated Policy Self-Distillation
- Linked via arxiv authorWenhao Liu →
DemoPSD: Disagreement-Modulated Policy Self-Distillation
- Linked via arxiv authorMengzhe Ruan →
DemoPSD: Disagreement-Modulated Policy Self-Distillation
- Linked via arxiv authorHanxu Hou →
DemoPSD: Disagreement-Modulated Policy Self-Distillation
- Linked via arxiv authorZhongxiang Dai →
DemoPSD: Disagreement-Modulated Policy Self-Distillation
- Linked via arxiv authorShuang Qiu →
DemoPSD: Disagreement-Modulated Policy Self-Distillation
- Linked via arxiv authorLinqi Song →
DemoPSD: Disagreement-Modulated Policy Self-Distillation
