paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

Purified OPSD: On-Policy Self-Distillation Without Losing How to Think

On-policy self-distillation (OPSD) has emerged as a promising paradigm for improving LLM reasoning, where a privileged teacher with access to reference solutions provides token-level supervision on the student's own generated trajectories. However, we find that OPSD consistently fails on long chain-of-thought (long-CoT) reasoning models, yielding at best marginal gains while destabilizing the reflective reasoning capability these models depend on. Through a novel decomposition of the teacher's supervision signal, we identify the root cause: the teacher's supervision is dominated by a reference

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorZhanming Shen →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorJintao Tong →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorShaotian Yan →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorChen Shen →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorHao Chen →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorWentao Ye →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorXiaomeng Hu →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorRui Miao →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorHaobo Wang →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorJunbo Zhao →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorGang Chen →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Linked via arxiv authorJieping Ye →
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think

Implements

repochrisliu298/awesome-on-policy-distillation repobenjaminzwhite/reasoning-models

authored (incoming)

personZhanming Shen personJintao Tong personShaotian Yan personChen Shen personHao Chen personWentao Ye personXiaomeng Hu personRui Miao personHaobo Wang personJunbo Zhao personGang Chen personJieping Ye

Implements (incoming)

reponick7nlp/Awesome-LLM-On-Policy-Distillation

Related across the graph

personJintao Tong repochrisliu298/awesome-on-policy-distillation personZhanming Shen personHao Chen personGang Chen reponick7nlp/Awesome-LLM-On-Policy-Distillation personChen Shen personJunbo Zhao repobenjaminzwhite/reasoning-models personJieping Ye personRui Miao personXiaomeng Hu personShaotian Yan personHaobo Wang personWentao Ye

Topics

cs.AI