Read original ↗
paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

Modern large language models (LLMs) rely on reinforcement learning during post-training to push specific capabilities, yet integrating multiple capabilities into one model remains hard. Existing methods, such as Off-Policy Finetune and Mix-RL, are either inefficient or lose performance. In this work, we propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for combining the capabilities of multiple domain RL teachers: we first run per-domain specialised RL to obtain a set of domain teachers, then distill these teachers into the student on its own rollouts. This eliminat

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Implements (incoming)

Related across the graph

Topics