paperarXivTrust 82 · PrimaryPublished 7d agoLive · 4d ago
Tandem Reinforcement Learning with Verifiable Rewards
Reinforcement learning with verifiable rewards (RLVR) has significantly improved the reasoning capability of large language models, reaching expert or even superhuman performance in domains such as competition math. However, whether weaker agents and humans can actually harness this capability is far less certain, with RLVR documented to drift reasoning toward idiosyncratic patterns such as poor readability and language mixing. Tandem training is a recently introduced paradigm that targets this compatibility problem: a trained, stronger senior co-generates each rollout with a frozen, weaker ju
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
