Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only what can be objectively scored, often neglecting subjective, non-verifiable aspects of human-like outputs, such as style and structure. This limitation leads to well-documented failure modes such as diversity collapse, unnatural-sounding responses, and reward hacking. We propose an adversarial generator-discriminator framework that augments verifiable rewards with a learn
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorMehul Damani →
Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
- Linked via arxiv authorIsha Puri →
Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
- Linked via arxiv authorIdan Shenfeld →
Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
- Linked via arxiv authorJacob Andreas →
Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
