paperarXivTrust 82 · PrimaryPublished 2d agoLive · 21h ago

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only what can be objectively scored, often neglecting subjective, non-verifiable aspects of human-like outputs, such as style and structure. This limitation leads to well-documented failure modes such as diversity collapse, unnatural-sounding responses, and reward hacking. We propose an adversarial generator-discriminator framework that augments verifiable rewards with a learn

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorMehul Damani →
Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
Linked via arxiv authorIsha Puri →
Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
Linked via arxiv authorIdan Shenfeld →
Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
Linked via arxiv authorJacob Andreas →
Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

authored (incoming)

personMehul Damani personIsha Puri personIdan Shenfeld personJacob Andreas

Related across the graph

personIsha Puri personJacob Andreas personIdan Shenfeld personMehul Damani

Topics

cs.CL