Read original ↗
paperarXivTrust 82 · PrimaryPublished 2d agoLive · 21h ago

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only what can be objectively scored, often neglecting subjective, non-verifiable aspects of human-like outputs, such as style and structure. This limitation leads to well-documented failure modes such as diversity collapse, unnatural-sounding responses, and reward hacking. We propose an adversarial generator-discriminator framework that augments verifiable rewards with a learn

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorMehul Damani

    Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

  • Linked via arxiv authorIsha Puri

    Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

  • Linked via arxiv authorIdan Shenfeld

    Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

  • Linked via arxiv authorJacob Andreas

    Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

authored (incoming)

Related across the graph

Topics