person profile

Jacob Andreas

Jacob Andreas — researcher or builder tracked in the Angestrom contributor network.

4Connections

1Papers

0Models

0Repos

0News

Papers · 1

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only what can be objectively scored, often neglecting subjective, non-verifiable aspects of human-like outputs, such as style and structure. This limitation leads to well-documented failure modes such as diversity collapse, unnatural-sounding responses, and reward hacking. We propose an adversarial generator-discriminator framework that augments verifiable rewards with a learn