paper · arXiv

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We introduce a \textbf{R}anking-\textbf{i}nduced \textbf{VER}ifiable framework (RiVER) that trains LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous-valued supervision. When applying group-relative RL to such continuous rewards, we identify two key challenges: \emph{scale dominance}, where uncalibrated score magn

Want the primary source?View original →

newsRL without TD learning

glossary_termRLHF

newsRL without TD learning glossary_termRLHF

cs.LG