Topic cluster · 4 items
rl
paper
Self-rewarding agents that retrace failures
Agents that attribute their own errors and retrace to repair multi-step reasoning.
paperLong-horizon credit assignment in RL
A method for propagating reward across very long action sequences.
reponano-rlhf
A from-scratch RLHF training loop in one file.
glossary_termRLHF
Reinforcement learning from human feedback — tuning a model toward preferred answers.