Topic cluster · 4 items

rl

paper

Self-rewarding agents that retrace failures

Agents that attribute their own errors and retrace to repair multi-step reasoning.

paper

Long-horizon credit assignment in RL

A method for propagating reward across very long action sequences.

repo

nano-rlhf

A from-scratch RLHF training loop in one file.

glossary_term

RLHF

Reinforcement learning from human feedback — tuning a model toward preferred answers.

Related topics