Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
High-throughput RLHF systems often decouple rollout generation from policy optimization, leading to the use of stale rollouts during learner updates. In this work, we study the effect of such staleness in asynchronous GRPO. We make the behavior policy explicit in the GRPO surrogate objective and distinguish between the surrogate-gradient mapping used by the learner and the true total derivative of a distribution-dependent population objective. Under assumptions of local boundedness, distributional smoothness, and behavior-policy smoothness, we show that stale rollouts introduce a per-step surr
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorJingwei Song →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorHaofeng Xu →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorJie Xiao →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorChengke Bao →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorJingwei Shi →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorPengbin Feng →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorWeixun Wang →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorYuhang Han →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorChuan Wu →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorLinfeng Zhang →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
- Linked via arxiv authorBill Shi →
Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
