paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Modern large-scale LLM pretraining benefits from utilizing Pipeline Parallelism; however, synchronous implementations leave GPUs idle during pipeline bubbles, wasting computational resources. Asynchronous Pipeline Parallelism eliminates these bubbles, maximizing throughput at the cost of gradient staleness. Among asynchronous schedules, PipeDream-2BW is particularly appealing: unlike the original PipeDream schedule, it ensures a constant one-step gradient delay regardless of pipeline depth. However, its adoption remains limited due to the common belief that optimizing under staleness is fundam

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

newsHardware startup unveils inference accelerator

Implements (incoming)

repomosecorg/mosec repoLMCache/LMCache repoFastFlowLM/FastFlowLM repobeam-cloud/beta9

Related across the graph

repobeam-cloud/beta9 repoLMCache/LMCache newsHardware startup unveils inference accelerator repoFastFlowLM/FastFlowLM repomosecorg/mosec

Topics

cs.LG