Read original ↗
paperarXivTrust 82 · PrimaryPublished 2d agoLive · 21h ago

The State-Prediction Separation Hypothesis

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Trans

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Related to

Implements

Explains

authored (incoming)

Related across the graph

Topics