The State-Prediction Separation Hypothesis
Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Trans
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorGiovanni Monea →
The State-Prediction Separation Hypothesis
- Linked via arxiv authorNathan Godey →
The State-Prediction Separation Hypothesis
- Linked via arxiv authorKianté Brantley →
The State-Prediction Separation Hypothesis
- Linked via arxiv authorYoav Artzi →
The State-Prediction Separation Hypothesis
