paperarXivTrust 82 · PrimaryPublished 2d agoLive · 21h ago

The State-Prediction Separation Hypothesis

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Trans

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorGiovanni Monea →
The State-Prediction Separation Hypothesis
Linked via arxiv authorNathan Godey →
The State-Prediction Separation Hypothesis
Linked via arxiv authorKianté Brantley →
The State-Prediction Separation Hypothesis
Linked via arxiv authorYoav Artzi →
The State-Prediction Separation Hypothesis

Related to

glossary_termTransformer

Implements

repoengineering87/llm-atlas repoquant-kit

Explains

tutorialBuild your first transformer from scratch

authored (incoming)

personGiovanni Monea personNathan Godey personKianté Brantley personYoav Artzi

Related across the graph

glossary_termTransformer repoengineering87/llm-atlas personYoav Artzi repoquant-kit personGiovanni Monea tutorialBuild your first transformer from scratch personKianté Brantley personNathan Godey

Topics

cs.CL