person profile

Nathan Godey

Nathan Godey — researcher or builder tracked in the Angestrom contributor network.

4Connections

1Papers

0Models

0Repos

0News

Papers · 1

The State-Prediction Separation Hypothesis

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Trans