paper · arXiv

Grokking in small transformers

When and why tiny models suddenly generalize long after overfitting.

Want the primary source?View original →