paper · arXiv

How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation

Large language models (LLMs) are increasingly critical to digital library workflows, yet their ability to process historical language remains poorly understood. Historical difficulty is typically treated as a monolithic barrier, conflating orthographic variation, linguistic distance, and pretraining exposure. In this paper, we propose a diagnostic framework that decomposes this difficulty into four distinct dimensions: tokenization cost, predictive uncertainty (surprisal), semantic robustness, and context sensitivity. We evaluate this framework on three datasets spanning three centuries: (1)

Want the primary source?View original →

newsWhat exactly does word2vec learn?

repominimal-diffusion-lm

glossary_termTransformer

newsWhat exactly does word2vec learn?glossary_termTransformer repominimal-diffusion-lm

cs.CL