Read original ↗
paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

We study three complementary techniques for training compute-efficient language models. (1) Selective supervision and per-token efficiency. Selective Ground Truth Token Training (SGT) concentrates supervision on the ~15% of output tokens that carry semantic payload. Through positive gradient coupling in position-shared transformer weights -- a token-level instance of auxiliary-task transfer -- the remaining 85% of unsupervised tokens still improve substantially, giving a 4.5x per-supervised-token efficiency (at the step-100 eval optimum, ~67% of the full-sequence loss reduction is recovered

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Related to

Covers

Related across the graph

Topics