paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

Does Verbose Chain-of-Thought Really Help? In-Distribution Evidence that Content, Not Length, Matters

Chain-of-thought (CoT) prompting improves LLM reasoning, but the source is contested: do the intermediate steps help because they carry useful semantic content, or because conditioning on more tokens buys extra computation before the model commits to an answer? We bring two lines of evidence to bear. First, in distribution: we repeatedly sample each model on the same question and pair a shorter with a longer of its own natural generations that follow the same reasoning plan, so nothing is rewritten and both traces are genuinely in-distribution. Across 25 models the extra tokens leave accuracy

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

newsNew benchmark exposes reasoning gaps in top models

Related across the graph

newsNew benchmark exposes reasoning gaps in top models

Topics

cs.CL