paperarXivTrust 82 · PrimaryPublished 7d agoLive · 4d ago

Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA

LLM-as-a-Judge and self-evaluation pipelines implicitly assume that evaluation is easier than generation. We test this in a controlled in-context QA setting where a context passage is the sole information source and each model judges the answer it generated, removing the parametric-knowledge confound of open-domain comparisons. Across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models, evaluation is not uniformly easier: generation accuracy exceeds self-evaluation on three of four, with multi-hop MuSiQue the exception. Attention analysis reveals why: evaluation attends to cont

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Explains

tutorialEvaluate a model properly

Covers

newsNew benchmark exposes reasoning gaps in top models

Related across the graph

tutorialEvaluate a model properly newsNew benchmark exposes reasoning gaps in top models

Topics

cs.CL