paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations can appear clinically convincing even when the final diagnosis is incorrect. We introduce CLExEval, a human-in-the-loop framework for evaluating LLM clinical reasoning under progressive information masking. CLExEval combines 5,600 expert-physician annotations with 200 clinical reasoning traces derived from 40 rare diagnostic cases. Our analysis identifies three recur

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

newsUsing AI to help physicians diagnose rare genetic diseases affecting children newsTowards AI-augmented decision making in psychiatry

Related across the graph

newsTowards AI-augmented decision making in psychiatry newsUsing AI to help physicians diagnose rare genetic diseases affecting children

Topics

cs.CL