A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks
Multiple-choice medical benchmarks are increasingly saturated, and recent rubric-based evaluations such as HealthBench have shown that open-ended clinical performance is far from solved - its "Hard" subset top score remains 32%. We present a small, deliberately difficult evaluation dataset of five clinician-authored clinical scenarios spanning four specialties (anaesthesia, internal/family medicine, emergency medicine, and obstetrics), each accompanied by an atomic, weighted, MECE rubric (25-62 criteria per task; 184 criteria total) authored from a clinician-drafted golden answer. We evaluate
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorSamiha A. Ismail →
A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks
- Linked via arxiv authorFan X. Chen →
A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks
- Linked via arxiv authorAli Merali →
A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks
