paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

Multiple-choice medical benchmarks are increasingly saturated, and recent rubric-based evaluations such as HealthBench have shown that open-ended clinical performance is far from solved - its "Hard" subset top score remains 32%. We present a small, deliberately difficult evaluation dataset of five clinician-authored clinical scenarios spanning four specialties (anaesthesia, internal/family medicine, emergency medicine, and obstetrics), each accompanied by an atomic, weighted, MECE rubric (25-62 criteria per task; 184 criteria total) authored from a clinician-drafted golden answer. We evaluate

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorSamiha A. Ismail →
A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks
Linked via arxiv authorFan X. Chen →
A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks
Linked via arxiv authorAli Merali →
A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

Covers

newsNew benchmark exposes reasoning gaps in top models

authored (incoming)

personSamiha A. Ismail personFan X. Chen personAli Merali

Related across the graph

personFan X. Chen personSamiha A. Ismail newsNew benchmark exposes reasoning gaps in top models personAli Merali

Topics

cs.AI