Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

Multiple-choice medical benchmarks are increasingly saturated, and recent rubric-based evaluations such as HealthBench have shown that open-ended clinical performance is far from solved - its "Hard" subset top score remains 32%. We present a small, deliberately difficult evaluation dataset of five clinician-authored clinical scenarios spanning four specialties (anaesthesia, internal/family medicine, emergency medicine, and obstetrics), each accompanied by an atomic, weighted, MECE rubric (25-62 criteria per task; 184 criteria total) authored from a clinician-drafted golden answer. We evaluate

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorSamiha A. Ismail

    A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

  • Linked via arxiv authorFan X. Chen

    A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

  • Linked via arxiv authorAli Merali

    A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

Covers

authored (incoming)

Related across the graph

Topics