Topic cluster · 4 items
benchmarks
tutorial
Evaluate a model properly
Avoid common pitfalls when benchmarking LLMs.
toolEvalBoard
A dashboard for tracking model evaluations over time.
repoeval-harness-plus
An extensible evaluation harness for LLMs.
newsNew benchmark exposes reasoning gaps in top models
A harder evaluation suite shows even leading models struggle on multi-hop tasks.