Topic cluster · 4 items

benchmarks

tutorial

Evaluate a model properly

Avoid common pitfalls when benchmarking LLMs.

tool

EvalBoard

A dashboard for tracking model evaluations over time.

repo

eval-harness-plus

An extensible evaluation harness for LLMs.

news

New benchmark exposes reasoning gaps in top models

A harder evaluation suite shows even leading models struggle on multi-hop tasks.

Related topics