Enterprise Pricing

Topic cluster · 4 items

benchmarks

Evaluate a model properly

Avoid common pitfalls when benchmarking LLMs.

EvalBoard

A dashboard for tracking model evaluations over time.

eval-harness-plus

An extensible evaluation harness for LLMs.

New benchmark exposes reasoning gaps in top models

A harder evaluation suite shows even leading models struggle on multi-hop tasks.

Related topics