paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

PACE: A Proxy for Agentic Capability Evaluation

Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensive agentic benchmarks can be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructs proxy benc

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorYueqi Song →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorLintang Sutawika →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorJiarui Liu →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorLindia Tjuatja →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorJiayi Geng →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorYunze Xiao →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorDaniel Lee →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorAditya Bharat Soni →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorVincent Lo →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorXiang Yue →
PACE: A Proxy for Agentic Capability Evaluation
Linked via arxiv authorGraham Neubig →
PACE: A Proxy for Agentic Capability Evaluation