PACE: A Proxy for Agentic Capability Evaluation
Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensive agentic benchmarks can be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructs proxy benc
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorYueqi Song →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorLintang Sutawika →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorJiarui Liu →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorLindia Tjuatja →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorJiayi Geng →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorYunze Xiao →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorDaniel Lee →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorAditya Bharat Soni →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorVincent Lo →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorXiang Yue →
PACE: A Proxy for Agentic Capability Evaluation
- Linked via arxiv authorGraham Neubig →
PACE: A Proxy for Agentic Capability Evaluation
