Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

PACE: A Proxy for Agentic Capability Evaluation

Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensive agentic benchmarks can be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructs proxy benc

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorYueqi Song

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorLintang Sutawika

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorJiarui Liu

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorLindia Tjuatja

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorJiayi Geng

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorYunze Xiao

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorDaniel Lee

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorAditya Bharat Soni

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorVincent Lo

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorXiang Yue

    PACE: A Proxy for Agentic Capability Evaluation

  • Linked via arxiv authorGraham Neubig

    PACE: A Proxy for Agentic Capability Evaluation

Implements

Covers

Implements (incoming)

authored (incoming)

Related across the graph

Topics