Read original ↗
paperarXivTrust 82 · PrimaryPublished 2d agoLive · 21h ago

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence of coding-agent progress, but those scores can conflate runtime instability, benchmark-specific scoring rules, and how many tasks are already solved by at least one public submission. We audit these issues across the three benchmarks. First, we replay the official reference patches for 740 code opti

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorZhi Chen

    Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

  • Linked via arxiv authorZhensu Sun

    Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

  • Linked via arxiv authorYuling Shi

    Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

  • Linked via arxiv authorDavid Lo

    Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

  • Linked via arxiv authorLingxiao Jiang

    Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Covers

Covers (incoming)

authored (incoming)

Implements (incoming)

Related across the graph

Topics