paperarXivTrust 82 · PrimaryPublished 2d agoLive · 21h ago

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence of coding-agent progress, but those scores can conflate runtime instability, benchmark-specific scoring rules, and how many tasks are already solved by at least one public submission. We audit these issues across the three benchmarks. First, we replay the official reference patches for 740 code opti

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorZhi Chen →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Linked via arxiv authorZhensu Sun →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Linked via arxiv authorYuling Shi →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Linked via arxiv authorDavid Lo →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Linked via arxiv authorLingxiao Jiang →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Covers

newsREAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]newsScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration newsDeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]newsOrnith-1.0: self-improving open-source models for agentic coding newsGoogle's Agentic Peer-Reviewer Handled ~10K Papers at ICML/STOC — Formal Research Paper Now Out [R]

Covers (incoming)

newsSenior SWE-Bench: open-source benchmark that assesses agents as senior engineers

authored (incoming)

personZhi Chen personZhensu Sun personYuling Shi personDavid Lo personLingxiao Jiang

Implements (incoming)

repogolobokov.misha/llm-review-agents

Related across the graph

personYuling Shi personZhensu Sun newsOrnith-1.0: self-improving open-source models for agentic coding personLingxiao Jiang personZhi Chen newsGoogle's Agentic Peer-Reviewer Handled ~10K Papers at ICML/STOC — Formal Research Paper Now Out [R]newsSenior SWE-Bench: open-source benchmark that assesses agents as senior engineers newsScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration personDavid Lo newsDeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]repogolobokov.misha/llm-review-agents newsREAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]

Topics

cs.AI