Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence of coding-agent progress, but those scores can conflate runtime instability, benchmark-specific scoring rules, and how many tasks are already solved by at least one public submission. We audit these issues across the three benchmarks. First, we replay the official reference patches for 740 code opti
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorZhi Chen →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
- Linked via arxiv authorZhensu Sun →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
- Linked via arxiv authorYuling Shi →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
- Linked via arxiv authorDavid Lo →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
- Linked via arxiv authorLingxiao Jiang →
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
