newsReddit r/MachineLearningTrust 72 · CommunityPublished 2d agoLive · 2d ago
REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]
submitted by /u/julian88888888 [link] [comments]
Covers
Covers (incoming)
Related across the graph
paperAre Performance-Optimization Benchmarks Reliably Measuring Coding Agents?paperTraceLab: Characterizing Coding Agent Workloads for LLM ServingpaperTestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolutionrepotomasz-tomczyk/crit-webpaperSWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding SessionspaperCoding-agents can replicate scientific machine learning papers
