Read original ↗

newsReddit r/MachineLearningTrust 72 · CommunityPublished 2d agoLive · 2d ago

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]

submitted by /u/julian88888888 [link] [comments]

Research Reddit r/MachineLearning verified

Covers

paperSWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions paperTraceLab: Characterizing Coding Agent Workloads for LLM Serving

Covers (incoming)

paperAre Performance-Optimization Benchmarks Reliably Measuring Coding Agents?paperCoding-agents can replicate scientific machine learning papers paperTestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution repotomasz-tomczyk/crit-web

Related across the graph

paperAre Performance-Optimization Benchmarks Reliably Measuring Coding Agents?paperTraceLab: Characterizing Coding Agent Workloads for LLM Serving paperTestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution repotomasz-tomczyk/crit-web paperSWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions paperCoding-agents can replicate scientific machine learning papers