paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago
SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions
We introduce SWE-Interact, a new testbed for evaluating coding agents on multi-turn, interactive, user-driven software engineering tasks. Existing frontier SWE benchmarks typically provide complete requirements upfront and evaluate agents on autonomous implementation. In contrast, SWE-Interact places agents in a realistic developer workflow: a carefully designed user simulator starts with vague or incomplete instructions, progressively reveals requirements, inspects the agent's workspace, and provides targeted feedback, revisions, and new constraints until the full task goal has been handed of
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Related to
Has model
Covers
Implements
Covers (incoming)
Implements (incoming)
Related across the graph
repolangchain-ai/open-swerepomarimo-team/marimorepogeneralaction/emdashrepokubeshop/testkuberepoxlang-ai/CUA-Gym-Hubrepolanggenius/difyrepopotpie-ai/potpienewsOpen-source agent framework crosses 50k starsrepomozilla/bugbugrepolangwatch/langwatchrepoSDSLeon/lightcoderepoPipelex/pipelexnewsScarfBench: Benchmarking AI Agents for Enterprise Java Framework MigrationmodelAgentCore-8BnewsCursor now has a mobile app for guiding your coding agent on the gorepoagent-toolsnewsREAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]toolAgentTrace
