Read original ↗

paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

We introduce SWE-Interact, a new testbed for evaluating coding agents on multi-turn, interactive, user-driven software engineering tasks. Existing frontier SWE benchmarks typically provide complete requirements upfront and evaluate agents on autonomous implementation. In contrast, SWE-Interact places agents in a realistic developer workflow: a carefully designed user simulator starts with vague or incomplete instructions, progressively reveals requirements, inspects the agent's workspace, and provides targeted feedback, revisions, and new constraints until the full task goal has been handed of

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Related to

Has model

modelAgentCore-8B

Covers

newsOpen-source agent framework crosses 50k stars newsCursor now has a mobile app for guiding your coding agent on the go

Implements

repoagent-tools

Covers (incoming)

newsScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration newsREAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]

Implements (incoming)

repokubeshop/testkube repomozilla/bugbug repopotpie-ai/potpie repomarimo-team/marimo repogeneralaction/emdash repolanggenius/dify repoSDSLeon/lightcode repolangchain-ai/open-swe repoPipelex/pipelex repolangwatch/langwatch repoxlang-ai/CUA-Gym-Hub

Related across the graph

repolangchain-ai/open-swe repomarimo-team/marimo repogeneralaction/emdash repokubeshop/testkube repoxlang-ai/CUA-Gym-Hub repolanggenius/dify repopotpie-ai/potpie newsOpen-source agent framework crosses 50k stars repomozilla/bugbug repolangwatch/langwatch repoSDSLeon/lightcode repoPipelex/pipelex newsScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration modelAgentCore-8B newsCursor now has a mobile app for guiding your coding agent on the go repoagent-tools newsREAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]toolAgentTrace

Topics