Read original ↗
newsReddit r/LocalLLaMATrust 58 · CommunityPublished 2d agoLive · 2d ago

Senior SWE Bench: a new benchmark focussed on realistically underspecified feature tasks

Covers

Related across the graph