SWE-Bench Pro: Raising the Bar for Agentic Coding

https://scale.com/blog/swe-bench-pro(scale.com)

A new benchmark, SWE-Bench Pro, has been introduced to more accurately measure the capabilities of AI coding agents by raising the difficulty and realism compared to previous tests. It addresses challenges like data contamination by using code from private and copyleft repositories that models have not been trained on, and it increases task diversity and complexity. Results show that even frontier models like OpenAI GPT-5 and Claude Opus 4.1 experience a massive performance drop on SWE-Bench Pro, with scores falling to around 23% from over 70% on older benchmarks. The analysis also reveals that model performance varies significantly by programming language and repository, with private commercial codebases proving to be the most difficult, highlighting the need for better generalization in AI models.

0 points•by ogg•4 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?