0
Advancing Agents: Introducing Scale’s Agentic Leaderboards
https://scale.com/blog/advancing-agents(scale.com)Scale AI is launching agentic leaderboards with new benchmarks to evaluate AI agent performance in complex, real-world environments. The initial benchmarks include SWE-Bench Pro, which measures an agent's ability to perform professional software engineering tasks, and MCP Atlas, which evaluates multi-tool orchestration. These evaluations focus on both foundational skills like coding and tool use, as well as the ability to complete complex end-to-end tasks. A key challenge addressed is the creation of realistic digital environments that can accurately test an agent's capabilities in multi-system workflows.
0 points•by chrisf•1 month ago