Advancing Agents: Introducing Scale’s Agentic Leaderboards

https://scale.com/blog/advancing-agents(scale.com)

Scale AI is launching agentic leaderboards with new benchmarks to evaluate AI agent performance in complex, real-world environments. The initial benchmarks include SWE-Bench Pro, which measures an agent's ability to perform professional software engineering tasks, and MCP Atlas, which evaluates multi-tool orchestration. These evaluations focus on both foundational skills like coding and tool use, as well as the ability to complete complex end-to-end tasks. A key challenge addressed is the creation of realistic digital environments that can accurately test an agent's capabilities in multi-system workflows.

0 points•by chrisf•9 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?