0
Open-Sourcing MCP-Atlas: A Benchmark for Real Tool Use
https://scale.com/blog/open-sourcing-mcp-atlas(scale.com)Scale AI is open-sourcing its MCP-Atlas benchmark, which is designed to measure how well Large Language Models (LLMs) handle realistic, multi-step tool use. The benchmark evaluates agents against real servers and natural language prompts, forcing them to discover and select the correct tools from a set that includes distractors. Updated results show that even top models like Claude Opus 4.5 fail nearly 40% of tasks, primarily due to errors in tool selection, parameterization, or sequencing. The release includes a research paper, a public dataset, and a containerized evaluation environment to help developers measure and compare agent performance.
0 points•by will22•1 day ago