Function Calling and Agentic AI in 2025: What the Latest Benchmarks Tell Us About Model Performance

https://www.klavis.ai/blog/function-calling-and-agentic-ai-in-2025-what-the-latest-benchmarks-tell-us-about-model-performance(www.klavis.ai)

Function calling and agentic AI performance is evaluated using specialized benchmarks that surpass traditional metrics. The Berkeley Function Calling Leaderboard (BFCL) assesses capabilities like multi-step reasoning and tool selection, with models like GLM-4.5 and Claude 4.1 showing strong results. A more rigorous benchmark, MCPMark, stress-tests models in realistic, multi-step workflows, revealing significant performance gaps. On MCPMark, GPT-5 leads but only achieves a 52.6% success rate, indicating that even top models struggle with complex, real-world agentic tasks.

0 points•by chrisf•4 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?