Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis(huggingface.co)

VAKRA is a tool-grounded, executable benchmark introduced to evaluate how well AI agents reason and act in enterprise-like environments. It measures compositional reasoning across APIs and documents, using full execution traces to assess if agents can complete multi-step workflows. The benchmark provides an environment where agents interact with over 8,000 locally hosted APIs and domain-aligned documents, requiring 3-7 step reasoning chains. The analysis focuses on specific capabilities like API chaining and tool selection, revealing common failure modes in current models.

0 points•by will22•3 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?