0
Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis(huggingface.co)VAKRA is a tool-grounded, executable benchmark introduced to evaluate how well AI agents reason and act in enterprise-like environments. It measures compositional reasoning across APIs and documents, using full execution traces to assess if agents can complete multi-step workflows. The benchmark provides an environment where agents interact with over 8,000 locally hosted APIs and domain-aligned documents, requiring 3-7 step reasoning chains. The analysis focuses on specific capabilities like API chaining and tool selection, revealing common failure modes in current models.
0 points•by will22•1 hour ago