0
IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST
https://huggingface.co/blog/ibm-research/itbenchandmast(huggingface.co)IBM Research and UC Berkeley collaborated to diagnose why AI agents fail in enterprise IT automation tasks. They applied a new framework called MAST (Multi-Agent System Failure Taxonomy) to the ITBench benchmark to move beyond simple success rates and understand specific failure modes. The analysis of models like Gemini-3-Flash and GPT-OSS-120B revealed distinct patterns, such as stronger models having isolated failures while open models suffer from cascading errors. A primary cause of failure across all models was incorrect verification, where an agent declares a task complete without checking the ground truth. The findings suggest actionable strategies for building more robust agents, like externalizing verification and better managing task termination.
0 points•by hdt•22 hours ago