0
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
https://huggingface.co/blog/ibm-research/itbench-aa(huggingface.co)Artificial Analysis and IBM have launched ITBench-AA, a new benchmark for evaluating AI models on agentic enterprise IT tasks, starting with Site Reliability Engineering (SRE). The benchmark tests a model's ability to diagnose Kubernetes incidents by analyzing logs, traces, and system snapshots to identify root-cause entities. Key findings show that even top-tier models like Claude Opus 4.7 and GPT-5.5 score below 50%, highlighting the benchmark's difficulty. The evaluation methodology reveals that longer investigation trajectories do not correlate with higher accuracy, and some open-weight models offer better performance for their cost.
0 points•by hdt•1 hour ago