Browser Agent Benchmark: Comparing LLM Models for Web Automation

https://browser-use.com/posts/ai-browser-agent-benchmark(browser-use.com)

An open-source benchmark has been released for comparing Large Language Models (LLMs) on web automation tasks. The benchmark consists of 100 difficult but possible tasks curated from existing sets like WebBench and GAIA, plus custom challenges for complex browser interactions. To evaluate performance, an LLM-based judge was developed, with a specific Gemini model proving most aligned with human evaluations. The results show a comparison of various models' accuracy and throughput, with a new specialized API, ChatBrowserUse 2, achieving the highest performance. The benchmark is available on GitHub for others to use and replicate the findings.

0 points•by will22•2 days ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?