How we built the best browser agent with Auto-Research

https://browser-use.com/posts/online-mind2web-benchmark(browser-use.com)

A browser agent was developed that achieved a 97% success rate on the Online-Mind2Web benchmark, the highest score reported. This was accomplished using an "Auto-Research" technique where an LLM, Claude Code, was given a command-line interface to an evaluation platform to iteratively improve the agent's code. A significant improvement involved upgrading the agent to a coding agent that could write Python to parse HTML, which aligned better with the LLM's training. The process also required building an agentic judge for more accurate evaluation and carefully managing the research loop to avoid overfitting. The results are compared against other major AI agents on a public leaderboard, and the creators call for the development of more difficult benchmarks.

0 points•by will22•3 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?