Evaluating DeepAgents CLI on Terminal Bench 2.0

https://blog.langchain.com/evaluating-deepagents-cli-on-terminal-bench-2-0/(blog.langchain.com)

DeepAgents CLI, a model-agnostic coding agent, was evaluated on the Terminal Bench 2.0 benchmark to measure its performance on real-world tasks. The benchmark includes 89 tasks across domains like software engineering and security, testing an agent's ability to operate in a terminal environment. To ensure clean, isolated, and scalable evaluations, the Harbor framework was used to run the agent in containerized environments. The DeepAgents CLI, powered by Claude Sonnet 4.5, achieved a score of approximately 42.5%, establishing a competitive baseline for future improvements. This process demonstrates a robust method for systematically testing and iterating on AI agents.

0 points•by hdt•6 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?