Breaking Out of the Lab: Testing AI in Professional Domains

https://scale.com/blog/prbench(scale.com)

A new benchmark series, PRBench (Professional Reasoning Bench), has been introduced to evaluate the real-world reasoning capabilities of frontier AI models in professional domains like finance and law. Unlike academic-style benchmarks, PRBench uses realistic, open-ended tasks authored by domain experts to assess how models perform in complex, high-stakes scenarios with tangible economic consequences. Initial results show that while top models like GPT-5 Pro perform best, all models struggle with complex professional reasoning, especially on the most difficult tasks. Common failures include making inaccurate judgments, providing opaque reasoning, and lacking domain-specific diligence, indicating that current models are not yet ready for critical professional use.

0 points•by ogg•8 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?