0
Smoothing Out LLM Variance for Reliable Enterprise Evals
https://scale.com/blog/smoothing-out-llm-variance(scale.com)Using Large Language Models (LLMs) as judges for evaluating AI agents suffers from a critical flaw: the results are not repeatable over time. Internal experiments reveal that evaluation metrics for the same test can vary by 10-15% from one day to the next across major models, invalidating A/B tests. This instability is attributed to provider-side model updates and the architectural combination of Sparse Mixture of Experts (MoE) with batched inference, which makes individual query outputs non-deterministic. To solve this, a "cohort of judges" method is proposed, using a panel of three judges with slightly different prompts to aggregate outputs and smooth out the variance. This approach reduces evaluation variance by at least 50%, enabling reliable, repeatable measurements for principled AI development.
0 points•by hdt•1 month ago