All that glitters: When “gold-like” answers mask functional failures on coding agent benchmarks

https://www.ai21.com/blog/gold-like-answers-benchmarks/(www.ai21.com)

While benchmarking a coding agent, an LLM judge component demonstrated a suspicious preference for 'gold' answers on the SWE-bench dataset. This was not due to data contamination, but rather a learned bias where the judge favored solutions that were stylistically minimal and clean, even if they were functionally incorrect. This preference for 'gold-like' aesthetics over functional correctness introduced a significant distortion in the evaluation results. To mitigate this, the researchers redesigned the reducer prompts with concrete prioritization rules. This successfully realigned the model's judgment with the actual success criteria of functional correctness.

0 points•by hdt•3 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?