The $20K Bug That Changed How We Think About Evals

Author: Narek Maloyan
AI Evals Benchmarking LLM Testing


Summary

While testing 15 models across open and proprietary families for autonomous task execution using SWE-Bench Pro, Zencoder encountered a bug that accidentally consumed roughly $20K of a $50K evaluation budget. Rather than writing it off as waste, the team discovered that the bug had inadvertently created a more realistic testing scenario -- and the results challenged fundamental assumptions about how AI model benchmarks should be designed.


The Bug

The issue originated in Harbor's adapter for SWE-Bench Pro. A bug in the adapter accidentally leaked failing test names to the models while simultaneously keeping the task descriptions minimal. In effect, models received partial, noisy information: they knew which tests were failing but had less context about what exactly needed to be fixed and why. This consumed roughly $20K in compute before the team identified and corrected the problem.

On reflection, the team realized this accidental setup closely mirrored real-world regression testing scenarios. In practice, developers often work with ambiguous specifications, incomplete context, and partially informative test failures. The bug had created a benchmark condition that was messy in exactly the ways real engineering work tends to be.

What It Revealed

The most striking finding was how dramatically the benchmarking setup itself changed what was being measured. Under the standard configuration with detailed task specifications, model performance clustered tightly -- the spread between models was only about 6 percentage points (45-52%). The models looked roughly interchangeable.

Under the buggy configuration with ambiguous instructions and leaked test names, the performance spread ballooned to 26 percentage points (32-78%). Model rankings reshuffled significantly. The same benchmark, with different information given to the models, produced an entirely different picture of relative model capability.

Additional analysis revealed that even models achieving identical overall scores often solved different subsets of problems, with overlap ranging from only 68% to 84%. This means that raw pass rates hide substantial differences in what models are actually good at. Two models with the same score might have genuinely different strengths.

Lessons Learned

The experience yielded several practical insights for anyone building or relying on AI benchmarks:

The broader takeaway is that benchmark design deserves as much scrutiny as model development. How you test determines what you see.

Read the full post on the Zencoder blog →