The $20K Bug That Changed How We Think About Evals
Summary
While testing 15 models across open and proprietary families for autonomous task execution using SWE-Bench Pro, Zencoder encountered a bug that accidentally consumed roughly $20K of a $50K evaluation budget. Rather than writing it off as waste, the team discovered that the bug had inadvertently created a more realistic testing scenario -- and the results challenged fundamental assumptions about how AI model benchmarks should be designed.
The Bug
The issue originated in Harbor's adapter for SWE-Bench Pro. A bug in the adapter accidentally leaked failing test names to the models while simultaneously keeping the task descriptions minimal. In effect, models received partial, noisy information: they knew which tests were failing but had less context about what exactly needed to be fixed and why. This consumed roughly $20K in compute before the team identified and corrected the problem.
On reflection, the team realized this accidental setup closely mirrored real-world regression testing scenarios. In practice, developers often work with ambiguous specifications, incomplete context, and partially informative test failures. The bug had created a benchmark condition that was messy in exactly the ways real engineering work tends to be.
What It Revealed
The most striking finding was how dramatically the benchmarking setup itself changed what was being measured. Under the standard configuration with detailed task specifications, model performance clustered tightly -- the spread between models was only about 6 percentage points (45-52%). The models looked roughly interchangeable.
Under the buggy configuration with ambiguous instructions and leaked test names, the performance spread ballooned to 26 percentage points (32-78%). Model rankings reshuffled significantly. The same benchmark, with different information given to the models, produced an entirely different picture of relative model capability.
Additional analysis revealed that even models achieving identical overall scores often solved different subsets of problems, with overlap ranging from only 68% to 84%. This means that raw pass rates hide substantial differences in what models are actually good at. Two models with the same score might have genuinely different strengths.
Lessons Learned
The experience yielded several practical insights for anyone building or relying on AI benchmarks:
- Over-specified evaluation frameworks compress the score distribution and mask real differences between models. When you give every model a detailed roadmap, they all look similar.
- Ambiguity is a feature, not a bug -- it reveals which models can reason under uncertainty, a skill that matters enormously in production use.
- Identical scores do not mean identical capabilities. Complementary model selection based on problem-type analysis can outperform picking a single "best" model.
- Anthropic models showed particular strength in handling ambiguous task specifications, while OpenAI models demonstrated more consistent incremental improvement across configurations.
The broader takeaway is that benchmark design deserves as much scrutiny as model development. How you test determines what you see.