The $20K Bug That Changed How We Think About Evals
Summary
While testing 15 models across open and proprietary families for autonomous task execution using SWE-Bench Pro, Zencoder encountered a bug that accidentally consumed roughly $20K of a $50K evaluation budget. Rather than writing it off as waste, the team discovered that the bug had inadvertently created a more realistic testing scenario -- and the results challenged fundamental assumptions about how AI model benchmarks should be designed.
Key Findings
- Benchmark setup changes everything: Under standard SWE-Bench Pro configuration, model performance clustered within a 6 percentage point spread (45-52%). Under the buggy configuration with ambiguous instructions, the spread ballooned to 26 percentage points (32-78%).
- Identical scores hide different capabilities: Models achieving the same overall pass rate solved different subsets of problems, with pairwise overlap ranging from only 68% to 84%.
- 15 models, 2 configurations, $20K of accidental insight: Testing across open and proprietary model families consumed $20K of a $50K budget before the bug was caught, but produced data that challenges how the industry benchmarks AI coding agents.
- Ambiguity reveals real capability: Anthropic models showed particular strength under ambiguous specifications, while OpenAI models demonstrated more consistent incremental improvement across configurations.
The Bug
The issue originated in Harbor's adapter for SWE-Bench Pro. A bug in the adapter accidentally leaked failing test names to the models while simultaneously keeping the task descriptions minimal. In effect, models received partial, noisy information: they knew which tests were failing but had less context about what exactly needed to be fixed and why. This consumed roughly $20K in compute before the team identified and corrected the problem.
On reflection, the team realized this accidental setup closely mirrored real-world regression testing scenarios. In practice, developers often work with ambiguous specifications, incomplete context, and partially informative test failures. The bug had created a benchmark condition that was messy in exactly the ways real engineering work tends to be.
What It Revealed
| Metric | Standard Config | Buggy Config |
|---|---|---|
| Performance spread | 6 pts (45-52%) | 26 pts (32-78%) |
| Task specification | Detailed | Ambiguous + leaked tests |
| Pairwise overlap (same score) | -- | 68-84% |
| Models tested | 15 (open + proprietary) | |
The most striking finding was how dramatically the benchmarking setup itself changed what was being measured. Under the standard configuration with detailed task specifications, model performance clustered tightly -- the spread between models was only about 6 percentage points (45-52%). The models looked roughly interchangeable.
Under the buggy configuration with ambiguous instructions and leaked test names, the performance spread ballooned to 26 percentage points (32-78%). Model rankings reshuffled significantly. The same benchmark, with different information given to the models, produced an entirely different picture of relative model capability.
Additional analysis revealed that even models achieving identical overall scores often solved different subsets of problems, with overlap ranging from only 68% to 84%. This means that raw pass rates hide substantial differences in what models are actually good at. Two models with the same score might have genuinely different strengths.
Lessons Learned
The experience yielded several practical insights for anyone building or relying on AI benchmarks:
- Over-specified evaluation frameworks compress the score distribution and mask real differences between models. When you give every model a detailed roadmap, they all look similar.
- Ambiguity is a feature, not a bug -- it reveals which models can reason under uncertainty, a skill that matters enormously in production use.
- Identical scores do not mean identical capabilities. Complementary model selection based on problem-type analysis can outperform picking a single "best" model.
- Anthropic models showed particular strength in handling ambiguous task specifications, while OpenAI models demonstrated more consistent incremental improvement across configurations.
The broader takeaway is that benchmark design deserves as much scrutiny as model development. How you test determines what you see.
A deeper lesson concerns the relationship between evaluation methodology and the conclusions we draw from it. In traditional software testing, the test environment is tightly controlled to isolate the variable under test. But when evaluating AI agents on complex tasks, the information environment -- what context, hints, and constraints the agent receives -- is itself a critical variable. Two evaluation setups that appear to test "the same thing" can produce fundamentally different rankings if they differ in how much information they provide to the model. This means that anyone comparing benchmark scores across different evaluation frameworks is likely comparing apples to oranges, even when the underlying task set is identical.
The $20K cost of this lesson also highlighted a practical reality of large-scale AI evaluation: the expense creates a strong incentive to run benchmarks only once, under a single configuration. But our accidental experiment showed that the most valuable insights came from running the same benchmark under different conditions. Organizations that invest in multi-configuration evaluation -- deliberately varying the information regime, the level of task specification, and the tools available to the model -- will develop a far more accurate picture of model capabilities than those that optimize for a single benchmark configuration.
Why does this matter for AI development?
The implications extend beyond SWE-Bench Pro. The AI industry relies heavily on benchmarks to make high-stakes decisions: which model to deploy in production, where to allocate research resources, and how to communicate progress to stakeholders. If a benchmark's information regime -- the specific details provided to the model -- can change rankings by 26 percentage points, then leaderboard positions are as much a function of benchmark design as model capability.
This finding is especially relevant for coding agents, where real-world tasks rarely come with the level of specification that benchmarks typically provide. A developer filing a bug report might include a stack trace and a vague description of expected behavior. A project manager might describe a feature request in business terms without specifying implementation details. The models that excel in these ambiguous conditions may be quite different from those that top a fully-specified benchmark.
The overlap analysis adds another dimension. When two models with a 70% pass rate share only 68% of their solved problems, roughly a third of each model's successes are unique. This suggests that ensemble or routing approaches -- directing tasks to the model most likely to succeed based on problem characteristics -- could substantially outperform any single model selection. It also means that benchmark averages dramatically understate the diversity of model capabilities.
For the broader LLM evaluation community, this experience raises uncomfortable questions about the reproducibility and comparability of published benchmark results. If a 6-point spread can balloon to 26 points by changing the information provided to the model, then leaderboard rankings are far more fragile than they appear. Organizations making deployment decisions based on benchmark scores should ask not just "which model scored highest?" but "under what information regime was this score achieved, and how closely does that regime match our production use case?" Without this context, a benchmark number is at best incomplete and at worst misleading.
What should change about how we benchmark AI?
Based on this experience, we propose three concrete changes to benchmark methodology:
- Test under multiple information regimes. Run the same benchmark with varying levels of task specification -- fully detailed, partially specified, and minimal. Report the performance spread, not just the peak score. A model that performs consistently across regimes is genuinely more capable than one that requires detailed hand-holding.
- Report pairwise overlap, not just pass rates. Two models with identical scores may solve different problems. Publishing the overlap matrix helps practitioners choose complementary models and reveals where diversity is greatest.
- Budget for accidental insights. Our most valuable data came from a bug. Evaluation pipelines should include exploratory runs with deliberately varied configurations, not just optimized benchmark execution. The cost of these experiments is small relative to the information they provide.