The $20K Bug That Changed How We Think About Evals

Author: Narek Maloyan
AI Evals Benchmarking LLM Testing


Summary

While testing 15 models across open and proprietary families for autonomous task execution using SWE-Bench Pro, Zencoder encountered a bug that accidentally consumed roughly $20K of a $50K evaluation budget. Rather than writing it off as waste, the team discovered that the bug had inadvertently created a more realistic testing scenario -- and the results challenged fundamental assumptions about how AI model benchmarks should be designed.

Key Findings

Benchmark Setup Changes What You Measure Standard Config detailed task specs 45-52% spread (6 pts) Buggy Config ambiguous + leaked tests 32-78% spread (26 pts) 15 Models open + proprietary Pairwise Overlap 68-84% at same score Key Insight How you test determines what you see
Standard vs. buggy benchmark configuration: identical models produce dramatically different performance spreads depending on the information regime.

The Bug

The issue originated in Harbor's adapter for SWE-Bench Pro. A bug in the adapter accidentally leaked failing test names to the models while simultaneously keeping the task descriptions minimal. In effect, models received partial, noisy information: they knew which tests were failing but had less context about what exactly needed to be fixed and why. This consumed roughly $20K in compute before the team identified and corrected the problem.

On reflection, the team realized this accidental setup closely mirrored real-world regression testing scenarios. In practice, developers often work with ambiguous specifications, incomplete context, and partially informative test failures. The bug had created a benchmark condition that was messy in exactly the ways real engineering work tends to be.

What It Revealed

Metric Standard Config Buggy Config
Performance spread 6 pts (45-52%) 26 pts (32-78%)
Task specification Detailed Ambiguous + leaked tests
Pairwise overlap (same score) -- 68-84%
Models tested 15 (open + proprietary)

The most striking finding was how dramatically the benchmarking setup itself changed what was being measured. Under the standard configuration with detailed task specifications, model performance clustered tightly -- the spread between models was only about 6 percentage points (45-52%). The models looked roughly interchangeable.

Under the buggy configuration with ambiguous instructions and leaked test names, the performance spread ballooned to 26 percentage points (32-78%). Model rankings reshuffled significantly. The same benchmark, with different information given to the models, produced an entirely different picture of relative model capability.

Additional analysis revealed that even models achieving identical overall scores often solved different subsets of problems, with overlap ranging from only 68% to 84%. This means that raw pass rates hide substantial differences in what models are actually good at. Two models with the same score might have genuinely different strengths.

Lessons Learned

The experience yielded several practical insights for anyone building or relying on AI benchmarks:

The broader takeaway is that benchmark design deserves as much scrutiny as model development. How you test determines what you see.

A deeper lesson concerns the relationship between evaluation methodology and the conclusions we draw from it. In traditional software testing, the test environment is tightly controlled to isolate the variable under test. But when evaluating AI agents on complex tasks, the information environment -- what context, hints, and constraints the agent receives -- is itself a critical variable. Two evaluation setups that appear to test "the same thing" can produce fundamentally different rankings if they differ in how much information they provide to the model. This means that anyone comparing benchmark scores across different evaluation frameworks is likely comparing apples to oranges, even when the underlying task set is identical.

The $20K cost of this lesson also highlighted a practical reality of large-scale AI evaluation: the expense creates a strong incentive to run benchmarks only once, under a single configuration. But our accidental experiment showed that the most valuable insights came from running the same benchmark under different conditions. Organizations that invest in multi-configuration evaluation -- deliberately varying the information regime, the level of task specification, and the tools available to the model -- will develop a far more accurate picture of model capabilities than those that optimize for a single benchmark configuration.

Why does this matter for AI development?

The implications extend beyond SWE-Bench Pro. The AI industry relies heavily on benchmarks to make high-stakes decisions: which model to deploy in production, where to allocate research resources, and how to communicate progress to stakeholders. If a benchmark's information regime -- the specific details provided to the model -- can change rankings by 26 percentage points, then leaderboard positions are as much a function of benchmark design as model capability.

This finding is especially relevant for coding agents, where real-world tasks rarely come with the level of specification that benchmarks typically provide. A developer filing a bug report might include a stack trace and a vague description of expected behavior. A project manager might describe a feature request in business terms without specifying implementation details. The models that excel in these ambiguous conditions may be quite different from those that top a fully-specified benchmark.

The overlap analysis adds another dimension. When two models with a 70% pass rate share only 68% of their solved problems, roughly a third of each model's successes are unique. This suggests that ensemble or routing approaches -- directing tasks to the model most likely to succeed based on problem characteristics -- could substantially outperform any single model selection. It also means that benchmark averages dramatically understate the diversity of model capabilities.

For the broader LLM evaluation community, this experience raises uncomfortable questions about the reproducibility and comparability of published benchmark results. If a 6-point spread can balloon to 26 points by changing the information provided to the model, then leaderboard rankings are far more fragile than they appear. Organizations making deployment decisions based on benchmark scores should ask not just "which model scored highest?" but "under what information regime was this score achieved, and how closely does that regime match our production use case?" Without this context, a benchmark number is at best incomplete and at worst misleading.

What should change about how we benchmark AI?

Based on this experience, we propose three concrete changes to benchmark methodology:

Read the full post on the Zencoder blog →



Narek Maloyan is a PhD candidate at Moscow State University and AI Research Engineer at Zencoder. His research focuses on AI safety, LLM security, and adversarial machine learning. Learn more