All writing
·3 min read

Assume the Benchmark Is Gamed

Berkeley researchers showed every major agent benchmark can be exploited to near-perfect scores. Production telemetry says deployed agents succeed 56.6% of the time. Measure like an SRE instead.

evals agents benchmarks reliability

Two numbers crossed my desk this month that should not be able to coexist.

The first: UC Berkeley researchers demonstrated that every major AI agent benchmark they examined can be exploited to achieve near-perfect scores without solving a single task. Not fine-tuned on, not contaminated by accident: exploited, through the benchmark’s own scoring machinery.

The second: telemetry aggregated across 6,259 deployed agents, 4.5 million production runs, shows a task success rate of 56.6%. Related analyses this spring put the gap between lab benchmark scores and real-world deployment performance at 37 points.

Leaderboard: high nineties. Reality: a coin flip with good posture. That is not a measurement error. That is two different quantities wearing the same name.

Benchmarks became adversarial systems

The old story about benchmark rot was passive: datasets leak into training corpora, scores inflate, everyone squints and adjusts. MMLU sitting saturated above 88% for every frontier model is that story’s ending; differences up there are statistically meaningless and everyone politely agrees not to mention it.

The new story is active. The moment a benchmark number moves procurement decisions and funding rounds, it stops being an instrument and becomes a target, and the target is defended by nobody. Goodhart’s law was a warning about incentives; agent leaderboards industrialized it. Scoring harnesses that check outcomes loosely, environments that can be probed, judges that can be persuaded: the Berkeley result says the attack surface is the benchmark itself, and it is soft.

I do not think the lesson is “build harder benchmarks,” although people should. The lesson is that any single static number that matters enough to game will be gamed, and your defense is to measure things that are expensive to fake because they are attached to your actual system doing actual work.

Lab score vs. production reality for agentic systems, per this spring's public numbers: near-perfect exploitable benchmark scores, a 37-point lab-to-production gap, and 56.6% success across 4.5M production runs (6,259 agents).

Measure agents like an SRE measures a service

Here is the alternative discipline, and none of it is novel; it is site reliability engineering pointed at a new kind of service. It works precisely because it is private, longitudinal, and tied to consequences, three properties no public leaderboard can have.

Task SLOs on your own traffic. Define completion precisely, sample continuously, and track success rate with the same seriousness you track uptime. Not a score, a time series. Gaming a time series of your own production traffic is called “making the product better,” which is the point.

Percentiles, not means. Agents fail at the tail: the long chain, the ambiguous ticket, the record that changed mid-run. Mean quality is marketing; p95 behavior is engineering. This is the same argument I make about latency in retrieval systems, because it is the same argument everywhere: users live at the percentile, not the average.

Cost per successful task. The number that catches the agent that “succeeds” by burning 30x the tokens. Benchmarks are cost-blind; your CFO is not. Public analyses this spring found 50x cost variation between systems with similar accuracy, which means accuracy-only comparison is not measuring the thing you buy.

Regression gates, not victory laps. Every change to a prompt, a tool, a model version goes through a sequential test on live traffic before it commits. In my CAFO work this was the single most load-bearing component: a statistical gate that rejects changes that regress, run continuously. Evaluation as a gate, not a trophy.

The benchmark’s actual job

Public benchmarks still have one legitimate use: they are a screening device for what to try, a cheap filter over an enormous model menu. Screening is fine. The failure mode is promotion: letting the screening number make the shipping decision.

So my rule, stated plainly. A benchmark can earn a model an interview. Only your own instrumented traffic can hire it. If a vendor’s pitch deck leads with a leaderboard and cannot show task-level success on workloads shaped like yours, you are not looking at evidence. You are looking at a number that survived because nobody paid to attack it, in a month when we learned attacking them is nearly free.