Pilots Don't Die in Demos. They Die in Month Three.

The number that should be keeping AI vendors up at night is not a benchmark. It is this: by recent industry counts, roughly 88% of enterprise AI agent pilots never make it to production. Adoption surveys this spring put agents in production at about 31% of enterprises, with banking near 47% and healthcare and government down at 18% and 14%. Everyone has a pilot. Almost nobody has a system.

The convenient explanation is that models are not good enough yet, which has the advantage of solving itself: wait two quarters, re-run the pilot. I have watched this cycle from inside an enterprise deployment and from the research side, and I do not believe the convenient explanation. The pilots I have seen die did not die because the model was too dumb. They died because a pilot is a static system, and it was demoed against a world that moves.

The three deaths, in order

Month one is the demo. It goes well, because demos are built where the light is good: curated inputs, a knowledge base snapshotted last week, the team’s most enthusiastic users, and an engineer watching the logs with one hand on the wheel.

Month two is quiet drift. The knowledge base is now five weeks stale, and the agent answers from a world that no longer exists. Input distribution shifts as normal users replace enthusiasts and ask questions nobody curated for. Success rate slides a few points. Nobody notices, because nobody instrumented success rate; they instrumented usage.

Month three is the incident. Some visible, embarrassing failure lands in front of the sponsoring executive: a confidently stale answer, an action taken on a record that changed, a cost report with a number that ends the meeting. The pilot is “paused for evaluation.” Pauses of this kind have the half-life of a proton.

The pilot survival curve as I keep seeing it: the cliff is not at the demo, it is in the operational weeks after. Schematic drawn from the ~88% failure figure; exact shape varies by org.

The missing discipline has a name

Software engineering met this exact problem decades ago and named the answer operations. Nobody expects a service to stay up because the code was good on launch day; we expect it to stay up because monitoring, alerting, deploys, and rollbacks exist. The agent world is still mostly shipping launch-day code and acting surprised that the world has a derivative.

My M.S. thesis ended up being, more or less, an inventory of what the operations layer for LLM systems has to contain, and each piece maps onto one of the three deaths:

Freshness is an SLO, not a batch job. If your agent’s knowledge has a “last updated” measured in weeks, month-two drift is already scheduled. Time-to-answerable, the delay between a fact changing and your system answering correctly from it, is measurable and should have a target. Continual updates that do not erase old competence are a solved-enough problem (STAR+FAR is my entry in that genre) that “we re-index quarterly” no longer deserves a pass.

Success rate is the metric, and it needs a denominator. Usage tells you people tried. Only task-level success and cost per completed task tell you whether the thing works, and both require defining “done” precisely enough to check. Most pilots skipped that definition, which is why month two is silent.

Correction must be structural, not conversational. When failures cluster, the fix is usually in the pipeline: a retrieval that needs re-scoping, a prompt that needs a patch, a guardrail that needs adding. In my CAFO experiments, repairing the system beat revising individual answers by such a margin that answer-level self-correction barely registered on the same axis. A pilot with no mechanism for structural fixes does not learn from month two, so month three arrives on time.

The reframe for whoever owns the budget

The uncomfortable arithmetic: a pilot proves the cheapest 20% of the work and defers the expensive 80%, then the deferred 80% shows up as a surprise, and the surprise kills the project. If the plan for “who notices drift, who owns the eval suite, what triggers a rollback, and how fresh is fresh enough” does not exist on demo day, the project is not early. It is unstarted.

None of this is an argument against agents; the ~12% that cross the gap are quietly compounding advantages the 88% keep restarting from zero. It is an argument about where the difficulty lives. The model is the easy part now. Month three is the exam, and it is open-book: drift, staleness, silent regression, no correction loop. Study before the demo, not after the incident.