All writing
·3 min read

A Hundred Agents Is Not a Plan

Swarm architectures multiply an unreliable unit and call it scale. The math of chained success rates says the opposite: fewer agents, tighter loops, structural correction.

agents multi-agent reliability architecture

The fashionable number in agent architecture right now is one hundred. Moonshot’s Kimi K2.5 markets an agent-swarm mode coordinating up to 100 parallel sub-agents; every orchestration framework demo this spring seems to open with a boxes-and-arrows slide that looks like an org chart for a mid-size company staffed entirely by interns. The pitch is intuitive: if one agent is useful, a hundred must be a workforce.

I build multi-agent systems, so this is not a genre I am against. It is a scaling claim I am against, because the arithmetic underneath it is merciless and almost never quoted on the slide.

The exponent does not care about your org chart

A step that succeeds with probability rr, chained nn times with each step depending on the last, succeeds end-to-end with probability rnr^n. That innocent exponent is the whole story of multi-agent reliability:

0.99100.900.95100.600.90200.120.99^{10} \approx 0.90 \qquad 0.95^{10} \approx 0.60 \qquad 0.90^{20} \approx 0.12

A 95%-reliable step is excellent by current agent standards; public production telemetry this spring puts average deployed-agent task success near 57%. Chain ten excellent steps and you are at 60%. Twenty merely-good steps: 12%. Every handoff between agents is one more factor in the product, and unlike human organizations, sub-agents do not push back when handed nonsense. They elaborate it, confidently, at tokens per second.

End-to-end success probability for a dependent chain of n steps, at per-step reliabilities of 99%, 95%, and 90%. Computed directly from r^n; the lines are labeled at their ends.

“But the hundred agents run in parallel, not in a chain.” Sometimes, for genuinely separable work, and then fan-out is legitimate: my own tornado response system runs four agents because radar analysis, tweet triage, and document retrieval truly are independent until the synthesis step. But read the transcripts of a big swarm on a real task and you find the dependency chain hiding inside the parallelism: planner to sub-planner to worker to verifier to re-planner. The width is marketing. The depth is where the exponent lives, and the synthesis step at the end inherits every upstream error that nobody caught.

There is also a bill. Coordination is paid for in tokens: plans, handoff summaries, arbitration, retries. Agentic workloads already burn 5 to 30 times the tokens of a chat exchange, and tail latency stacks with every serial hop. A swarm is a machine for converting your budget into intermediate artifacts.

Retry loops with better marketing

Here is the uncomfortable diagnosis of many “self-healing swarms”: strip the terminology and you find a retry loop. Worker fails, critic notices something is off, planner reassigns, new worker tries again with a slightly different prompt. If each attempt is an independent draw from the same distribution, you are buying success probability at full token price per draw, and learning nothing between draws. That can be rational for cheap steps. As an architecture, it is a slot machine with a dashboard.

The alternative I keep landing on, in research and in production, is not “one giant agent” either. It is: the smallest number of agents the task’s true structure demands, wrapped in a control loop that fixes the system instead of re-rolling the attempt. In my CAFO experiments, a controller that diagnosed failures and applied structural fixes (patch the retrieval, patch the guardrail, patch the prompt) beat a per-query oracle, the theoretical best possible single-shot selector, by 2.6x on cumulative correction. More attempts lose to better systems, and it is not close.

Raise r, shorten n, then scale

So my rule for agent architecture, in priority order: raise rr (per-step reliability: tighter tools, narrower scopes, verified outputs), shorten nn (collapse steps; every handoff you delete multiplies end-to-end success), and only then widen (fan out across genuinely independent work, never as a substitute for either). Exactly the order the marketing runs in reverse.

The hundred-agent slide will keep winning meetings this year; head-count is a metaphor executives already trust. Bring the exponent to the meeting. An org chart is not an architecture, and a workforce whose every member is 90% reliable is not a workforce. It is a lottery with very good production values.