Long Context Didn't Kill Retrieval

“RAG is dead” is having another moment. The argument, compressed: frontier models now take a million tokens of context, so skip the retrieval pipeline, load the corpus into the window, and let attention sort it out. Chunking, embeddings, rerankers: 2023 artifacts, like prompt-injection-by-haiku.

I run retrieval research for a living, so discount me accordingly. But my objection is not sentimental, it is operational, and it fits in one sentence: context is a budget, not a backpack. Every token you put in the window is a purchase, and “buy everything” has never been a serious answer to any allocation question.

What a million tokens actually costs

Three costs scale with what you stuff in, and none of them show up in a demo.

Latency. Prefill compute scales with input length; time-to-first-token over a maximally packed window is measured in tens of seconds on current serving stacks. That is fine for a research assistant grinding a long document. It is disqualifying for anything with a user waiting, which is to say, most things. Under a 5-second P95 SLO, the million-token option does not exist, whatever the model card says.

Money. Input tokens are the cheap kind until you multiply them by every request. A packed window on every query is a standing tax of several dollars per call at frontier prices; agentic systems already burn 5-30x a chatbot’s tokens before anyone gets generous with context. Caching helps exactly when your context is static, and production knowledge is not static; that is the whole problem retrieval exists to manage.

Attention. “Effective context” still lags advertised context. Needle-in-haystack benchmarks are the easy case (the needle is verbatim); position effects and distractor sensitivity persist on realistic tasks, and irrelevant-but-plausible text is not neutral filler, it is interference. More context is not monotonically more signal. Past the relevant set, it is noise you paid for.

The operating question: latency and cost rise with context size while marginal answer quality flattens once the relevant evidence is in. The right operating point is a budget decision under your SLO, not a maximum. Illustrative shape.

What long context actually killed

Here is the concession, because the “RAG is dead” crowd is half right. Long context killed bad retrieval. The 2023 pipeline, fixed-size chunks, one embedding model, top-5 cosine similarity, pray, existed to squeeze evidence through a 4K window. That constraint is gone, and with it every argument for retrieval that was really just window management. If your retrieval’s job was rationing, it is unemployed.

What survives is retrieval’s real job: selection under a budget. Given this query, this SLO, this price, which evidence earns its place in the window? Sometimes the answer is five passages. Sometimes it is two hundred pages, because the task is genuinely a whole-document synthesis and the user will wait. The point is that it is a decision, per query, and making it well is worth real money: in SAGE we found nearly half of production-shaped queries were fully served by five passages or fewer, and deciding when to go deep held a 95% latency SLO at 49% of the fixed pipeline’s cost. The million-token window makes that decision space wider, and a wider decision space raises the value of deciding well. That is the opposite of dead.

The synthesis position, which will sound obvious in two years: long context and retrieval are not rivals, they are a memory hierarchy. Window as RAM, corpus as disk, retrieval as the paging policy. Nobody responded to cheap RAM by declaring the file system dead; they built better caches and kept the hierarchy. The teams treating context that way, budgeted, tiered, measured at the percentile, are quietly beating both purist camps on every metric that reaches an invoice.

Retrieval is dead. Long live allocation.