Designing SLO-Aware RAG
Why production retrieval should treat latency and cost as constraints to control, not numbers to hope for, and how difficulty-adaptive routing gets there.
Most RAG systems are tuned once and applied everywhere. That’s the bug. A pipeline configured for the hardest query wastes effort on the easy ones, and a pipeline tuned for the average query misses its latency budget exactly when it matters: under load, on the tail.
The fix is to stop treating the service-level objective as something you measure after the fact and start treating it as something the system controls in real time.
The tradeoff you’re actually managing
Every retrieval decision trades quality for cost and latency. If is answer quality, is cost, and is latency, a fixed pipeline picks one point on this surface for all traffic. What we want instead is to pick, per query , the cheapest configuration that still satisfies the latency SLO:
The trick is that depends on both the query and current system load, so the controller has to react, not just precompute.
Difficulty is predictable enough
You don’t need a perfect difficulty oracle. A cheap predictor that sorts queries into a few bands is enough to route most of the traffic correctly:
def choose_config(query, load):
difficulty = estimate_difficulty(query) # cheap, runs before retrieval
budget = latency_slo - queue_delay(load) # shrink budget as load rises
if difficulty < EASY and budget > TIGHT:
return Config(k=3, rerank=False, model="small")
if difficulty < HARD:
return Config(k=8, rerank=True, model="mid")
return Config(k=20, rerank=True, model="large")
The difficulty signal does not require an extra model call. Score distributions, rank gaps, and sparse-dense agreement from a shallow probe retrieval carry most of the information, and they cost milliseconds.
What it buys you
In SAGE, the system this line of thinking became, a learned version of this policy held a 5-second P95 SLO on 95% of Natural Questions traffic where the best static configuration managed 30%, while cutting retrieval cost by 51% and P95 latency by 36%, at a cost of two points of exact match. Not by being smarter about any single query, but by refusing to pay uniform cost for non-uniform work. Nearly half the queries were served with five passages or fewer; the fifth that genuinely needed twenty still got twenty.
One result from that work surprised me enough to change how I design these systems: the policy transfers. Train the router on one dataset with one model, move it to multi-hop questions, temporal questions, and three other model families, and it keeps working with no retraining. Difficulty, it turns out, lives mostly in the retrieval signals, not in the generator. That is an argument for keeping control logic out of the model and in the system, where it survives model swaps.
The general principle
RAG is the special case. The general claim is that every stage of an LLM pipeline has a knob that trades effort for quality, and production systems should set those knobs per request, under an explicit budget, with load in the loop. Fixed configurations are a bet that your traffic is uniform. Your traffic is not uniform.
If you’re building retrieval this year, instrument difficulty first. The distribution of your own traffic is the strongest argument you will find for or against adaptivity, and it costs a day of logging to get.