Designing SLO-Aware RAG

Most RAG systems are tuned once and applied everywhere. That’s the bug. A pipeline configured for the hardest query wastes effort on the easy ones, and a pipeline tuned for the average query misses its latency budget exactly when it matters: under load, on the tail.

The fix is to stop treating the service-level objective as something you measure after the fact and start treating it as something the system controls in real time.

The tradeoff you’re actually managing

Every retrieval decision trades quality for cost and latency. If $q$ is answer quality, $c$ is cost, and $\ell$ is latency, a fixed pipeline picks one point on this surface for all traffic. What we want instead is to pick, per query $x$ , the cheapest configuration $\theta$ that still satisfies the latency SLO:

\theta^\star(x) \;=\; \arg\min_{\theta}\; c(x, \theta) \quad \text{subject to} \quad \ell(x, \theta) \le \ell_{\text{SLO}}.

The trick is that $\ell(x,\theta)$ depends on both the query and current system load, so the controller has to react, not just precompute.

Difficulty is predictable enough

You don’t need a perfect difficulty oracle. A cheap predictor that sorts queries into a few bands is enough to route most of the traffic correctly:

def choose_config(query, load):
    difficulty = estimate_difficulty(query)      # cheap, runs before retrieval
    budget = latency_slo - queue_delay(load)     # shrink budget as load rises

    if difficulty < EASY and budget > TIGHT:
        return Config(k=3, rerank=False, model="small")
    if difficulty < HARD:
        return Config(k=8, rerank=True, model="mid")
    return Config(k=20, rerank=True, model="large")

The difficulty signal does not require an extra model call. Score distributions, rank gaps, and sparse-dense agreement from a shallow probe retrieval carry most of the information, and they cost milliseconds.

What it buys you

Fixed high-effort retrieval vs. difficulty-adaptive routing as load rises. The fixed pipeline crosses its latency SLO under load; the adaptive one sheds effort and holds. Illustrative shape, not measured data.

In SAGE, the system this line of thinking became, a learned version of this policy held a 5-second P95 SLO on 95% of Natural Questions traffic where the best static configuration managed 30%, while cutting retrieval cost by 51% and P95 latency by 36%, at a cost of two points of exact match. Not by being smarter about any single query, but by refusing to pay uniform cost for non-uniform work. Nearly half the queries were served with five passages or fewer; the fifth that genuinely needed twenty still got twenty.

One result from that work surprised me enough to change how I design these systems: the policy transfers. Train the router on one dataset with one model, move it to multi-hop questions, temporal questions, and three other model families, and it keeps working with no retraining. Difficulty, it turns out, lives mostly in the retrieval signals, not in the generator. That is an argument for keeping control logic out of the model and in the system, where it survives model swaps.

The general principle

RAG is the special case. The general claim is that every stage of an LLM pipeline has a knob that trades effort for quality, and production systems should set those knobs per request, under an explicit budget, with load in the loop. Fixed configurations are a bet that your traffic is uniform. Your traffic is not uniform.

If you’re building retrieval this year, instrument difficulty first. The distribution of your own traffic is the strongest argument you will find for or against adaptivity, and it costs a day of logging to get.