Back to research

SAGE

SAGE: SLO-Aware Adaptive Retrieval for Production RAG Systems

Muhammad Faizan Raza, Shuo (Luna) Yang, Satish Mahadevan Srinivasan

IEEE International Conference on Control, Decision and Information Technologies (CoDIT), Bari, Italy · 2026 Accepted
95%
SLO compliance (vs 30% static)
−51%
Retrieval cost
−36%
P95 latency

Abstract

Retrieval-augmented generation (RAG) systems in production operate under strict service level objectives (SLOs) on tail latency and infrastructure cost. However, standard retrieval pipelines rely on fixed retrieval budgets that ignore query difficulty, over-retrieving for easy queries and under-serving hard ones, forcing operators to trade answer quality against SLO compliance. SAGE is a learned SLO-aware adaptive retrieval policy that dynamically selects the number of passages k per query. It uses lightweight features derived from initial retrieval (score distributions, rank gaps, lexical signals) and is trained offline via imitation learning from an oracle that approximates optimal latency–quality trade-offs. At inference, it adds no LLM calls and minimal overhead. On Natural Questions, under a 5 s P95 latency SLO, SAGE achieves 95% SLO compliance versus 30% for the best static baseline (k=20), reduces P95 latency by 36% and retrieval cost by 51% with only 2 percentage points Exact Match loss. A single policy trained on Natural Questions generalizes across HotpotQA, UnSeenTimeQA, and four LLM families (Llama, Qwen, Mistral, Gemma), consistently yielding +45–52 point SLO improvements without quality degradation.

Problem

Most deployed RAG pipelines retrieve a globally tuned, fixed number of passages k for every query. That single choice is misaligned with real traffic: easy factoid questions can be answered from a handful of passages, while hard multi-hop or temporal queries genuinely need deeper retrieval. Set k low and hard queries fail; set it high enough to protect answer quality and the extra retrieval and re-ranking work drives up tail latency and cost; in our measurements, a static k=20 baseline met a 5 s P95 SLO on only 30% of queries.

SAGE reframes retrieval as a per-query resource allocation decision under explicit latency and cost constraints: maximize answer quality subject to a target P95 SLO and a retrieval budget.

Approach

SAGE is a lightweight policy that predicts an appropriate retrieval budget k for each query before generation:

  • Hybrid retrieval substrate. Queries hit sparse (BM25) and dense (BGE-M3) backends, fused with Reciprocal Rank Fusion, a strong, standard production stack.
  • Lightweight features only. From a cheap k=2 probe (~300 ms, amortized into the retrieval path), SAGE extracts score statistics, rank gaps, and sparse–dense agreement signals. No LLM calls, no chain-of-thought, no multi-step protocols.
  • Imitation learning from an offline oracle. For each training query, we sweep all budgets k ∈ {2 … 30}, record latency and quality, and label the smallest k that satisfies the target P95 SLO at the highest quality. A RandomForest policy (inference < 1 ms) is trained to imitate these oracle decisions.
  • Calibration. A temperature parameter on the policy logits is tuned on validation data to maximize quality subject to the desired SLO compliance, so the deployed policy actually meets its target.

Because the policy touches only retrieval-side signals, it deploys as a stateless service next to the retriever and can be updated or rolled back independently of model weights or prompts.

Results

On Natural Questions under a 5 s P95 SLO (334 test queries, Llama-3.1-8B via vLLM):

ConfigurationSLO comp.P95 lat.EMAvg kCost
Static k=295%2.1 s11%2.010%
Static k=1045%4.8 s22%10.050%
Static k=2030%5.6 s24%20.0100%
SAGE (dynamic k)95%3.6 s22%9.849%

SAGE moves the operating point into the high-SLO, non-trivial-EM regime that no static choice can reach: 95% SLO compliance (vs 30%), −36% P95 latency (5.6 s → 3.6 s), and −51% retrieval cost, giving up only 2 EM points. It does this by matching effort to difficulty: 45% of queries are served with k ≤ 5, while the 20% that genuinely need k=20 still get it.

Generalization. The single policy trained on NQ transfers unchanged to HotpotQA (multi-hop) and UnSeenTimeQA (temporal), improving SLO compliance by +46–52 points, and to Qwen2.5-7B, Mistral-7B, and Gemma-2-9B with +49–51 point gains and no EM loss: zero-shot, because it operates purely on retrieval-side signals.

Production impact. For a 10M-queries/day deployment, SAGE’s halved average budget translates to roughly $132K/year in retrieval-serving savings, on top of serving-layer optimizations like PagedAttention or batching, not instead of them.

RAG SLO-aware serving adaptive retrieval tail latency imitation learning LLM systems
BibTeX
@inproceedings{raza2026sage,
  title     = {SAGE: SLO-Aware Adaptive Retrieval for Production RAG Systems},
  author    = {Raza, Muhammad Faizan and Yang, Shuo (Luna) and Srinivasan, Satish Mahadevan},
  booktitle = {IEEE International Conference on Control, Decision and Information Technologies (CoDIT)},
  address   = {Bari, Italy},
  month     = jul,
  year      = {2026}
}