CAFO
Convergence-Aware Feedback Orchestration: Self-Correcting LLM Pipelines as Closed-Loop Control Systems
Abstract
LLM self-correction methods like Self-Refine, Reflexion, and CRITIC revise individual answers; CAFO instead fixes the system that produced them. Borrowing from industrial process control, CAFO closes a feedback loop around a production RAG pipeline: six quality evaluators score every answer, a 30-dimensional failure fingerprint (severity, percentile, trajectory) clusters failures into "error neighborhoods" sharing the same structural fault, a LightGBM selector picks one of five structural actions (prompt patch, retrieval patch, adapter activation, guardrail patch, or no action), and a sequential A/B gate rejects regressions before commit. NDCE (Neighborhood-Distilled Correction Exemplars) makes corrections travel: a fix that succeeds for one query becomes an in-context exemplar for every other query in its error neighborhood. Under controlled fault injection (9 regimens × 9 fault types × 8 iterations × 2 seeds, on Natural Questions with Qwen3-8B), CAFO+NDCE reaches a cumulative correction effect (CCE) of 4.55: 2.6× a per-query oracle upper bound and 4.6× CAFO alone, beating the oracle on 66% of queries with Cohen's d > 3.0 against every baseline.
Fix the system, not the answer
Answer-level self-correction (Self-Refine, Reflexion, CRITIC) edits the finished product. Industrial process control intervenes on the production line instead, and that’s the framing CAFO brings to LLM pipelines. When a RAG system produces a bad answer, the cause is usually structural: poisoned context, a mis-tuned retriever, a stale adapter, a missing guardrail. CAFO senses those faults and repairs the pipeline itself.
The control loop
Each iteration runs five stages:
- Sense. Six quality evaluators score every answer on [0, 1]: hallucination, faithfulness, freshness, toxicity, LLM-grounding (G-Eval), and BERTScore.
- Fingerprint. Each evaluator contributes five features (severity, percentile, trajectory, and more), forming a 30-dimensional failure signature.
- Cluster. Failures group into error neighborhoods that share the same underlying structural fault.
- Select action. A LightGBM selector picks one of five structural fixes: prompt patch, retrieval patch, adapter activation, guardrail patch, or no action.
- Apply & gate. A sequential A/B test rejects regressions before any change commits.
NDCE: the cross-query multiplier
Neighborhood-Distilled Correction Exemplars are what make corrections compound. When a structural fix succeeds for one query, the triple (failure signature, action, observed gain) becomes an in-context exemplar for every other query in the same neighborhood, priming the selector toward the right action on its first iteration and shortening the A/B trial. Where Self-Refine is memoryless and Reflexion remembers only within one query, NDCE transfers corrections across queries that share a structural fault. The result: CAFO+NDCE delivers 4.6× the cumulative correction effect of CAFO alone; the multiplier comes from cross-query transfer, not better per-query selection.
Evaluation
Controlled fault injection on a production-grade RAG pipeline: 9 feedback regimens × 9 fault types × 8 iterations × 2 seeds, on 100 Natural Questions queries over Wikipedia evidence with Qwen3-8B served via vLLM on a single A100 (80 GB). Fault types span six base faults (retrieval, prompt, context, adapter, guardrail, no-fault) plus three compound ones (compositional, distribution shift, intermittent). Baselines cover heuristic (no-feedback, random, rule-based), answer-level (Self-Refine, Reflexion, CRITIC), and structural (per-query oracle upper bound, CAFO alone) regimens.
Results
- CCE = 4.55 (±0.86, n=200): 2.6× the per-query oracle (1.77), 4.6× CAFO alone (0.99), and effectively unbounded versus answer-level methods (≈0).
- CAFO+NDCE beats the oracle on 66% of queries, possible because the oracle upper-bounds per-query action selection only; cross-query exemplars are an orthogonal information channel no single-query oracle can exploit.
- Effect sizes are decisive: Cohen’s d > 3.0 against every baseline (3.07 vs oracle, 5.29 vs answer-level methods).
- The learned policy concentrates 86% of its actions on guardrail (64%) and retrieval (22%) patches, matching the injected fault profile, and its correction curve diverges from baselines at iteration 1 and never closes.