Staying Fresh Without Forgetting: A Pilot Study of Sparse Temporal Adapter Routing with Freshness-Aware Replay for Budgeted Continual Adaptation

Abstract

Real-world LLM deployments face day-wise knowledge drift under tight compute and memory budgets. We study the budget-constrained continual adaptation problem: acquire fresh knowledge each day while preserving legacy knowledge and keeping costs practical. STAR (Sparse Temporal Adapter Routing) maintains a bank of per-day LoRA adapters and routes a top-k convex mix per query; FAR (Freshness-Aware Replay) adds a weighted reservoir that prioritizes hardness, novelty, and recency. On reproducible day-streams (Wikipedia, News, StackExchange), evaluated with a Factual Freshness Index on same-day evals and legacy retention on a fixed pre-stream holdout, routing-based adapter selection generalizes across a 6.4× capacity increase, from 1.1B parameters (CPU-only) to 7B (single GPU), with consistent 3.2–4.1 percentage-point gains in freshness and legacy at negligible cost. STAR+FAR builds a strong freshness–legacy–cost Pareto frontier and remains CPU-practical (minutes/day) and GPU-scalable (1.7 min/day on A100-40GB), outperforming LoRA-only, replay-only, and hybrid baselines on stability while matching or exceeding freshness at comparable cost envelopes.

The stale-vs-forgetting tension

Facts change daily; organizations want their models updated today without forgetting yesterday. Simple approaches each fail one side of the tension: LoRA-only updates are reactive but unstable on legacy knowledge; replay-only training stabilizes legacy but dulls freshness and inflates cost; fixed hybrid mixes offer limited control over the conflict. STAR+FAR treats the trade-off as a three-way frontier: freshness, legacy retention, cost, and attacks it with routing and cost-bounded replay.

Method

STAR (Sparse Temporal Adapter Routing). Each day’s update trains a LoRA adapter atop a frozen base model, building a bank of per-day adapters. At inference, a lightweight router, feeding cheap text features (MiniLM / TF–IDF) into a compact 2-layer MLP, produces a k-sparse, non-negative weight vector, and the model applies a linear mix of at most k adapter deltas (typically k ∈ {1, 2, 3}). Routing is trained as a success-weighted preference problem against today+legacy probes, so the router learns which temporal slices actually help each query. Adapters mix on-the-fly with no re-materialization, keeping inference overhead negligible on CPU/MPS.

FAR (Freshness-Aware Replay). A bounded replay buffer with a weighted reservoir scores candidates on three signals: hardness (model error/margin), novelty (embedding distance to the buffer), and recency (age), keeping buffers small yet informative instead of replaying uniformly.

Evaluation

Day-wise streams built with deterministic day-builders over three domains (encyclopedic / news / StackExchange-style), each with same-day evaluation splits for freshness (Factual Freshness Index: EM / normalized-EM / token-F1 after each day’s update) and a fixed pre-stream holdout for legacy retention, alongside cost proxies (train seconds/day, peak RAM/VRAM, buffer size). Comparisons use paired procedures with confidence intervals.

Results

Routing-based adapter selection generalizes across a 6.4× capacity increase, from 1.1B-parameter models on CPU-only budgets to 7B models on a single GPU, with consistent +3.2–4.1 pp gains in freshness and legacy retention at negligible added cost.
STAR+FAR traces a strong freshness–legacy–cost Pareto frontier, outperforming LoRA-only, replay-only, and hybrid baselines on stability while matching or exceeding their freshness at comparable cost envelopes.
Updates stay CPU-practical (minutes per day) and GPU-scalable (~1.7 min/day for a 7B model on an A100-40GB), compatible with nightly batch cycles on commodity hardware.

This is a pilot study on 10-day streams; extended (~100-day) horizons and standalone STAR and FAR ablations are planned as future work.