Why your RAG system gets worse over time — and how to fix retrieval drift before users complain

The first 90 days, your RAG system feels accurate. By month five it’s firefighting. Four drift drivers, four detection signals, three embedding refresh strategies, and the operational practices that catch the regression in dashboards instead of customer complaints.

AdminFounder & Engineering Lead · May 19, 2026 · 7 min read

The honeymoon period for an enterprise RAG system is real. The first 90 days, retrieval feels accurate, citations land in the right documents, users trust the answers. Then quietly, around month four or five, something shifts. Answer quality drops 5%. Citations point to slightly-wrong sections. The on-call channel gets its first "the AI is wrong about X" message. Six months in, the team is firefighting. The phenomenon has a name — retrieval drift — and it's predictable enough to engineer against.

What retrieval drift actually is

RAG systems get worse over time for four compounding reasons. Each one is invisible from the user's seat until it adds up.

Content drift: your knowledge base keeps changing — new policies, deprecated procedures, updated SOPs. The vector store snapshot was right on day one and 70% right by month six.
Embedding-corpus mismatch: not the model itself changing, but the corpus shifting such that the original embedding choices stop being optimal — new acronyms, new product names, new domain jargon that the embedding can't disambiguate.
Query distribution shift: users get more sophisticated. Month-one questions are basic. Month-five questions are "we already know X, why doesn't your AI handle the corollary Y." The system was tuned for the easy queries.
Evaluation decay: the gold-standard eval set built at launch reflects month-one usage. New question patterns aren't in it. Your eval scores stay high while real quality drops.

How to detect drift before users do

The point is to catch the regression in dashboards, not in user complaints. Four signals matter, and they trail real quality decline by a week or two — early enough to act, late enough to be reliable:

Continuous evaluation: keep the gold set growing. Add 5–10 questions every week sampled from real production traffic, with the correct answers tagged by a subject-matter reviewer.
Retrieval recall@k against the gold set: when recall@5 drops 3% or more, alert. This is the earliest leading indicator.
Citation hit rate: percentage of answers where the cited chunk actually contains the asserted fact. Trends down before any other signal — measure on a sample of 100 answers per week.
Confidence score distributions: the model’s stated confidence shifts measurably when retrieval quality drops. Track the mean and the 10th percentile.

Embedding refresh strategies

Three strategies, in order of cost. Pick the cheapest that addresses your drift profile.

Incremental refresh: re-embed only documents that changed. Cheap; the right default for most systems. A nightly job that watches the source-of-truth content and updates embeddings for changed documents.
Full reindex: re-embed the whole corpus. Triggered by an embedding-model upgrade or by a structural reorganization of the source. Usually quarterly or semi-annual.
Model upgrade: when a new embedding model materially outperforms yours on your domain (test on your data, not on public benchmarks), plan a full reindex tied to the upgrade. Versioned indexes let you roll back if the new model regresses on something the benchmark missed.

Operational practices that keep drift bounded

Drift is unavoidable. Drift becoming a customer-facing problem is avoidable, if four practices are in place:

Versioned indexes: every embedding generation gets a version. Roll back instantly when an upgrade regresses.
Shadow evaluation: run new models against the existing eval before promoting. Compare answers on the same gold set side by side.
Quarterly review with an owner: someone owns the eval set. Expanding it is part of their job, not a "we should do this" afterthought.
Reading the queries: log and sample 50 user queries a week. Tag them. Patterns emerge in the queries before they emerge in the metrics.

IDS AI Solutions builds the evaluation and drift-monitoring layer into every Enterprise RAG & Knowledge Systems engagement — gold-set design, recall and citation-hit dashboards, refresh strategy tuned to your content velocity. Talk to our team for a drift-readiness review of your current RAG deployment.

Frequently asked questions

How often should we re-embed our corpus?

Incremental re-embedding of changed documents runs as a daily or weekly job. Full reindex of the whole corpus is usually quarterly or semi-annual, triggered by either an embedding-model upgrade or a structural reorganization of the source. Don’t reindex on a calendar without reason — the cost compounds, and a stable embedding is fine as long as the eval set says so.

What is the earliest leading indicator of retrieval drift?

Citation hit rate — the percentage of answers where the cited chunk actually contains the asserted fact. Measured on a weekly sample of 100 answers, it trends down before any other metric and before user complaints arrive. Recall@k against the gold set is the second-earliest signal.

How big should the gold-standard eval set be?

Start with 100 carefully tagged Q&A pairs that cover your top 5 query patterns. Grow it by 5–10 questions per week sampled from real production traffic, tagged by a subject-matter reviewer. After six months you’ll have ~250–350 questions covering the real distribution. Bigger isn’t automatically better — coverage of real query types matters more than raw size.