The honeymoon period for an enterprise RAG system is real. The first 90 days, retrieval feels accurate, citations land in the right documents, users trust the answers. Then quietly, around month four or five, something shifts. Answer quality drops 5%. Citations point to slightly-wrong sections. The on-call channel gets its first "the AI is wrong about X" message. Six months in, the team is firefighting. The phenomenon has a name — retrieval drift — and it's predictable enough to engineer against.
What retrieval drift actually is
RAG systems get worse over time for four compounding reasons. Each one is invisible from the user's seat until it adds up.
- Content drift: your knowledge base keeps changing — new policies, deprecated procedures, updated SOPs. The vector store snapshot was right on day one and 70% right by month six.
- Embedding-corpus mismatch: not the model itself changing, but the corpus shifting such that the original embedding choices stop being optimal — new acronyms, new product names, new domain jargon that the embedding can't disambiguate.
- Query distribution shift: users get more sophisticated. Month-one questions are basic. Month-five questions are "we already know X, why doesn't your AI handle the corollary Y." The system was tuned for the easy queries.
- Evaluation decay: the gold-standard eval set built at launch reflects month-one usage. New question patterns aren't in it. Your eval scores stay high while real quality drops.
How to detect drift before users do
The point is to catch the regression in dashboards, not in user complaints. Four signals matter, and they trail real quality decline by a week or two — early enough to act, late enough to be reliable:
- Continuous evaluation: keep the gold set growing. Add 5–10 questions every week sampled from real production traffic, with the correct answers tagged by a subject-matter reviewer.
- Retrieval recall@k against the gold set: when recall@5 drops 3% or more, alert. This is the earliest leading indicator.
- Citation hit rate: percentage of answers where the cited chunk actually contains the asserted fact. Trends down before any other signal — measure on a sample of 100 answers per week.
- Confidence score distributions: the model’s stated confidence shifts measurably when retrieval quality drops. Track the mean and the 10th percentile.
Embedding refresh strategies
Three strategies, in order of cost. Pick the cheapest that addresses your drift profile.
- Incremental refresh: re-embed only documents that changed. Cheap; the right default for most systems. A nightly job that watches the source-of-truth content and updates embeddings for changed documents.
- Full reindex: re-embed the whole corpus. Triggered by an embedding-model upgrade or by a structural reorganization of the source. Usually quarterly or semi-annual.
- Model upgrade: when a new embedding model materially outperforms yours on your domain (test on your data, not on public benchmarks), plan a full reindex tied to the upgrade. Versioned indexes let you roll back if the new model regresses on something the benchmark missed.
Operational practices that keep drift bounded
Drift is unavoidable. Drift becoming a customer-facing problem is avoidable, if four practices are in place:
- Versioned indexes: every embedding generation gets a version. Roll back instantly when an upgrade regresses.
- Shadow evaluation: run new models against the existing eval before promoting. Compare answers on the same gold set side by side.
- Quarterly review with an owner: someone owns the eval set. Expanding it is part of their job, not a "we should do this" afterthought.
- Reading the queries: log and sample 50 user queries a week. Tag them. Patterns emerge in the queries before they emerge in the metrics.
IDS AI Solutions builds the evaluation and drift-monitoring layer into every Enterprise RAG & Knowledge Systems engagement — gold-set design, recall and citation-hit dashboards, refresh strategy tuned to your content velocity. Talk to our team for a drift-readiness review of your current RAG deployment.
