All articles
AI & LLMs November 30, 2024 8 min read

RAG in production: what the tutorials don't tell you

Retrieval-augmented generation looks straightforward in a Jupyter notebook. Production is different. Chunking strategies, embedding freshness, context window management, hallucination rates under real load — here's what we've encountered building RAG pipelines for actual clients.

RAG works in demos. Production is harder.

Retrieval-Augmented Generation has become the default architecture for LLM applications that need to answer questions about private or recent data. The concept is compelling: retrieve relevant documents, include them in the context, get a grounded answer. The tutorials make it look straightforward. Production deployments reveal a different story.

We've built RAG systems for client knowledge bases, legal document repositories, technical documentation, and product catalogues. The gap between tutorial performance and production performance is substantial, and the causes are predictable once you know where to look.

Chunking strategy determines retrieval quality

Every RAG tutorial splits documents into fixed-size chunks — 512 tokens, 1000 tokens, pick a number. Fixed-size chunking is the worst chunking strategy that still produces working results. It works well enough in tutorials because tutorials use clean, well-structured documents. It fails in production because real documents are not clean and not consistently structured.

A paragraph split across two chunks loses semantic coherence. A table split across chunks loses the relationship between headers and values. A legal clause split in the middle of a condition changes its meaning. These splits don't produce retrieval failures — they produce subtly wrong answers that are harder to catch than explicit failures.

We use semantic chunking by default: splitting at natural document boundaries (paragraphs, sections, list items) rather than at fixed token counts. For structured documents, we use structure-aware chunking that preserves tables, code blocks, and lists as atomic units. The implementation is more complex than fixed-size chunking. The retrieval quality improvement is significant.

Embeddings are not interchangeable

The choice of embedding model affects retrieval performance more than most other system parameters. General-purpose embeddings trained on web text perform well on general knowledge queries. They perform poorly on domain-specific queries where technical terminology has precise meanings that differ from their common usage.

For technical documentation, legal documents, or medical records, domain-specific or fine-tuned embedding models consistently outperform general-purpose models. The performance gap is often invisible in initial testing, which tends to use queries that general-purpose models handle well. It emerges in production, where users ask precisely the domain-specific questions that general-purpose models struggle with.

Evaluate embedding models on a representative sample of the actual queries your users will ask, not on generic benchmarks. Build a small evaluation set before committing to an embedding strategy.

Retrieval is a recall-precision tradeoff

Tuning retrieval involves a fundamental tradeoff between recall (retrieving all relevant documents) and precision (retrieving only relevant documents). Retrieving too few chunks risks missing the answer. Retrieving too many chunks degrades generation quality by filling the context with irrelevant information.

The optimal retrieval configuration depends on the document corpus, the query distribution, and the context window available. What works for a 10,000-document knowledge base may not work for a 1,000,000-document corpus. What works for short, specific questions may not work for broad, exploratory queries.

We build evaluation pipelines that measure retrieval recall and precision separately from generation quality. This separation makes it possible to diagnose whether a poor answer results from retrieval failure (relevant chunk not retrieved) or generation failure (relevant chunk retrieved but answer wrong). The diagnosis determines the fix.

Hybrid search outperforms dense retrieval alone

Pure vector similarity search — the approach every RAG tutorial uses — misses an important class of queries: exact keyword matches. A user searching for a specific product code, a legal citation, or a technical error message wants exact matching, not semantic similarity.

Hybrid search combines dense vector retrieval with sparse keyword search (BM25 or similar). The results are merged using reciprocal rank fusion or learned ranking. In production, hybrid search consistently outperforms pure vector search on diverse query sets, particularly for domain-specific queries with precise terminology.

The additional implementation complexity is modest. Most vector databases now support hybrid search natively. The performance improvement is significant enough that we treat hybrid search as the default for production RAG systems.

Answer grounding requires explicit verification

RAG is supposed to reduce hallucination by grounding answers in retrieved documents. In practice, it reduces hallucination without eliminating it. Models will occasionally generate answers that are not supported by the retrieved context — particularly when the context is ambiguous, incomplete, or when the question falls outside the document corpus.

For applications where hallucination carries significant risk — legal advice, medical information, financial guidance — we implement explicit grounding verification. After generating the answer, a second model call verifies that each claim in the answer is supported by a specific passage in the retrieved context. Claims that cannot be grounded are flagged or removed.

This adds latency and cost. In high-stakes domains, it's not optional.

Evaluation before deployment, not after

The most important investment in a production RAG system is an evaluation dataset built before deployment. This dataset should contain representative queries with known correct answers, covering the range of question types users will actually ask.

Without an evaluation dataset, you are deploying blind — optimising for subjective impressions of quality rather than measured performance. With one, you can make objective comparisons between chunking strategies, embedding models, retrieval configurations, and prompting approaches.

Building the evaluation dataset requires domain expertise and takes time. It also makes every subsequent decision about the system faster, cheaper, and more reliable. It's the infrastructure investment that makes everything else possible.

Work with us

Let's build
something that
matters.

We've been delivering production software since 2000. If something we've written resonates, we'd love to hear about your project.

Have a project in mind?

If this article sparked an idea, let's talk about your specific situation.

We use cookies to improve your experience. Learn more

Available now
Talk to us directly.
Skip the form, start a real conversation.