Retrieval-augmented generation has become the standard pattern for building LLM applications over domain-specific knowledge. The tutorials make it look straightforward: chunk documents, embed them, store in a vector database, retrieve relevant chunks at query time, pass to the LLM with context. Two hundred lines of Python and you have a demo that impresses stakeholders.
The gap between the demo and production reliability is where most RAG implementations fail. The specific failure modes aren’t covered in the tutorials because they only appear under real usage patterns. Here’s what actually breaks and how to address it.
The Chunking Problem Is Not Solved by Defaults
Chunking — splitting documents into pieces that get embedded individually — is where most RAG implementations have their largest unaddressed quality gap. The default “chunk at 512 tokens with 50-token overlap” that appears in most tutorials is a reasonable starting point and a poor production strategy.
The problem with fixed-size chunking: it splits semantic units at arbitrary boundaries. A legal clause split mid-sentence. A code block split at line 200. A paragraph that starts with “The exception to the above rule” separated from the rule it’s excepting. The chunk contains locally coherent text but lacks the context needed for a retrieval system to understand what it’s about.
What works better in production:
Semantic chunking uses embedding similarity to split documents at natural semantic boundaries rather than character/token counts. The implementation is more complex (you’re computing embeddings at chunk time), but the retrieval quality improvement is substantial for document types with clear sections and subsections.
Document structure awareness uses document metadata to chunk along structural boundaries — split legal documents at section headings, code files at function definitions, PDFs at page breaks. This requires knowing the document type and extracting structure, but the results are significantly better than naive splitting.
Hierarchical chunking stores both fine-grained chunks and coarser parent chunks. Retrieval fetches fine-grained chunks for precision; the parent chunk is passed to the LLM for context. This is the pattern that handles the “context around the specific fact” problem — the retrieved chunk contains the precise answer, but the LLM gets the surrounding context to reason about it correctly.
Retrieval Quality Is the Core Problem
Every RAG system has two distinct quality problems: retrieval quality (did we find the right chunks?) and generation quality (given the right chunks, did the LLM produce a good answer?). Most teams focus on generation quality — the answer is wrong, let’s improve the prompt — when retrieval is the actual bottleneck.
Measuring retrieval quality requires building a labeled evaluation set: questions paired with the document chunks that should be retrieved. This is time-consuming and is consistently skipped. Teams instead test by asking the system questions and evaluating whether the answers “seem right.” This doesn’t distinguish between retrieval failures and generation failures.
Build the evaluation set. For most domains, 100-200 question/ground-truth pairs is enough to get meaningful signal. Run retrieval-only evaluation: given a question, what’s in the top-5 retrieved chunks? Is the ground-truth chunk present? At what rank? This metric — retrieval recall at k — tells you whether the information is findable before you worry about whether it’s usable.
Common retrieval failures that evaluation will surface:
Embedding model mismatch. The embedding model’s semantic representation of queries doesn’t match its representation of relevant documents. This often appears with domain-specific terminology. A query about “counterparty risk in derivative contracts” may not retrieve a document about “credit exposure for swaps” if the embedding model doesn’t understand these terms as semantically related. Fine-tuning or choosing a domain-specific embedding model improves this.
Lexical vs. semantic mismatch. Dense retrieval (pure vector similarity) misses exact phrase matches that BM25 (keyword-based) retrieval finds easily. Hybrid retrieval — combining dense and sparse retrieval scores — consistently outperforms either alone for real-world document collections.
Chunk boundaries cutting relevant information. The relevant text starts at the end of one chunk and continues in the next. Increasing overlap helps at the cost of redundancy and storage. Better: retrieve adjacent chunks automatically when the retrieved chunk is at a boundary.
The Evaluation Problem
RAG without systematic evaluation is untestable. You can’t know if a change improved or degraded performance without measuring. Yet most production RAG implementations have no formal evaluation pipeline.
The evaluation stack that works in production:
RAGAS provides automated metrics: faithfulness (does the answer follow from the retrieved context?), answer relevancy (does the answer address the question?), context precision and recall. These are computed automatically using an LLM as judge, which introduces its own biases but is substantially better than no evaluation.
LLM-as-judge with rubrics works well for domain-specific evaluation where automated metrics don’t capture what matters. Write explicit rubrics for what “correct” means in your domain and evaluate generated answers against them. This is more expensive per evaluation but necessary for domains where the nuance matters.
Regression testing catches regressions when you make changes. The labeled evaluation set you built for retrieval quality serves double duty: if your retrieval changes cause previously-answered questions to fail, you know before deploying.
Latency and Cost Are Architectural Concerns
A RAG pipeline that works at 1 query per minute may not work at 100 queries per minute. The latency stack compounds:
- Embedding the query: 10-100ms depending on model and hosting
- Vector retrieval: 10-50ms for most managed vector databases at moderate scale
- LLM inference: 500ms-5s depending on model size and response length
At modest scale, this is fine. At scale, the LLM inference step dominates latency and cost. Optimizations:
Caching semantic query results for frequent or near-duplicate queries. A query cache with fuzzy matching (cache hits for queries within a similarity threshold) can dramatically reduce LLM calls for repetitive use cases.
Query routing sends simple queries to faster, cheaper models and only uses the large model for complex questions. A classifier that routes “what is X?” questions to a small model and “analyze the implications of X given Y” questions to a large model meaningfully reduces cost without proportional quality loss.
Streaming responses don’t reduce total latency but dramatically improve perceived latency. Users who see tokens appearing within 200ms don’t experience the same “loading” frustration as users waiting 3 seconds for the full response.
What to Get Right Before Going to Production
The production readiness questions for a RAG system:
- Do you have a labeled evaluation set and automated evaluation pipeline?
- Is retrieval quality measured separately from answer quality?
- Do you have monitoring on retrieval latency, embedding failures, and LLM API errors?
- Is there a feedback mechanism for users to flag incorrect answers?
- Do you have rate limiting and cost controls on LLM API usage?
- Is the system’s behavior on out-of-scope questions defined? (It should say it doesn’t know, not hallucinate.)
The last point matters more than most teams expect. A RAG system that confidently answers questions outside its retrieval corpus is worse than one that says “I don’t have information about that.” The guardrails for out-of-scope queries require explicit design — the LLM will answer if not instructed otherwise.
Our AI consulting and implementation practice works through exactly this gap between working prototype and production-reliable system. If you’re evaluating whether a RAG system fits your use case, the data quality question comes before the model question: what are the documents, how are they structured, and how consistent is the quality? Related: the infrastructure for running RAG pipelines at scale intersects with data engineering — the document processing pipeline, embedding job scheduling, and monitoring stack are all data infrastructure problems.