Building Production RAG: Architecture, Chunking, Evaluation & Monitoring (2026 Guide)
Build production RAG that actually works at scale. Covers chunking strategies with benchmarks, embedding selection, hybrid retrieval, reranking, RAGAS evaluation, latency budgets, and monitoring.
80% of RAG failures trace back to the ingestion and chunking layer, not the LLM. Most teams discover this after spending weeks tuning prompts and swapping models while their retrieval quietly returns the wrong context every third query.
This guide is about production RAG: the architecture decisions that don't appear in tutorials, the benchmarks that should drive your chunking and embedding choices, and the evaluation and monitoring setup that tells you when things break. It assumes you have a working prototype and want to make it reliable at scale. For a broader look at where RAG fits among advanced retrieval methods, that post covers the taxonomy before you commit to an architecture.
The Gap Between POC and Production
A notebook RAG demo typically works like this: load PDFs, chunk by 512 tokens, embed with OpenAI, retrieve top-5, generate with GPT-4. It produces decent answers on your test questions. It feels ready.
Then you deploy it. Users ask questions you didn't anticipate. Documents get updated. Load increases. Latency climbs. Answers start slipping, but there's no metric telling you that, so you find out from user complaints.
The gap isn't the LLM. It's usually five things. If you're still deciding whether RAG or a different approach fits your use case, see the RAG vs long-context LLMs comparison first. That decision affects every architectural choice below.
- Chunking that splits semantic units across boundaries, so the retrieved context is always slightly wrong
- Retrieval relying on dense-only search, which misses keyword-specific queries
- No reranking, so the top-5 by cosine similarity aren't the top-5 by actual relevance
- No evaluation framework, so you can't tell if a code change made things better or worse
- No observability, so production failures are invisible until users report them
Each section below addresses one of these in depth.
Architecture Overview
Before diving into components, here's the full pipeline with the decision points that matter:
[Documents] → [Ingestion & Parsing] → [Chunking] → [Embedding] → [Vector Index]
↓
[User Query] → [Query Processing] → [Hybrid Retrieval] → [Reranking] → [Context Assembly] → [LLM] → [Response]
↓
[Evaluation & Monitoring]
Each arrow is a failure point. The rest of this guide treats them in sequence.
Stage 1: Document Ingestion and Parsing
Most guides skip this stage. It's where production RAG actually starts failing.
The parsing problem
Raw documents are not clean text. PDFs have tables, headers, footers, multi-column layouts, and scanned pages. HTML has navigation menus mixed into the body. Word documents have tracked changes and comments. If your parser returns garbage, your chunks are garbage, your embeddings are garbage, and your retrieval returns garbage, regardless of how well everything downstream is configured.
For PDFs: Use Unstructured.io or LlamaParse rather than PyPDF2 or pdfminer directly. Both do layout-aware parsing that distinguishes body text from headers, tables, and figures. For scanned PDFs, you need an OCR stage. Tesseract works for most use cases, but production accuracy on dense documents requires a commercial OCR service.
For tables: Treat them separately. Embedding a markdown table as prose produces poor retrieval. Extract tables into structured chunks with a consistent format:
Table: Q3 Revenue by Region (Source: Q3_report.pdf, Page 12)
Region | Revenue ($M) | YoY Change
North America | 142.3 | +18%
EMEA | 89.1 | +7%
APAC | 67.4 | +31%
This format retrieves correctly because the header row and table context are part of the chunk.
For HTML/web content: Strip navigation, ads, and boilerplate before chunking. Trafilatura does this better than BeautifulSoup for most content types.
Metadata extraction
Every chunk needs metadata attached at ingestion, not after. Metadata you'll use later:
source_id: Unique document identifier (for access control and citation)doc_type: PDF, web, database record, etc. (for routing decisions)section_title: The heading this chunk falls undercreated_at/updated_at: For temporal filteringaccess_level: For multi-tenant permission filtering at retrieval time
Adding metadata after embedding is possible but requires re-indexing. Build it into ingestion from the start.
Document freshness
Stale context is a silent failure mode. A document updated last month that your index hasn't re-processed will return outdated answers with full confidence. You need:
- Change detection: Hash document content at ingestion. On re-ingestion, compare hashes to detect updates.
- Incremental re-indexing: Update only the chunks from changed documents, not the full corpus.
- Deletion handling: When source documents are deleted or access is revoked, those chunks must be removed from the index.
For most pipelines, a daily re-ingestion job that checks document hashes is sufficient. For real-time data sources, use event-driven ingestion triggered by source system webhooks. Teams dealing with large-scale corpus automation should also look at enterprise dataset automation patterns. The same ingestion pipeline that feeds RAG indexes can feed fine-tuning datasets with minimal additional work.
Stage 2: Chunking Strategy
Chunking quality constrains retrieval accuracy more than embedding model choice. A 2025 clinical decision support study found adaptive chunking hit 87% accuracy versus 13% for fixed-size baselines on the same corpus. That's not a marginal gap. It's the difference between a system that works and one that doesn't. For a higher-level overview of how chunking fits into different RAG strategies including simple, hybrid, and agentic patterns, that post is a useful companion.
But the right strategy depends entirely on your document types and query patterns. There is no universal best chunking strategy.
Fixed-size chunking
Split on token count, typically 400-512 tokens with 10-20% overlap.
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_text(document_text)
When it works: Homogeneous content like news articles, support tickets, FAQ entries, or short product descriptions where each item is already a complete semantic unit.
When it fails: Technical documentation, legal contracts, research papers, or any document where meaning spans multiple paragraphs. A 512-token window that cuts mid-explanation sends half-context to the LLM.
The overlap question: A January 2026 systematic analysis found that overlap provided no measurable benefit on recall when using SPLADE retrieval; it only increased storage and embedding costs. Overlap matters most with dense retrieval on long-context queries. Test whether it helps your specific corpus before defaulting to 20% overlap.
Recursive chunking
Split on structural boundaries (paragraph, sentence, word) before falling back to character count. This is the better default over fixed-size for most document types.
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""] # priority order
)
A February 2026 benchmark across 50 academic papers placed recursive 512-token splitting at 69% accuracy, 15 percentage points above semantic chunking on the same corpus, because semantic chunking's small average fragment size (43 tokens) destroyed context.
Recursive chunking is faster than semantic methods, avoids the semantic chunking fragmentation problem, and performs on par or better across most content types. It's the right production default.
Semantic chunking
Split based on embedding similarity between adjacent sentences. Split when semantic distance exceeds a threshold.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = splitter.split_text(document_text)
The cost problem: Semantic chunking requires embedding every sentence to compute distances. On a 10,000-word document, that's 200-300 embedding API calls just for chunking, before any retrieval happens. For large corpus ingestion, this adds cost and latency to the indexing pipeline.
Use it for: High-value documents where retrieval quality is critical and budget is less constrained: legal contracts, medical protocols, compliance documentation, research papers. The clinical study mentioned above found semantic chunking hit 79-82% faithfulness scores versus 47-51% for fixed-size on the same corpus.
When it isn't: Homogeneous corpora where fixed-size performs equally well, high-volume ingestion pipelines where per-document latency matters, or rapid prototyping.
Proposition chunking
Break content into atomic, self-contained factual statements. Each chunk answers a single question and can be understood without surrounding context.
Input paragraph:
The model was trained on 3,000 H200 GPUs. Training took 6 weeks.
Final perplexity was 2.3 on the holdout set.
Proposition chunks:
"The model was trained on 3,000 NVIDIA H200 GPUs."
"Model training duration was 6 weeks."
"The model achieved 2.3 perplexity on the holdout evaluation set."
This produces the highest retrieval precision for factoid queries. It's expensive to generate (requires an LLM to produce propositions) and increases chunk count significantly, but for knowledge bases where users ask specific factual questions, it outperforms every other method.
Document-level chunking
For short, focused documents like support tickets, product descriptions, and FAQ entries: don't chunk at all. Embed the entire document. Chunking introduces fragmentation where none is needed.
Hierarchical chunking (parent-child)
Store both large parent chunks (full sections) and small child chunks (sentences/paragraphs). Retrieve by child chunk similarity, but return the parent chunk as context to the LLM.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.vectorstores import Chroma
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
retriever = ParentDocumentRetriever(
vectorstore=Chroma(embedding_function=embeddings),
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
This approach gives you precise retrieval (small chunks match queries more exactly) with full context for generation (parent chunks have the surrounding information the LLM needs). It's the most balanced approach for complex documents and is worth the extra storage cost in most production systems.
Chunking decision framework
| Document Type | Query Pattern | Recommended Strategy |
|---|---|---|
| Short FAQ / tickets / product descriptions | Factoid lookup | No chunking (embed whole document) |
| News articles, blog posts | General questions | Fixed-size 512 tokens |
| Technical docs, wikis, manuals | Mixed: factoid and exploratory | Recursive or hierarchical |
| Legal contracts, compliance docs | Clause-specific queries | Semantic or proposition |
| Research papers | Specific fact retrieval | Proposition chunking |
| Multi-topic long docs | Mixed queries | Hierarchical (parent-child) |
Build a small labeled test set for your corpus before committing. 50-100 representative query-answer pairs is enough to rank chunking strategies empirically. Changing strategy later requires re-embedding your entire corpus. The decision is semi-permanent.
Stage 3: Embedding Model Selection
The embedding model converts your text chunks into vectors. Model choice affects retrieval quality, latency, storage cost, and whether you're sending data to a third party.
Current options
As of early 2026, Voyage AI's voyage-3-large leads the MTEB leaderboard for retrieval tasks. It outperforms OpenAI's text-embedding-3-large by 9.74% and Cohere's embed-v3-english by 20.71% on evaluated domains. It supports 32K-token context versus 8K for OpenAI, which matters for long document chunks. At $0.06 per million tokens it's 2.2x cheaper than OpenAI and requires 3x less vector storage due to smaller 1024-dimensional embeddings.
| Model | MTEB Score | Context | Cost/1M tokens | Dimensions |
|---|---|---|---|---|
| voyage-3-large | Best | 32K | $0.06 | 1,024 |
| text-embedding-3-large | Strong | 8K | $0.13 | 3,072 |
| cohere embed-v3 | Good | 512 | $0.10 | 1,024 |
| BGE-M3 (self-hosted) | Strong | 8K | Infra cost | 1,024 |
| E5-large-v2 (quantized) | Good | 512 | Infra cost | 1,024 |
On self-hosted models: If data sovereignty is a concern, BGE-M3 and E5-large-v2 are the strongest open-source options. Quantized E5-large-v2 runs in 10ms on CPU, faster than any embedding API. For enterprises with private data that shouldn't leave their infrastructure, this is the right path. PremAI's platform handles self-hosted fine-tuned model serving including embedding models with zero data retention. The private AI platform overview covers the full architecture for teams that need on-premise embedding.
Domain-specific embedding
General-purpose models perform well across domains but may miss domain-specific terminology. A biomedical embedding model will retrieve "myocardial infarction" when the query says "heart attack", which a general model may not. If your corpus is domain-specific (legal, medical, financial, code), evaluate whether a domain-tuned model improves retrieval before committing to a general one.
Fine-tuning an embedding model on your own corpus is the strongest option for specialized domains. See the enterprise fine-tuning guide for the full workflow. For most teams, start with a general model and evaluate whether domain-specific matters for your actual queries.
Embedding consistency
Your query must use the same embedding model as your indexed chunks. This sounds obvious but becomes an operational problem:
- If you upgrade your embedding model, you must re-embed and re-index your entire corpus
- If you run multiple embedding models for different document types, query routing must match
- Model deprecations require planned migration windows
Pin your embedding model version explicitly and treat upgrades as migrations, not patches.
Stage 4: Vector Indexing and Hybrid Retrieval
Pure vector (dense) retrieval misses exact-match queries. BM25 (sparse) retrieval misses semantic queries. Production RAG uses both.
Why dense-only fails
If a user asks for "RFC 7231 section 4.3.1", dense retrieval returns documents semantically similar to "HTTP specification GET method", which may be correct but won't surface the specific section with that exact identifier. BM25 matches the exact string. Similarly, product codes, error codes, legal citations, and named entities all retrieve better with keyword search.
Hybrid retrieval with Reciprocal Rank Fusion
The standard production approach combines BM25 and dense retrieval with Reciprocal Rank Fusion (RRF), which merges ranked lists without needing score normalization:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
# Dense retriever
dense_retriever = Chroma(
embedding_function=embeddings
).as_retriever(search_kwargs={"k": 10})
# Sparse retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
# Hybrid with RRF weighting
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.4, 0.6] # tune based on your query mix
)
The RRF formula is 1 / (rank + k) summed across retrievers, where k is typically 60. It's robust to score scale differences between BM25 and cosine similarity, so you don't need to normalize.
Weight tuning: dense 0.6 / sparse 0.4 is a reasonable starting point for general question-answering. Increase sparse weight if your users frequently query specific terms, codes, or identifiers. Increase dense weight if queries are predominantly conceptual or exploratory.
Vector database selection
| Database | Strengths | Watch out for |
|---|---|---|
| Pinecone | Managed, simple, production-tested | Cost at scale, limited filtering |
| Qdrant | Fast, strong filtering, self-hostable | Operational overhead vs. managed |
| Weaviate | GraphQL API, multi-vector, hybrid built-in | More complex setup |
| Milvus | Scales to billions of vectors | Significant operational complexity |
| pgvector | Already in your Postgres stack | Slower ANN at large scale |
| ChromaDB | Easy local dev | Not production-grade at scale |
For most enterprise deployments: Qdrant self-hosted or Weaviate for control, Pinecone for simplicity. pgvector is surprisingly capable if your corpus is under a few million chunks and you already run Postgres, which removes a database dependency.
Metadata filtering
Filter at retrieval time to narrow the search space before similarity ranking. This reduces irrelevant results and speeds up retrieval:
results = vectorstore.similarity_search(
query=user_query,
k=20,
filter={
"doc_type": "policy",
"access_level": {"$in": user_permission_levels},
"updated_at": {"$gte": "2024-01-01"}
}
)
Multi-tenant systems must filter by access_level at retrieval, not after. Retrieving documents and then filtering means your vector search scans the entire corpus. In a security context, returning unauthorized documents is a data leak waiting to happen. Build access control into the metadata filter.
Query expansion and transformation
Raw user queries are often too short for good retrieval. "how do I do X" retrieves less well than "step-by-step instructions for doing X including prerequisites and common errors."
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer to the query using a small LLM, then embed that answer instead of the raw query. The synthetic answer has the vocabulary and style of a real document, which retrieves better than the original question:
from langchain.chains import HypotheticalDocumentEmbedder
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
llm=llm,
base_embeddings=base_embeddings,
custom_prompt="Write a detailed technical document that answers this question: {question}"
)
HyDE adds one LLM call to your retrieval path, typically 200-400ms. Worth it for complex queries, probably not for simple factoid lookups. Consider routing: use HyDE for queries classified as "complex" or "exploratory" and skip it for straightforward factoid queries.
Query rewriting: Use an LLM to expand ambiguous queries before retrieval:
rewrite_prompt = """
Rewrite the following user query to be more specific and retrievable.
Include relevant technical terms that might appear in documentation.
Keep it concise (under 100 words).
Query: {query}
Rewritten:"""
Stage 5: Reranking
Reranking is the single highest-ROI addition to a basic RAG pipeline. Adding a reranker after retrieval typically improves precision by 10-30% at a cost of 50-100ms added latency.
Why retrieval order isn't relevance order
Cosine similarity ranks by vector distance. That correlates with relevance but doesn't equal it. A document that's semantically similar to your query might be answering a different question entirely. A cross-encoder reranker reads both the query and each retrieved document together, scoring them as a pair, which is much closer to actual relevance.
The tradeoff: bi-encoders (what vector search uses) encode query and document independently and compare embeddings, fast enough for large-scale search. Cross-encoders read both together, more accurate but too slow to run on millions of documents. You use retrieval (bi-encoder) to get a candidate set of 20-50 documents, then reranking (cross-encoder) to re-order them. Return the top 3-5 to the LLM.
Implementation
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Load a cross-encoder
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3")
compressor = CrossEncoderReranker(model=model, top_n=5)
# Wrap your existing retriever
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=hybrid_retriever # from Stage 4
)
Or use Cohere's hosted reranker, which avoids the GPU requirement:
from langchain.retrievers.document_compressors import CohereRerank
compressor = CohereRerank(
model="rerank-v3.5",
top_n=5
)
Reranker model selection
| Model | Type | Latency | Notes |
|---|---|---|---|
| BAAI/bge-reranker-v2-m3 | Cross-encoder (self-hosted) | ~30ms/query on GPU | Strong multilingual support |
| ms-marco-MiniLM-L-6-v2 | Cross-encoder (self-hosted) | ~10ms/query on CPU | Fastest, English-only |
| Cohere rerank-v3.5 | API | ~100ms | No GPU needed, easy setup |
| Jina reranker-v2 | Cross-encoder (self-hosted) | ~40ms on GPU | Good for long documents |
For most teams: Cohere rerank-v3.5 if you're already on managed APIs, bge-reranker-v2-m3 if you need multilingual and can spare a GPU, ms-marco-MiniLM-L-6-v2 if you need fast CPU-only reranking.
The retrieve-rerank-generate pattern
Full pipeline:
def rag_query(query: str, k_retrieve: int = 20, k_rerank: int = 5) -> str:
# 1. Retrieve broader candidate set
candidates = hybrid_retriever.get_relevant_documents(query, k=k_retrieve)
# 2. Rerank to get top-k by cross-encoder relevance
reranked = reranker.compress_documents(candidates, query)[:k_rerank]
# 3. Assemble context
context = "\n\n---\n\n".join([doc.page_content for doc in reranked])
# 4. Generate
response = llm.invoke(
prompt_template.format(context=context, query=query)
)
return response
Retrieve 20, return 5. Retrieving a larger candidate set gives the reranker more to work with; returning fewer chunks to the LLM reduces context noise and token cost. Tune k_retrieve and k_rerank based on your latency budget and corpus characteristics.
Stage 6: Context Assembly and Prompt Design
What you put in the prompt window matters as much as what you retrieve.
Context ordering
LLMs show recency bias: content at the end of the context window gets higher attention than the middle. Put the most relevant chunks last, not first. With a reranker, your chunks are already ranked; reverse the order before assembling the prompt:
# Most relevant chunk goes last
ordered_chunks = list(reversed(reranked_docs))
context = "\n\n---\n\n".join([doc.page_content for doc in ordered_chunks])
Source attribution in chunks
Include source metadata in each context block:
def format_chunk(doc) -> str:
source = doc.metadata.get("source_id", "unknown")
section = doc.metadata.get("section_title", "")
return f"[Source: {source} | Section: {section}]\n{doc.page_content}"
This lets the LLM cite sources in its response and makes debugging much easier. You can trace exactly which chunk produced a particular response.
Prompt structure
You are an assistant for [SYSTEM CONTEXT]. Answer questions based only on the provided context.
If the context doesn't contain the answer, say so. Do not infer or extrapolate.
Context:
---
[CHUNK 3 - least relevant]
---
[CHUNK 2]
---
[CHUNK 1 - most relevant]
---
Question: {query}
Answer:
The "answer only from context" instruction cuts hallucination rates. Combine it with a faithfulness evaluator in production to catch when the model strays.
Context window management
At long context lengths, retrieval quality degrades even for models with large context windows. Chroma's July 2025 research tested 18 models including GPT-4.1, Claude 4, and Gemini 2.5 and found consistent performance degradation as context length increased. Counterintuitively, shorter, more precise context often produces better answers than dumping 50K tokens of retrieved text.
Keep your assembled context under 8K tokens for most queries. If you're consistently hitting that limit, your reranking threshold is too loose. Reduce k_rerank or be more aggressive with the relevance score cutoff.
Stage 7: Evaluation Framework
Teams that don't measure can't improve. And without evaluation, you can't tell if a change to chunking, embedding, or retrieval actually made things better. LLM reliability and why evaluation matters covers the broader principles. The section below applies them specifically to RAG systems.
The RAG evaluation stack
You need to evaluate two separate systems: the retriever (did it find the right documents?) and the generator (did it use them correctly?).
Retrieval metrics:
- Context Precision: Of the retrieved chunks, what fraction were actually relevant to answering the query? Low precision means you're sending noise to the LLM.
- Context Recall: Did the retrieved set contain all the information needed to answer correctly? Low recall means the LLM is answering from incomplete context.
- MRR (Mean Reciprocal Rank): How high does the first relevant document appear in the ranking? Good for evaluating reranker quality.
- NDCG@k: Normalized Discounted Cumulative Gain, which measures both presence and ranking position of relevant documents in top-k results. For a broader treatment of LLM evaluation benchmarks, challenges, and trends beyond RAG-specific metrics, that post is worth reading before you design your evaluation protocol.
Generation metrics:
- Faithfulness: Does the response contain only claims that are supported by the retrieved context? Measured by checking each statement in the response against the context chunks. This is your primary hallucination metric.
- Answer Relevance: Does the response actually address the question? High faithfulness with low relevance means the model answered a different question using real context.
- Answer Correctness: Is the response factually correct? Requires ground-truth labels. It's the most expensive metric to compute but the most directly meaningful.
RAGAS implementation
RAGAS provides reference-free evaluation using LLM-as-a-judge, so you don't need ground-truth labels for most metrics:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Collect a sample of production queries with retrieved context and responses
eval_data = {
"question": questions,
"answer": generated_answers,
"contexts": retrieved_contexts, # list of lists
"ground_truth": ground_truth_answers # only needed for context_recall, answer_correctness
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
llm=evaluator_llm,
embeddings=evaluator_embeddings
)
print(result)
# {'faithfulness': 0.89, 'answer_relevancy': 0.91,
# 'context_precision': 0.76, 'context_recall': 0.82}
A few caveats on RAGAS: it uses recursive LLM calls internally, so evaluation costs are non-trivial on large samples. It also occasionally fails to extract statements from long responses, producing incorrect scores on edge cases. For production systems in high-stakes domains, Lynx outperforms RAGAS on hallucination detection for long context cases. Run RAGAS for iteration speed, supplement with Lynx for critical accuracy validation. For enterprise-grade evaluation patterns including side-by-side model comparisons and custom rubrics, see enterprise AI evaluation for production-ready performance.
Building your evaluation set
You need at minimum 50-100 labeled query-answer pairs that cover your real use cases. Build this set from:
- Manual curation: 25-30 queries covering core use cases, written by domain experts who know what correct answers look like
- Synthetic generation: Use RAGAS or your LLM to generate questions from your corpus. Cheap and fast, but validate a sample manually. Synthetic questions can be too literal or miss the style of real user queries
- Production queries: Harvest real queries from logs. These are the most representative but require labeling. Start tagging production failures as they happen. After a month you'll have a useful dataset
Freeze your evaluation set when you start iterating. If you keep adding queries, you can't tell whether metric changes are from system improvements or dataset changes.
Automated evaluation in CI/CD
Run evaluation on every change to chunking strategy, embedding model, retrieval parameters, or prompt:
# evaluation/test_rag_quality.py
import pytest
from ragas import evaluate
FAITHFULNESS_THRESHOLD = 0.85
CONTEXT_PRECISION_THRESHOLD = 0.75
def test_rag_quality():
result = evaluate(eval_dataset, metrics=[faithfulness, context_precision])
assert result["faithfulness"] >= FAITHFULNESS_THRESHOLD, \
f"Faithfulness {result['faithfulness']:.2f} below threshold {FAITHFULNESS_THRESHOLD}"
assert result["context_precision"] >= CONTEXT_PRECISION_THRESHOLD, \
f"Context precision {result['context_precision']:.2f} below threshold"
This is a quality gate. If your evaluation set is representative, a regression that breaks retrieval will show up here before it reaches users. 60% of RAG deployments in 2026 include systematic evaluation from day one, up from under 30% in early 2025. PremAI's evaluations overview covers how to set up LLM-as-a-judge scoring and side-by-side model comparisons within a managed platform, including custom evaluation rubrics for domain-specific RAG.
Stage 8: Latency Optimization
Production RAG has a latency budget. Most real-time applications target under 2 seconds total end-to-end. Here's where time actually goes and how to recover it.
Latency accounting
Typical breakdown for a production query:
| Stage | Typical Latency | Notes |
|---|---|---|
| Query embedding | 20-50ms | API or local model |
| Vector search | 5-30ms | Depends on index size and ANN algo |
| BM25 search | 5-20ms | Typically fast, in-memory |
| Reranking | 30-100ms | Cross-encoder or API |
| LLM generation | 500ms-3s+ | Biggest variable |
| Total | 560ms-3.2s+ |
LLM generation dominates. But retrieval overhead compounds quickly if you're doing query expansion, HyDE, or multiple retrieval passes.
Semantic caching
Cache results for repeated or similar queries. The same or semantically similar query hitting production dozens of times per hour doesn't need to go through the full pipeline each time.
from gptcache import cache
from gptcache.adapter import openai
# Semantic cache: matches similar queries, not just exact
cache.init(
embedding_func=embedding_function,
similarity_evaluation=SearchDistanceEvaluation(max_distance=0.15)
)
Cache hit rates vary heavily by use case. Internal knowledge bases with repeated queries (support bots, HR assistants) see 30-60% cache hit rates. General-purpose assistants see closer to 10-20%. Even 15% cache hits on a high-traffic system meaningfully reduces both latency and cost. For more on cutting LLM API costs without losing performance, semantic caching is one of several techniques covered there.
Parallel retrieval
Run BM25 and dense retrieval in parallel rather than sequentially:
import asyncio
async def parallel_retrieve(query: str):
dense_task = asyncio.create_task(dense_retriever.aget_relevant_documents(query))
sparse_task = asyncio.create_task(bm25_retriever.aget_relevant_documents(query))
dense_results, sparse_results = await asyncio.gather(dense_task, sparse_task)
return rrf_merge(dense_results, sparse_results)
On async-capable infrastructure, this recovers 10-20ms off retrieval latency. Small gain, free to implement.
Streaming generation
Don't wait for the full LLM response before returning output to the user. Stream tokens as they generate:
async def stream_rag_response(query: str):
context = await retrieve_and_rerank(query)
async for chunk in llm.astream(
prompt_template.format(context=context, query=query)
):
yield chunk.content
Streaming makes a 2-second response feel like a 200ms response. The user sees text appearing immediately. For any user-facing application, streaming is mandatory.
Pre-computed embeddings for known queries
For systems with predictable query patterns (dashboards, scheduled reports, fixed FAQ sets), pre-compute and cache embeddings at load time rather than generating them per request.
Chunking and context length
Every extra chunk you pass to the LLM adds both token cost and generation latency. If your reranker is doing its job, k_rerank=3 probably produces better latency and similar quality to k_rerank=8. Run ablations on your evaluation set.
Stage 9: Production Monitoring
Without observability, you're flying blind. RAG systems fail in ways that aren't immediately visible. Retrieval quality degrades slowly, a document update breaks a specific query path, embedding drift causes subtle recall drops.
What to instrument
Retrieval layer:
- Retrieval latency (p50, p95, p99) per query type
- Number of documents retrieved vs. filtered by reranker
- Average reranker score of top-1 result (sharp drops indicate retrieval degradation)
- Queries with zero retrieval results (corpus gaps)
Generation layer:
- LLM latency (p50, p95, p99)
- Token counts per query (monitors cost and context length trends)
- Faithfulness score distribution (run on a sampled subset of production queries)
- Citation rate (what fraction of responses include source references)
System layer:
- Vector index size and ingestion lag (time from document update to index availability)
- Embedding model API error rates and latency
- Cache hit rate
LLM observability tools
| Tool | Strengths | Best for |
|---|---|---|
| LangSmith | Deep LangChain integration, trace visualization | LangChain pipelines |
| Arize Phoenix | Framework-agnostic, strong open-source | Multi-framework systems |
| RAGAS + custom dashboards | Focused on RAG metrics | Teams already using RAGAS |
| Langfuse | Open-source, self-hostable, good UX | Privacy-conscious teams |
For PremAI users, the LLM observability patterns guide covers the full monitoring setup including tracing for fine-tuned model deployments. Prem Studio includes built-in evaluation and dataset tooling that wires traces directly into improvement workflows.
Detecting retrieval drift
Embedding drift happens when your query distribution shifts away from your indexed content. A document corpus that was indexed in Q1 may not retrieve well for Q3 queries about new products or changed procedures.
Monitor for it by tracking average similarity scores of top-1 retrieved documents over time. A sustained downward trend in top-1 similarity without a corresponding drop in query volume signals that queries are shifting away from indexed content. Review whether your corpus needs updating.
Alerting thresholds
| Metric | Warning | Critical |
|---|---|---|
| Retrieval latency p99 | > 200ms | > 500ms |
| LLM generation latency p99 | > 3s | > 6s |
| Zero-result retrieval rate | > 5% | > 15% |
| Faithfulness score (sampled) | < 0.80 | < 0.70 |
| Embedding API error rate | > 1% | > 5% |
Zero-result rate is particularly telling. If 10% of queries return no relevant documents, your corpus has coverage gaps. Those users get no useful answer.
Advanced Patterns
GraphRAG for interconnected knowledge
For knowledge bases where entities relate to each other: org charts, product hierarchies, code dependencies, regulatory cross-references: standard vector retrieval misses relationship traversal. GraphRAG builds a knowledge graph alongside the vector index and uses graph traversal to expand context.
LinkedIn's GraphRAG deployment reduced support resolution time by 28.6% compared to standard RAG for queries requiring reasoning across multiple connected entities. Worth evaluating when your data is highly structured and relational. Implementation cost is high. Skip it unless vector RAG is already working well and you have a specific class of queries that require multi-hop reasoning.
Agentic RAG
For complex queries that require multiple sequential retrievals. For example, "compare our Q3 policy against the regulatory change from last month and identify gaps": a single retrieval pass is insufficient. Agentic RAG gives the LLM control over the retrieval process, letting it issue multiple queries, synthesize partial results, and decide when it has enough context.
from langchain.agents import create_react_agent
tools = [
Tool(name="retrieve", func=retriever.get_relevant_documents,
description="Retrieve relevant documents for a query"),
Tool(name="retrieve_by_date", func=date_filtered_retriever,
description="Retrieve documents updated after a specific date"),
]
agent = create_react_agent(llm=llm, tools=tools, prompt=react_prompt)
Agentic RAG multiplies LLM calls and adds latency. For most queries, it's overkill. Use it specifically for complex multi-step queries where single-pass retrieval demonstrably fails. Before investing in custom agentic RAG, read chatbots vs AI agents: which is right for your business to clarify whether an agentic retrieval pattern is actually what your use case requires. Then see agentic framework patterns for implementation options.
RAG with fine-tuned models
Off-the-shelf LLMs handle general RAG reasonably well but struggle with domain-specific formatting, citation styles, and terminology. Fine-tuning the generation model on examples of your desired input-output format, with retrieved context as part of the training input, produces better adherence to domain conventions and output formatting.
This is the strongest production option for high-stakes enterprise use cases: healthcare, legal, finance. The enterprise fine-tuning workflow covers the full dataset and training setup. For RAG specifically, include retrieved context in your training examples so the model learns to use it rather than ignoring it and falling back to parametric knowledge.
Privacy concerns in RAG
RAG systems that index sensitive documents have specific privacy risks. Retrieved context can leak across users if access control is not enforced at retrieval time. Embedding models send text to third-party APIs during ingestion and query time. The LLM generation step sends retrieved context to an inference endpoint.
For regulated industries, see the detailed breakdown of RAG privacy risks and mitigation patterns. The short version: enforce access control as a metadata filter at vector search time, use self-hosted embedding models for sensitive data, and ensure your inference endpoint has appropriate data retention policies.
Common Failure Modes and Fixes
Retrieval returns correct documents but wrong sections. Your chunks are too large. The right answer is in the chunk but buried under unrelated content, and the LLM either misses it or hallucinates. Fix: reduce chunk size or switch to hierarchical chunking.
Retrieval misses obvious relevant documents. Usually a keyword mismatch: the query uses different terminology than the documents. Fix: add BM25 to your dense-only retrieval, or use query expansion to add synonyms before retrieval.
Faithfulness score is high but answers are still wrong. The LLM is accurately reporting what's in the context, but the context itself is wrong or outdated. Fix: check document freshness. Re-indexing pipeline may be stale.
Latency spikes at unpredictable times. Usually the reranker or embedding API under load. Fix: add p99 latency monitoring and set timeouts on external API calls with a fallback that skips reranking if it exceeds budget.
Different users get different answers to the same question. Access control at the metadata filter level means different users retrieve different documents, which is correct behavior if intended, a bug if not. Verify that access control logic is consistent and document it explicitly.
Good answers in testing, poor answers in production. Your test query set doesn't match production query distribution. Harvest production queries and add them to your evaluation set. This is the most common reason production RAG underperforms expectations.
FAQ
RAG vs. fine-tuning: which should I use?
They solve different problems. RAG gives the model access to information it wasn't trained on and keeps that information current without retraining. Fine-tuning changes how the model reasons, formats outputs, and applies domain expertise, but doesn't give it new factual knowledge. Use RAG for knowledge access, fine-tuning for behavior and output quality. Combining both produces the best results for enterprise use cases. See the comparison in detail.
How many chunks should I retrieve (k)?
Retrieve more than you return. A common starting point: retrieve 20 candidates, rerank, return top 5 to the LLM. The right k_retrieve depends on how many documents in your corpus are potentially relevant. Higher corpus overlap means you need a larger candidate set. Tune by monitoring context precision: if it's low, either increase k_retrieve to give the reranker more candidates, or fix your embedding/chunking.
How do I handle contradictory information in retrieved context?
This is an underaddressed problem in most RAG guides. When two retrieved chunks contradict each other, the LLM often picks one arbitrarily. Mitigations: add document freshness metadata and instruct the LLM to prefer more recent sources; add a source trust hierarchy (internal policy docs over blog posts); or surface the contradiction explicitly in the response rather than hiding it. For compliance use cases, surfacing contradictions is often the correct behavior.
How often should I re-index my corpus?
It depends on how frequently your source documents change. For static corpora (historical records, archived policies), re-index on-demand when documents change. For dynamic corpora (support tickets, product updates, regulatory changes), run daily or event-driven re-indexing. Monitor index staleness as a metric: time from document update to index availability should have an SLA.
Can I use RAG with structured data?
Yes, but differently. For tabular data, embedding rows as text works for small tables. For large structured datasets, use text-to-SQL instead: let the LLM generate a SQL query against your database rather than embedding and retrieving from a vector index. PremSQL handles this pattern specifically for local-first pipelines. Hybrid approaches using vector retrieval for unstructured text and SQL for structured data work well for enterprise knowledge bases that mix both.
What's the minimum evaluation set size?
50 manually curated queries covering your core use cases gives a meaningful signal. 100 is better. Below 50, metric variance from random effects makes it hard to detect real improvements. Supplement with synthetic queries for breadth, but anchor evaluation on your manually curated set.
For teams building production RAG on top of fine-tuned or custom models, PremAI's platform handles the full lifecycle: dataset management, fine-tuning, evaluation, and self-hosted inference with zero data retention. Book a technical call to talk through your architecture.