By Arnav Jalan — 17 Mar 2026

RAG Evaluation: Metrics, Frameworks & Testing (2026)

The complete guide to RAG evaluation metrics. Faithfulness, context precision, recall, LLM-as-judge. Code for Ragas and DeepEval. CI/CD integration included.

Most RAG pipelines pass demos and fail production. The reasons are predictable: hallucinated answers that are technically grounded in retrieved context, retrieval that returns the right documents in the wrong order, chunks that contain the answer but get cut at the wrong boundary.

You can't catch any of this by eyeballing outputs. And a 70% of engineers either have RAG in production or plan to ship it within a year, which makes evaluation infrastructure a prerequisite, not an afterthought.

This guide covers the metrics that actually matter, how to interpret them diagnostically, working code for the main frameworks, and how to wire it all into a CI/CD pipeline that catches regressions before they hit users.

Why Standard LLM Evaluation Misses Half the Problem

When you evaluate a standard LLM, you care about whether the output is correct, coherent, and on-topic. RAG adds a second failure mode: the retrieval step.

A study published in the Journal of Machine Learning Research found that retrieval accuracy alone explains only 60% of variance in end-to-end RAG quality. The other 40% comes from how well the model uses the retrieved context once it has it.

That means your RAG system can fail in two independent ways:

Retrieval failures:

Wrong documents retrieved entirely
Right documents retrieved but ranked too low to make the context window
Documents that contain the answer but get chunked at the wrong boundary

Generation failures:

Model ignores retrieved context and answers from training data
Model fabricates details not present in retrieved chunks
Model answers a related but different question than what was asked

Standard BLEU or ROUGE scores miss both. Measuring only end-to-end answer quality misses which layer is broken. Good RAG evaluation measures retrieval and generation separately so you know where to actually fix things.

The Core Metrics: What They Measure and What They Tell You

There are five metrics that cover the full RAG evaluation surface. Understand what each one is diagnosing before you run it.

Faithfulness

What it measures: Whether every claim in the generated answer is supported by the retrieved context.

Faithfulness is calculated by decomposing the answer into individual statements, then checking each one against the retrieved chunks. A score of 0.6 means roughly 40% of statements in the answer have no basis in what was retrieved.

What a low score tells you: The model is using training knowledge to fill gaps in retrieval. This is hallucination in the strictest technical sense. The fix is usually better retrieval so the model has less reason to improvise, or a more constrained system prompt that explicitly tells the model to answer only from context.

Threshold to target: 0.8 or higher for most production use cases. Regulated industries (finance, healthcare, legal) should target 0.9+.

Answer Relevance

What it measures: Whether the generated answer actually addresses the question asked.

This is different from faithfulness. An answer can be completely grounded in retrieved context and still be irrelevant if the retrieval pulled related-but-not-matching documents.

What a low score tells you: Two possible causes. Either the retrieval is bringing back tangentially related chunks that don't contain the actual answer, or the generation prompt is too verbose and adds background context that dilutes the relevance score. Check whether high-faithfulness and low-relevance appears together: if it does, your retrieval is the problem.

Threshold to target: 0.75 or higher.

Context Precision

What it measures: Whether the top-ranked retrieved chunks are the ones actually used in the answer. Specifically: are the most relevant chunks appearing early in the retrieved list?

A context precision of 0.4 means your retriever is returning the right documents somewhere in the results, but the relevant ones are ranked low. The LLM is getting overwhelmed by irrelevant context before it reaches the useful chunks.

What a low score tells you: Your re-ranking step needs work. The embedding model is retrieving the right documents but sorting them poorly. Try adding a cross-encoder re-ranker on top of your vector retrieval.

Threshold to target: 0.7 or higher.

Context Recall

What it measures: Whether all the information needed to answer the question correctly was present in the retrieved context.

This requires a reference answer to compare against. For each piece of information in the reference answer, it checks whether a supporting chunk was retrieved.

What a low score tells you: Your retriever is missing relevant documents entirely. Common causes: chunk size too small (answer spans multiple chunks), embedding model doesn't capture domain-specific terminology well, top-K is set too low. Try increasing chunk overlap, increasing K, or switching embedding models.

Threshold to target: 0.75 or higher.

Hallucination Rate

What it measures: The proportion of responses that contain unsupported claims. Closely related to faithfulness but often expressed as a production monitoring metric rather than a per-response score.

In production, you typically track this as: percentage of queries where faithfulness falls below your threshold. A 5% hallucination rate means 1 in 20 responses contains fabricated information, which is too high for any customer-facing application.

What a high rate tells you: See faithfulness above, but at production scale. If hallucination spikes after a document ingestion update, your new chunks are probably worse quality than what they replaced.

Retrieval-Specific Metrics: The Ones Most Articles Skip

The five metrics above cover the combined retrieval + generation pipeline. For pure retrieval evaluation, you need separate metrics.

Precision@K and Recall@K

Precision@K measures: of the top K documents retrieved, what fraction were actually relevant? Recall@K measures: of all relevant documents in your corpus, what fraction appeared in the top K results?

These require labeled relevance judgments, which is why many teams skip them. But they are the most diagnostic metrics for figuring out retrieval problems specifically.

A good starting target: Precision@5 at 0.7 or higher for narrow-domain knowledge bases. Recall@20 at 0.8 or higher for broad corpus search.

Mean Reciprocal Rank (MRR)

MRR measures where the first relevant document appears in your ranked results. If the correct document is always ranked 3rd or 4th instead of 1st, your context precision suffers and the model struggles to prioritize what matters.

MRR is especially useful for FAQ systems and help-desk bots where users need the right answer fast, not a collection of loosely relevant chunks.

NDCG (Normalized Discounted Cumulative Gain)

NDCG accounts for both relevance and position, giving higher weight to relevant documents that appear earlier. It correlates more strongly with end-to-end RAG quality than binary precision/recall because it rewards the right ordering, not just the right documents.

Research indicates NDCG correlates more strongly with end-to-end RAG quality than binary relevance metrics, making it worth tracking even though it requires graded relevance labels.

Metrics Summary Table

Metric	What it evaluates	Requires reference?	Key question answered
Faithfulness	Generation	No	Is the answer grounded in context?
Answer Relevance	Generation	No	Does the answer address the question?
Context Precision	Retrieval	No	Are the best chunks ranked first?
Context Recall	Retrieval	Yes	Did retrieval find all needed info?
Hallucination Rate	Generation	No	What % of answers are fabricated?
Precision@K	Retrieval	Yes	How many retrieved docs are relevant?
Recall@K	Retrieval	Yes	Did retrieval miss important docs?
MRR	Retrieval	Yes	How early does the right doc appear?
NDCG	Retrieval	Yes	Is the ranking quality good?

LLM-as-Judge: How It Works and the Cost Problem

Most of the generation metrics above (faithfulness, answer relevance) are scored by an LLM judge. The judge reads the question, the retrieved context, and the answer, then scores it.

LLM-as-judge is the best method available for evaluating nuanced text quality. Tools using GPT-4o as judge achieve over 80% accuracy at distinguishing genuinely relevant context from hard negatives designed to look relevant.

The problem is cost. Evaluation with an LLM judge means one or more additional LLM calls per test case. At scale this gets expensive fast. A dataset of 500 questions with five metrics each runs 2,500+ LLM calls per eval run.

Practical strategies to manage this:

Use a smaller model for routine evals. GPT-4o-mini or a capable open-source model handles most evaluation tasks accurately enough. Reserve GPT-4o for evaluating edge cases flagged by the smaller model.

Run full evals on a representative sample. You do not need to eval every production query. Sample 5-10% of production traffic for continuous monitoring, run full evals on your golden test set in CI.

Self-hosted models for evaluation. If you run a fine-tuned model for inference, you can use the same infrastructure for evaluation at zero marginal API cost. This is particularly relevant if you are running private AI infrastructure where data cannot go to cloud APIs anyway.

Building Your Evaluation Dataset

Bad eval dataset = misleading scores. This is where most teams cut corners and pay for it later.

Types of datasets

Golden dataset: Manually curated question-answer pairs where the answers are verifiably correct. 50-200 questions is enough to start. Cover edge cases, multi-hop questions, and questions where the answer spans multiple documents.

Synthetic dataset: LLM-generated questions from your actual documents. Fast to create, broad coverage, but lower quality than human-curated. Use for broad regression testing, not for precise metric calibration.

Production-sourced dataset: Real queries from users, filtered and labeled. The most representative test set you can have. Build this over time by sampling and labeling production traffic.

A practical starting point: 50 golden questions for core metric tracking, 500 synthetic questions for regression testing, production queries feeding in as your system matures.

Generating a synthetic dataset with Ragas

from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from llama_index.core import SimpleDirectoryReader

# Load your documents
documents = SimpleDirectoryReader("./docs").load_data()

# Set up generator with your LLM
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings
)

# Generate test set: mix of simple and multi-hop questions
dataset = generator.generate_with_llama_index_docs(
    documents,
    testset_size=100,
    distributions={
        "simple": 0.5,         # Single-hop factual questions
        "multi_context": 0.35, # Require multiple chunks
        "reasoning": 0.15      # Require inference
    }
)

# Export for review before using
df = dataset.to_pandas()
df.to_csv("eval_dataset.csv", index=False)
print(f"Generated {len(df)} test cases")
print(df[["question", "ground_truth"]].head())

Always manually review synthetic datasets before using them to track metrics. LLMs generate plausible-sounding questions that reference content not actually in your documents. A 10-minute review of 100 synthetic questions catches most of these.

Ragas: Code and Usage

Ragas is the most widely adopted open-source RAG evaluation framework. It grew from a 2023 research paper on reference-free RAG evaluation and has become the de facto standard for the core five metrics.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness
)
from datasets import Dataset

# Your RAG pipeline output
# Each entry needs: question, answer, contexts, ground_truth
data = {
    "question": [
        "What is the refund policy for enterprise contracts?",
        "How do I upgrade my storage quota?",
        "What SLA uptime does the platform guarantee?"
    ],
    "answer": [
        "Enterprise contracts have a 30-day refund window...",
        "Storage upgrades can be requested from the account settings...",
        "The platform guarantees 99.9% uptime under the standard SLA..."
    ],
    "contexts": [
        ["Enterprise customers are eligible for a full refund within 30 days...",
         "Refund requests must be submitted via the enterprise portal..."],
        ["To upgrade storage, navigate to Settings > Storage > Upgrade...",
         "Storage quotas reset on the first of each billing month..."],
        ["Platform uptime SLA is 99.9% for standard plans, 99.95% for enterprise...",
         "SLA credits apply when monthly uptime falls below guaranteed levels..."]
    ],
    "ground_truth": [
        "Enterprise contracts allow full refunds within 30 days of purchase.",
        "Storage upgrades are available through account settings.",
        "The platform SLA guarantees 99.9% uptime for standard plans."
    ]
}

dataset = Dataset.from_dict(data)

# Run evaluation
results = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness
    ]
)

print(results)
# Output: {'faithfulness': 0.87, 'answer_relevancy': 0.82, 
#          'context_precision': 0.79, 'context_recall': 0.74,
#          'answer_correctness': 0.81}

# Get per-question breakdown
df = results.to_pandas()
print(df[["question", "faithfulness", "context_precision"]].to_string())

Ragas honest assessment:

Ragas is good for quick experimental evaluation and works well when you want the standard RAG metrics with minimal setup. The main pain point reported consistently by developers: NaN scores appear when the LLM judge returns invalid JSON during metric calculation. There is no graceful fallback, so a single bad API response can fail an entire eval run. Pin your Ragas version and use try/except around eval calls in CI.

DeepEval: Code and Usage

DeepEval takes a test-driven development approach. Evaluations are unit tests with pass/fail thresholds, run through pytest, and designed to integrate directly into CI pipelines.

import pytest
from deepeval import assert_test, evaluate
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric
)
from deepeval.test_case import LLMTestCase

# Your RAG pipeline function
def run_rag_pipeline(question: str) -> tuple[str, list[str]]:
    # Returns (answer, retrieved_chunks)
    answer = rag_chain.invoke(question)
    chunks = retriever.invoke(question)
    return answer, [c.page_content for c in chunks]

# Define thresholds - set these based on your acceptable quality floor
contextual_precision = ContextualPrecisionMetric(threshold=0.7, model="gpt-4o-mini")
contextual_recall = ContextualRecallMetric(threshold=0.75, model="gpt-4o-mini")
contextual_relevancy = ContextualRelevancyMetric(threshold=0.7, model="gpt-4o-mini")
answer_relevancy = AnswerRelevancyMetric(threshold=0.75, model="gpt-4o-mini")
faithfulness = FaithfulnessMetric(threshold=0.8, model="gpt-4o-mini")

# Test cases with expected outputs as ground truth
test_questions = [
    {
        "input": "What is the refund policy for enterprise contracts?",
        "expected_output": "Enterprise contracts allow full refunds within 30 days."
    },
    {
        "input": "What SLA uptime does the platform guarantee?",
        "expected_output": "99.9% uptime for standard plans, 99.95% for enterprise."
    }
]

@pytest.mark.parametrize("test_data", test_questions)
def test_rag_quality(test_data):
    question = test_data["input"]
    expected = test_data["expected_output"]
    
    # Run your actual RAG pipeline
    actual_output, retrieval_context = run_rag_pipeline(question)
    
    test_case = LLMTestCase(
        input=question,
        actual_output=actual_output,
        expected_output=expected,
        retrieval_context=retrieval_context
    )
    
    assert_test(test_case, [
        contextual_precision,
        contextual_recall,
        contextual_relevancy,
        answer_relevancy,
        faithfulness
    ])

Run it locally:

deepeval test run test_rag.py

DeepEval's key advantage over Ragas: every failing metric explains why it failed. Instead of a score of 0.4 with no context, you get the LLM judge's reasoning for why the answer was considered unfaithful. This makes debugging significantly faster when a metric regresses.

Custom metric with DeepEval's GEval

The five standard metrics do not cover domain-specific quality requirements. A legal RAG system needs to evaluate whether citations are properly attributed. A medical RAG system needs to check whether recommendations are appropriately caveated. GEval lets you define these in plain language.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# Custom metric for enterprise document Q&A
citation_quality = GEval(
    name="CitationQuality",
    criteria="""Evaluate whether the answer:
    1. References specific sections or clauses from the retrieved documents
    2. Does not make claims beyond what's explicitly stated in the context
    3. Acknowledges when the retrieved context does not fully answer the question
    
    Score 1 if all three criteria are met, 0 if any are violated.""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.RETRIEVAL_CONTEXT
    ],
    threshold=0.7
)

TruLens: A Different Approach

TruLens takes an instrumentation approach rather than batch evaluation. You wrap your RAG pipeline in a recorder and it evaluates each call as it happens, which makes it well suited for monitoring development experiments rather than CI testing.

from trulens.apps.langchain import TruChain
from trulens.core import TruSession
from trulens.providers.openai import OpenAI

session = TruSession()
session.reset_database()

provider = OpenAI(model_engine="gpt-4o-mini")

# Define feedback functions
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(Select.RecordCalls.retrieve.rets.collect())
    .on_output()
)

f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input()
    .on_output()
)

f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()
    .on(Select.RecordCalls.retrieve.rets[:])
    .aggregate(np.mean)
)

# Wrap your LangChain RAG chain
tru_recorder = TruChain(
    rag_chain,
    app_name="ProductionRAG",
    app_version="v2.1",
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance]
)

# Use as context manager - all calls are automatically evaluated
with tru_recorder as recording:
    response = rag_chain.invoke({"question": "What are the contract renewal terms?"})

# View results
session.get_leaderboard()

TruLens is most useful during the experimentation phase when you are comparing different chunking strategies, embedding models, or retrieval configurations and want to see metric scores for every run in a dashboard. It is less suited for CI/CD quality gates than DeepEval because it was not designed for batch testing with pass/fail thresholds.

Framework Comparison

	Ragas	DeepEval	TruLens
Best for	Quick experiments, standard metrics	CI/CD testing, production gates	Dev-time monitoring, A/B experiments
CI/CD integration	Manual setup	Native pytest integration	Not designed for CI
Metric explainability	Score only	Score + reasoning	Score + COT reasoning
Custom metrics	Limited	GEval (natural language)	Custom feedback functions
Synthetic data gen	Built-in	Built-in (more flexible)	No
Self-hosted judge	Any OpenAI-compat model	Any model	OpenAI, Bedrock, local
Stability	NaN score issues reported	More robust error handling	Stable for monitoring use
Licensing	Apache 2.0	MIT	MIT

For most production teams: use DeepEval for CI/CD quality gates, Ragas for initial metric exploration and synthetic dataset generation, and TruLens or Langfuse for production monitoring.

CI/CD Integration: Full GitHub Actions Setup

The goal of CI/CD eval integration is simple: fail the build when RAG quality drops below your thresholds, before a PR gets merged.

# test_rag_regression.py
import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric
)
from deepeval.test_case import LLMTestCase

# Load your golden dataset - keep this in version control
GOLDEN_DATASET = [
    Golden(
        input="What is the maximum file size for document uploads?",
        expected_output="The maximum file size is 50MB per document."
    ),
    Golden(
        input="How many concurrent fine-tuning experiments can I run?",
        expected_output="Up to 6 concurrent fine-tuning experiments on the standard plan."
    ),
    Golden(
        input="Does the platform support HIPAA compliance?",
        expected_output="Yes, the platform is HIPAA compliant with full audit logging."
    ),
    # Add 50-200 more golden questions covering your domain
]

# Metrics with thresholds
METRICS = [
    ContextualPrecisionMetric(threshold=0.7, model="gpt-4o-mini"),
    ContextualRecallMetric(threshold=0.70, model="gpt-4o-mini"),
    AnswerRelevancyMetric(threshold=0.75, model="gpt-4o-mini"),
    FaithfulnessMetric(threshold=0.80, model="gpt-4o-mini"),
]

def get_rag_response(question: str) -> tuple[str, list[str]]:
    """Your actual RAG pipeline - import from your app code."""
    from your_app.rag import rag_chain, retriever
    
    answer = rag_chain.invoke(question)
    chunks = retriever.invoke(question)
    return answer, [c.page_content for c in chunks]

@pytest.mark.parametrize("golden", GOLDEN_DATASET)
def test_rag_regression(golden: Golden):
    actual_output, retrieval_context = get_rag_response(golden.input)
    
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=actual_output,
        expected_output=golden.expected_output,
        retrieval_context=retrieval_context
    )
    
    assert_test(test_case, METRICS)

# .github/workflows/rag-eval.yml
name: RAG Quality Gate

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  rag-evaluation:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.11"
      
      - name: Install dependencies
        run: |
          pip install deepeval ragas your-app-package
      
      - name: Run RAG evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
          # If using self-hosted inference, point to your endpoint:
          # OPENAI_API_BASE: ${{ secrets.INFERENCE_ENDPOINT }}
        run: |
          deepeval test run tests/test_rag_regression.py \
            --exit-on-first-failure \
            --verbose
      
      - name: Upload evaluation report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: rag-eval-results
          path: deepeval-results/

A few important implementation notes:

Set thresholds conservatively at first. If you start with faithfulness at 0.9 before your pipeline is mature, you will fail every build and developers will delete the workflow. Start at 0.7, prove the system works, then raise thresholds as quality improves.

Keep your golden dataset in version control alongside your code. When you change chunking strategy or swap embedding models, the regression test tells you immediately whether quality improved or dropped.

The --exit-on-first-failure flag speeds up CI runs when there are obvious regressions. For a full quality report, remove it and let all tests run.

Evaluating Fine-Tuned Models

Standard RAG evaluation assumes a general-purpose model. When you fine-tune a model on domain data, the evaluation picture changes.

A fine-tuned model will score differently on the same test cases because it has domain knowledge baked in. This creates a measurement challenge: some of that extra knowledge is good (it fills gaps when retrieval misses), and some of it is bad (the model overrides retrieved context with memorized but potentially outdated information).

For production fine-tuning workflows, you want to track:

Faithfulness before and after fine-tuning. If faithfulness drops after fine-tuning, the model is using its new knowledge to override retrieved context rather than citing it. This is a problem in regulated applications where auditability matters.

Answer correctness against a pre-fine-tuning baseline. Fine-tuning should improve answer quality on in-domain questions. Track the delta.

Hallucination rate on out-of-domain questions. Fine-tuned models can hallucinate more confidently on questions outside their training distribution. Test with questions from adjacent but different domains.

# Compare base model vs fine-tuned model on the same eval set
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

# Base model results
base_results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_correctness],
    llm=LangchainLLMWrapper(
        ChatOpenAI(
            model="gpt-4o-mini",
            base_url="http://base-model-endpoint:8000/v1"
        )
    )
)

# Fine-tuned model results - same eval set, same metrics
finetuned_results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_correctness],
    llm=LangchainLLMWrapper(
        ChatOpenAI(
            model="your-finetuned-model",
            base_url="http://finetuned-model-endpoint:8000/v1"
        )
    )
)

print(f"Faithfulness delta: {finetuned_results['faithfulness'] - base_results['faithfulness']:+.3f}")
print(f"Correctness delta: {finetuned_results['answer_correctness'] - base_results['answer_correctness']:+.3f}")

Both endpoints take the same format because they follow the OpenAI-compatible API standard. The eval framework does not care whether it is hitting a cloud model or a self-hosted inference server.

Production Monitoring: Beyond Offline Evaluation

Offline evaluation catches regressions before deployment. Production monitoring catches issues that only appear at scale.

Metrics to track in production

Metric	Alert threshold	What to check when it spikes
Faithfulness (sampled)	< 0.75	Recent document ingestion quality
Answer relevance (sampled)	< 0.70	Query distribution shift
Hallucination rate	> 5%	Retrieval coverage for new query types
P95 retrieval latency	> 500ms	Index size, embedding model load
Context utilization	< 40%	Chunk size, overlap settings
User negative feedback rate	> 10%	All of the above

Context utilization is worth highlighting because it is rarely mentioned. If your retrieved context contains 5 chunks but the answer only references 1, you are paying for retrieval that is not contributing. Either your top-K is too high or your re-ranker is not filtering well enough.

Setting up sampling-based production eval

# production_eval.py - run this as a scheduled job, not on every request
import random
from datetime import datetime, timedelta
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate

def sample_production_queries(db_connection, hours: int = 24, sample_rate: float = 0.05):
    """Pull a sample of recent production queries for evaluation."""
    since = datetime.now() - timedelta(hours=hours)
    recent_queries = db_connection.query(
        "SELECT question, answer, retrieved_chunks FROM rag_logs WHERE created_at > %s",
        (since,)
    )
    return random.sample(recent_queries, max(1, int(len(recent_queries) * sample_rate)))

def run_production_eval(sample):
    test_cases = [
        LLMTestCase(
            input=row["question"],
            actual_output=row["answer"],
            retrieval_context=row["retrieved_chunks"]
        )
        for row in sample
    ]
    
    results = evaluate(
        test_cases=test_cases,
        metrics=[
            FaithfulnessMetric(threshold=0.75, model="gpt-4o-mini"),
            AnswerRelevancyMetric(threshold=0.70, model="gpt-4o-mini"),
        ],
        run_async=True  # Parallel evaluation reduces cost and time
    )
    
    # Alert if metrics fall below threshold
    if results.faithfulness < 0.75:
        send_alert(f"Faithfulness dropped to {results.faithfulness:.2f} in production")
    
    return results

Run this on a schedule: daily for stable systems, hourly for high-traffic or recently-updated ones.

Common Mistakes in RAG Evaluation

Evaluating with the same model you used to generate. If GPT-4o generates your answers and scores them, you get inflated scores. Use a different model or model size for the judge.

Skipping per-component evaluation. End-to-end correctness tells you something is wrong. Separate retrieval and generation metrics tell you where.

Using evaluation metrics to optimize prompts directly. Metrics should track quality, not be the optimization target. Overfitting your prompt to maximize Ragas scores produces systems that score well on evals and fail on production queries that look slightly different.

Setting thresholds too high too early. A pipeline that fails every build gets disabled. Start lower, establish baselines, then tighten thresholds iteratively.

No human review of synthetic datasets. LLMs generate plausible-looking test cases that contain factual errors or reference non-existent content. Always review a sample before using synthetic data for metric tracking.

Not versioning your eval dataset. If your golden dataset changes between runs, you cannot compare scores meaningfully. Tag eval dataset versions alongside model and pipeline versions.

FAQ

Can I run RAG evaluation without sending data to OpenAI?

Yes. Both Ragas and DeepEval support any OpenAI-compatible endpoint as the judge model. Point them at a locally hosted model (via Ollama or vLLM) or a self-hosted inference endpoint and evaluation runs entirely on your infrastructure. For teams with data residency requirements, this is the standard approach.

How many test cases do I need for reliable metrics?

50 well-curated golden questions give you stable enough metrics to track meaningful changes. Fewer than 20 and your scores will swing significantly between runs. For synthetic datasets used as regression tests, 200-500 gives reasonable coverage without excessive eval cost.

What is a good faithfulness score for production?

It depends on the application. General-purpose Q&A can tolerate 0.8. Customer-facing product documentation should target 0.85+. Regulated industries (finance, healthcare, legal) should be at 0.9 or above before going live.

How do I evaluate RAG on a fine-tuned model?

Use the same eval dataset and metrics as your base model. The key comparison is the faithfulness delta: if it drops post-fine-tuning, the model is relying on memorized knowledge rather than retrieved context. For enterprise fine-tuning workflows, track both answer correctness (should improve) and faithfulness (should not drop significantly).

Ragas vs DeepEval: which should I use?

Ragas for quick experimentation and its synthetic dataset generator. DeepEval for CI/CD integration and production quality gates. The pytest-native workflow in DeepEval makes it significantly easier to build evaluation into existing engineering workflows. If you want both, start with Ragas to generate your golden dataset and explore metrics, then move evaluation into DeepEval for systematic testing.

How much does LLM-as-judge evaluation cost?

With GPT-4o-mini as judge, expect roughly $0.001-0.003 per test case with five metrics. A 200-question golden dataset costs under $1 per eval run. Scale up to 1,000 production samples daily and you are looking at $3-5/day. This is manageable for most teams. If cost is a constraint, use a self-hosted judge model.

What happens to my eval baseline when I update the knowledge base?

Your retrieval metrics will shift as the knowledge base changes. Context recall in particular can drop if new documents have different chunking characteristics. Re-run your full eval suite after any significant knowledge base update and treat a recall drop as a signal to review your chunking configuration, not as a regression to ignore.