Hybrid Search for RAG: BM25, SPLADE, and Vector Search Combined

Hybrid Search for RAG: BM25, SPLADE, and Vector Search Combined

Quick answer: Hybrid search runs a sparse retriever (BM25 or SPLADE) and a dense vector retriever in parallel, then merges results with a fusion algorithm before passing chunks to the LLM. It consistently outperforms either method alone because dense retrieval misses exact identifiers and dense retrieval misses semantic matches. Start with RRF at k=60 as the zero-config default. If you have 50+ labeled query pairs, switch to convex combination and tune alpha. Add a cross-encoder reranker after fusion for the biggest single precision gain.

This article covers how hybrid search works, when BM25 vs SPLADE is the right sparse retriever, the math behind three fusion strategies (RRF, convex combination, DBSF), four Python code examples including a full reranking pipeline, and a decision framework for every common retrieval scenario. Prerequisites: basic familiarity with embeddings and RAG concepts. Python experience helps for the code sections.

Why Single-Method Retrieval Breaks on Real Queries

Dense retrieval (embedding-based search) encodes queries and documents into continuous vector space, typically 384 to 1536 dimensions. Similarity is measured via cosine similarity or dot product, and retrieval uses Approximate Nearest Neighbor (HNSW). It captures semantic meaning that goes beyond lexical overlap: "slow database queries" matches "PostgreSQL optimization techniques," and "car" matches "automobile."

What it misses: exact identifiers. Search for a Python traceback, an API endpoint name, an error code like ECONNREFUSED, or a product SKU, and the embedding model often returns semantically related but wrong results. Embeddings average meaning across dimensions. Rare or unique tokens get diluted.

BM25, the standard sparse retrieval method, is the complement. It scores documents using term frequency with saturation, inverse document frequency, and document length normalization.

It returns millisecond responses at millions of documents with no GPU. But it has zero semantic understanding. If a user asks "how to fix slow queries" and your document says "optimization techniques for database performance," BM25 finds no match because there is no overlapping vocabulary.

Hybrid search runs both retrievers in parallel, merges the result sets using a fusion strategy, and optionally applies a reranker before passing chunks to the LLM. One critical point on ordering: recall must come before precision.

A reranker can only reorder what has already been retrieved. If your dense retriever missed a relevant document because it lacked the exact keyword, no amount of reranking brings it back. Hybrid retrieval is what gives the reranker something worth working with.

For teams building domain-specific RAG systems with specialized terminology, acronyms, and identifiers, improving retrieval quality at the search layer is usually more impactful than upgrading the generation model.

BM25 vs SPLADE: Choosing Your Sparse Retriever

Most hybrid search guides treat BM25 as the only sparse option. It is not. SPLADE (Sparse Lexical and Expansion model) is a learned sparse retriever that closes most of BM25's weaknesses while keeping its speed and interpretability advantages.

BM25 is a purely statistical ranking function. It matches exact query terms against indexed documents using TF-IDF weighting with document length normalization. It needs no training, runs with any inverted index, and handles rare terms and identifiers extremely well. Its weaknesses: zero vocabulary expansion (if your query says "car" and the document says "vehicle," no match), and sensitivity to term frequency on short documents.

SPLADE uses a transformer to produce sparse vectors where each dimension corresponds to a vocabulary token. Unlike BM25, SPLADE expands both query and document representations with semantically related terms at index time. A document about "database optimization" gets expanded with "query," "indexing," "performance," and related terms. This means SPLADE handles vocabulary mismatch while still producing sparse, interpretable vectors that work with standard inverted indexes.

BM25 SPLADE
Vocabulary expansion None Yes (learned)
Exact keyword matching Excellent Good
Semantic matching None Partial
Index type Inverted index Inverted index
Inference cost None One encoder pass per doc
GPU required No For indexing (not retrieval)
Best for SKUs, error codes, legal identifiers Mixed vocabulary, general enterprise docs

Use BM25 when your queries are exact-match dominated: product catalogs, legal case numbers, financial instrument identifiers, error codes. Use SPLADE when your corpus has vocabulary mismatch between how users ask questions and how documents are written, which describes most enterprise knowledge bases.

In practice, SPLADE consistently outperforms BM25 on BEIR benchmarks across most dataset types, and several vector databases now support it natively alongside dense vectors. The code section below shows both options.


What the Benchmarks Show

Before the numbers, context on what each study measured:

BEIR Benchmark (Thakur et al., NeurIPS 2021) Corpus: 15+ datasets including MS MARCO, TREC-COVID, Natural Questions, FiQA, SciFact Measured: NDCG@10 and MRR across zero-shot retrieval tasks Key design: Standard evaluation suite used across most retrieval papers. Aggregated results via EmergentMind.

Weaviate Search Mode Benchmarking Corpus: BEIR SciFact, BRIGHT Biology, and other domain-specific datasets Embedding: Snowflake Arctic 2.0 + BM25 with RRF fusion Measured: Retrieval recall across different domains Key design: Shows domain variance in hybrid search gains, from +5% on BEIR SciFact to +24% on BRIGHT Biology.

softwaredoug Elasticsearch Benchmark Corpus: WANDS furniture e-commerce dataset Embedding: MiniLM Measured: Mean NDCG comparing KNN, RRF, and dismax fusion

OpenSearch Real-World Evaluation Corpus: Production search queries Measured: MAP and NDCG vs. keyword-only baseline

Consolidated results

Dataset Metric Dense/Keyword Only Hybrid Gain
BEIR aggregate (13 datasets) NDCG Baseline +26 to 31% 26 to 31%
BEIR aggregate MRR 0.410 0.486 +18.5%
WANDS furniture (Elasticsearch) Mean NDCG 0.695 (KNN) 0.708 (RRF) +1.7%
WANDS furniture (Elasticsearch) Mean NDCG 0.695 (KNN) 0.708 (dismax) +1.9%
Weaviate BEIR SciFact Recall Baseline +5% +5%
Weaviate BRIGHT Biology Recall Baseline +24% +24%
OpenSearch real-world MAP 0.55 (keyword) 0.60 (hybrid) +9%
OpenSearch real-world NDCG 0.69 (keyword) 0.82 (hybrid) +19%

What the variance means

The +1.7% on WANDS furniture vs. +24% on BRIGHT Biology is not a contradiction. It is the central lesson of hybrid search benchmarking. On the WANDS dataset, product names and attributes already create strong lexical overlap between queries and documents, so dense retrieval already does well and BM25 adds little. On BRIGHT Biology, researchers phrase queries differently from how papers are written, and semantic bridging fails without the lexical anchor that BM25 provides.

The key finding: Hybrid search helps most where vocabulary mismatch between queries and documents is highest. Before assuming benchmark gains apply to your system, measure on your own queries.

One more finding worth knowing: a poorly tuned hybrid configuration can perform worse than your dense baseline. AIMultiple's benchmark found that their initial hybrid setup scored MRR 0.390, below their dense-only baseline of 0.410. The issue was an untuned fusion weight. Their optimized configuration reached 0.486. Hybrid search is not automatic improvement. The fusion parameter matters.


Fusion Strategies: RRF, Convex Combination, and DBSF

Once you have ranked result lists from both retrievers, you need a fusion strategy to produce a single ranked list.

Reciprocal Rank Fusion (RRF)

Introduced by Cormack, Clarke, and Buettcher at SIGIR 2009:

RRF_score(d) = SUM over all rankers r: 1 / (k + rank_r(d))

The constant k defaults to 60 from the original paper. A document ranked #1 in one list scores 1/(60+1) = 0.0164. A document ranked #5 in both lists scores 1/(60+5) + 1/(60+5) = 0.0308. Consistent presence across lists matters more than a single high rank.

RRF is score-agnostic: it uses only rank positions, not raw scores. You do not need to normalize BM25 scores (unbounded) against cosine similarity scores (-1 to 1). It works with zero labeled data.

The limitation: Bruch et al. (ACM TOIS 2023) found that convex combination outperforms RRF in both in-domain and out-of-domain settings when the alpha parameter is tuned, even on small evaluation sets. RRF is the right starting point, not the ceiling.

Convex Combination (Weighted Linear Scoring)

score(d) = alpha * normalized_dense(d) + (1 - alpha) * normalized_sparse(d)

Where alpha = 1.0 is pure vector search and alpha = 0.0 is pure keyword search. This requires score normalization because BM25 and cosine similarity operate on different scales. Min-max normalization is standard. Bruch et al. found the specific normalization method matters less than having normalization at all.

Alpha starting points per LlamaIndex's alpha tuning guide:

  • alpha = 0.5 for balanced starting point
  • alpha ~0.3 for technical docs with exact API names, error codes, SKUs (favor keyword)
  • alpha ~0.7 for customer support chatbots where users describe problems in their own words (favor semantic)

A small evaluation set of 50 to 100 query-relevance pairs is enough to tune alpha. Convex combination is sample-efficient, and the gains over RRF are consistent when tuned.

Distribution-Based Score Fusion (DBSF)

Qdrant's DBSF approach calculates mean and standard deviation of scores from each retriever, sets normalization bounds at mean +/- 3 standard deviations, and normalizes into 0 to 1 before summing. Unlike static min-max normalization, DBSF adapts to each query's score distribution.

Use DBSF when score magnitudes vary significantly between retrievers across different queries. Use RRF when you want zero-config fusion.

RAG-Fusion (Multi-Query + RRF)

Rackauckas (arXiv:2402.03367) extends RRF by generating multiple reformulated queries from the original using an LLM, running each through retrievers, and applying RRF across all result lists. This improves recall by approaching the query from multiple angles. The trade-off is LLM latency per query and occasional off-topic drift when generated queries stray from original intent.

Building a Hybrid Search Pipeline in Python

Four approaches, from quick prototype to full production-grade pipeline with reranking.

Example 1: LangChain EnsembleRetriever (Fastest Prototype)

The LangChain EnsembleRetriever combines any two retrievers with weighted RRF. Shortest path to hybrid search if you are already in the LangChain ecosystem.

# pip install langchain langchain-community faiss-cpu rank-bm25 langchain-openai

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

docs = [Document(page_content=text) for text in texts]

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 10

embeddings = OpenAIEmbeddings()
faiss_store = FAISS.from_documents(docs, embeddings)
dense_retriever = faiss_store.as_retriever(search_kwargs={"k": 10})

# Weighted RRF: 40% BM25, 60% dense
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.4, 0.6]
)

results = hybrid_retriever.invoke("What is hybrid search?")
for doc in results[:5]:
    print(doc.page_content[:150])

The weights parameter controls each retriever's contribution to the RRF score. Start at 0.5/0.5, then adjust based on your query mix.

Example 2: Qdrant Native Hybrid Search with SPLADE (Production-Grade)

For production, native hybrid search in a vector database avoids running two separate retrieval systems. Qdrant supports dense and sparse vectors in the same collection with built-in RRF and DBSF fusion. This example uses SPLADE for the sparse side.

# pip install qdrant-client transformers torch

from qdrant_client import QdrantClient, models
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

client = QdrantClient(url="http://localhost:6333")

# Create collection: dense + sparse (SPLADE) vectors
client.create_collection(
    collection_name="hybrid_docs",
    vectors_config={
        "dense": models.VectorParams(
            size=384,
            distance=models.Distance.COSINE
        ),
    },
    sparse_vectors_config={
        "sparse": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        ),
    },
)

def encode_splade(text: str, tokenizer, model) -> dict:
    """Generate SPLADE sparse vector as {token_id: weight} dict."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # SPLADE activation: ReLU(log(1 + relu(logits)))
    logits = outputs.logits
    activations = torch.log(1 + torch.relu(logits))
    sparse = activations.max(dim=1).values.squeeze()

    indices = sparse.nonzero(as_tuple=False).squeeze().tolist()
    values = sparse[sparse != 0].tolist()
    return {"indices": indices if isinstance(indices, list) else [indices],
            "values": values if isinstance(values, list) else [values]}

# Load SPLADE model (run once)
splade_model_name = "naver/splade-cocondenser-ensemble-distil"
tokenizer = AutoTokenizer.from_pretrained(splade_model_name)
splade_model = AutoModelForMaskedLM.from_pretrained(splade_model_name)

# Hybrid search: dense + SPLADE, fused with RRF
def hybrid_search(query: str, dense_vector: list, top_k: int = 10):
    sparse_vector = encode_splade(query, tokenizer, splade_model)

    results = client.query_points(
        collection_name="hybrid_docs",
        prefetch=[
            models.Prefetch(
                query=dense_vector,
                using="dense",
                limit=20,
            ),
            models.Prefetch(
                query=models.SparseVector(**sparse_vector),
                using="sparse",
                limit=20,
            ),
        ],
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        limit=top_k,
    )
    return results.points

Switch models.Fusion.RRF to models.Fusion.DBSF if your dense and sparse score distributions vary greatly across queries.


Example 3: RRF and Convex Combination from Scratch

When your vector database does not support native hybrid search, or when you need full control over fusion logic:

def reciprocal_rank_fusion(result_lists: list[list], k: int = 60) -> list:
    """
    Fuse multiple ranked result lists using RRF.
    Each result_list: list of (doc_id, score) sorted by relevance descending.
    Returns: list of (doc_id, rrf_score) sorted descending.
    """
    rrf_scores = {}
    for result_list in result_lists:
        for rank, (doc_id, _score) in enumerate(result_list, start=1):
            rrf_scores.setdefault(doc_id, 0.0)
            rrf_scores[doc_id] += 1.0 / (k + rank)

    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)


def min_max_normalize(scores: list[float]) -> list[float]:
    min_s, max_s = min(scores), max(scores)
    if max_s == min_s:
        return [0.5] * len(scores)
    return [(s - min_s) / (max_s - min_s) for s in scores]


def hybrid_convex_combination(
    bm25_results: dict,    # {doc_id: raw_bm25_score}
    dense_results: dict,   # {doc_id: cosine_score}
    alpha: float = 0.5     # 1.0 = pure dense, 0.0 = pure BM25
) -> list:
    """
    Combine BM25 and dense scores via convex combination.
    Per Bruch et al. (ACM TOIS 2023): outperforms RRF when alpha is tuned
    on even a small evaluation set (50-100 labeled pairs).
    """
    bm25_ids = list(bm25_results.keys())
    bm25_norm = dict(zip(bm25_ids, min_max_normalize(list(bm25_results.values()))))

    dense_ids = list(dense_results.keys())
    dense_norm = dict(zip(dense_ids, min_max_normalize(list(dense_results.values()))))

    all_docs = set(bm25_ids) | set(dense_ids)
    combined = {}
    for doc_id in all_docs:
        d = dense_norm.get(doc_id, 0.0)
        b = bm25_norm.get(doc_id, 0.0)
        combined[doc_id] = alpha * d + (1 - alpha) * b

    return sorted(combined.items(), key=lambda x: x[1], reverse=True)


# Usage: start with RRF, switch to convex combination once you have eval data
bm25_hits = [("doc1", 12.4), ("doc2", 9.1), ("doc3", 7.8)]
dense_hits = [("doc2", 0.91), ("doc1", 0.88), ("doc4", 0.76)]

rrf_result = reciprocal_rank_fusion([bm25_hits, dense_hits])
cc_result = hybrid_convex_combination(
    dict(bm25_hits), dict(dense_hits), alpha=0.5
)

Example 4: Complete Pipeline with Cross-Encoder Reranking

Fusion expands recall. Reranking buys precision. A cross-encoder scores each query-document pair jointly, which is more accurate than embedding similarity but too slow for first-stage retrieval. The two-stage pattern: hybrid retrieval gets top-20 candidates, reranker selects the best 5 to pass to the LLM.

# pip install sentence-transformers rank-bm25 faiss-cpu numpy

from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss
import numpy as np
from rank_bm25 import BM25Okapi

# ---- Setup ----
corpus = [
    "PostgreSQL query optimization reduces latency using indexes.",
    "ECONNREFUSED error occurs when the server is not accepting connections.",
    "Dense retrieval uses cosine similarity for semantic matching.",
    "BM25 ranks documents by term frequency and inverse document frequency.",
    "Hybrid search combines sparse and dense retrievers for better recall.",
]

tokenized = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized)

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
doc_embeddings = embedder.encode(corpus, normalize_embeddings=True).astype("float32")

index = faiss.IndexFlatIP(doc_embeddings.shape[1])
index.add(doc_embeddings)

# Cross-encoder for reranking (MS MARCO fine-tuned)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


def hybrid_search_with_reranking(
    query: str,
    top_k_retrieve: int = 20,
    top_k_return: int = 5,
    alpha: float = 0.5,
    rrf_k: int = 60,
) -> list[dict]:
    """
    Full pipeline: hybrid retrieval (BM25 + dense + RRF) -> cross-encoder reranking.
    """
    # Stage 1: BM25 retrieval
    bm25_scores = bm25.get_scores(query.lower().split())
    bm25_hits = sorted(enumerate(bm25_scores), key=lambda x: x[1], reverse=True)[:top_k_retrieve]

    # Stage 2: Dense retrieval
    q_emb = embedder.encode([query], normalize_embeddings=True).astype("float32")
    scores, ids = index.search(q_emb, top_k_retrieve)
    dense_hits = list(zip(ids[0].tolist(), scores[0].tolist()))

    # Stage 3: RRF fusion
    bm25_ranked = [(str(idx), score) for idx, score in bm25_hits]
    dense_ranked = [(str(idx), score) for idx, score in dense_hits]

    rrf_scores: dict[str, float] = {}
    for ranked_list in [bm25_ranked, dense_ranked]:
        for rank, (doc_id, _) in enumerate(ranked_list, start=1):
            rrf_scores.setdefault(doc_id, 0.0)
            rrf_scores[doc_id] += 1.0 / (rrf_k + rank)

    fused = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:top_k_retrieve]

    # Stage 4: Cross-encoder reranking
    candidate_docs = [(int(doc_id), corpus[int(doc_id)]) for doc_id, _ in fused]
    pairs = [[query, doc_text] for _, doc_text in candidate_docs]
    rerank_scores = reranker.predict(pairs)

    reranked = sorted(
        zip(candidate_docs, rerank_scores),
        key=lambda x: x[1],
        reverse=True
    )[:top_k_return]

    return [
        {"doc_id": idx, "text": text, "rerank_score": float(score)}
        for (idx, text), score in reranked
    ]


results = hybrid_search_with_reranking(
    "ECONNREFUSED connection error fix",
    top_k_retrieve=20,
    top_k_return=5
)

for r in results:
    print(f"[{r['rerank_score']:.3f}] {r['text'][:100]}")

The top_k_retrieve=20 / top_k_return=5 ratio is a practical default for production. Retrieve wide to maximize recall in the hybrid stage, then let the reranker cut to what the LLM actually needs.

If building and maintaining this retrieval stack is not where your team adds value, Prem Platform deploys hybrid RAG pipelines with configurable retrieval inside your own cloud account. Dense and sparse retrieval, built-in reranking, data stays in your VPC. See the private RAG deployment guide for architecture details.


Every production vector database now supports hybrid search, but implementations differ in fusion methods, sparse vector support, and API design.

Vector DB Sparse Retriever Fusion Methods Weight Control API Pattern
Qdrant SPLADE, custom sparse RRF, DBSF Prefetch weights query_points + prefetch + FusionQuery
Weaviate Built-in BM25F rankedFusion, relativeScoreFusion alpha 0.0 to 1.0 collection.query.hybrid()
Elasticsearch BM25, ELSER (sparse neural) RRF, linear retriever rank_constant, weights retriever.rrf JSON (since v8.9)
OpenSearch BM25 Arithmetic, harmonic, geometric mean Per-query weights Search pipeline + normalization-processor
Pinecone BM25 encoder (client-side) Convex combination alpha param hybrid_convex_scale + index.query
Milvus Built-in BM25 function WeightedRanker, RRFRanker Per-ranker weights AnnSearchRequest + hybrid_search

Verified from official documentation, March 2026.

Choosing between them depends on your stack. If you need SPLADE with score-aware fusion, Qdrant's DBSF offers the most control. For the simplest API, Weaviate's single-call hybrid() with an alpha parameter gets you running in one line. If you are already on Elasticsearch, RRF works out of the box since v8.9. Redis 8.4 added an FT.HYBRID command combining BM25, vector search, and filtering in a single atomic operation.

When Hybrid Search Is Overkill

Hybrid search adds real costs: dual indexing (both inverted and vector index), two retrieval paths, and fusion tuning that requires an evaluation set. On the WANDS e-commerce benchmark, RRF added only +1.7% Mean NDCG over dense-only. That marginal gain may not justify the infrastructure for all use cases.

  1. Dense-only is enough when queries are highly semantic with minimal vocabulary mismatch. It also suffices when your embedding model has been fine-tuned on the target domain and already captures domain-specific terminology. If the model understands your vocabulary, BM25 adds little.
  2. BM25-only is enough when queries are exact-match dominated: product catalogs with SKUs, legal case numbers, financial instrument identifiers. BM25 delivers millisecond responses at millions of documents without GPU infrastructure.
  3. Consider reranking before hybrid. If your dense retriever already has high recall but poor precision at the top, a cross-encoder reranker often helps more than adding BM25. Reranking reorders what you have already retrieved. Hybrid search expands what you retrieve. Know which problem you are solving before adding complexity.

Also worth knowing: a hybrid search configuration with a badly tuned alpha or k parameter can underperform your dense baseline, as documented in AIMultiple's benchmark. Measure before and after. Do not assume hybrid is automatic improvement.


Decision Framework

Your Situation Strategy Fusion Alpha / k
Mixed queries (exact terms + semantic intent) Hybrid RRF, k=60 Default
You have 50+ labeled query pairs Hybrid Convex combination Tune alpha on eval set
Technical docs with error codes, API names, SKUs Hybrid, sparse-weighted Convex combination alpha ~0.3
Customer support KB, conversational queries Hybrid, semantic-weighted Convex combination alpha ~0.7
Pure semantic similarity tasks Dense-only N/A N/A
Exact-match dominated (SKUs, case numbers) BM25-only N/A N/A
Dense recall is fine, top precision is poor Dense + reranker N/A N/A
High recall needed, no labeled data Hybrid + RRF RRF k=60
Score distributions vary across retrievers Hybrid + DBSF DBSF Auto-normalized
Domain with heavy vocabulary mismatch Hybrid with SPLADE sparse RRF or CC Tune from 0.5

Sources: Cormack et al. SIGIR 2009, Bruch et al. ACM TOIS 2023, LlamaIndex alpha tuning, Weaviate benchmarking.

Two defaults cover most situations. No evaluation data: RRF with k=60. You can build even a small eval set: convex combination with tuned alpha. Measure Hit Rate and MRR before committing.


Measuring Retrieval Quality

Switching from dense-only to hybrid is a hypothesis until you measure it on your own data. Three metrics cover most production evaluation needs:

Hit Rate @K: What fraction of queries have at least one relevant document in the top-K results. Best for checking whether relevant context is reaching the LLM at all.

MRR (Mean Reciprocal Rank): Average of 1/rank of the first relevant result. Measures whether relevant documents appear at the top of the list, not just somewhere in top-K.

NDCG@K: Normalized Discounted Cumulative Gain. Rewards relevant documents appearing earlier in the ranked list, with diminishing credit for lower positions. Used in most academic benchmarks.

def hit_rate(relevant_docs: list[str], retrieved_docs: list[str], k: int) -> float:
    """Fraction of queries with at least one relevant doc in top-K."""
    return int(any(doc in retrieved_docs[:k] for doc in relevant_docs))

def mean_reciprocal_rank(relevant_docs: list[str], retrieved_docs: list[str]) -> float:
    """MRR for a single query."""
    for rank, doc in enumerate(retrieved_docs, start=1):
        if doc in relevant_docs:
            return 1.0 / rank
    return 0.0

# Evaluate across a query set
def evaluate_retriever(retriever_fn, eval_set: list[dict], k: int = 10):
    """
    eval_set: list of {"query": str, "relevant_docs": list[str]}
    retriever_fn: function(query) -> list[str] of doc_ids
    """
    hit_rates, mrrs = [], []
    for item in eval_set:
        retrieved = retriever_fn(item["query"])
        hit_rates.append(hit_rate(item["relevant_docs"], retrieved, k))
        mrrs.append(mean_reciprocal_rank(item["relevant_docs"], retrieved))

    return {
        f"hit_rate@{k}": sum(hit_rates) / len(hit_rates),
        "mrr": sum(mrrs) / len(mrrs),
    }

# Compare dense-only vs hybrid on the same eval set
dense_metrics = evaluate_retriever(dense_retriever_fn, eval_set, k=10)
hybrid_metrics = evaluate_retriever(hybrid_retriever_fn, eval_set, k=10)

print(f"Dense-only:  Hit@10={dense_metrics['hit_rate@10']:.3f}, MRR={dense_metrics['mrr']:.3f}")
print(f"Hybrid:      Hit@10={hybrid_metrics['hit_rate@10']:.3f}, MRR={hybrid_metrics['mrr']:.3f}")

Build your eval set from real user queries, not synthetic ones. 50 to 100 labeled pairs is enough to detect meaningful differences and tune alpha.


Start with RRF, Measure, Then Tune

Hybrid search at k=60 RRF is where to start. It requires no labeled data, handles both dense and sparse failure modes, and consistently outperforms single-method retrieval on mixed query types.

Your action plan:

  1. Implement hybrid retrieval with RRF (k=60). Use the LangChain EnsembleRetriever to prototype quickly, or native hybrid search in your vector database.
  2. Measure Hit Rate @10 and MRR on 50 to 100 representative queries from your actual users.
  3. If metrics show room to improve, tune alpha on your eval set using convex combination.
  4. Add a cross-encoder reranker after fusion to improve top-K precision before the LLM step.

If you want managed hybrid retrieval without building the stack yourself, Prem Platform runs production RAG pipelines with hybrid search and reranking inside your cloud account. Data stays in your VPC. For open-source models suited to RAG workloads, Prem-1B (Apache 2.0, 8,192-token context) pairs well with any of the retrieval strategies above.

FAQ

What is hybrid search in RAG?

Hybrid search runs a sparse retriever (BM25 or SPLADE) and a dense vector retriever on the same query in parallel, then merges their result lists using a fusion algorithm like RRF or convex combination. The combined candidate set is passed to the LLM, often after an optional reranking step. It improves recall over either method alone because dense retrieval misses exact keyword matches and sparse retrieval misses semantic synonyms.

When should I use hybrid search vs. pure vector search?

Use hybrid search when your queries include exact identifiers (error codes, product names, API endpoints, legal terms) alongside natural language intent. Use pure vector search when queries are entirely semantic and your embedding model already understands your domain vocabulary. If you have fine-tuned embeddings on your specific corpus, the marginal gain from BM25 shrinks considerably.

What is the difference between BM25 and SPLADE?

BM25 is a statistical ranking function that matches exact query terms without any vocabulary expansion. SPLADE is a learned sparse model that expands both query and document representations with semantically related terms at index time. SPLADE handles vocabulary mismatch better, outperforms BM25 on most BEIR benchmark datasets, but requires a transformer inference pass during indexing. For exact-match dominated queries like SKUs or error codes, BM25 is sufficient and faster. For enterprise knowledge bases with vocabulary mismatch, SPLADE is worth the indexing cost.

What is RRF and what does k=60 mean?

Reciprocal Rank Fusion scores each document by summing 1/(k + rank) across all result lists it appears in. The constant k=60 is the value from the original Cormack et al. (SIGIR 2009) paper and has become the industry standard. A higher k flattens the score differences between ranks. k=60 gives a good balance between rewarding top-ranked documents and giving credit to consistently appearing documents across lists. It is score-agnostic, so you do not need to normalize BM25 and cosine similarity scores before fusion.

Does hybrid search always improve retrieval?

No. On e-commerce datasets where product names create strong lexical overlap with queries, RRF added only +1.7% Mean NDCG over dense-only (WANDS benchmark, Elasticsearch). More importantly, a badly configured hybrid setup with an untuned fusion weight can perform worse than your dense baseline. Measure on your own query set before deploying. The +26 to 31% NDCG improvement cited in BEIR aggregate results reflects domains with high vocabulary mismatch, not all corpora.

Should I add reranking on top of hybrid search?

For most production RAG systems, yes. Hybrid retrieval and reranking solve different problems. Hybrid retrieval maximizes recall so the relevant document is somewhere in your candidate set. A cross-encoder reranker then maximizes precision by scoring each query-document pair jointly, which is more accurate than embedding similarity but too slow for first-stage retrieval. The standard pattern is hybrid retrieval for top-20 candidates, reranker to select top-5 for the LLM.

Which vector database has the best hybrid search support?

They all support hybrid search now. Qdrant has the most flexibility: SPLADE sparse vectors, RRF, and DBSF fusion with per-query weights. Weaviate has the simplest API: one call with an alpha parameter. Elasticsearch's RRF retriever works out of the box since v8.9. Pinecone is straightforward for teams already in that ecosystem. Choose based on your existing stack, not on hybrid search support alone.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe