Best Embedding Models for RAG (2026): Ranked by MTEB Score, Cost, and Self-Hosting
10 best embedding models for RAG in 2026 with MTEB benchmarks, cost per million tokens, max context length, dimensions, and a decision guide for your use case.
Your embedding model choice affects every query your RAG system handles. Pick the wrong one and you're fighting bad retrieval for the lifetime of your index. Re-embedding a large corpus costs real money and time. If you're still at the architecture stage, the RAG strategies overview covers how embedding fits into the full pipeline before you get into model selection.
MTEB (Massive Text Embedding Benchmark) is the standard benchmark for comparing models, covering 56+ tasks including retrieval, classification, clustering, and semantic similarity. But the leaderboard score is an average. A model that dominates classification may underperform on retrieval, which is the task that actually matters for RAG. Throughout this guide, retrieval-specific NDCG@10 scores are used where available, not just overall MTEB averages.
Two other things worth knowing before you read rankings:
First, the leaderboard shifts constantly. New models submit results monthly. The scores here reflect the state of things in early 2026. Check the MTEB leaderboard before making a final call.
Second, benchmark scores on public datasets don't always translate to your corpus. A model that tops the leaderboard on Wikipedia and legal documents might perform differently on your internal ticketing system or product catalog. Run your own retrieval eval on a sample of your data before committing.
With that said, here are the 10 models worth knowing.
Quick Comparison Table
| Model | MTEB Score | Context | Dimensions | Cost/1M tokens | Self-host | License |
|---|---|---|---|---|---|---|
| Gemini embedding-001 | 68.32 | 2,048 | 3072 (flex) | $0.15 | No | Proprietary |
| Qwen3-Embedding-8B | 70.58 (multilingual) | 32,000 | 7168 (flex to 32) | Free (self-host) | Yes | Apache 2.0 |
| voyage-3-large | ~67+ | 32,000 | 2048 (flex) | $0.06 | No | Proprietary |
| text-embedding-3-large | 64.6 | 8,192 | 3072 (flex) | $0.13 | No | Proprietary |
| Cohere embed-v4 | 65.2 | 128,000 | 1024 | $0.10 | VPC/On-prem | Proprietary |
| BGE-M3 | 63.0 | 8,192 | 1024 | Free (self-host) | Yes | MIT |
| NV-Embed-v2 | 69.32 | 32,768 | 4096 | Free (self-host) | Yes | CC-BY-NC-4.0 |
| Jina embeddings-v3 | ~62+ | 8,192 | 1024 (flex to 32) | $0.018 | Yes | CC-BY-NC-4.0 |
| Nomic embed-text-v1.5 | ~62+ | 8,192 | 768 (flex) | $0.10 | Yes | Apache 2.0 |
| all-MiniLM-L6-v2 | 56.3 | 512 | 384 | Free | Yes | Apache 2.0 |
1. Gemini Embedding 001: Best Overall MTEB Score
Use when: You're on Google Cloud, want the highest benchmark number without managing infrastructure, and your queries don't exceed 2,048 tokens.
Google's gemini-embedding-001 hit general availability in mid-2025 and currently holds the top spot on the MTEB Multilingual leaderboard. It scored 68.32 overall, with 67.71 on retrieval tasks and 85.13 on pair classification.
The model supports 100+ languages, outputs up to 3,072 dimensions (adjustable down to 768 via Matryoshka learning), and costs $0.15 per million tokens after the free tier.
The context window is the main limitation: 2,048 tokens per input. That's roughly 1,500-1,800 words. If your documents are longer, you're chunking regardless. For most RAG use cases, that's fine. For long-form legal contracts or research papers where you want to embed full sections, look elsewhere.
Pros
- Top MTEB score among proprietary models as of early 2026
- Strong multilingual retrieval across science, legal, finance, and code
- Flexible dimensions via MRL: reduce storage without losing much recall
- Free tier available through AI Studio
Cons
- Hard 2,048 token context limit
- API-only, no self-hosting option
- Tight Google ecosystem coupling for production deployments
Pricing: $0.15/1M tokens (Vertex AI). Free tier via Gemini API.
2. Qwen3-Embedding-8B: Best Open-Source Option
Use when: You need strong multilingual performance, want full infrastructure control, and can run an 8B parameter model.
Alibaba's Qwen3-Embedding-8B scored 70.58 on the MTEB Multilingual leaderboard as of mid-2025, ranking first. For retrieval specifically, it performs on par with or better than most proprietary alternatives while being fully self-hostable under Apache 2.0.
The model handles 32,000 token context windows, supports 100+ natural languages plus programming languages, and uses Matryoshka learning to support flexible output dimensions from 32 to 7,168. Smaller variants (0.6B and 4B) are available when you need to trade accuracy for speed.
An important implementation note: Qwen3-Embedding responds to task-specific instructions at inference time. Including a task prefix like Instruct: Represent this document for retrieval\nQuery: before your query typically improves performance by 1-5% compared to no instruction. It's a small gain but free.
For teams running self-hosted fine-tuned model serving, Qwen3-Embedding is currently the strongest open-source choice available under a commercially permissive license. If you want to take it further, PremAI's autonomous fine-tuning system covers how to fine-tune embedding models on domain-specific corpora without deep ML expertise.
Pros
- #1 on MTEB Multilingual leaderboard
- 32K context: handles long documents without chunking
- Apache 2.0: fully commercial-friendly
- Instruction-aware: tune behavior per task without retraining
- 0.6B and 4B variants for latency-constrained deployments
Cons
- 8B parameters needs GPU for production throughput
- Matryoshka dimensions available but 7,168-dim vectors increase storage costs at default
- Managed API not widely available; self-hosting required for most teams
Pricing: Free to self-host. Alibaba Cloud API pricing varies.
3. voyage-3-large: Best Accuracy-to-Cost Ratio Among Proprietary APIs
Use when: Retrieval quality is the priority, you're using a managed API, and you want the best NDCG@10 across general and domain-specific corpora.
Voyage AI's voyage-3-large outperforms text-embedding-3-large by 10.58% at matched dimensions and Cohere v3 by 20.71% on average across 100 datasets spanning law, finance, code, and multilingual content. Its 32,000 token context window handles long documents at 4x the capacity of OpenAI.
At $0.06/1M tokens, it costs 2.2x less than OpenAI's large model and produces 1,024-dimensional vectors versus OpenAI's 3,072, requiring 3x less vector storage. The tradeoff is real: at int8 quantization and 512 dimensions, voyage-3-large still beats OpenAI's full 3,072-dim float vectors by 1.16%, at 200x less storage cost.
Voyage also publishes domain-specific variants: voyage-law-2, voyage-finance-2, voyage-code-3, and voyage-multilingual-2. If your corpus is predominantly legal, financial, or code content, these domain-specific models will outperform the general-purpose one. That granularity is rare among proprietary providers.
Anthropic recommends Voyage AI for embedding in Claude-based pipelines, which makes it a natural fit for teams building on the broader Anthropic ecosystem.
Pros
- Best retrieval NDCG@10 among proprietary managed models
- 32K context at lower cost than OpenAI
- Domain-specific variants (law, finance, code)
- Matryoshka support for dimension-storage tradeoffs
- $0.06/1M tokens is the best price among top-tier managed APIs
Cons
- API-only, no self-hosting path
- Slightly less brand recognition than OpenAI/Google for enterprise procurement
Pricing: $0.06/1M tokens (voyage-3-large). Domain-specific models priced separately.
4. text-embedding-3-large: Best for OpenAI Ecosystem Teams
Use when: Your stack already runs on OpenAI APIs and switching costs outweigh a marginal retrieval gain.
OpenAI's text-embedding-3-large scores 64.6 on MTEB, costs $0.13/1M tokens, supports 8,192 token context, and outputs up to 3,072 dimensions (reducible via Matryoshka). It's the most battle-tested managed embedding model in production by usage volume.
What justifies staying with it despite being outperformed by Voyage, Gemini, and Cohere on benchmarks: ecosystem integration. If your LLM calls, fine-tuning jobs, and assistant APIs are all in one place, the operational simplicity is worth something. It works, it's reliable, it scales, and the documentation is thorough.
The Matryoshka dimension reduction is particularly useful here. Dropping from 3,072 to 1,024 dimensions reduces vector storage by 3x with only ~2% retrieval quality drop. If you're paying for a large vector database, that's meaningful.
text-embedding-3-small ($0.02/1M tokens) is worth benchmarking for less critical retrieval tasks. It scores around 62 on MTEB and runs faster. Many teams use small for initial retrieval pass and large (or a reranker) for the final set. For cutting LLM API costs without sacrificing performance, that two-stage approach often makes more sense than upgrading your embedding model.
Pros
- Battle-tested at production scale
- Tight OpenAI platform integration
- Matryoshka support for storage optimization
- Reliable SLAs and documentation
Cons
- 8,192 token context (half of Voyage/Qwen3/NV-Embed)
- Outperformed by Voyage, Gemini, and Cohere on MTEB retrieval
- $0.13/1M tokens is expensive relative to comparable performance
Pricing: $0.13/1M tokens (large). $0.02/1M tokens (small).
5. Cohere Embed v4: Best for Noisy Enterprise Data
Use when: Your documents are messy (OCR'd text, scanned forms, inconsistent formatting, or handwriting) and you need enterprise deployment options including VPC or on-premise.
Cohere's embed-v4 scores 65.2 on MTEB, slightly above OpenAI's large model. What sets it apart isn't the benchmark number. Cohere specifically trains for resilience against real-world document noise: spelling errors, formatting inconsistencies, mixed content types, and scanned handwriting.
The context window is extraordinary: 128,000 tokens. That's the only managed embedding model that can handle an entire lengthy contract or research paper as a single chunk. For teams that want to avoid chunking long documents, this is the only proprietary API option that genuinely removes the constraint.
Enterprise deployment options are a real differentiator. Cohere offers virtual private cloud deployment and on-premises options, putting it in the same conversation as self-hosted models for regulated industries. For finance and healthcare teams that need an API with managed infrastructure but can't send data to a shared endpoint, this fills the gap.
One limitation: on its own, without a reranker, Cohere's contrastive training approach can struggle with the "query syntax vs. document syntax" mismatch common in retrieval. Cohere's own reranker is designed to complement it, and using both together typically outperforms either alone.
For a broader look at RAG data privacy considerations, especially for regulated industries, see the RAG privacy concerns guide.
Pros
- Best handling of noisy, real-world enterprise documents
- 128K token context window is unique among proprietary models
- VPC and on-premise deployment for regulated industries
- 65.2 MTEB, slightly ahead of OpenAI text-embedding-3-large
Cons
- Single-pass retrieval can be weak without the paired reranker
- $0.10/1M tokens is mid-range but not best value
- 1,024 dimensions at full size (no Matryoshka flexibility for smaller storage)
Pricing: $0.10/1M tokens.
6. BGE-M3: Best Open-Source Workhorse
Use when: You need a free, self-hostable, multilingual model that handles dense, sparse, and hybrid retrieval from a single model.
BAAI's BGE-M3 is the most versatile open-source embedding model available under an MIT license. It handles three things most models can't: dense embedding (standard semantic search), sparse retrieval (lexical matching like BM25), and multi-vector retrieval (ColBERT-style), all from a single model.
That matters for hybrid RAG pipelines. Instead of running a separate BM25 index alongside your dense vector store, BGE-M3 generates both representations in one pass. You get hybrid retrieval with less infrastructure complexity.
MTEB score is 63.0. It supports 100+ languages, 8,192 token context, and 1,024-dimensional outputs. At 568M parameters, it runs efficiently on a single GPU and can be quantized for CPU deployment.
The main operational consideration: dense-only BGE-M3 is straightforward to deploy. Enabling the sparse and multi-vector outputs adds complexity. You need a vector store that supports multiple vector types per document (Qdrant and Weaviate do; Pinecone does not natively). Decide upfront which retrieval modes you need.
Self-hosting BGE-M3 with vLLM is covered in PremAI's self-hosted model serving docs.
Pros
- Dense + sparse + multi-vector from one model
- MIT license: genuinely free for commercial use
- 100+ languages, 8K context
- 568M params: runs on modest GPU hardware
- No API dependency, zero data leaves your infrastructure
Cons
- MTEB 63.0: below Voyage, Gemini, and Qwen3-8B
- Multi-vector and sparse modes add deployment complexity
- Generalization to out-of-distribution domains needs validation on your corpus
Pricing: Free to self-host. ~$0.016/1M tokens via Fireworks.ai if you prefer a managed option.
7. NV-Embed-v2: Best for Research and Non-Commercial Use
Use when: You're in a research or non-commercial context and want the highest absolute retrieval performance available in an open-weight model.
NVIDIA's NV-Embed-v2 held the top MTEB spot at 69.32 before Gemini and Qwen3 appeared. It's based on Mistral-7B, outputs 4,096-dimensional vectors, supports 32,768 token context, and uses a two-stage training process (contrastive learning first, then non-retrieval task integration) that produces notably strong dense retrieval scores.
The license is CC-BY-NC-4.0: free for research and non-commercial use, not for commercial deployment. That rules it out for most production RAG systems.
For teams in academia, healthcare research, or internal tooling that doesn't have a commercial component, it's the strongest open-weight English retrieval model available.
Operationally: 7.85B parameters need a GPU with at least 16GB VRAM. It doesn't support sparse or multi-vector outputs, so you'd pair it with BM25 separately for hybrid retrieval. The 4,096-dimensional output increases vector storage costs compared to BGE-M3 or Qwen3 at smaller dimensions.
Pros
- MTEB 69.32: top retrieval scores among open-weight models
- 32K token context
- Strong on HotpotQA and NQ, good for multi-hop reasoning queries
Cons
- CC-BY-NC-4.0: not for commercial use
- 7.85B parameters: needs 16GB+ VRAM
- English-only, no multilingual support
- Dense only: separate BM25 needed for hybrid retrieval
Pricing: Free (non-commercial only).
8. Jina Embeddings v3: Best for Task-Specific Adapters
Use when: You need multilingual embeddings across varied tasks (retrieval, clustering, classification) and want a mid-size model with flexible dimensions.
Jina's jina-embeddings-v3 (570M parameters) uses task-specific LoRA adapters to switch behavior per task at inference time. The same model produces retrieval-optimized embeddings for document indexing and clustering-optimized embeddings for grouping tasks, without retraining or loading different weights.
It scores around 62 on MTEB retrieval, supports 89+ languages, handles 8,192 token context, and outputs 1,024 dimensions adjustable down to 32 via Matryoshka.
The LoRA adapter approach has a practical implication: you need to specify the task type at query time. For a RAG pipeline where the task is always retrieval, this is a one-line config. For systems that use the same embedding pipeline for both retrieval and document clustering, the adapter switching happens automatically and removes the need for two separate models.
Jina v4 is available as of 2025 with 2,048 dimensions (adjustable to 128) and improved multilingual coverage. Check current benchmarks before deciding between v3 and v4 for your corpus.
A note on deployment: jina-embeddings-v3 is available for self-hosting, but the CC-BY-NC-4.0 license applies to the weights. Commercial use requires Jina's managed API. For teams with multilingual requirements that also need self-hosted deployment, BGE-M3 or Qwen3 under Apache 2.0 are cleaner options.
Pros
- Task-specific LoRA adapters without multiple models
- 89+ language support with strong multilingual MTEB scores
- Flexible dimensions (32 to 1,024)
- Managed API pricing is low at $0.018/1M tokens
Cons
- CC-BY-NC-4.0 on weights: commercial self-hosting requires their API
- MTEB retrieval score (~62) below top-tier models
- Task switching adds a small configuration overhead for multi-task pipelines
Pricing: $0.018/1M tokens (API). Self-hosting requires licensing for commercial use.
9. Nomic Embed Text v1.5: Best for Transparency and Auditability
Use when: You need a fully open model (weights, code, and training data all public), care about auditability, or need 8K context in a lightweight self-hostable package.
nomic-embed-text-v1.5 is the only major embedding model with complete openness: weights, training code, and training data are all publicly available. For teams with compliance requirements around model provenance, this is the only option. Some regulated industries need to document every model in their AI stack, and no other embedding model satisfies that requirement. that satisfies full auditability.
It scores around 62 on MTEB, supports 8,192 token context, outputs 768 dimensions (flexible via Matryoshka to 64), and runs on CPU with reasonable latency. At 137M parameters, it's lighter than BGE-M3 and Qwen3 by a significant margin.
Performance is solid for the parameter count. On the MTEB retrieval tasks, it competes with models 2-4x larger. The tradeoff is a lower ceiling: it won't match BGE-M3 or Qwen3 on multilingual or domain-specific benchmarks. For English-heavy corpora where auditability matters more than top-1 retrieval accuracy, it's the right choice.
Available as nomic-embed-text on Ollama for local inference, making it easy to run fully offline without any external API dependency.
Pros
- Fully open: weights, code, and training data public
- 8K context, 137M params: runs on CPU
- Apache 2.0 for commercial use
- Complete auditability for compliance requirements
Cons
- MTEB ~62: below BGE-M3, Qwen3, Voyage
- English focus; multilingual performance lags larger models
- 768 dimensions (lower ceiling on semantic richness)
Pricing: Free to self-host.
10. all-MiniLM-L6-v2: Best for Prototyping and Low-Latency Edge
Use when: You're prototyping, running on CPU-only hardware, or need sub-5ms embedding latency.
all-MiniLM-L6-v2 is the default starting point in the sentence-transformers library. MTEB score of 56.3, 512-token context, 384 dimensions, runs in under 10ms on CPU. It's not competitive with anything else on this list for production retrieval quality.
It earns a spot here because it correctly serves a purpose. For early-stage testing where you need to validate your chunking strategy or retrieval pipeline before committing to a production model, spinning up all-MiniLM-L6-v2 locally means zero API cost and no latency overhead. When your evaluation framework shows you what retrieval quality level your use case needs, you swap in a better model.
For edge deployments on devices with limited compute, embedded systems, or real-time applications where latency is the hard constraint, it's often the right tradeoff.
Don't use it for production RAG where retrieval quality matters.
Pros
- Sub-10ms on CPU
- Zero cost, no API dependency
- Dead simple to get running via sentence-transformers
- Good for validating pipeline architecture before choosing a production model
Cons
- MTEB 56.3: lowest on this list
- 512-token context limits chunk size options
- 384 dimensions: low semantic richness
Pricing: Free.
How to Choose: Decision Framework
The right model depends on your constraints, not the leaderboard ranking. Work through these questions in order.
1. Can your data leave your infrastructure?
If no: BGE-M3 (multilingual, MIT), Qwen3-Embedding (multilingual, Apache 2.0), or Nomic (auditable, Apache 2.0). NV-Embed-v2 if non-commercial. Cohere VPC/on-prem if you need managed infrastructure.
If yes: continue.
2. What's your primary language requirement?
English-only: voyage-3-large or Gemini embedding-001 for managed API; NV-Embed-v2 for open-weight non-commercial; BGE-M3 or Qwen3 for open-weight commercial.
Multilingual: Qwen3-Embedding-8B (best multilingual MTEB), Gemini embedding-001, Cohere embed-v4 (noisy data), Jina v3 (task adapters), or BGE-M3.
For a deeper look at multilingual LLM deployment considerations, see multilingual LLMs: progress, challenges, and future directions.
3. What's your document length distribution?
Short to medium (under 2,048 tokens): Any model works. Optimize for retrieval quality and cost.
Long documents (2,048-8,192 tokens): text-embedding-3-large, BGE-M3, Jina v3, Nomic.
Very long documents (8,192-32,768 tokens): Qwen3-Embedding-8B, voyage-3-large, NV-Embed-v2, Cohere embed-v4.
Full documents without chunking (128K tokens): Cohere embed-v4 is the only option in this range.
4. Do you need hybrid retrieval?
If your queries frequently include product codes, error messages, legal citation numbers, or other exact-match terms, dense-only retrieval misses them. See advanced RAG methods for how hybrid retrieval fits into production pipelines.
BGE-M3 handles dense, sparse, and multi-vector in one model. Everything else requires a separate BM25 index alongside your dense vectors.
5. Do you need domain-specific tuning?
General-purpose models perform well across domains, but fine-tuning on your corpus produces 10-30% retrieval gains for specialized content. Legal, medical, financial, and code corpora all benefit from domain-specific embedding fine-tuning. Voyage publishes domain-specific variants for code, law, and finance, making it the easiest managed option. For self-hosted fine-tuning, the enterprise fine-tuning workflow covers the full dataset and training setup.
PremAI's evaluations framework lets you run side-by-side retrieval comparisons between embedding models on your actual corpus before committing to re-indexing.
MTEB Retrieval vs. Overall Score: Why It Matters
The overall MTEB score blends performance across classification, clustering, pair classification, reranking, retrieval, STS, and summarization. For RAG, only retrieval matters. A model optimized for document classification will have a high overall score but mediocre retrieval NDCG@10.
When evaluating models for your RAG pipeline:
- Filter MTEB results to the Retrieval task category, not the overall leaderboard
- Use NDCG@10 as your primary metric: it measures both whether the right document was found and where it ranked
- Cross-reference with Recall@5 if your pipeline returns fewer than 10 results to the LLM
- Run evaluations on your corpus: even a 50-query labeled set will reveal model-specific failure modes that the public benchmark can't
For teams setting up a systematic evaluation process, LLM evaluation benchmarks, challenges, and trends covers how to build a reliable eval framework beyond benchmark scores. For a practical end-to-end setup, LLM reliability and why evaluation matters covers the full evaluation workflow including retrieval-specific metrics.
Embedding Model Migration: What It Actually Costs
Switching embedding models requires re-embedding your entire corpus. That cost is real.
For a corpus of 100M tokens:
| Model switch | Re-embedding cost |
|---|---|
| To voyage-3-large | $6 |
| To text-embedding-3-large | $13 |
| To Cohere embed-v4 | $10 |
| To Gemini embedding-001 | $15 |
| To BGE-M3 (self-hosted) | Compute cost only |
The token cost is usually manageable. The real cost is downtime (if you're replacing a live index), validation time (re-running your evaluation set to confirm the new model improves retrieval), and the risk that the new model underperforms on a subset of queries you didn't anticipate.
Mitigation: before committing to a full corpus re-embed, run the new model on a representative 1-5% sample of your corpus and compare NDCG@10 on your labeled evaluation set. If it improves by less than 3-5%, the migration probably isn't worth it.
Dimension Reduction: Matryoshka Embeddings
Several models on this list support Matryoshka Representation Learning (MRL): voyage-3-large, text-embedding-3-large, Qwen3-Embedding, Gemini embedding-001, Jina v3, and Nomic embed-v1.5.
MRL trains the model to front-load the most important semantic information into the first N dimensions. This means you can truncate vectors to smaller sizes with predictable, bounded quality loss.
Voyage-3-large at int8 512 dimensions outperforms OpenAI's full 3,072-dim float vectors by 1.16%, at 200x lower storage cost. At production scale with millions of vectors, that difference in storage cost is significant.
If you're using a model with MRL support:
- Determine your recall target (e.g., 95% of the quality at full dimensions)
- Run retrieval quality at decreasing dimensions (2048 → 1024 → 512 → 256)
- Set the dimension at the smallest size that meets your recall target
- Monitor vector storage costs. The savings compound at scale
For teams self-hosting models and optimizing for cost, data distillation and smaller model options covers how model compression applies to the full inference stack.
The Self-Hosting Decision
Proprietary managed APIs are simpler to start with. Self-hosted models offer data sovereignty, lower ongoing costs at scale, and no vendor dependency. The decision comes down to your data sensitivity requirements and your team's operational capacity.
Choose managed API if:
- Your corpus doesn't contain sensitive data that would trigger data residency or compliance concerns
- You don't have the GPU infrastructure or ML engineering capacity to operate model serving
- You're in early stages and want to iterate on your RAG pipeline before locking in infrastructure
Choose self-hosted if:
- Data cannot leave your infrastructure (healthcare, legal, financial services)
- Corpus is large enough that API costs at scale exceed infrastructure costs
- You need to fine-tune the embedding model on your domain-specific data
The self-hosted LLM guide covers hardware requirements and serving options for production model deployment. For teams considering a private AI platform that handles embedding, fine-tuning, and inference under one roof, that post covers what the architecture looks like.
FAQ
Can I use different embedding models for different document types in the same RAG system?
Yes, but with care. You can route different document types to different embedding models and maintain separate vector indexes. The catch: query routing must be consistent. The query embedding must come from the same model as the document embeddings it searches against. Mixing models within a single retrieval call produces meaningless similarity scores.
A simpler approach: use a strong multilingual general-purpose model (BGE-M3 or Qwen3-Embedding) for everything, then fine-tune on your domain-specific subset if retrieval quality on that subset is below threshold.
How often should I re-evaluate my embedding model choice?
The MTEB leaderboard shifts meaningfully every 3-4 months. New models routinely produce 3-5% retrieval improvements over their predecessors. A practical cadence: re-run your retrieval evaluation on the current top 3 models every 6 months. If a new model beats your current one by more than 5% on your corpus, plan a migration. Under 5%, the re-embedding cost rarely pays off.
Does embedding model size correlate with retrieval quality?
Roughly, but with diminishing returns. Moving from 22M (MiniLM) to 568M (BGE-M3) produces a large quality jump. Moving from 568M to 7B (NV-Embed-v2) produces a smaller proportional gain. Jina v3 at 570M parameters outperforms E5-Mistral at 7B on most multilingual tasks while using 12x fewer compute resources. Parameter count is a rough proxy, not a guarantee.
Should I fine-tune my embedding model?
For general-purpose corpora, probably not worth it. For specialized domains, yes. Fine-tuning typically improves retrieval by 10-30% for in-domain queries. The prerequisite is a labeled dataset of query-document pairs from your actual corpus. If you don't have 500-1,000 labeled examples, start with a strong general model and build your eval set first. The enterprise dataset automation guide covers how to build labeled datasets at scale for fine-tuning. For smaller embedding models specifically, fine-tuning small language models covers the training setup and tradeoffs.
What about embedding models for code retrieval?
Voyage's voyage-code-3 is the top managed option for code-specific RAG. Among open-source, Qwen3-Embedding-8B leads the MTEB-Code benchmark and handles both natural language queries and code natively. For SQL-specific retrieval, PremAI's PremSQL and text-to-SQL evaluation covers a different retrieval paradigm that bypasses embedding entirely for structured data queries.
Building production RAG with self-hosted or fine-tuned embedding models? PremAI's platform handles dataset management, embedding fine-tuning, evaluation, and inference with full data sovereignty. Book a call to talk through your setup.