12 Best Open-Source LLMs for Production in 2026: Real Benchmarks, Real Problems
Which open-source LLMs actually work in production? Real benchmarks, deployment problems, user complaints, and what to watch for.
Open-source LLMs have caught up on benchmarks. But benchmarks lie.
The real story is what happens when you deploy. DeepSeek V3 scores 90+ on HumanEval but inserts random text into outputs. Llama 4 Maverick claims 1M context but performance degrades past 200K. Gemma 3 27B somehow runs slower than the 70B Llama model on identical hardware.
This guide ranks 12 production-ready open-source LLMs based on real deployment experiences, actual hardware requirements, and the problems you'll hit. We've pulled from GitHub issues, Reddit threads, and HuggingFace discussions to give you the full picture.
The shift toward self-hosted deployment makes sense financially. But choosing wrong costs months of engineering time.
Quick Comparison: What You Actually Get
| Model | Active Params | Real VRAM (FP16) | Tokens/Sec* | License | Reality Check |
|---|---|---|---|---|---|
| DeepSeek V3.2 | 37B | 8xH100 minimum | 60 t/s | MIT | Best reasoning, but random text insertions |
| Qwen 3-235B | 22B | 8xH100 (470GB quantized) | 34 t/s | Apache 2.0 | Thinking mode adds 2-3x latency |
| Llama 4 Maverick | 17B | 800GB+ full weights | 50 t/s | Llama License | Context degrades past 200K |
| Mistral Large 3 | 41B | 8xH200 or 8xH100 | Varies | Apache 2.0 | Not optimized for vision despite having it |
| Llama 4 Scout | 17B | 216GB + 16GB KV | 148 t/s | Llama License | AWS caps at 328K, not 10M |
| Gemma 3 27B | 27B | 62GB base + KV | 50 t/s | Gemma License | Slower than Llama-70B on same hardware |
| Qwen 3-32B | 32B | 1xA100 80GB | 64 t/s | Apache 2.0 | Most stable mid-size option |
| DeepSeek-Coder V2 | 21B | 4xH100 | 65 t/s | DeepSeek License | Code specialist, limited general use |
| Mistral Small 3.2 | 24B | 2xH100 (doubled from 3.1) | 93 t/s | Apache 2.0 | VRAM usage doubled from previous version |
| Gemma 3 12B | 12B | 24GB minimum | 130 t/s | Gemma License | Context limited to 4K in practice |
| Qwen 3-8B | 8B | 16GB | 150+ t/s | Apache 2.0 | Best consumer option |
| Phi-4 | 14B | 28GB | 80 t/s | MIT | Limited multi-turn capability |
*Tokens/second vary by hardware, quantization, and batch size. Numbers from user reports on comparable setups.
1. DeepSeek V3.2
Parameters: 685B total, 37B active (MoE)
License: MIT
Context: 128K tokens
Release: December 2025
DeepSeek V3.2 delivers the best open-source reasoning available. MMLU 88.5, MATH-500 90.2, competitive with GPT-4.5 on most benchmarks. The MIT license means no restrictions on commercial use.
But it has problems.
What the Benchmarks Don't Show
- Random text insertions. Users report the model inserting unrelated text mid-response, particularly in longer outputs. The DeepSeek team acknowledged this in V3.1 release notes, calling it a known issue with "hybrid inference modes."
- Instruction following degrades. Reddit user Dr_Karminski tested V3.1 extensively: "I asked for only the changed code. It output the entire file. Three times. With different prompts." This pattern appeared across multiple users testing coding tasks.
- TypeScript performance is inconsistent. On 16x Eval's coding benchmark, V3.1 scored 1/10 on TypeScript narrowing tasks. The model couldn't identify invalid Tailwind CSS classes like z-60 or z-70. For comparison, Claude Sonnet 4 scored 9/10 on the same tests.
- Censorship affects certain topics. Questions involving Taiwan, Tibet, or Tiananmen return answers aligned with Chinese government positions or get refused entirely. For enterprise use cases requiring balanced geopolitical content, this matters.
Deployment Reality
Minimum setup: 8x H100 GPUs for full precision inference.
Full model weights: 700GB. You'll need substantial storage and bandwidth for initial deployment.
Generation speed: 60 tokens/second once running, roughly 3x faster than V2. The Multi-head Latent Attention architecture delivers real efficiency gains.
Server stability: Direct API users report frequent "server busy" errors during peak times. Self-hosting eliminates this but requires the hardware.
Quantization: FP8 weights available, reducing to 4xH100 for inference. Quality loss is minimal for most use cases.
Terms of Service Warning
DeepSeek's terms hold users liable for all inputs and outputs. The language is broader than most: you must ensure legal rights to all submitted data and are responsible if outputs breach any laws. For regulated industries, review these terms with legal counsel.
Real Benchmarks
| Benchmark | DeepSeek V3.2 | GPT-4.5 | Claude Opus 4 |
|---|---|---|---|
| MMLU | 88.5 | 89.2 | 86.8 |
| MATH-500 | 90.2 | 91.0 | 78.3 |
| HumanEval | 90+ | 92 | 93 |
| LiveCodeBench | 64.3 | 68.1 | 65.2 |
| SWE-Bench | 50.8 | 52.4 | 49.1 |
Best For
Complex reasoning and mathematical tasks where you can work around the random insertion issue. Research applications. Code generation with human review. Teams that need MIT licensing flexibility.
Skip It If
You need reliable TypeScript or frontend development. You're building products where random text insertion would cause failures. You need balanced geopolitical content.
For teams building fine-tuned models, DeepSeek provides the strongest reasoning foundation despite its quirks.
2. Qwen 3-235B-A22B
Parameters: 235B total, 22B active
License: Apache 2.0
Context: 128K tokens (256K with recent updates)
Release: May 2025, updated July 2025
Alibaba's flagship offers something unique: unified thinking and non-thinking modes in one model. Switch between deep reasoning and fast responses without deploying separate models.
The Apache 2.0 license makes it the most permissively licensed frontier-class model available.
Thinking Mode Tradeoffs
The "thinking mode" produces reasoning traces in <think> blocks before final answers. This improves accuracy on math and logic tasks. But it has costs:
Latency increases 2-3x. A response that takes 2 seconds in non-thinking mode takes 5-6 seconds in thinking mode. The model generates extensive reasoning before output.
Token usage explodes. Thinking traces consume tokens. A 500-token answer might require 2,000+ tokens total with thinking enabled. This affects both cost and context limits.
Not all tasks benefit. Simple queries get slower without accuracy gains. The July 2025 update (Instruct-2507) added better budget control, but you still need to tune this per use case.
Hardware Reality
Full precision: 8xH100 GPUs with tensor parallelism.
Model size: ~470GB for GGUF BF16 quantized weights. Storage planning matters.
OOM errors are common. HuggingFace discussions show users hitting out-of-memory even on H100 nodes. The fix: reduce context length to 32K from the claimed 128K. The model works at full context but requires careful memory management.
Consumer hardware: The smaller Qwen3-30B-A3B runs on a single high-end GPU. Users report ~34 tokens/second on RX 7900 XTX with Q4 quantization.
What Works Well
119 language support. Not just tokens trained on multilingual data, but actual quality across languages. Chinese, Japanese, Arabic, and European languages all perform well.
Tool calling. The model's function calling is more reliable than most open alternatives. The Qwen-Agent framework handles tool parsing well.
MCP integration. Recent updates added Model Context Protocol support, making agent workflows simpler to build.
Integration Complexity
Tool calling works, but setup isn't plug-and-play. The recommended approach uses Qwen-Agent which handles tool templates and parsers. Raw API integration requires careful prompt engineering.
From the technical report: "We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity."
Translation: the raw model needs wrapper infrastructure.
Deployment
# SGLang deployment (recommended)
python -m sglang.launch_server \
--model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
--tp 8 \
--context-length 262144
# If you hit OOM, reduce context:
--context-length 32768
The transformers<4.51.0 requirement catches people. Older versions throw errors on MoE architecture loading.
Real Benchmarks
| Benchmark | Qwen 3-235B (Thinking) | DeepSeek R1 | Gemini 2.5 Pro |
|---|---|---|---|
| AIME 2024 | 85.7 | 79.8 | 83.2 |
| AIME 2025 | 81.5 | 72.6 | 78.9 |
| LiveCodeBench v5 | 70.7 | 65.9 | 68.4 |
| BFCL v3 (Tools) | 70.8 | 62.1 | 71.2 |
Qwen 3 outperforms DeepSeek R1 on 17 of 23 benchmarks while using fewer active parameters.
Best For
Multilingual applications. Agentic workflows with tool calling. Teams that need Apache 2.0 licensing. Mathematical reasoning with controllable thinking budgets.
Skip It If
You need consistent sub-second latency. You're deploying on consumer hardware. You want plug-and-play tool integration without framework overhead.
3. Llama 4 Maverick
Parameters: 400B total, 17B active (128 experts)
License: Llama 4 Community License
Context: 1M tokens
Release: April 2025
Meta's flagship generated massive hype. Native multimodal, 1M context, trained with Behemoth distillation. The benchmarks looked impressive.
Then people deployed it.
The Benchmark Controversy
The AI community immediately questioned Meta's numbers. Independent testing showed significant gaps:
HumanEval discrepancy. Meta claimed scores competitive with GPT-4o. Independent tests from LM Arena showed 62% accuracy vs. Gemma 3 27B's 74%. Reddit user Dr_Karminski summarized it: "They completely surpassed my expectations... in a negative direction."
Coding performance. Users reported 18% more Python syntax errors compared to DeepSeek R1 in controlled tests. The model struggles with complex multi-file code generation.
Real document analysis. Reddit user Holvagyok tested legal document processing: the model missed key clauses, produced incorrect summaries, and performed worse than smaller models on domain-specific tasks.
Context Window Reality
Maverick claims 1M token context. Reality is more complicated.
Performance degrades past 200K tokens. The NIHAS 1M-token benchmark showed 92% factual recall at full context. But users report synthesis tasks (like analyzing contracts or comparing documents) degrade significantly at long contexts.
Providers cap lower. AWS Bedrock limits to 328K tokens. Still 2.5x higher than Gemini 2.5 Pro's 128K, but well below the marketed 1M.
Memory explodes at scale. Full weights require 800GB+ storage. KV cache for long contexts adds substantial VRAM.
Security Assessment
ProtectAI ran vulnerability scans on both Llama 4 models:
Risk score: 52-58 (medium risk)
Successful attacks: ~490 across both models
Llama Guard 4 bypass rate: 33.8%
One-third of harmful prompts bypassed Meta's safety guardrails. For enterprise deployment, additional safety infrastructure is required.
Hardware Requirements
| Setup | Memory | Notes |
|---|---|---|
| Full weights (FP16) | 800GB+ | Multi-node required |
| 8-bit quantized | 400GB | 8xH100 |
| 4-bit quantized | 200GB | 4xH100 |
| INT4 with KV compression | 100GB | 2xH100 possible |
Inference speed: ~50 tokens/second on RTX 5090 with Q4 quantization. For comparison, Gemma 2 27B hits 76 tokens/second on the same hardware.
Licensing Terms
The Llama 4 Community License allows commercial use with restrictions:
- Products exceeding 700M monthly active users need separate licensing
- Acceptable use policy prohibits certain applications
- Redistribution requires maintaining attribution
For most enterprises, these terms work. But review the acceptable use policy before deployment in sensitive domains.
What Actually Works
Multimodal processing. Image understanding is solid. The early fusion architecture handles mixed text/image inputs well.
Document QA. For shorter documents (under 50K tokens), retrieval and summarization work reliably.
API availability. Maverick is available through Fireworks, Together AI, and major cloud providers. If you don't want to self-host, options exist.
Real Benchmarks
| Benchmark | Llama 4 Maverick | GPT-4o | Gemini 2.0 Flash |
|---|---|---|---|
| MMLU-Pro | 80.5 | 80.0 | 79.8 |
| DocVQA | 94.4 | 92.3 | 94.1 |
| MMMU | 73.4 | 69.1 | 70.2 |
| HumanEval* | 62-82 | 91 | 85 |
*HumanEval scores vary significantly between Meta's claims (82.4) and independent testing (~62).
Best For
Document understanding tasks. Multimodal applications. RAG systems where you need extended but not extreme context. Teams using managed API services rather than self-hosting.
Skip It If
You need reliable coding generation. You're processing legal or financial documents requiring high accuracy. You need full 1M context synthesis. You're security-conscious about prompt injection.
For RAG implementations, test carefully before committing to Maverick's context window claims.
4. Mistral Large 3
Parameters: 675B total, ~41B active
License: Apache 2.0
Context: 256K tokens
Release: December 2025
Mistral's European flagship offers Apache 2.0 licensing at frontier scale. For GDPR-conscious enterprises, the French development and data sovereignty options matter.
But "general-purpose multimodal" has caveats.
Vision Isn't the Strength
The model has vision capabilities. It processes images. But it's not optimized for vision tasks:
From Mistral's own documentation: "It is not a dedicated reasoning model and is not optimized for vision tasks, so it may not be the best option for reasoning use cases or multimodal tasks that require a lot of vision capability."
If your use case is primarily vision, look elsewhere. Gemma 3 or dedicated vision models will outperform.
Deployment Complexity
Minimum viable setup: 8xH200 or 8xH100 nodes.
FP8 recommended: Mistral suggests FP8 precision deployment, which enables single-node inference. NVFP4 (4-bit) reduces further but with quality tradeoffs.
vLLM configuration matters. The recommended deployment uses specific Mistral tokenizer and config modes:
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8
Getting these flags wrong causes tokenization errors or performance degradation.
The Ministral 3 Problem
If you're considering the smaller Mistral models, be aware of version issues:
Mistral Small 3.1 vs 3.2: The jump from 3.1 to 3.2 nearly doubled VRAM requirements. Users running 3.1 on a single H100 found 3.2 needs 2xH100. Same parameter count, double the memory.
Architecture change: Ollama's 3.1 model uses the "mistral3" architecture which is significantly slower than the "llama" architecture used in earlier versions. Users report the same model running at different speeds depending on source:
| Source | Context | Memory | Speed |
|---|---|---|---|
| Ollama mistral-small3.1 | 32k | 36GB | Slower |
| HuggingFace bartowski GGUF | 32k | 21GB | Faster |
Same model, different packaging, different performance.
What Works
Multilingual: 40+ languages with strong performance on non-English. European languages particularly well-tuned.
Function calling: Native tool use support with JSON structured output.
Document processing: The 256K context handles long documents well for enterprise knowledge bases.
Real Benchmarks
| Benchmark | Mistral Large 3 | GPT-4o | Claude Sonnet 4 |
|---|---|---|---|
| MMLU (8-lang) | 85.5 | 87.2 | 86.8 |
| HumanEval | 92 | 91 | 93 |
| MMLU-Pro | 73.1 | 74.8 | 73.6 |
Best For
European enterprises needing Apache 2.0 licensing and data sovereignty. Multilingual document processing. General enterprise AI where vision isn't primary.
Skip It If
Vision tasks are core to your application. You need reasoning-specialist capability. You're running on consumer or professional GPU hardware (requires H100-class minimum).
5. Llama 4 Scout
Parameters: 109B total, 17B active (16 experts)
License: Llama 4 Community License
Context: 10M tokens (claimed)
Release: April 2025
Scout is Llama 4's efficiency-focused option. Single H100 deployment with the longest context window in open-source.
The 10M context claim needs asterisks.
Context Window Reality
Provider limits: AWS caps at 328K tokens. Still long, but 3% of claimed capacity.
Memory requirements: Full 10M context would require extraordinary KV cache. Blockwise sparse attention reduces this, but practical limits exist.
Performance at scale: Users report 92% factual recall on the NIHAS 1M-token benchmark. But synthesis tasks requiring cross-document reasoning degrade significantly at long contexts.
For most use cases, treat 200-300K as the practical ceiling with good performance.
Hardware Efficiency
This is where Scout shines:
| Quantization | Memory | Configuration |
|---|---|---|
| Full weights | 216GB + 16GB KV | 4xH100 |
| 8-bit | 109GB + 8GB KV | 2xH100 |
| 4-bit | 54.5GB + 8GB KV | 1xH100 |
| 2-bit | 27.3GB + 8GB KV | 1xA100 |
Inference speed: 148 tokens/second at 4-bit on single H100, roughly 1.7x faster than Llama 3 at similar sizes.
The Download Problem
Scout's HuggingFace launch showed 18,000 downloads in 48 hours. For comparison, Llama 3 hit this threshold faster. The slower adoption suggests community hesitation after benchmark controversies.
Fine-tuning Viable
With LoRA adapters under 20GB VRAM, Scout becomes accessible for domain-specific fine-tuning without massive infrastructure.
Real Benchmarks
| Benchmark | Llama 4 Scout | Gemma 3 27B | Mistral 3.1 24B |
|---|---|---|---|
| MMLU-Pro | 74.3 | 67.5 | 73+ |
| HumanEval | 74.1 | 89 | 92.9 |
| MATH | 50.3 | 58+ | 55+ |
Scout underperforms smaller models on coding benchmarks. Gemma 3 27B beats it on HumanEval despite being significantly smaller.
Best For
Long-document RAG systems. Code repository analysis (reading, not writing). Environments where single-GPU deployment matters. Fine-tuning projects needing efficiency.
Skip It If
Coding quality is critical. You actually need 10M context (current infrastructure doesn't support it). You need frontier-class capability in a smaller package.
6. Gemma 3 27B
Parameters: 27B (dense)
License: Gemma Terms of Use
Context: 128K tokens
Release: March 2025
Google's dense model offers multimodal processing in a relatively compact package. 140+ language support. Runs on professional GPUs.
But something's wrong with the performance.
The Speed Problem
Multiple users report Gemma 3 27B running slower than larger models on identical hardware:
From HuggingFace discussions: "I can echo I have the issue: with the same 2-A100-80G GPUs, Gemma3-27B is slower than the Llama-70B in my tests, which is very strange."
Ollama benchmarks on RTX 5090:
- Gemma 3 27B: 50 tokens/second
- Gemma 2 27B: 76 tokens/second
- Qwen 2.5 32B: 64 tokens/second
Same model family, newer version, 34% slower. The architectural changes for multimodal support appear to have performance costs.
VRAM Issues
The KV cache behavior is unusual:
From GitHub issue #9678: "When using Gemma 3 27B with a context length of 20,000 (20k), I run out of VRAM on a 4090. However, when using Qwen2.5 32B IQ4XS, which is basically the same size as Gemma 3 27B Q4KM, with a full 32K context, I still have 2 GB of VRAM left."
Gemma 3 uses significantly more memory per token of context than comparable models.
Context Limitations in Practice
OOM crashes reported. Earlier Ollama versions (0.6.0) could run Gemma 3 12B at 8K context and 20-25 tokens/second. Newer versions crash systems at 8K, limiting users to 4K context.
Training vs. inference: Full training requires 500GB+ VRAM. One H100 is not enough. Single-GPU deployment works for inference only.
What Works
Multimodal quality. Image understanding is good. The native vision encoder avoids the bolted-on feel of some competitors.
Multilingual breadth. 140+ languages with reasonable quality across them.
Integration options. Supported by Hugging Face, vLLM, TGI, Ollama, and most major frameworks.
Hardware Requirements
| Setup | VRAM | Notes |
|---|---|---|
| Full weights | 62GB | Single H100 |
| QAT INT4 | ~14GB | RTX 3090/4090 |
| With KV cache (32K) | 20GB+ | Limited context on consumer GPUs |
Expect 20-30 tokens/second on RTX 3090 with 300 daily requests. Acceptable for many use cases.
Real Benchmarks
| Benchmark | Gemma 3 27B | Qwen 3-32B | Llama 4 Scout |
|---|---|---|---|
| HumanEval | 89 | 85+ | 74.1 |
| MMLU | ~78 | 83.3 | 79.6 |
| Arena Elo | 1339 | 1340+ | ~1300 |
Strong HumanEval performance, but the speed penalty means it generates code slower than alternatives.
Best For
Multimodal tasks where image understanding matters. Consumer GPU deployment with quantization. Teams already in Google's ecosystem.
Skip It If
You need maximum inference speed. Long context (32K+) is required. You're comparing against larger models and expect size-based speed advantages.
7. Qwen 3-32B
Parameters: 32B (dense)
License: Apache 2.0
Context: 128K tokens
Release: May 2025
The most stable mid-size option. Dense architecture avoids MoE routing complexity. Apache 2.0 licensing. Reasonable hardware requirements.
Why It Works
Predictable behavior. Dense models have simpler failure modes than MoE. Debugging is easier. Performance is more consistent across inputs.
Hardware fit. Single A100 80GB handles full precision. Single RTX 4090 handles 4-bit quantization. The sweet spot for professional GPU deployments.
Community adoption. Extensive quantization options available. Most inference frameworks support it without special configuration.
Performance
On LM Arena, Qwen 3-32B scores comparably to models 2-3x its size when properly tuned. The thinking mode (if enabled) pushes it higher on reasoning tasks.
| Benchmark | Qwen 3-32B | Gemma 3 27B | Llama 3.1 70B |
|---|---|---|---|
| MMLU | 83.3 | ~78 | 86.0 |
| HumanEval | 85+ | 89 | 80.5 |
Deployment
# Ollama (simplest)
ollama run qwen3:32b
# vLLM for production
vllm serve Qwen/Qwen3-32B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768
Best For
Teams wanting a reliable mid-size model. Apache 2.0 license requirements. Single-GPU deployment targets. Predictable, debuggable behavior.
8. DeepSeek-Coder V2
Parameters: 236B total, 21B active (MoE)
License: DeepSeek License
Context: 128K tokens
Release: June 2024, updated July 2025
DeepSeek-Coder V2 is the specialist. While general models handle code adequately, this one was built from the ground up for programming. Trained on 6 trillion tokens with 338 programming languages supported.
The tradeoff is clear: exceptional at code, mediocre at everything else.
What Makes It Different
The model was pre-trained from an intermediate DeepSeek-V2 checkpoint, then continued training on code-heavy data. This creates a model that thinks in code patterns rather than adapting general language understanding to programming tasks.
Language coverage: 338 programming languages, up from 86 in DeepSeek-Coder V1. This includes obscure languages that general models struggle with.
Fill-in-the-Middle (FIM): Native support for code completion in the middle of files, not just at the end. Critical for IDE integration.
The Limitations
General tasks suffer. Ask DeepSeek-Coder to write marketing copy or analyze a business document, and quality drops noticeably. The model was optimized for code at the expense of general capability.
High-performance computing gaps. Research from Nader et al. found that on HPC tasks like matrix multiplication and DGEMM benchmarks, DeepSeek-generated code lagged behind GPT-4 in scalability and execution efficiency. Manual optimization was often required.
Inference speed penalty. Despite the MoE architecture activating only 21B parameters, the 236B total still requires significant infrastructure. Users report slower inference than equivalently-sized dense models in some configurations.
Hardware Requirements
| Setup | Memory | Notes |
|---|---|---|
| Full BF16 | 8x 80GB GPUs | Production deployment |
| Lite variant (16B) | 2x 80GB GPUs | Reduced capability |
| Quantized | 4x H100 | Acceptable for most use cases |
The Lite variant (16B total, 2.4B active) runs on smaller setups but with reduced capability. For serious code generation, the full model is worth the infrastructure.
Real Benchmarks
| Benchmark | DeepSeek-Coder V2 | GPT-4 Turbo | Claude 3 Opus |
|---|---|---|---|
| HumanEval | 90+ | 87 | 85 |
| MBPP | 89+ | 86 | 84 |
| LiveCodeBench | 43.4 | 45.2 | 42.8 |
Strong on standard benchmarks, but LiveCodeBench (newer problems) shows the gap narrows against proprietary models.
Deployment
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained(
"deepseek-ai/DeepSeek-Coder-V2-Instruct",
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-Coder-V2-Instruct",
trust_remote_code=True,
torch_dtype=torch.bfloat16
).cuda()
Note the trust_remote_code=True requirement. DeepSeek models use custom architecture code that must be executed.
For fine-tuning on code-specific tasks, DeepSeek-Coder V2 provides a strong foundation despite the infrastructure requirements.
Best For
Pure code generation workflows. IDE integration for code completion. Teams that need a dedicated coding model separate from their general assistant. Legacy code migration projects where language coverage matters.
Skip It If
You need a general-purpose assistant that also codes well. Your use case mixes code with documentation, communication, or analysis. You're constrained to single-GPU deployment.
9. Mistral Small 3.2
Parameters: 24B (dense)
License: Apache 2.0
Context: 128K tokens
Release: June 2025
Mistral Small 3.2 brings vision capabilities to the small model tier. Apache 2.0 licensing makes it attractive for commercial deployment. But the upgrade from 3.1 came with hidden costs.
The Version Upgrade Problem
Mistral Small 3.1 fit comfortably on a single H100 with room to spare. Then 3.2 arrived.
From HuggingFace discussion (June 2025): "Trying to run this model essentially entails doubling our infrastructure. Small 3.1 fit easily on a single H100 with plenty of headroom. With 3.2 we need to use 2xH100 because VRAM itself is >55GB and then the KV cache and map puts it past the 80GB of a single H100."
Same parameter count. Same model name. Double the memory.
Architecture Change Fallout
The 3.1 to 3.2 transition also changed the underlying architecture. Ollama's implementation reveals the issue:
From GitHub issue #10553: Users report the "mistral3" architecture in 3.2 is significantly slower than the "llama" architecture used in 3.1. Same model downloaded from different sources (Ollama vs. HuggingFace bartowski GGUF) shows different performance:
| Source | Context | Memory | Relative Speed |
|---|---|---|---|
| Ollama mistral-small3.1 Q4_K_M | 32K | 36GB | Slower |
| HuggingFace bartowski GGUF Q4_K_M | 32K | 21GB | Faster |
The HuggingFace version using "llama" architecture performs closer to 3.1 expectations. Same weights, different packaging, different results.
Vision Performance Reality
Mistral Small 3.2 includes vision capabilities, but don't expect flagship performance.
From GitHub issue #10393: "Experiments with the Q4_K_M quant currently available are only ~3 tokens/second, putting the laptop under full load for more than a minute."
On laptop RTX 4090 (16GB VRAM variant), vision inference at Q4 quantization delivers 3 tokens/second. That's a minute of GPU saturation for a single image description. The IQ3_XXS quantization improves to 20+ tokens/second but with quality tradeoffs.
VRAM Anomalies
Users report the 3.1 model using double the expected VRAM with default settings:
From GitHub issue #10177: "Loading mistral-small3.1 24b in Q4 takes double the amount of VRAM it should use with default 4096 context."
This appears related to the architecture change and affects Ollama deployments specifically.
Hardware Requirements
| Setup | Memory | Notes |
|---|---|---|
| Full precision | 55GB+ | Single H100 insufficient |
| Recommended | 2x H100 | With KV cache headroom |
| Q4 quantized | 24-36GB | Varies by framework |
What Works
Multilingual. Strong performance across 40+ languages, typical of Mistral models.
Apache 2.0. Maximum licensing flexibility for commercial use.
Function calling. Native tool use support for agent applications.
Real Benchmarks
| Benchmark | Mistral Small 3.2 | Gemma 3 27B | Qwen 3-32B |
|---|---|---|---|
| HumanEval | 92.9 | 89 | 85+ |
| MMLU | 73+ | ~78 | 83.3 |
Strong coding performance, but general knowledge (MMLU) lags behind similarly-sized competitors.
Best For
Teams already committed to Mistral infrastructure. Apache 2.0 license requirements. Multilingual European deployments. Organizations needing models that can be updated over time.
Skip It If
You're upgrading from 3.1 and expect the same infrastructure to work. Vision is a primary use case. You need single-GPU deployment.
10. Gemma 3 12B
Parameters: 12B (dense)
License: Gemma Terms of Use
Context: 128K tokens (claimed)
Release: March 2025
Google positioned Gemma 3 12B for edge deployment. Mobile devices, laptops, resource-constrained servers. The model fits where larger options can't.
But edge deployment revealed edge cases.
Context Window Problems
The 128K context claim runs into practical limits quickly.
From community reports: Earlier Ollama versions (0.6.0) ran Gemma 3 12B at 8K context with 20-25 tokens/second. Newer versions crash systems attempting the same configuration, forcing users down to 4K context.
This appears to be a KV cache management issue. The model's attention mechanism consumes more memory per token than comparable models, and framework updates haven't kept pace with the architecture's demands.
Speed vs. Size Expectations
You'd expect a 12B model to run faster than a 27B model. With Gemma 3, that assumption breaks:
From user reports: The Gemma 3 family shows unexpected speed regressions compared to Gemma 2. On identical hardware, Gemma 3 runs ~30% slower than its predecessor despite architectural improvements claimed to increase efficiency.
The culprit appears to be the multimodal architecture. Even when processing text-only, the vision encoder overhead affects performance.
Consumer Hardware Reality
| GPU | Quantization | VRAM | Context | Speed |
|---|---|---|---|---|
| RTX 3090 24GB | Q4_K_M | ~14GB | 4K | 25-30 t/s |
| RTX 4090 24GB | Q4_K_M | ~14GB | 8K | 35-40 t/s |
| RTX 4060 8GB | Q4_K_M | ~8GB | 2K | 15-20 t/s |
The Q4_K_M quantization makes consumer deployment viable, but context limitations hurt usability for document processing or long conversations.
Framework Sensitivity
Gemma 3 12B performance varies significantly between deployment frameworks:
vLLM: Generally stable but requires specific version matching.
Ollama: Context issues reported across versions. Check release notes before deploying.
llama.cpp: Works but lacks some optimizations available for Llama-architecture models.
Test on your exact production framework and version before committing.
What Works
Multimodal in a small package. Image understanding without requiring datacenter hardware.
Multilingual. 140+ languages with reasonable quality.
Integration. Supported by Hugging Face, JAX, PyTorch, TensorFlow via Keras 3.0, vLLM, and Ollama.
Real Benchmarks
| Benchmark | Gemma 3 12B | Qwen 3-8B | Phi-4 14B |
|---|---|---|---|
| MMLU | ~70 | 74+ | 84.8 |
| HumanEval | 75+ | 80+ | 82 |
Trails both Qwen 3-8B (smaller) and Phi-4 (similar size) on most benchmarks. The multimodal capability is the differentiator.
Best For
Multimodal applications on consumer hardware. Google ecosystem integration. Quick prototyping where vision understanding matters.
Skip It If
Text-only workloads where Qwen 3-8B or Phi-4 would suffice. You need 8K+ context reliably. Maximum inference speed matters.
11. Qwen 3-8B
Parameters: 8.2B (dense)
License: Apache 2.0
Context: 128K tokens (40K native, YaRN extended)
Release: May 2025
Qwen 3-8B hits the sweet spot for consumer deployment. Runs on a 16GB GPU. Apache 2.0 licensed. Thinking mode available when you need it.
This is the model most developers should start with.
Consumer Hardware Performance
| GPU | Quantization | VRAM Used | Speed | Notes |
|---|---|---|---|---|
| RTX 4090 24GB | Q4_K_M | ~6GB | 40+ t/s | Plenty of headroom |
| RTX 3080 10GB | Q4_K_M | ~6GB | 30+ t/s | Comfortable fit |
| RTX 4060 8GB | Q4_K_M | ~5.5GB | 25+ t/s | Sweet spot for 8GB cards |
| M2 Mac 16GB | Q4_K_M | ~6GB | 20+ t/s | Unified memory works well |
The ~6GB footprint at Q4_K_M leaves headroom for KV cache even on 8GB GPUs. Most users report smooth operation at 2K-4K context without memory pressure.
Thinking Mode Tradeoffs
Like the larger Qwen 3 models, the 8B variant supports thinking/non-thinking modes. The behavior differs:
Thinking mode: Adds <think>...</think> blocks before responses. Improves accuracy on math, logic, and coding. Increases latency 2-3x. Consumes more tokens.
Non-thinking mode: Direct responses without reasoning traces. Faster. Suitable for conversational use.
Critical setting: Don't use greedy decoding with thinking mode. Qwen documentation explicitly warns this causes "performance degradation and endless repetitions." Use temperature 0.7, top_p 0.8, top_k 20 as recommended.
Looping Issues
Some users report the model getting stuck in output loops, particularly at low context lengths:
From Unsloth documentation: "If you're experiencing any looping, Ollama might have set your context length window to 2,048 or so. If this is the case, bump it up to 32,000 and see if the issue still persists."
The fix is usually increasing context length, even if you don't need the full window.
100+ Language Support
Qwen 3-8B inherits the family's multilingual strength. Over 100 languages and dialects with reasonable quality. Strong on CJK languages (Chinese, Japanese, Korean) given Alibaba's development focus.
Deployment
# Ollama (simplest)
ollama run qwen3:8b
# With recommended parameters
ollama run qwen3:8b --temperature 0.7 --top-p 0.8 --top-k 20
For production, vLLM or SGLang provide better throughput:
vllm serve Qwen/Qwen3-8B-Instruct \
--max-model-len 32768 \
--gpu-memory-utilization 0.9
Real Benchmarks
| Benchmark | Qwen 3-8B | Llama 3.1 8B | Gemma 3 12B |
|---|---|---|---|
| MMLU | 74+ | 73.0 | ~70 |
| HumanEval | 80+ | 72.6 | 75+ |
| MATH | 65+ | 51.9 | 55+ |
Outperforms Llama 3.1 8B across the board. Beats the larger Gemma 3 12B on most metrics. The thinking mode pushes scores higher on reasoning tasks.
Best For
First local LLM deployment. Consumer hardware with 8-16GB VRAM. Multilingual applications. Development and prototyping before scaling to larger models.
Skip It If
You need maximum capability and have the hardware for larger models. Vision/multimodal is required. You're deploying at massive scale where the 32B or 235B variants' efficiency gains matter.
For teams building small model strategies, Qwen 3-8B provides the best starting point.
12. Phi-4
Parameters: 14B (dense)
License: MIT
Context: 16K tokens
Release: December 2024, reasoning variants April 2025
Microsoft's Phi-4 demonstrates what's possible with careful data curation over raw scale. 14 billion parameters matching models 5x larger on reasoning benchmarks.
The MIT license removes all commercial restrictions.
Training Philosophy
Phi-4 diverges from the "more data, bigger model" approach. Microsoft used 400B+ tokens of synthetic, curriculum-style data designed to teach structured reasoning. The result: a model that punches above its weight class on math and logic.
Training infrastructure: 1,920 H100 GPUs over 21 days processing 9.8 trillion tokens. Significant investment for a "small" model.
Known Limitations
Microsoft's own documentation is unusually direct about Phi-4's constraints:
English only. "Phi-4 is not intended to support multilingual use." Languages other than English, especially non-standard American English varieties, perform worse.
Code scope limited. "Majority of Phi-4 training data is based in Python and uses common packages such as typing, math, random, collections, datetime, itertools." Other languages and packages require manual verification.
Multi-turn degradation. Users report quality drops in extended conversations. Phi-4 works better for single-query reasoning than ongoing dialogue.
Bias concerns. Like all models trained on public data, Phi-4 can perpetuate stereotypes or produce inappropriate content. Microsoft recommends additional safety measures for deployment.
The Phi-4 Family
Microsoft expanded Phi-4 into multiple variants:
| Variant | Parameters | Focus | Release |
|---|---|---|---|
| Phi-4 | 14B | General reasoning | Dec 2024 |
| Phi-4-mini | 3.8B | Speed, 128K context | Feb 2025 |
| Phi-4-multimodal | 5.6B | Vision + speech + text | Feb 2025 |
| Phi-4-reasoning | 14B | Enhanced reasoning | Apr 2025 |
| Phi-4-reasoning-plus | 14B | Maximum reasoning | Apr 2025 |
The reasoning variants were trained using distillation from DeepSeek R1, reportedly on ~1 million synthetic math problems.
Hardware Requirements
Phi-4's dense architecture makes hardware planning straightforward:
| Setup | VRAM | Notes |
|---|---|---|
| Full FP16 | ~28GB | Single A100 or H100 |
| BF16 | ~28GB | Preferred format |
| Q4_K_M | ~8-10GB | Consumer GPU viable |
| Q8_0 | ~16GB | High quality, RTX 4090 |
The 16K context limitation keeps memory requirements predictable. No surprise OOM errors from context expansion.
Inference Speed
Phi-4 runs 2-4x faster than comparable larger models according to Microsoft. On consumer hardware:
| GPU | Quantization | Speed |
|---|---|---|
| RTX 4090 24GB | Q8_0 | 60-80 t/s |
| RTX 3090 24GB | Q4_K_M | 50-70 t/s |
| M2 Mac 16GB | Q4_K_M | 30-40 t/s |
The speed advantage makes Phi-4 viable for real-time applications where latency matters.
Real Benchmarks
| Benchmark | Phi-4 | Llama 3.3 70B | Qwen 2.5 72B |
|---|---|---|---|
| MMLU | 84.8 | 86.0 | 85.3 |
| MATH | 80.4 | 68.0 | 80.0 |
| HumanEval | 82 | 88.4 | 86.6 |
| GPQA | 56.1 | 49.0 | 49.0 |
Phi-4 matches or exceeds 70B models on reasoning benchmarks (MATH, GPQA) while requiring a fraction of the infrastructure. The tradeoff appears on code (HumanEval) where larger models maintain an edge.
For teams evaluating small models, systematic benchmark testing helps identify where Phi-4 excels versus where alternatives perform better.
Deployment
Available through multiple channels:
# Ollama
ollama run phi4
# Azure AI Foundry (managed)
# HuggingFace (open weights)
# NVIDIA NGC (containerized)
# ONNX Runtime (edge deployment)
The MIT license means no restrictions on any deployment path.
Best For
Reasoning-heavy applications on limited infrastructure. Single-query tasks where multi-turn isn't required. Teams needing MIT licensing. Math tutoring, code review, technical analysis.
Skip It If
Multilingual support is required. Extended conversations are the use case. You need maximum coding capability. Context beyond 16K matters.
Licensing Deep Dive
| License | Commercial | Modifications | Key Restriction |
|---|---|---|---|
| MIT | Yes | Yes | None |
| Apache 2.0 | Yes | Yes | Patent protection, attribution |
| Llama 4 | Yes | Yes | 700M MAU threshold |
| Gemma | Yes | Yes | Acceptable use policy |
| DeepSeek | Yes | Yes | ToS liability clauses |
For maximum freedom: DeepSeek V3.2 (MIT) or Phi-4 (MIT).
For enterprise standard: Qwen 3 or Mistral (Apache 2.0).
For Meta ecosystem: Llama 4 (review MAU thresholds).
Teams with data security requirements should prioritize self-hosting over API access regardless of license.
Hardware Reality Check
What vendors claim vs. what users report:
| Model | Claimed | Actual Minimum | Notes |
|---|---|---|---|
| DeepSeek V3.2 | 8xH100 | 8xH100 | Accurate, quantized reduces to 4x |
| Qwen 3-235B | 4xH100 | 8xH100 | OOM common at claimed minimum |
| Llama 4 Maverick | Single H100 host | 4-8xH100 | "Single host" = 8 GPUs |
| Gemma 3 27B | 1xH100 | 1xH100 | But slower than expected |
| Mistral Small 3.2 | 1xH100 | 2xH100 | Doubled from 3.1 version |
For self-hosted deployment, add 20-30% VRAM headroom to published requirements.
Selection Framework
By Actual Capability (not benchmarks)
Reliable coding: Qwen 3-32B or DeepSeek-Coder V2
Reasoning: DeepSeek V3.2 (accept quirks) or Qwen 3-235B
Multimodal: Gemma 3 27B or Llama 4 Maverick
Long context: Llama 4 Scout (up to 300K practical)
Multilingual: Qwen 3 (119 languages) or Gemma 3 (140+)
By Hardware Budget
Consumer (16-24GB): Qwen 3-8B, Phi-4, Gemma 3 12B
Professional (48-80GB): Qwen 3-32B, Gemma 3 27B
Multi-GPU (4-8xH100): Llama 4 Scout, Qwen 3-235B
Enterprise cluster: DeepSeek V3.2, Mistral Large 3
By Risk Tolerance
Battle-tested: Qwen 3-32B, Mistral Small
High upside, known issues: DeepSeek V3.2, Llama 4 Maverick
Emerging, monitor closely: Mistral Small 3.2, Gemma 3
Deployment Patterns That Work
vLLM (Production Standard)
# Qwen 3-32B with tensor parallelism
pip install vllm
vllm serve Qwen/Qwen3-32B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--enable-prefix-caching
SGLang (High Throughput)
# Better for Qwen 3 thinking mode
python -m sglang.launch_server \
--model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
--tp 8 \
--context-length 65536 \
--reasoning-parser deepseek-r1
Ollama (Development)
# Quick start - watch for version-specific issues
ollama run qwen3:32b
ollama run gemma3:27b
For production observability, add monitoring regardless of deployment method.
What the Benchmarks Actually Mean
MMLU: General knowledge. Scores above 80 indicate broad capability. But doesn't predict task-specific performance.
HumanEval: Python coding. Widely gamed through training data contamination. Take high scores with skepticism.
LiveCodeBench: Recent competition problems. Better signal for actual coding ability since problems are newer.
SWE-Bench: Real GitHub issues. Practical software engineering signal. Most models score under 60%.
MATH/GSM8K: Mathematical reasoning. GSM8K is easier (high school level), MATH is harder (competition level).
Benchmark scores predict capability ranges but not task-specific performance. Evaluation against your specific use case matters more than aggregate scores.
FAQ
Which open-source LLM is actually best for coding?
For pure code generation, DeepSeek-Coder V2 remains the specialist choice. For coding plus general capability, Qwen 3-32B offers the best stability. Avoid Llama 4 models for production code unless you've tested extensively on your specific stack.
Can I really run these on consumer hardware?
Qwen 3-8B and Phi-4 run well on 16GB GPUs. Quantized Qwen 3-32B works on 24GB. Anything larger needs professional or datacenter hardware despite marketing claims.
What's the actual cost to self-host?
Entry point: $15-20K for a single RTX 4090 workstation running 8B-32B models. Mid-tier: $150-200K for multi-H100 setups running 100B+ models. Enterprise: $500K+ for frontier model deployment at scale.
Should I wait for better models?
Current generation is production-ready. DeepSeek V3.2 and Qwen 3 match proprietary models on most tasks. The issues are known and workable. Ship now, upgrade later.
How do I evaluate for my use case?
Don't trust benchmarks. Create a test set from your actual production queries. Run 100+ examples through candidate models. Measure what matters for your application. Platforms like Prem Studio provide evaluation tooling for systematic comparison.
MIT vs Apache 2.0 - does it matter?
For most enterprises, no. Both allow commercial use without restrictions. Apache 2.0 adds patent protection, which matters if you're in a patent-heavy industry. MIT is simpler.