12 Best Open-Source LLMs for Production in 2026: Real Benchmarks, Real Problems

Which open-source LLMs actually work in production? Real benchmarks, deployment problems, user complaints, and what to watch for.

12 Best Open-Source LLMs for Production in 2026: Real Benchmarks, Real Problems

Open-source LLMs have caught up on benchmarks. But benchmarks lie.

The real story is what happens when you deploy. DeepSeek V3 scores 90+ on HumanEval but inserts random text into outputs. Llama 4 Maverick claims 1M context but performance degrades past 200K. Gemma 3 27B somehow runs slower than the 70B Llama model on identical hardware.

This guide ranks 12 production-ready open-source LLMs based on real deployment experiences, actual hardware requirements, and the problems you'll hit. We've pulled from GitHub issues, Reddit threads, and HuggingFace discussions to give you the full picture.

The shift toward self-hosted deployment makes sense financially. But choosing wrong costs months of engineering time.

Quick Comparison: What You Actually Get

Model Active Params Real VRAM (FP16) Tokens/Sec* License Reality Check
DeepSeek V3.2 37B 8xH100 minimum 60 t/s MIT Best reasoning, but random text insertions
Qwen 3-235B 22B 8xH100 (470GB quantized) 34 t/s Apache 2.0 Thinking mode adds 2-3x latency
Llama 4 Maverick 17B 800GB+ full weights 50 t/s Llama License Context degrades past 200K
Mistral Large 3 41B 8xH200 or 8xH100 Varies Apache 2.0 Not optimized for vision despite having it
Llama 4 Scout 17B 216GB + 16GB KV 148 t/s Llama License AWS caps at 328K, not 10M
Gemma 3 27B 27B 62GB base + KV 50 t/s Gemma License Slower than Llama-70B on same hardware
Qwen 3-32B 32B 1xA100 80GB 64 t/s Apache 2.0 Most stable mid-size option
DeepSeek-Coder V2 21B 4xH100 65 t/s DeepSeek License Code specialist, limited general use
Mistral Small 3.2 24B 2xH100 (doubled from 3.1) 93 t/s Apache 2.0 VRAM usage doubled from previous version
Gemma 3 12B 12B 24GB minimum 130 t/s Gemma License Context limited to 4K in practice
Qwen 3-8B 8B 16GB 150+ t/s Apache 2.0 Best consumer option
Phi-4 14B 28GB 80 t/s MIT Limited multi-turn capability

*Tokens/second vary by hardware, quantization, and batch size. Numbers from user reports on comparable setups.


1. DeepSeek V3.2

Parameters: 685B total, 37B active (MoE)
License: MIT
Context: 128K tokens
Release: December 2025

DeepSeek V3.2 delivers the best open-source reasoning available. MMLU 88.5, MATH-500 90.2, competitive with GPT-4.5 on most benchmarks. The MIT license means no restrictions on commercial use.

But it has problems.

What the Benchmarks Don't Show

  1. Random text insertions. Users report the model inserting unrelated text mid-response, particularly in longer outputs. The DeepSeek team acknowledged this in V3.1 release notes, calling it a known issue with "hybrid inference modes."
  2. Instruction following degrades. Reddit user Dr_Karminski tested V3.1 extensively: "I asked for only the changed code. It output the entire file. Three times. With different prompts." This pattern appeared across multiple users testing coding tasks.
  3. TypeScript performance is inconsistent. On 16x Eval's coding benchmark, V3.1 scored 1/10 on TypeScript narrowing tasks. The model couldn't identify invalid Tailwind CSS classes like z-60 or z-70. For comparison, Claude Sonnet 4 scored 9/10 on the same tests.
  4. Censorship affects certain topics. Questions involving Taiwan, Tibet, or Tiananmen return answers aligned with Chinese government positions or get refused entirely. For enterprise use cases requiring balanced geopolitical content, this matters.

Deployment Reality

Minimum setup: 8x H100 GPUs for full precision inference.

Full model weights: 700GB. You'll need substantial storage and bandwidth for initial deployment.

Generation speed: 60 tokens/second once running, roughly 3x faster than V2. The Multi-head Latent Attention architecture delivers real efficiency gains.

Server stability: Direct API users report frequent "server busy" errors during peak times. Self-hosting eliminates this but requires the hardware.

Quantization: FP8 weights available, reducing to 4xH100 for inference. Quality loss is minimal for most use cases.

Terms of Service Warning

DeepSeek's terms hold users liable for all inputs and outputs. The language is broader than most: you must ensure legal rights to all submitted data and are responsible if outputs breach any laws. For regulated industries, review these terms with legal counsel.

Real Benchmarks

Benchmark DeepSeek V3.2 GPT-4.5 Claude Opus 4
MMLU 88.5 89.2 86.8
MATH-500 90.2 91.0 78.3
HumanEval 90+ 92 93
LiveCodeBench 64.3 68.1 65.2
SWE-Bench 50.8 52.4 49.1

Best For

Complex reasoning and mathematical tasks where you can work around the random insertion issue. Research applications. Code generation with human review. Teams that need MIT licensing flexibility.

Skip It If

You need reliable TypeScript or frontend development. You're building products where random text insertion would cause failures. You need balanced geopolitical content.

For teams building fine-tuned models, DeepSeek provides the strongest reasoning foundation despite its quirks.


2. Qwen 3-235B-A22B

Parameters: 235B total, 22B active
License: Apache 2.0
Context: 128K tokens (256K with recent updates)
Release: May 2025, updated July 2025

Alibaba's flagship offers something unique: unified thinking and non-thinking modes in one model. Switch between deep reasoning and fast responses without deploying separate models.

The Apache 2.0 license makes it the most permissively licensed frontier-class model available.

Thinking Mode Tradeoffs

The "thinking mode" produces reasoning traces in <think> blocks before final answers. This improves accuracy on math and logic tasks. But it has costs:

Latency increases 2-3x. A response that takes 2 seconds in non-thinking mode takes 5-6 seconds in thinking mode. The model generates extensive reasoning before output.

Token usage explodes. Thinking traces consume tokens. A 500-token answer might require 2,000+ tokens total with thinking enabled. This affects both cost and context limits.

Not all tasks benefit. Simple queries get slower without accuracy gains. The July 2025 update (Instruct-2507) added better budget control, but you still need to tune this per use case.

Hardware Reality

Full precision: 8xH100 GPUs with tensor parallelism.

Model size: ~470GB for GGUF BF16 quantized weights. Storage planning matters.

OOM errors are common. HuggingFace discussions show users hitting out-of-memory even on H100 nodes. The fix: reduce context length to 32K from the claimed 128K. The model works at full context but requires careful memory management.

Consumer hardware: The smaller Qwen3-30B-A3B runs on a single high-end GPU. Users report ~34 tokens/second on RX 7900 XTX with Q4 quantization.

What Works Well

119 language support. Not just tokens trained on multilingual data, but actual quality across languages. Chinese, Japanese, Arabic, and European languages all perform well.

Tool calling. The model's function calling is more reliable than most open alternatives. The Qwen-Agent framework handles tool parsing well.

MCP integration. Recent updates added Model Context Protocol support, making agent workflows simpler to build.

Integration Complexity

Tool calling works, but setup isn't plug-and-play. The recommended approach uses Qwen-Agent which handles tool templates and parsers. Raw API integration requires careful prompt engineering.

From the technical report: "We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity."

Translation: the raw model needs wrapper infrastructure.

Deployment

# SGLang deployment (recommended)
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
    --tp 8 \
    --context-length 262144

# If you hit OOM, reduce context:
--context-length 32768

The transformers<4.51.0 requirement catches people. Older versions throw errors on MoE architecture loading.

Real Benchmarks

Benchmark Qwen 3-235B (Thinking) DeepSeek R1 Gemini 2.5 Pro
AIME 2024 85.7 79.8 83.2
AIME 2025 81.5 72.6 78.9
LiveCodeBench v5 70.7 65.9 68.4
BFCL v3 (Tools) 70.8 62.1 71.2

Qwen 3 outperforms DeepSeek R1 on 17 of 23 benchmarks while using fewer active parameters.

Best For

Multilingual applications. Agentic workflows with tool calling. Teams that need Apache 2.0 licensing. Mathematical reasoning with controllable thinking budgets.

Skip It If

You need consistent sub-second latency. You're deploying on consumer hardware. You want plug-and-play tool integration without framework overhead.


3. Llama 4 Maverick

Parameters: 400B total, 17B active (128 experts)
License: Llama 4 Community License
Context: 1M tokens
Release: April 2025

Meta's flagship generated massive hype. Native multimodal, 1M context, trained with Behemoth distillation. The benchmarks looked impressive.

Then people deployed it.

The Benchmark Controversy

The AI community immediately questioned Meta's numbers. Independent testing showed significant gaps:

HumanEval discrepancy. Meta claimed scores competitive with GPT-4o. Independent tests from LM Arena showed 62% accuracy vs. Gemma 3 27B's 74%. Reddit user Dr_Karminski summarized it: "They completely surpassed my expectations... in a negative direction."

Coding performance. Users reported 18% more Python syntax errors compared to DeepSeek R1 in controlled tests. The model struggles with complex multi-file code generation.

Real document analysis. Reddit user Holvagyok tested legal document processing: the model missed key clauses, produced incorrect summaries, and performed worse than smaller models on domain-specific tasks.

Context Window Reality

Maverick claims 1M token context. Reality is more complicated.

Performance degrades past 200K tokens. The NIHAS 1M-token benchmark showed 92% factual recall at full context. But users report synthesis tasks (like analyzing contracts or comparing documents) degrade significantly at long contexts.

Providers cap lower. AWS Bedrock limits to 328K tokens. Still 2.5x higher than Gemini 2.5 Pro's 128K, but well below the marketed 1M.

Memory explodes at scale. Full weights require 800GB+ storage. KV cache for long contexts adds substantial VRAM.

Security Assessment

ProtectAI ran vulnerability scans on both Llama 4 models:

Risk score: 52-58 (medium risk)
Successful attacks: ~490 across both models
Llama Guard 4 bypass rate: 33.8%

One-third of harmful prompts bypassed Meta's safety guardrails. For enterprise deployment, additional safety infrastructure is required.

Hardware Requirements

Setup Memory Notes
Full weights (FP16) 800GB+ Multi-node required
8-bit quantized 400GB 8xH100
4-bit quantized 200GB 4xH100
INT4 with KV compression 100GB 2xH100 possible

Inference speed: ~50 tokens/second on RTX 5090 with Q4 quantization. For comparison, Gemma 2 27B hits 76 tokens/second on the same hardware.

Licensing Terms

The Llama 4 Community License allows commercial use with restrictions:

  • Products exceeding 700M monthly active users need separate licensing
  • Acceptable use policy prohibits certain applications
  • Redistribution requires maintaining attribution

For most enterprises, these terms work. But review the acceptable use policy before deployment in sensitive domains.

What Actually Works

Multimodal processing. Image understanding is solid. The early fusion architecture handles mixed text/image inputs well.

Document QA. For shorter documents (under 50K tokens), retrieval and summarization work reliably.

API availability. Maverick is available through Fireworks, Together AI, and major cloud providers. If you don't want to self-host, options exist.

Real Benchmarks

Benchmark Llama 4 Maverick GPT-4o Gemini 2.0 Flash
MMLU-Pro 80.5 80.0 79.8
DocVQA 94.4 92.3 94.1
MMMU 73.4 69.1 70.2
HumanEval* 62-82 91 85

*HumanEval scores vary significantly between Meta's claims (82.4) and independent testing (~62).

Best For

Document understanding tasks. Multimodal applications. RAG systems where you need extended but not extreme context. Teams using managed API services rather than self-hosting.

Skip It If

You need reliable coding generation. You're processing legal or financial documents requiring high accuracy. You need full 1M context synthesis. You're security-conscious about prompt injection.

For RAG implementations, test carefully before committing to Maverick's context window claims.


4. Mistral Large 3

Parameters: 675B total, ~41B active
License: Apache 2.0
Context: 256K tokens
Release: December 2025

Mistral's European flagship offers Apache 2.0 licensing at frontier scale. For GDPR-conscious enterprises, the French development and data sovereignty options matter.

But "general-purpose multimodal" has caveats.

Vision Isn't the Strength

The model has vision capabilities. It processes images. But it's not optimized for vision tasks:

From Mistral's own documentation: "It is not a dedicated reasoning model and is not optimized for vision tasks, so it may not be the best option for reasoning use cases or multimodal tasks that require a lot of vision capability."

If your use case is primarily vision, look elsewhere. Gemma 3 or dedicated vision models will outperform.

Deployment Complexity

Minimum viable setup: 8xH200 or 8xH100 nodes.

FP8 recommended: Mistral suggests FP8 precision deployment, which enables single-node inference. NVFP4 (4-bit) reduces further but with quality tradeoffs.

vLLM configuration matters. The recommended deployment uses specific Mistral tokenizer and config modes:

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8

Getting these flags wrong causes tokenization errors or performance degradation.

The Ministral 3 Problem

If you're considering the smaller Mistral models, be aware of version issues:

Mistral Small 3.1 vs 3.2: The jump from 3.1 to 3.2 nearly doubled VRAM requirements. Users running 3.1 on a single H100 found 3.2 needs 2xH100. Same parameter count, double the memory.

Architecture change: Ollama's 3.1 model uses the "mistral3" architecture which is significantly slower than the "llama" architecture used in earlier versions. Users report the same model running at different speeds depending on source:

Source Context Memory Speed
Ollama mistral-small3.1 32k 36GB Slower
HuggingFace bartowski GGUF 32k 21GB Faster

Same model, different packaging, different performance.

What Works

Multilingual: 40+ languages with strong performance on non-English. European languages particularly well-tuned.

Function calling: Native tool use support with JSON structured output.

Document processing: The 256K context handles long documents well for enterprise knowledge bases.

Real Benchmarks

Benchmark Mistral Large 3 GPT-4o Claude Sonnet 4
MMLU (8-lang) 85.5 87.2 86.8
HumanEval 92 91 93
MMLU-Pro 73.1 74.8 73.6

Best For

European enterprises needing Apache 2.0 licensing and data sovereignty. Multilingual document processing. General enterprise AI where vision isn't primary.

Skip It If

Vision tasks are core to your application. You need reasoning-specialist capability. You're running on consumer or professional GPU hardware (requires H100-class minimum).


5. Llama 4 Scout

Parameters: 109B total, 17B active (16 experts)
License: Llama 4 Community License
Context: 10M tokens (claimed)
Release: April 2025

Scout is Llama 4's efficiency-focused option. Single H100 deployment with the longest context window in open-source.

The 10M context claim needs asterisks.

Context Window Reality

Provider limits: AWS caps at 328K tokens. Still long, but 3% of claimed capacity.

Memory requirements: Full 10M context would require extraordinary KV cache. Blockwise sparse attention reduces this, but practical limits exist.

Performance at scale: Users report 92% factual recall on the NIHAS 1M-token benchmark. But synthesis tasks requiring cross-document reasoning degrade significantly at long contexts.

For most use cases, treat 200-300K as the practical ceiling with good performance.

Hardware Efficiency

This is where Scout shines:

Quantization Memory Configuration
Full weights 216GB + 16GB KV 4xH100
8-bit 109GB + 8GB KV 2xH100
4-bit 54.5GB + 8GB KV 1xH100
2-bit 27.3GB + 8GB KV 1xA100

Inference speed: 148 tokens/second at 4-bit on single H100, roughly 1.7x faster than Llama 3 at similar sizes.

The Download Problem

Scout's HuggingFace launch showed 18,000 downloads in 48 hours. For comparison, Llama 3 hit this threshold faster. The slower adoption suggests community hesitation after benchmark controversies.

Fine-tuning Viable

With LoRA adapters under 20GB VRAM, Scout becomes accessible for domain-specific fine-tuning without massive infrastructure.

Real Benchmarks

Benchmark Llama 4 Scout Gemma 3 27B Mistral 3.1 24B
MMLU-Pro 74.3 67.5 73+
HumanEval 74.1 89 92.9
MATH 50.3 58+ 55+

Scout underperforms smaller models on coding benchmarks. Gemma 3 27B beats it on HumanEval despite being significantly smaller.

Best For

Long-document RAG systems. Code repository analysis (reading, not writing). Environments where single-GPU deployment matters. Fine-tuning projects needing efficiency.

Skip It If

Coding quality is critical. You actually need 10M context (current infrastructure doesn't support it). You need frontier-class capability in a smaller package.


6. Gemma 3 27B

Parameters: 27B (dense)
License: Gemma Terms of Use
Context: 128K tokens
Release: March 2025

Google's dense model offers multimodal processing in a relatively compact package. 140+ language support. Runs on professional GPUs.

But something's wrong with the performance.

The Speed Problem

Multiple users report Gemma 3 27B running slower than larger models on identical hardware:

From HuggingFace discussions: "I can echo I have the issue: with the same 2-A100-80G GPUs, Gemma3-27B is slower than the Llama-70B in my tests, which is very strange."

Ollama benchmarks on RTX 5090:

  • Gemma 3 27B: 50 tokens/second
  • Gemma 2 27B: 76 tokens/second
  • Qwen 2.5 32B: 64 tokens/second

Same model family, newer version, 34% slower. The architectural changes for multimodal support appear to have performance costs.

VRAM Issues

The KV cache behavior is unusual:

From GitHub issue #9678: "When using Gemma 3 27B with a context length of 20,000 (20k), I run out of VRAM on a 4090. However, when using Qwen2.5 32B IQ4XS, which is basically the same size as Gemma 3 27B Q4KM, with a full 32K context, I still have 2 GB of VRAM left."

Gemma 3 uses significantly more memory per token of context than comparable models.

Context Limitations in Practice

OOM crashes reported. Earlier Ollama versions (0.6.0) could run Gemma 3 12B at 8K context and 20-25 tokens/second. Newer versions crash systems at 8K, limiting users to 4K context.

Training vs. inference: Full training requires 500GB+ VRAM. One H100 is not enough. Single-GPU deployment works for inference only.

What Works

Multimodal quality. Image understanding is good. The native vision encoder avoids the bolted-on feel of some competitors.

Multilingual breadth. 140+ languages with reasonable quality across them.

Integration options. Supported by Hugging Face, vLLM, TGI, Ollama, and most major frameworks.

Hardware Requirements

Setup VRAM Notes
Full weights 62GB Single H100
QAT INT4 ~14GB RTX 3090/4090
With KV cache (32K) 20GB+ Limited context on consumer GPUs

Expect 20-30 tokens/second on RTX 3090 with 300 daily requests. Acceptable for many use cases.

Real Benchmarks

Benchmark Gemma 3 27B Qwen 3-32B Llama 4 Scout
HumanEval 89 85+ 74.1
MMLU ~78 83.3 79.6
Arena Elo 1339 1340+ ~1300

Strong HumanEval performance, but the speed penalty means it generates code slower than alternatives.

Best For

Multimodal tasks where image understanding matters. Consumer GPU deployment with quantization. Teams already in Google's ecosystem.

Skip It If

You need maximum inference speed. Long context (32K+) is required. You're comparing against larger models and expect size-based speed advantages.


7. Qwen 3-32B

Parameters: 32B (dense)
License: Apache 2.0
Context: 128K tokens
Release: May 2025

The most stable mid-size option. Dense architecture avoids MoE routing complexity. Apache 2.0 licensing. Reasonable hardware requirements.

Why It Works

Predictable behavior. Dense models have simpler failure modes than MoE. Debugging is easier. Performance is more consistent across inputs.

Hardware fit. Single A100 80GB handles full precision. Single RTX 4090 handles 4-bit quantization. The sweet spot for professional GPU deployments.

Community adoption. Extensive quantization options available. Most inference frameworks support it without special configuration.

Performance

On LM Arena, Qwen 3-32B scores comparably to models 2-3x its size when properly tuned. The thinking mode (if enabled) pushes it higher on reasoning tasks.

Benchmark Qwen 3-32B Gemma 3 27B Llama 3.1 70B
MMLU 83.3 ~78 86.0
HumanEval 85+ 89 80.5

Deployment

# Ollama (simplest)
ollama run qwen3:32b

# vLLM for production
vllm serve Qwen/Qwen3-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768

Best For

Teams wanting a reliable mid-size model. Apache 2.0 license requirements. Single-GPU deployment targets. Predictable, debuggable behavior.


8. DeepSeek-Coder V2

Parameters: 236B total, 21B active (MoE)
License: DeepSeek License
Context: 128K tokens
Release: June 2024, updated July 2025

DeepSeek-Coder V2 is the specialist. While general models handle code adequately, this one was built from the ground up for programming. Trained on 6 trillion tokens with 338 programming languages supported.

The tradeoff is clear: exceptional at code, mediocre at everything else.

What Makes It Different

The model was pre-trained from an intermediate DeepSeek-V2 checkpoint, then continued training on code-heavy data. This creates a model that thinks in code patterns rather than adapting general language understanding to programming tasks.

Language coverage: 338 programming languages, up from 86 in DeepSeek-Coder V1. This includes obscure languages that general models struggle with.

Fill-in-the-Middle (FIM): Native support for code completion in the middle of files, not just at the end. Critical for IDE integration.

The Limitations

General tasks suffer. Ask DeepSeek-Coder to write marketing copy or analyze a business document, and quality drops noticeably. The model was optimized for code at the expense of general capability.

High-performance computing gaps. Research from Nader et al. found that on HPC tasks like matrix multiplication and DGEMM benchmarks, DeepSeek-generated code lagged behind GPT-4 in scalability and execution efficiency. Manual optimization was often required.

Inference speed penalty. Despite the MoE architecture activating only 21B parameters, the 236B total still requires significant infrastructure. Users report slower inference than equivalently-sized dense models in some configurations.

Hardware Requirements

Setup Memory Notes
Full BF16 8x 80GB GPUs Production deployment
Lite variant (16B) 2x 80GB GPUs Reduced capability
Quantized 4x H100 Acceptable for most use cases

The Lite variant (16B total, 2.4B active) runs on smaller setups but with reduced capability. For serious code generation, the full model is worth the infrastructure.

Real Benchmarks

Benchmark DeepSeek-Coder V2 GPT-4 Turbo Claude 3 Opus
HumanEval 90+ 87 85
MBPP 89+ 86 84
LiveCodeBench 43.4 45.2 42.8

Strong on standard benchmarks, but LiveCodeBench (newer problems) shows the gap narrows against proprietary models.

Deployment

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
    "deepseek-ai/DeepSeek-Coder-V2-Instruct", 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-Coder-V2-Instruct", 
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16
).cuda()

Note the trust_remote_code=True requirement. DeepSeek models use custom architecture code that must be executed.

For fine-tuning on code-specific tasks, DeepSeek-Coder V2 provides a strong foundation despite the infrastructure requirements.

Best For

Pure code generation workflows. IDE integration for code completion. Teams that need a dedicated coding model separate from their general assistant. Legacy code migration projects where language coverage matters.

Skip It If

You need a general-purpose assistant that also codes well. Your use case mixes code with documentation, communication, or analysis. You're constrained to single-GPU deployment.


9. Mistral Small 3.2

Parameters: 24B (dense)
License: Apache 2.0
Context: 128K tokens
Release: June 2025

Mistral Small 3.2 brings vision capabilities to the small model tier. Apache 2.0 licensing makes it attractive for commercial deployment. But the upgrade from 3.1 came with hidden costs.

The Version Upgrade Problem

Mistral Small 3.1 fit comfortably on a single H100 with room to spare. Then 3.2 arrived.

From HuggingFace discussion (June 2025): "Trying to run this model essentially entails doubling our infrastructure. Small 3.1 fit easily on a single H100 with plenty of headroom. With 3.2 we need to use 2xH100 because VRAM itself is >55GB and then the KV cache and map puts it past the 80GB of a single H100."

Same parameter count. Same model name. Double the memory.

Architecture Change Fallout

The 3.1 to 3.2 transition also changed the underlying architecture. Ollama's implementation reveals the issue:

From GitHub issue #10553: Users report the "mistral3" architecture in 3.2 is significantly slower than the "llama" architecture used in 3.1. Same model downloaded from different sources (Ollama vs. HuggingFace bartowski GGUF) shows different performance:

Source Context Memory Relative Speed
Ollama mistral-small3.1 Q4_K_M 32K 36GB Slower
HuggingFace bartowski GGUF Q4_K_M 32K 21GB Faster

The HuggingFace version using "llama" architecture performs closer to 3.1 expectations. Same weights, different packaging, different results.

Vision Performance Reality

Mistral Small 3.2 includes vision capabilities, but don't expect flagship performance.

From GitHub issue #10393: "Experiments with the Q4_K_M quant currently available are only ~3 tokens/second, putting the laptop under full load for more than a minute."

On laptop RTX 4090 (16GB VRAM variant), vision inference at Q4 quantization delivers 3 tokens/second. That's a minute of GPU saturation for a single image description. The IQ3_XXS quantization improves to 20+ tokens/second but with quality tradeoffs.

VRAM Anomalies

Users report the 3.1 model using double the expected VRAM with default settings:

From GitHub issue #10177: "Loading mistral-small3.1 24b in Q4 takes double the amount of VRAM it should use with default 4096 context."

This appears related to the architecture change and affects Ollama deployments specifically.

Hardware Requirements

Setup Memory Notes
Full precision 55GB+ Single H100 insufficient
Recommended 2x H100 With KV cache headroom
Q4 quantized 24-36GB Varies by framework

What Works

Multilingual. Strong performance across 40+ languages, typical of Mistral models.

Apache 2.0. Maximum licensing flexibility for commercial use.

Function calling. Native tool use support for agent applications.

Real Benchmarks

Benchmark Mistral Small 3.2 Gemma 3 27B Qwen 3-32B
HumanEval 92.9 89 85+
MMLU 73+ ~78 83.3

Strong coding performance, but general knowledge (MMLU) lags behind similarly-sized competitors.

Best For

Teams already committed to Mistral infrastructure. Apache 2.0 license requirements. Multilingual European deployments. Organizations needing models that can be updated over time.

Skip It If

You're upgrading from 3.1 and expect the same infrastructure to work. Vision is a primary use case. You need single-GPU deployment.


10. Gemma 3 12B

Parameters: 12B (dense)
License: Gemma Terms of Use
Context: 128K tokens (claimed)
Release: March 2025

Google positioned Gemma 3 12B for edge deployment. Mobile devices, laptops, resource-constrained servers. The model fits where larger options can't.

But edge deployment revealed edge cases.

Context Window Problems

The 128K context claim runs into practical limits quickly.

From community reports: Earlier Ollama versions (0.6.0) ran Gemma 3 12B at 8K context with 20-25 tokens/second. Newer versions crash systems attempting the same configuration, forcing users down to 4K context.

This appears to be a KV cache management issue. The model's attention mechanism consumes more memory per token than comparable models, and framework updates haven't kept pace with the architecture's demands.

Speed vs. Size Expectations

You'd expect a 12B model to run faster than a 27B model. With Gemma 3, that assumption breaks:

From user reports: The Gemma 3 family shows unexpected speed regressions compared to Gemma 2. On identical hardware, Gemma 3 runs ~30% slower than its predecessor despite architectural improvements claimed to increase efficiency.

The culprit appears to be the multimodal architecture. Even when processing text-only, the vision encoder overhead affects performance.

Consumer Hardware Reality

GPU Quantization VRAM Context Speed
RTX 3090 24GB Q4_K_M ~14GB 4K 25-30 t/s
RTX 4090 24GB Q4_K_M ~14GB 8K 35-40 t/s
RTX 4060 8GB Q4_K_M ~8GB 2K 15-20 t/s

The Q4_K_M quantization makes consumer deployment viable, but context limitations hurt usability for document processing or long conversations.

Framework Sensitivity

Gemma 3 12B performance varies significantly between deployment frameworks:

vLLM: Generally stable but requires specific version matching.

Ollama: Context issues reported across versions. Check release notes before deploying.

llama.cpp: Works but lacks some optimizations available for Llama-architecture models.

Test on your exact production framework and version before committing.

What Works

Multimodal in a small package. Image understanding without requiring datacenter hardware.

Multilingual. 140+ languages with reasonable quality.

Integration. Supported by Hugging Face, JAX, PyTorch, TensorFlow via Keras 3.0, vLLM, and Ollama.

Real Benchmarks

Benchmark Gemma 3 12B Qwen 3-8B Phi-4 14B
MMLU ~70 74+ 84.8
HumanEval 75+ 80+ 82

Trails both Qwen 3-8B (smaller) and Phi-4 (similar size) on most benchmarks. The multimodal capability is the differentiator.

Best For

Multimodal applications on consumer hardware. Google ecosystem integration. Quick prototyping where vision understanding matters.

Skip It If

Text-only workloads where Qwen 3-8B or Phi-4 would suffice. You need 8K+ context reliably. Maximum inference speed matters.


11. Qwen 3-8B

Parameters: 8.2B (dense)
License: Apache 2.0
Context: 128K tokens (40K native, YaRN extended)
Release: May 2025

Qwen 3-8B hits the sweet spot for consumer deployment. Runs on a 16GB GPU. Apache 2.0 licensed. Thinking mode available when you need it.

This is the model most developers should start with.

Consumer Hardware Performance

GPU Quantization VRAM Used Speed Notes
RTX 4090 24GB Q4_K_M ~6GB 40+ t/s Plenty of headroom
RTX 3080 10GB Q4_K_M ~6GB 30+ t/s Comfortable fit
RTX 4060 8GB Q4_K_M ~5.5GB 25+ t/s Sweet spot for 8GB cards
M2 Mac 16GB Q4_K_M ~6GB 20+ t/s Unified memory works well

The ~6GB footprint at Q4_K_M leaves headroom for KV cache even on 8GB GPUs. Most users report smooth operation at 2K-4K context without memory pressure.

Thinking Mode Tradeoffs

Like the larger Qwen 3 models, the 8B variant supports thinking/non-thinking modes. The behavior differs:

Thinking mode: Adds <think>...</think> blocks before responses. Improves accuracy on math, logic, and coding. Increases latency 2-3x. Consumes more tokens.

Non-thinking mode: Direct responses without reasoning traces. Faster. Suitable for conversational use.

Critical setting: Don't use greedy decoding with thinking mode. Qwen documentation explicitly warns this causes "performance degradation and endless repetitions." Use temperature 0.7, top_p 0.8, top_k 20 as recommended.

Looping Issues

Some users report the model getting stuck in output loops, particularly at low context lengths:

From Unsloth documentation: "If you're experiencing any looping, Ollama might have set your context length window to 2,048 or so. If this is the case, bump it up to 32,000 and see if the issue still persists."

The fix is usually increasing context length, even if you don't need the full window.

100+ Language Support

Qwen 3-8B inherits the family's multilingual strength. Over 100 languages and dialects with reasonable quality. Strong on CJK languages (Chinese, Japanese, Korean) given Alibaba's development focus.

Deployment

# Ollama (simplest)
ollama run qwen3:8b

# With recommended parameters
ollama run qwen3:8b --temperature 0.7 --top-p 0.8 --top-k 20

For production, vLLM or SGLang provide better throughput:

vllm serve Qwen/Qwen3-8B-Instruct \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

Real Benchmarks

Benchmark Qwen 3-8B Llama 3.1 8B Gemma 3 12B
MMLU 74+ 73.0 ~70
HumanEval 80+ 72.6 75+
MATH 65+ 51.9 55+

Outperforms Llama 3.1 8B across the board. Beats the larger Gemma 3 12B on most metrics. The thinking mode pushes scores higher on reasoning tasks.

Best For

First local LLM deployment. Consumer hardware with 8-16GB VRAM. Multilingual applications. Development and prototyping before scaling to larger models.

Skip It If

You need maximum capability and have the hardware for larger models. Vision/multimodal is required. You're deploying at massive scale where the 32B or 235B variants' efficiency gains matter.

For teams building small model strategies, Qwen 3-8B provides the best starting point.


12. Phi-4

Parameters: 14B (dense)
License: MIT
Context: 16K tokens
Release: December 2024, reasoning variants April 2025

Microsoft's Phi-4 demonstrates what's possible with careful data curation over raw scale. 14 billion parameters matching models 5x larger on reasoning benchmarks.

The MIT license removes all commercial restrictions.

Training Philosophy

Phi-4 diverges from the "more data, bigger model" approach. Microsoft used 400B+ tokens of synthetic, curriculum-style data designed to teach structured reasoning. The result: a model that punches above its weight class on math and logic.

Training infrastructure: 1,920 H100 GPUs over 21 days processing 9.8 trillion tokens. Significant investment for a "small" model.

Known Limitations

Microsoft's own documentation is unusually direct about Phi-4's constraints:

English only. "Phi-4 is not intended to support multilingual use." Languages other than English, especially non-standard American English varieties, perform worse.

Code scope limited. "Majority of Phi-4 training data is based in Python and uses common packages such as typing, math, random, collections, datetime, itertools." Other languages and packages require manual verification.

Multi-turn degradation. Users report quality drops in extended conversations. Phi-4 works better for single-query reasoning than ongoing dialogue.

Bias concerns. Like all models trained on public data, Phi-4 can perpetuate stereotypes or produce inappropriate content. Microsoft recommends additional safety measures for deployment.

The Phi-4 Family

Microsoft expanded Phi-4 into multiple variants:

Variant Parameters Focus Release
Phi-4 14B General reasoning Dec 2024
Phi-4-mini 3.8B Speed, 128K context Feb 2025
Phi-4-multimodal 5.6B Vision + speech + text Feb 2025
Phi-4-reasoning 14B Enhanced reasoning Apr 2025
Phi-4-reasoning-plus 14B Maximum reasoning Apr 2025

The reasoning variants were trained using distillation from DeepSeek R1, reportedly on ~1 million synthetic math problems.

Hardware Requirements

Phi-4's dense architecture makes hardware planning straightforward:

Setup VRAM Notes
Full FP16 ~28GB Single A100 or H100
BF16 ~28GB Preferred format
Q4_K_M ~8-10GB Consumer GPU viable
Q8_0 ~16GB High quality, RTX 4090

The 16K context limitation keeps memory requirements predictable. No surprise OOM errors from context expansion.

Inference Speed

Phi-4 runs 2-4x faster than comparable larger models according to Microsoft. On consumer hardware:

GPU Quantization Speed
RTX 4090 24GB Q8_0 60-80 t/s
RTX 3090 24GB Q4_K_M 50-70 t/s
M2 Mac 16GB Q4_K_M 30-40 t/s

The speed advantage makes Phi-4 viable for real-time applications where latency matters.

Real Benchmarks

Benchmark Phi-4 Llama 3.3 70B Qwen 2.5 72B
MMLU 84.8 86.0 85.3
MATH 80.4 68.0 80.0
HumanEval 82 88.4 86.6
GPQA 56.1 49.0 49.0

Phi-4 matches or exceeds 70B models on reasoning benchmarks (MATH, GPQA) while requiring a fraction of the infrastructure. The tradeoff appears on code (HumanEval) where larger models maintain an edge.

For teams evaluating small models, systematic benchmark testing helps identify where Phi-4 excels versus where alternatives perform better.

Deployment

Available through multiple channels:

# Ollama
ollama run phi4

# Azure AI Foundry (managed)
# HuggingFace (open weights)
# NVIDIA NGC (containerized)
# ONNX Runtime (edge deployment)

The MIT license means no restrictions on any deployment path.

Best For

Reasoning-heavy applications on limited infrastructure. Single-query tasks where multi-turn isn't required. Teams needing MIT licensing. Math tutoring, code review, technical analysis.

Skip It If

Multilingual support is required. Extended conversations are the use case. You need maximum coding capability. Context beyond 16K matters.


Licensing Deep Dive

License Commercial Modifications Key Restriction
MIT Yes Yes None
Apache 2.0 Yes Yes Patent protection, attribution
Llama 4 Yes Yes 700M MAU threshold
Gemma Yes Yes Acceptable use policy
DeepSeek Yes Yes ToS liability clauses

For maximum freedom: DeepSeek V3.2 (MIT) or Phi-4 (MIT).

For enterprise standard: Qwen 3 or Mistral (Apache 2.0).

For Meta ecosystem: Llama 4 (review MAU thresholds).

Teams with data security requirements should prioritize self-hosting over API access regardless of license.


Hardware Reality Check

What vendors claim vs. what users report:

Model Claimed Actual Minimum Notes
DeepSeek V3.2 8xH100 8xH100 Accurate, quantized reduces to 4x
Qwen 3-235B 4xH100 8xH100 OOM common at claimed minimum
Llama 4 Maverick Single H100 host 4-8xH100 "Single host" = 8 GPUs
Gemma 3 27B 1xH100 1xH100 But slower than expected
Mistral Small 3.2 1xH100 2xH100 Doubled from 3.1 version

For self-hosted deployment, add 20-30% VRAM headroom to published requirements.


Selection Framework

By Actual Capability (not benchmarks)

Reliable coding: Qwen 3-32B or DeepSeek-Coder V2
Reasoning: DeepSeek V3.2 (accept quirks) or Qwen 3-235B
Multimodal: Gemma 3 27B or Llama 4 Maverick
Long context: Llama 4 Scout (up to 300K practical)
Multilingual: Qwen 3 (119 languages) or Gemma 3 (140+)

By Hardware Budget

Consumer (16-24GB): Qwen 3-8B, Phi-4, Gemma 3 12B
Professional (48-80GB): Qwen 3-32B, Gemma 3 27B
Multi-GPU (4-8xH100): Llama 4 Scout, Qwen 3-235B
Enterprise cluster: DeepSeek V3.2, Mistral Large 3

By Risk Tolerance

Battle-tested: Qwen 3-32B, Mistral Small
High upside, known issues: DeepSeek V3.2, Llama 4 Maverick
Emerging, monitor closely: Mistral Small 3.2, Gemma 3


Deployment Patterns That Work

vLLM (Production Standard)

# Qwen 3-32B with tensor parallelism
pip install vllm
vllm serve Qwen/Qwen3-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --enable-prefix-caching

SGLang (High Throughput)

# Better for Qwen 3 thinking mode
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
    --tp 8 \
    --context-length 65536 \
    --reasoning-parser deepseek-r1

Ollama (Development)

# Quick start - watch for version-specific issues
ollama run qwen3:32b
ollama run gemma3:27b

For production observability, add monitoring regardless of deployment method.


What the Benchmarks Actually Mean

MMLU: General knowledge. Scores above 80 indicate broad capability. But doesn't predict task-specific performance.

HumanEval: Python coding. Widely gamed through training data contamination. Take high scores with skepticism.

LiveCodeBench: Recent competition problems. Better signal for actual coding ability since problems are newer.

SWE-Bench: Real GitHub issues. Practical software engineering signal. Most models score under 60%.

MATH/GSM8K: Mathematical reasoning. GSM8K is easier (high school level), MATH is harder (competition level).

Benchmark scores predict capability ranges but not task-specific performance. Evaluation against your specific use case matters more than aggregate scores.


FAQ

Which open-source LLM is actually best for coding?

For pure code generation, DeepSeek-Coder V2 remains the specialist choice. For coding plus general capability, Qwen 3-32B offers the best stability. Avoid Llama 4 models for production code unless you've tested extensively on your specific stack.

Can I really run these on consumer hardware?

Qwen 3-8B and Phi-4 run well on 16GB GPUs. Quantized Qwen 3-32B works on 24GB. Anything larger needs professional or datacenter hardware despite marketing claims.

What's the actual cost to self-host?

Entry point: $15-20K for a single RTX 4090 workstation running 8B-32B models. Mid-tier: $150-200K for multi-H100 setups running 100B+ models. Enterprise: $500K+ for frontier model deployment at scale.

Should I wait for better models?

Current generation is production-ready. DeepSeek V3.2 and Qwen 3 match proprietary models on most tasks. The issues are known and workable. Ship now, upgrade later.

How do I evaluate for my use case?

Don't trust benchmarks. Create a test set from your actual production queries. Run 100+ examples through candidate models. Measure what matters for your application. Platforms like Prem Studio provide evaluation tooling for systematic comparison.

MIT vs Apache 2.0 - does it matter?

For most enterprises, no. Both allow commercial use without restrictions. Apache 2.0 adds patent protection, which matters if you're in a patent-heavy industry. MIT is simpler.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe