Reasoning Models Explained: OpenAI o1/o3 vs DeepSeek R1 vs QwQ-32B
Reasoning models compared: OpenAI o1/o3, DeepSeek R1, QwQ-32B. Architecture differences, benchmark analysis, cost breakdown. Self-hosting guide for R1 with VRAM requirements.
Reasoning models think before they answer. Unlike standard LLMs that generate tokens sequentially, reasoning models produce internal chain-of-thought traces before outputting a response. This deliberate thinking time makes them dramatically better at math, coding, and multi-step logic problems.
The landscape shifted in January 2025 when DeepSeek released R1, an open-weight reasoning model that matched OpenAI's o1 at a fraction of the cost. Then Alibaba's QwQ-32B showed that a 32B parameter model could compete with models 20x its size. OpenAI responded with o3 and o4-mini, pushing benchmarks further.
This guide breaks down how these models work, where they excel, what they cost, and when to use each one. If you're evaluating reasoning models for production, this is the comparison you need.
How Reasoning Models Work
Standard LLMs predict the next token based on the previous tokens. Fast, but limited in complex reasoning. They can solve problems they've seen variations of during training, but struggle with novel multi-step logic.
Reasoning models add a thinking phase. Before generating the visible response, they produce an internal chain-of-thought that breaks the problem into steps, verifies intermediate results, and backtracks when needed. This thinking process consumes tokens and time, but dramatically improves accuracy on hard problems.
The key insight: test-time compute scales better than training compute for reasoning tasks. Instead of making the model bigger, you let it think longer. A smaller model that thinks for 30 seconds can outperform a larger model that answers immediately.
The Technical Mechanism
Each reasoning model implements this differently, but the core pattern is:
- Receive input — the user's question or problem
- Generate reasoning trace — internal tokens exploring the solution space
- Self-verify — check intermediate steps for consistency
- Produce output — the final answer, often with the reasoning shown
OpenAI's o-series hides the reasoning tokens (you're billed for them but can't see them). DeepSeek R1 and QwQ expose reasoning in <think>...</think> tags, which is valuable for debugging and trust.
User: What is 17 × 23?
<think>
I need to multiply 17 by 23.
Let me break this down: 17 × 23 = 17 × (20 + 3)
17 × 20 = 340
17 × 3 = 51
340 + 51 = 391
Let me verify: 391 / 17 = 23. Correct.
</think>
17 × 23 = 391
The thinking tokens often outnumber the output tokens by 5-20x on complex problems. This is why reasoning models are expensive per query despite their effectiveness.
The Models
OpenAI o1 / o3 / o4-mini
OpenAI's reasoning models use reinforcement learning to train the model to think before answering. The architecture details are not public, but the training approach involves:
- Large-scale RL on reasoning tasks
- Hidden chain-of-thought (billed but not exposed)
- Adaptive compute based on problem difficulty
Current lineup (as of early 2026):
| Model | Best For | Context | Notes |
|---|---|---|---|
| o1 | Complex reasoning | 128K | Original reasoning model |
| o3 | Frontier performance | 200K | 10x compute vs o1, best benchmarks |
| o3-mini | Cost-efficient reasoning | 200K | 3 reasoning effort levels |
| o4-mini | Fast reasoning + vision | 128K | Best on AIME, supports images |
o3 represents the current frontier. It scored 96.7% on AIME 2024, 87.7% on GPQA Diamond (PhD-level science), and achieved 2706 Elo on Codeforces. These numbers exceed human expert performance on most benchmarks.
The o4-mini variant adds vision capabilities and actually outperforms o3 on pure math (93.4% on AIME 2024 without tools). It's optimized for the common case where you need good reasoning without maximum compute.
DeepSeek R1
DeepSeek R1 is a 671B parameter Mixture-of-Experts model with 37B parameters active per token. It was trained using a novel approach: pure reinforcement learning without supervised fine-tuning for the initial version (R1-Zero), then refined with cold-start data for the production release.
Key characteristics:
- Architecture: MoE with 671B total, 37B active parameters
- Context: 128K tokens
- Training: RL-first, then SFT for readability
- License: MIT (fully open, commercial use allowed)
- Weights: Available on Hugging Face
The training process was unconventional. DeepSeek-R1-Zero was trained purely with RL, no supervised examples of good reasoning. The model discovered chain-of-thought reasoning on its own through trial and error. This produced strong reasoning but poor readability (language mixing, repetitive text).
R1 added cold-start data: a small set of high-quality reasoning examples to guide early training. This fixed readability while preserving the emergent reasoning capabilities.
Benchmark performance:
| Benchmark | R1 Score | o1 Score | Notes |
|---|---|---|---|
| AIME 2024 | 79.8% | 74.3% | Competition math |
| MATH-500 | 97.3% | 96.4% | Mathematical reasoning |
| GPQA Diamond | 71.5% | 75.7% | PhD-level science |
| Codeforces | 2029 Elo | 1891 Elo | Competitive programming |
| LiveCodeBench | 65.9% | — | Code generation |
R1 matches or exceeds o1 on math benchmarks while trailing slightly on science reasoning. The real differentiator is cost and openness.
DeepSeek R1 Distilled Models
DeepSeek also released distilled versions: smaller dense models trained on reasoning data generated by the full R1. These are practical for local deployment:
| Model | Parameters | AIME 2024 | MATH-500 | LiveCodeBench |
|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | 1.5B | — | — | — |
| R1-Distill-Qwen-7B | 7B | 55.5% | — | — |
| R1-Distill-Qwen-14B | 14B | — | — | — |
| R1-Distill-Qwen-32B | 32B | 72.6% | 94.3% | 57.2% |
| R1-Distill-Llama-70B | 70B | — | — | — |
The 32B distilled model outperforms o1-mini on most benchmarks. The 7B model beats non-reasoning models like GPT-4o. Distillation transfers reasoning capabilities from the teacher (R1-671B) to students (smaller dense models) surprisingly well.
QwQ-32B (Alibaba)
QwQ-32B proves you don't need 671B parameters for strong reasoning. Built on Qwen2.5-32B and trained with two-stage RL, it achieves performance comparable to R1 with 20x fewer total parameters.
Training approach:
- Stage 1 RL: Outcome-based rewards on math and coding. The model learns to reason by being rewarded only for correct final answers, not intermediate steps.
- Stage 2 RL: General capability rewards for instruction following, human preference alignment, and agent behaviors.
This two-stage approach is more efficient than R1's cold-start method and produces cleaner reasoning traces.
Benchmark performance:
| Benchmark | QwQ-32B | R1-671B | o1-mini |
|---|---|---|---|
| AIME 2024 | 79.5% | 79.8% | 63.6% |
| LiveCodeBench | 63.4% | 65.9% | 53.8% |
| LiveBench | 73.1% | 71.6% | 59.1% |
| IFEval | 83.9% | 83.8% | 84.8% |
| BFCL (function calling) | 66.4% | 60.3% | 62.8% |
QwQ-32B matches R1 on most benchmarks while being far easier to deploy. A 32B dense model fits on a single high-end consumer GPU (RTX 4090 with quantization), while R1-671B requires enterprise hardware.
The function-calling performance (BFCL) is notable. QwQ-32B outperforms both R1 and o1-mini on tool use, making it a strong choice for agentic applications that need reasoning plus action.
Benchmark Analysis
Math Reasoning (AIME)
The American Invitational Mathematics Examination tests competition-level high school math. Problems require multi-step reasoning, pattern recognition, and creative problem-solving.
| Model | AIME 2024 | AIME 2025 |
|---|---|---|
| o3 | 91.6% | 88.9% |
| o4-mini | 93.4% | 92.7% |
| o3-mini | 87.3% | 86.5% |
| DeepSeek R1 | 79.8% | — |
| QwQ-32B | 79.5% | — |
| o1 | 74.3% | — |
| o1-mini | 63.6% | — |
o4-mini leads, which is counterintuitive (the "mini" model beats the flagship). This reflects OpenAI's optimization for math specifically. For pure mathematical reasoning, o4-mini is currently the best option if you're using OpenAI's API.
Coding (Codeforces / LiveCodeBench)
Codeforces Elo measures competitive programming ability. LiveCodeBench tests practical code generation, repair, and testing.
| Model | Codeforces Elo | LiveCodeBench |
|---|---|---|
| o3 | 2706 | — |
| o4-mini | 2719 | 68.1% (SWE-bench) |
| DeepSeek R1 | 2029 | 65.9% |
| QwQ-32B | — | 63.4% |
| o1 | 1891 | — |
OpenAI's o-series dominates competitive programming. The gap is significant: o4-mini's 2719 Elo places it in the top 0.1% of human competitors.
For practical software engineering (SWE-bench), the gap narrows. R1 and QwQ perform well on real-world code tasks even if they lag on algorithmic competition problems.
Science Reasoning (GPQA Diamond)
GPQA Diamond contains PhD-level science questions across biology, physics, and chemistry. It tests deep domain knowledge plus multi-step reasoning.
| Model | GPQA Diamond |
|---|---|
| o3 | 87.7% |
| o4-mini | 81.4% |
| o1 | 75.7% |
| DeepSeek R1 | 71.5% |
| QwQ-32B | — |
OpenAI's models lead on science reasoning. The gap is larger here than on math, suggesting R1 and QwQ were optimized more heavily for mathematical tasks during RL training.
Frontier Math
EpochAI's Frontier Math benchmark contains research-level problems that take professional mathematicians hours or days to solve. Most AI models score under 2%.
| Model | Frontier Math |
|---|---|
| o3 | 25.2% |
| All others | <2% |
o3's performance here is a step change. Solving a quarter of research-level math problems puts it in territory that wasn't expected for years.
Cost Analysis
Reasoning models are expensive because they generate many tokens internally. A simple question might produce 2,000 thinking tokens for a 100-token visible response.
API Pricing (per 1M tokens)
| Model | Input | Output | Effective Cost* |
|---|---|---|---|
| o1 | $15.00 | $60.00 | $60-300 |
| o3 | ~$20.00 | ~$80.00 | $80-400 |
| o3-mini | ~$3.00 | ~$12.00 | $12-60 |
| DeepSeek R1 (API) | $0.55 | $2.19 | $2.19-11 |
| QwQ-32B (API) | ~$0.50 | ~$2.00 | $2-10 |
*Effective cost accounts for reasoning tokens, which are billed as output but not visible. Reasoning-heavy queries can use 5-20x more tokens than the visible output.
The cost gap is dramatic. DeepSeek R1 runs 20-50x cheaper than OpenAI o1 for equivalent tasks. For a task costing $50 on OpenAI, you'd pay $1-2 on DeepSeek.
Self-Hosting Costs
Self-hosting eliminates per-token charges but requires significant hardware investment.
Full R1-671B requirements:
| Configuration | Hardware | VRAM | Est. Cost |
|---|---|---|---|
| FP16 (unquantized) | 8x H100 | ~1.3TB | $200K+ |
| 4-bit quantized | 4x RTX 4090 | ~400GB | $8-10K |
| 1.73-bit dynamic quant | Single high-RAM system | ~160GB | $4-6K |
Full R1 requires enterprise hardware even with aggressive quantization. The 4-bit version runs at 2-4 tok/s on 4x RTX 4090, which is usable but slow.
Distilled models are more practical:
| Model | VRAM (FP16) | VRAM (4-bit) | Consumer Hardware |
|---|---|---|---|
| R1-Distill-7B | ~14GB | ~6GB | RTX 3080+ |
| R1-Distill-14B | ~28GB | ~10GB | RTX 4090 |
| R1-Distill-32B | ~64GB | ~18GB | RTX 4090 (quantized) |
| QwQ-32B | ~65GB | ~18GB | RTX 4090 (quantized) |
The 32B distilled models fit on consumer hardware with quantization. Performance is strong: R1-Distill-32B outperforms o1-mini on most benchmarks.
Self-Hosting Guide
Ollama (Simplest)
Ollama provides one-command deployment for distilled models:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a distilled model
ollama run deepseek-r1:14b
# Or the larger 32B version
ollama run deepseek-r1:32b
# QwQ-32B
ollama run qwq:32b
Ollama automatically selects appropriate quantization based on your hardware. For explicit control:
# Specific quantization
ollama run deepseek-r1:32b-q4_K_M
Modelfile for custom configuration:
FROM deepseek-r1:32b
PARAMETER temperature 0.6
PARAMETER num_ctx 32768
PARAMETER num_gpu 99
SYSTEM """You are a helpful reasoning assistant. Think step by step before answering."""
vLLM (Production)
vLLM provides higher throughput for production deployments:
# Install vLLM
pip install vllm
# Serve the 32B distilled model
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--enforce-eager
# Or with quantization
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--quantization awq \
--tensor-parallel-size 1 \
--max-model-len 16384
Docker Compose for production:
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: >
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
--tensor-parallel-size 2
--max-model-len 32768
--port 8000
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
Performance Expectations
Measured on RTX 4090 (24GB VRAM):
| Model | Quantization | Tokens/sec | Max Context |
|---|---|---|---|
| R1-Distill-14B | Q4_K_M | 25-35 | 32K |
| R1-Distill-32B | Q4_K_M | 12-18 | 16K |
| QwQ-32B | Q4_K_M | 10-15 | 16K |
| R1-671B (4-bit) | Q4_K_M | 2-4 | 8K |
For full R1-671B on CPU-only (256GB+ RAM): 5-8 tok/s with IQ4 quantization.
When to Use Each Model
Use OpenAI o3/o4-mini when:
- You need frontier performance on hard problems
- Science reasoning is critical (GPQA-level tasks)
- Competitive programming accuracy matters
- You're already in the OpenAI ecosystem
- Budget allows $60-400 per 1M output tokens
Use DeepSeek R1 (API) when:
- Cost matters (20-50x cheaper than o1)
- Math reasoning is the primary use case
- You want visible reasoning traces for debugging
- Open weights aren't required (API is fine)
Use DeepSeek R1 (Self-hosted) when:
- Data privacy requires on-premises deployment
- You're processing enough volume to justify hardware
- You need to eliminate per-token costs
- You have the infrastructure expertise
Use R1 Distilled Models when:
- You need local deployment without enterprise hardware
- The 32B model's performance is sufficient (it beats o1-mini)
- You want to fine-tune on domain-specific reasoning
- Budget is constrained but reasoning quality matters
Use QwQ-32B when:
- You need strong reasoning in a deployable size
- Function calling / agentic use cases are important
- You want open weights with commercial licensing
- You're building agents that need to reason and act
Fine-Tuning Reasoning Models
Standard fine-tuning doesn't work well on reasoning models. The reasoning capability comes from RL training, not supervised examples. Adding more SFT can actually degrade performance.
What works:
- Distillation: Train a smaller model on reasoning traces from a larger model. This is how R1-Distill models were created.
- Continued RL: Apply reinforcement learning with task-specific rewards. Requires significant compute and expertise.
- Prompt engineering: Often more effective than fine-tuning. Reasoning models respond well to instructions like "think step by step" and "verify your answer."
For teams that need domain-specific reasoning models, the distillation approach is most accessible. Generate reasoning traces from R1 or QwQ on your domain's problems, then fine-tune a smaller model on those traces.
Prem Studio provides fine-tuning infrastructure for creating specialized models using this distillation approach. You can train domain-specific reasoning capabilities using synthetic data generated by larger models, then deploy the resulting model at a fraction of the cost of running the full R1.
Common Pitfalls
1. Over-prompting
Reasoning models are sensitive to prompt length. Few-shot examples often degrade performance compared to zero-shot prompts. The model's internal reasoning can get confused by examples.
Don't do this:
Here are some examples of how to solve math problems:
Example 1: [long worked example]
Example 2: [long worked example]
Now solve: What is 17 × 23?
Do this:
What is 17 × 23? Think step by step.
2. Ignoring reasoning tokens in cost estimates
A response with 200 visible tokens might have 3,000 reasoning tokens. OpenAI bills for both. Your actual cost can be 10-20x higher than naive token counts suggest.
Always test with real queries and monitor actual token usage before budgeting.
3. Using reasoning models for simple tasks
Reasoning models add latency and cost. For tasks that don't benefit from multi-step thinking (simple Q&A, summarization, basic classification), standard models are faster and cheaper.
Use reasoning models for:
- Multi-step math problems
- Complex coding tasks
- Logic puzzles
- Planning and strategy
Use standard models for:
- Factual questions
- Text summarization
- Translation
- Simple classification
4. Expecting consistent formatting
Reasoning models can produce variable output formats. QwQ-32B is known for saying "wait" frequently during thinking. R1 can mix languages in reasoning traces. Build parsing logic that handles variability.
Real-World Performance: Beyond Benchmarks
Benchmarks tell part of the story. Real-world performance depends on your specific use case.
Mathematical Problem Solving
For pure math (competition problems, proofs, calculations), the ranking is clear:
- o3/o4-mini — Frontier performance, especially with tool access
- DeepSeek R1 — Matches o1, 20x cheaper
- QwQ-32B — 95% of R1's performance in a deployable package
If you're building a math tutoring system or automated proof assistant, any of these work. The choice depends on whether you need the absolute best (o3), cost efficiency (R1 API), or local deployment (QwQ-32B).
Code Generation and Debugging
For coding tasks, the picture is more nuanced:
| Task Type | Best Choice | Why |
|---|---|---|
| Competitive programming | o3/o4-mini | Highest Codeforces Elo |
| Production code | R1 or QwQ | Good enough, much cheaper |
| Code review | QwQ-32B | Function calling for tools |
| Debugging | R1 | Visible reasoning helps |
OpenAI leads on algorithmic problems but the gap shrinks for practical engineering tasks. R1's visible reasoning traces are valuable when you need to understand why the model made certain choices.
Multi-Step Planning
For tasks requiring planning (agent workflows, strategy, complex reasoning chains):
- QwQ-32B excels here due to strong function-calling performance (66.4% on BFCL vs R1's 60.3%)
- R1 is strong but slightly worse at tool coordination
- o3 is powerful but expensive for agentic loops that require many calls
Latency Considerations
Reasoning models trade speed for accuracy. Measured response times for a moderately complex math problem:
| Model | Time to First Token | Total Response Time |
|---|---|---|
| GPT-4o (non-reasoning) | ~300ms | ~2s |
| o1 | ~2s | ~15s |
| o3-mini (medium effort) | ~1.5s | ~8s |
| DeepSeek R1 (API) | ~1s | ~12s |
| QwQ-32B (local, RTX 4090) | ~500ms | ~20s |
For interactive applications, streaming is essential. Users tolerate delays when they see the model "thinking" in real-time.
Integration Patterns
OpenAI o-series
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="o3-mini",
messages=[
{"role": "user", "content": "Prove that √2 is irrational."}
],
reasoning_effort="medium" # low, medium, or high
)
print(response.choices[0].message.content)
# Note: reasoning tokens are billed but not visible
print(f"Total tokens: {response.usage.total_tokens}")
DeepSeek R1 API
from openai import OpenAI
client = OpenAI(
api_key="your-deepseek-key",
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-reasoner",
messages=[
{"role": "user", "content": "Prove that √2 is irrational."}
]
)
# R1 includes reasoning in <think> tags
full_response = response.choices[0].message.content
# Parse thinking vs answer
import re
thinking = re.search(r'<think>(.*?)</think>', full_response, re.DOTALL)
answer = re.sub(r'<think>.*?</think>', '', full_response, flags=re.DOTALL)
Local QwQ-32B with Ollama
import requests
def query_qwq(prompt):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "qwq:32b",
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.6,
"num_ctx": 32768
}
}
)
return response.json()["response"]
# Usage
result = query_qwq("Prove that √2 is irrational.")
print(result)
Parsing Reasoning Traces
For R1 and QwQ, you can extract and analyze reasoning:
def parse_reasoning(response: str) -> dict:
"""Extract thinking and final answer from reasoning model output."""
import re
think_match = re.search(r'<think>(.*?)</think>', response, re.DOTALL)
if think_match:
thinking = think_match.group(1).strip()
answer = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
else:
# No explicit tags, assume entire response is the answer
thinking = None
answer = response.strip()
return {
"thinking": thinking,
"thinking_tokens": len(thinking.split()) if thinking else 0,
"answer": answer,
"answer_tokens": len(answer.split())
}
# Analyze reasoning efficiency
result = parse_reasoning(model_output)
ratio = result["thinking_tokens"] / max(result["answer_tokens"], 1)
print(f"Thinking/Answer ratio: {ratio:.1f}x")
Evaluation: Testing Reasoning Quality
Before deploying reasoning models, evaluate on your domain. Generic benchmarks don't predict performance on your specific tasks.
Build a Test Set
Create 50-100 problems representative of your use case. Include:
- Easy problems (baseline sanity check)
- Medium problems (typical workload)
- Hard problems (stress test)
- Edge cases specific to your domain
Metrics to Track
def evaluate_reasoning_model(model, test_cases):
results = []
for case in test_cases:
start = time.time()
response = model.generate(case["prompt"])
latency = time.time() - start
parsed = parse_reasoning(response)
results.append({
"correct": verify_answer(parsed["answer"], case["expected"]),
"latency_s": latency,
"thinking_tokens": parsed["thinking_tokens"],
"answer_tokens": parsed["answer_tokens"],
"cost": estimate_cost(parsed, model.pricing)
})
return {
"accuracy": sum(r["correct"] for r in results) / len(results),
"avg_latency": sum(r["latency_s"] for r in results) / len(results),
"avg_cost": sum(r["cost"] for r in results) / len(results),
"p95_latency": sorted([r["latency_s"] for r in results])[int(0.95 * len(results))]
}
Compare Models on Your Data
Run the same test set across multiple models:
models = [
{"name": "o3-mini", "api": openai_client, "pricing": (3, 12)},
{"name": "deepseek-r1", "api": deepseek_client, "pricing": (0.55, 2.19)},
{"name": "qwq-32b-local", "api": ollama_client, "pricing": (0, 0)}
]
comparison = {}
for model in models:
comparison[model["name"]] = evaluate_reasoning_model(model, test_cases)
# Print comparison table
print(f"{'Model':<20} {'Accuracy':<10} {'Latency (s)':<12} {'Cost/query':<10}")
for name, metrics in comparison.items():
print(f"{name:<20} {metrics['accuracy']:.1%} {metrics['avg_latency']:.1f} ${metrics['avg_cost']:.4f}")
Frequently Asked Questions
What's the difference between o1 and o3?
o3 uses approximately 10x more compute for reasoning than o1. It scores significantly higher on benchmarks (96.7% vs 74.3% on AIME 2024) and can handle harder problems. o3 also integrates tools (code execution, web search) into its reasoning loop. Use o3 for the hardest problems; use o3-mini or o4-mini for cost-efficiency.
Is DeepSeek R1 actually as good as OpenAI claims it is?
On math benchmarks, R1 matches or exceeds o1. It scores 79.8% on AIME 2024 vs o1's 74.3%. On science reasoning (GPQA Diamond), R1 trails: 71.5% vs o1's 75.7%. R1 is genuinely competitive on reasoning tasks, especially math and coding.
Can I run R1 locally on consumer hardware?
Not the full 671B model. You need ~400GB+ VRAM for even heavily quantized versions. The distilled models (7B, 14B, 32B) run on consumer GPUs. R1-Distill-32B with 4-bit quantization fits on an RTX 4090 and outperforms o1-mini.
Why is QwQ-32B competitive with models 20x its size?
Two reasons. First, reasoning capability transfers well through distillation and RL. The model size matters less than the training approach. Second, MoE models like R1 only activate 37B parameters per token despite having 671B total. QwQ-32B activates all 32B parameters, so the effective gap is smaller than raw parameter counts suggest.
Should I use reasoning models for my chatbot?
Probably not. Reasoning models are slower and more expensive. They're optimized for tasks requiring multi-step logic. For general conversation, standard models (GPT-4o, Claude, Llama) are faster, cheaper, and often better at maintaining natural dialogue.
How do I see the reasoning process?
DeepSeek R1 and QwQ expose reasoning in <think>...</think> tags. OpenAI's o-series hides reasoning tokens (you pay for them but can't see them). If transparency into the model's thought process matters for your use case, use the open models.
Are reasoning tokens billed even though I can't see them?
Yes. OpenAI bills reasoning tokens as output tokens. A query might produce 200 visible tokens and 2,000 hidden reasoning tokens. You pay for all 2,200 at the output token rate.
What's the latency like for reasoning models?
Slower than standard models. Reasoning takes 5-30 seconds depending on problem complexity. For interactive applications, this is noticeable. Consider streaming the output so users see progress, or use async workflows where the user doesn't need immediate responses.
Can I fine-tune reasoning models?
Not effectively with standard SFT. The reasoning capability comes from RL training. Adding SFT can degrade performance. The best approach is distillation: generate reasoning traces from a large model, then train a smaller model on those traces.
Which model should I start with?
For experimentation: QwQ-32B via Ollama. It's free, runs locally on good consumer hardware, and provides competitive reasoning with visible thinking traces. For production with budget: DeepSeek R1 API. For frontier performance: OpenAI o3 or o4-mini.
Summary
Reasoning models represent a paradigm shift in AI capabilities. By trading latency and cost for accuracy, they solve problems that previous models couldn't touch.
The current landscape:
- OpenAI o3/o4-mini: Frontier performance, closed source, expensive ($60-400/M tokens)
- DeepSeek R1: Near-frontier performance, open weights, 20-50x cheaper via API
- QwQ-32B: Strong performance in a deployable size, excellent for agents
For most teams, the decision comes down to deployment model and budget. If you're using APIs and cost matters, DeepSeek R1 is compelling. If you need local deployment, the distilled models or QwQ-32B are practical options that run on consumer hardware while still beating o1-mini.
The reasoning model space is evolving quickly. DeepSeek-R2 is rumored, OpenAI continues pushing o-series, and Alibaba is iterating on QwQ. Expect continued price competition and capability improvements.
For teams building systems that need reliable reasoning—mathematical analysis, code generation, planning—these models change what's possible. The cost and deployment options mean reasoning capability is no longer limited to the largest players.
For production deployments that require domain-specific reasoning, fine-tuning smaller models on reasoning traces from R1 or QwQ provides a path to custom capabilities. Prem Studio supports this workflow with evaluation tools to validate reasoning quality before deployment.