Reasoning Models Explained: OpenAI o1/o3 vs DeepSeek R1 vs QwQ-32B

Reasoning models compared: OpenAI o1/o3, DeepSeek R1, QwQ-32B. Architecture differences, benchmark analysis, cost breakdown. Self-hosting guide for R1 with VRAM requirements.

Reasoning Models Explained: OpenAI o1/o3 vs DeepSeek R1 vs QwQ-32B

Reasoning models think before they answer. Unlike standard LLMs that generate tokens sequentially, reasoning models produce internal chain-of-thought traces before outputting a response. This deliberate thinking time makes them dramatically better at math, coding, and multi-step logic problems.

The landscape shifted in January 2025 when DeepSeek released R1, an open-weight reasoning model that matched OpenAI's o1 at a fraction of the cost. Then Alibaba's QwQ-32B showed that a 32B parameter model could compete with models 20x its size. OpenAI responded with o3 and o4-mini, pushing benchmarks further.

This guide breaks down how these models work, where they excel, what they cost, and when to use each one. If you're evaluating reasoning models for production, this is the comparison you need.

How Reasoning Models Work

Standard LLMs predict the next token based on the previous tokens. Fast, but limited in complex reasoning. They can solve problems they've seen variations of during training, but struggle with novel multi-step logic.

Reasoning models add a thinking phase. Before generating the visible response, they produce an internal chain-of-thought that breaks the problem into steps, verifies intermediate results, and backtracks when needed. This thinking process consumes tokens and time, but dramatically improves accuracy on hard problems.

The key insight: test-time compute scales better than training compute for reasoning tasks. Instead of making the model bigger, you let it think longer. A smaller model that thinks for 30 seconds can outperform a larger model that answers immediately.

The Technical Mechanism

Each reasoning model implements this differently, but the core pattern is:

  1. Receive input — the user's question or problem
  2. Generate reasoning trace — internal tokens exploring the solution space
  3. Self-verify — check intermediate steps for consistency
  4. Produce output — the final answer, often with the reasoning shown

OpenAI's o-series hides the reasoning tokens (you're billed for them but can't see them). DeepSeek R1 and QwQ expose reasoning in <think>...</think> tags, which is valuable for debugging and trust.

User: What is 17 × 23?

<think>
I need to multiply 17 by 23.
Let me break this down: 17 × 23 = 17 × (20 + 3)
17 × 20 = 340
17 × 3 = 51
340 + 51 = 391
Let me verify: 391 / 17 = 23. Correct.
</think>

17 × 23 = 391

The thinking tokens often outnumber the output tokens by 5-20x on complex problems. This is why reasoning models are expensive per query despite their effectiveness.

The Models

OpenAI o1 / o3 / o4-mini

OpenAI's reasoning models use reinforcement learning to train the model to think before answering. The architecture details are not public, but the training approach involves:

  • Large-scale RL on reasoning tasks
  • Hidden chain-of-thought (billed but not exposed)
  • Adaptive compute based on problem difficulty

Current lineup (as of early 2026):

Model Best For Context Notes
o1 Complex reasoning 128K Original reasoning model
o3 Frontier performance 200K 10x compute vs o1, best benchmarks
o3-mini Cost-efficient reasoning 200K 3 reasoning effort levels
o4-mini Fast reasoning + vision 128K Best on AIME, supports images

o3 represents the current frontier. It scored 96.7% on AIME 2024, 87.7% on GPQA Diamond (PhD-level science), and achieved 2706 Elo on Codeforces. These numbers exceed human expert performance on most benchmarks.

The o4-mini variant adds vision capabilities and actually outperforms o3 on pure math (93.4% on AIME 2024 without tools). It's optimized for the common case where you need good reasoning without maximum compute.

DeepSeek R1

DeepSeek R1 is a 671B parameter Mixture-of-Experts model with 37B parameters active per token. It was trained using a novel approach: pure reinforcement learning without supervised fine-tuning for the initial version (R1-Zero), then refined with cold-start data for the production release.

Key characteristics:

  • Architecture: MoE with 671B total, 37B active parameters
  • Context: 128K tokens
  • Training: RL-first, then SFT for readability
  • License: MIT (fully open, commercial use allowed)
  • Weights: Available on Hugging Face

The training process was unconventional. DeepSeek-R1-Zero was trained purely with RL, no supervised examples of good reasoning. The model discovered chain-of-thought reasoning on its own through trial and error. This produced strong reasoning but poor readability (language mixing, repetitive text).

R1 added cold-start data: a small set of high-quality reasoning examples to guide early training. This fixed readability while preserving the emergent reasoning capabilities.

Benchmark performance:

Benchmark R1 Score o1 Score Notes
AIME 2024 79.8% 74.3% Competition math
MATH-500 97.3% 96.4% Mathematical reasoning
GPQA Diamond 71.5% 75.7% PhD-level science
Codeforces 2029 Elo 1891 Elo Competitive programming
LiveCodeBench 65.9% Code generation

R1 matches or exceeds o1 on math benchmarks while trailing slightly on science reasoning. The real differentiator is cost and openness.

DeepSeek R1 Distilled Models

DeepSeek also released distilled versions: smaller dense models trained on reasoning data generated by the full R1. These are practical for local deployment:

Model Parameters AIME 2024 MATH-500 LiveCodeBench
R1-Distill-Qwen-1.5B 1.5B
R1-Distill-Qwen-7B 7B 55.5%
R1-Distill-Qwen-14B 14B
R1-Distill-Qwen-32B 32B 72.6% 94.3% 57.2%
R1-Distill-Llama-70B 70B

The 32B distilled model outperforms o1-mini on most benchmarks. The 7B model beats non-reasoning models like GPT-4o. Distillation transfers reasoning capabilities from the teacher (R1-671B) to students (smaller dense models) surprisingly well.

QwQ-32B (Alibaba)

QwQ-32B proves you don't need 671B parameters for strong reasoning. Built on Qwen2.5-32B and trained with two-stage RL, it achieves performance comparable to R1 with 20x fewer total parameters.

Training approach:

  1. Stage 1 RL: Outcome-based rewards on math and coding. The model learns to reason by being rewarded only for correct final answers, not intermediate steps.
  2. Stage 2 RL: General capability rewards for instruction following, human preference alignment, and agent behaviors.

This two-stage approach is more efficient than R1's cold-start method and produces cleaner reasoning traces.

Benchmark performance:

Benchmark QwQ-32B R1-671B o1-mini
AIME 2024 79.5% 79.8% 63.6%
LiveCodeBench 63.4% 65.9% 53.8%
LiveBench 73.1% 71.6% 59.1%
IFEval 83.9% 83.8% 84.8%
BFCL (function calling) 66.4% 60.3% 62.8%

QwQ-32B matches R1 on most benchmarks while being far easier to deploy. A 32B dense model fits on a single high-end consumer GPU (RTX 4090 with quantization), while R1-671B requires enterprise hardware.

The function-calling performance (BFCL) is notable. QwQ-32B outperforms both R1 and o1-mini on tool use, making it a strong choice for agentic applications that need reasoning plus action.

Benchmark Analysis

Math Reasoning (AIME)

The American Invitational Mathematics Examination tests competition-level high school math. Problems require multi-step reasoning, pattern recognition, and creative problem-solving.

Model AIME 2024 AIME 2025
o3 91.6% 88.9%
o4-mini 93.4% 92.7%
o3-mini 87.3% 86.5%
DeepSeek R1 79.8%
QwQ-32B 79.5%
o1 74.3%
o1-mini 63.6%

o4-mini leads, which is counterintuitive (the "mini" model beats the flagship). This reflects OpenAI's optimization for math specifically. For pure mathematical reasoning, o4-mini is currently the best option if you're using OpenAI's API.

Coding (Codeforces / LiveCodeBench)

Codeforces Elo measures competitive programming ability. LiveCodeBench tests practical code generation, repair, and testing.

Model Codeforces Elo LiveCodeBench
o3 2706
o4-mini 2719 68.1% (SWE-bench)
DeepSeek R1 2029 65.9%
QwQ-32B 63.4%
o1 1891

OpenAI's o-series dominates competitive programming. The gap is significant: o4-mini's 2719 Elo places it in the top 0.1% of human competitors.

For practical software engineering (SWE-bench), the gap narrows. R1 and QwQ perform well on real-world code tasks even if they lag on algorithmic competition problems.

Science Reasoning (GPQA Diamond)

GPQA Diamond contains PhD-level science questions across biology, physics, and chemistry. It tests deep domain knowledge plus multi-step reasoning.

Model GPQA Diamond
o3 87.7%
o4-mini 81.4%
o1 75.7%
DeepSeek R1 71.5%
QwQ-32B

OpenAI's models lead on science reasoning. The gap is larger here than on math, suggesting R1 and QwQ were optimized more heavily for mathematical tasks during RL training.

Frontier Math

EpochAI's Frontier Math benchmark contains research-level problems that take professional mathematicians hours or days to solve. Most AI models score under 2%.

Model Frontier Math
o3 25.2%
All others <2%

o3's performance here is a step change. Solving a quarter of research-level math problems puts it in territory that wasn't expected for years.

Cost Analysis

Reasoning models are expensive because they generate many tokens internally. A simple question might produce 2,000 thinking tokens for a 100-token visible response.

API Pricing (per 1M tokens)

Model Input Output Effective Cost*
o1 $15.00 $60.00 $60-300
o3 ~$20.00 ~$80.00 $80-400
o3-mini ~$3.00 ~$12.00 $12-60
DeepSeek R1 (API) $0.55 $2.19 $2.19-11
QwQ-32B (API) ~$0.50 ~$2.00 $2-10

*Effective cost accounts for reasoning tokens, which are billed as output but not visible. Reasoning-heavy queries can use 5-20x more tokens than the visible output.

The cost gap is dramatic. DeepSeek R1 runs 20-50x cheaper than OpenAI o1 for equivalent tasks. For a task costing $50 on OpenAI, you'd pay $1-2 on DeepSeek.

Self-Hosting Costs

Self-hosting eliminates per-token charges but requires significant hardware investment.

Full R1-671B requirements:

Configuration Hardware VRAM Est. Cost
FP16 (unquantized) 8x H100 ~1.3TB $200K+
4-bit quantized 4x RTX 4090 ~400GB $8-10K
1.73-bit dynamic quant Single high-RAM system ~160GB $4-6K

Full R1 requires enterprise hardware even with aggressive quantization. The 4-bit version runs at 2-4 tok/s on 4x RTX 4090, which is usable but slow.

Distilled models are more practical:

Model VRAM (FP16) VRAM (4-bit) Consumer Hardware
R1-Distill-7B ~14GB ~6GB RTX 3080+
R1-Distill-14B ~28GB ~10GB RTX 4090
R1-Distill-32B ~64GB ~18GB RTX 4090 (quantized)
QwQ-32B ~65GB ~18GB RTX 4090 (quantized)

The 32B distilled models fit on consumer hardware with quantization. Performance is strong: R1-Distill-32B outperforms o1-mini on most benchmarks.

Self-Hosting Guide

Ollama (Simplest)

Ollama provides one-command deployment for distilled models:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a distilled model
ollama run deepseek-r1:14b

# Or the larger 32B version
ollama run deepseek-r1:32b

# QwQ-32B
ollama run qwq:32b

Ollama automatically selects appropriate quantization based on your hardware. For explicit control:

# Specific quantization
ollama run deepseek-r1:32b-q4_K_M

Modelfile for custom configuration:

FROM deepseek-r1:32b

PARAMETER temperature 0.6
PARAMETER num_ctx 32768
PARAMETER num_gpu 99

SYSTEM """You are a helpful reasoning assistant. Think step by step before answering."""

vLLM (Production)

vLLM provides higher throughput for production deployments:

# Install vLLM
pip install vllm

# Serve the 32B distilled model
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --enforce-eager

# Or with quantization
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --quantization awq \
  --tensor-parallel-size 1 \
  --max-model-len 16384

Docker Compose for production:

version: '3.8'
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
      --tensor-parallel-size 2
      --max-model-len 32768
      --port 8000
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]

Performance Expectations

Measured on RTX 4090 (24GB VRAM):

Model Quantization Tokens/sec Max Context
R1-Distill-14B Q4_K_M 25-35 32K
R1-Distill-32B Q4_K_M 12-18 16K
QwQ-32B Q4_K_M 10-15 16K
R1-671B (4-bit) Q4_K_M 2-4 8K

For full R1-671B on CPU-only (256GB+ RAM): 5-8 tok/s with IQ4 quantization.

When to Use Each Model

Use OpenAI o3/o4-mini when:

  • You need frontier performance on hard problems
  • Science reasoning is critical (GPQA-level tasks)
  • Competitive programming accuracy matters
  • You're already in the OpenAI ecosystem
  • Budget allows $60-400 per 1M output tokens

Use DeepSeek R1 (API) when:

  • Cost matters (20-50x cheaper than o1)
  • Math reasoning is the primary use case
  • You want visible reasoning traces for debugging
  • Open weights aren't required (API is fine)

Use DeepSeek R1 (Self-hosted) when:

  • Data privacy requires on-premises deployment
  • You're processing enough volume to justify hardware
  • You need to eliminate per-token costs
  • You have the infrastructure expertise

Use R1 Distilled Models when:

  • You need local deployment without enterprise hardware
  • The 32B model's performance is sufficient (it beats o1-mini)
  • You want to fine-tune on domain-specific reasoning
  • Budget is constrained but reasoning quality matters

Use QwQ-32B when:

  • You need strong reasoning in a deployable size
  • Function calling / agentic use cases are important
  • You want open weights with commercial licensing
  • You're building agents that need to reason and act

Fine-Tuning Reasoning Models

Standard fine-tuning doesn't work well on reasoning models. The reasoning capability comes from RL training, not supervised examples. Adding more SFT can actually degrade performance.

What works:

  1. Distillation: Train a smaller model on reasoning traces from a larger model. This is how R1-Distill models were created.
  2. Continued RL: Apply reinforcement learning with task-specific rewards. Requires significant compute and expertise.
  3. Prompt engineering: Often more effective than fine-tuning. Reasoning models respond well to instructions like "think step by step" and "verify your answer."

For teams that need domain-specific reasoning models, the distillation approach is most accessible. Generate reasoning traces from R1 or QwQ on your domain's problems, then fine-tune a smaller model on those traces.

Prem Studio provides fine-tuning infrastructure for creating specialized models using this distillation approach. You can train domain-specific reasoning capabilities using synthetic data generated by larger models, then deploy the resulting model at a fraction of the cost of running the full R1.

Common Pitfalls

1. Over-prompting

Reasoning models are sensitive to prompt length. Few-shot examples often degrade performance compared to zero-shot prompts. The model's internal reasoning can get confused by examples.

Don't do this:

Here are some examples of how to solve math problems:

Example 1: [long worked example]
Example 2: [long worked example]

Now solve: What is 17 × 23?

Do this:

What is 17 × 23? Think step by step.

2. Ignoring reasoning tokens in cost estimates

A response with 200 visible tokens might have 3,000 reasoning tokens. OpenAI bills for both. Your actual cost can be 10-20x higher than naive token counts suggest.

Always test with real queries and monitor actual token usage before budgeting.

3. Using reasoning models for simple tasks

Reasoning models add latency and cost. For tasks that don't benefit from multi-step thinking (simple Q&A, summarization, basic classification), standard models are faster and cheaper.

Use reasoning models for:

  • Multi-step math problems
  • Complex coding tasks
  • Logic puzzles
  • Planning and strategy

Use standard models for:

  • Factual questions
  • Text summarization
  • Translation
  • Simple classification

4. Expecting consistent formatting

Reasoning models can produce variable output formats. QwQ-32B is known for saying "wait" frequently during thinking. R1 can mix languages in reasoning traces. Build parsing logic that handles variability.

Real-World Performance: Beyond Benchmarks

Benchmarks tell part of the story. Real-world performance depends on your specific use case.

Mathematical Problem Solving

For pure math (competition problems, proofs, calculations), the ranking is clear:

  1. o3/o4-mini — Frontier performance, especially with tool access
  2. DeepSeek R1 — Matches o1, 20x cheaper
  3. QwQ-32B — 95% of R1's performance in a deployable package

If you're building a math tutoring system or automated proof assistant, any of these work. The choice depends on whether you need the absolute best (o3), cost efficiency (R1 API), or local deployment (QwQ-32B).

Code Generation and Debugging

For coding tasks, the picture is more nuanced:

Task Type Best Choice Why
Competitive programming o3/o4-mini Highest Codeforces Elo
Production code R1 or QwQ Good enough, much cheaper
Code review QwQ-32B Function calling for tools
Debugging R1 Visible reasoning helps

OpenAI leads on algorithmic problems but the gap shrinks for practical engineering tasks. R1's visible reasoning traces are valuable when you need to understand why the model made certain choices.

Multi-Step Planning

For tasks requiring planning (agent workflows, strategy, complex reasoning chains):

  • QwQ-32B excels here due to strong function-calling performance (66.4% on BFCL vs R1's 60.3%)
  • R1 is strong but slightly worse at tool coordination
  • o3 is powerful but expensive for agentic loops that require many calls

Latency Considerations

Reasoning models trade speed for accuracy. Measured response times for a moderately complex math problem:

Model Time to First Token Total Response Time
GPT-4o (non-reasoning) ~300ms ~2s
o1 ~2s ~15s
o3-mini (medium effort) ~1.5s ~8s
DeepSeek R1 (API) ~1s ~12s
QwQ-32B (local, RTX 4090) ~500ms ~20s

For interactive applications, streaming is essential. Users tolerate delays when they see the model "thinking" in real-time.

Integration Patterns

OpenAI o-series

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="o3-mini",
    messages=[
        {"role": "user", "content": "Prove that √2 is irrational."}
    ],
    reasoning_effort="medium"  # low, medium, or high
)

print(response.choices[0].message.content)
# Note: reasoning tokens are billed but not visible
print(f"Total tokens: {response.usage.total_tokens}")

DeepSeek R1 API

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=[
        {"role": "user", "content": "Prove that √2 is irrational."}
    ]
)

# R1 includes reasoning in <think> tags
full_response = response.choices[0].message.content
# Parse thinking vs answer
import re
thinking = re.search(r'<think>(.*?)</think>', full_response, re.DOTALL)
answer = re.sub(r'<think>.*?</think>', '', full_response, flags=re.DOTALL)

Local QwQ-32B with Ollama

import requests

def query_qwq(prompt):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "qwq:32b",
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.6,
                "num_ctx": 32768
            }
        }
    )
    return response.json()["response"]

# Usage
result = query_qwq("Prove that √2 is irrational.")
print(result)

Parsing Reasoning Traces

For R1 and QwQ, you can extract and analyze reasoning:

def parse_reasoning(response: str) -> dict:
    """Extract thinking and final answer from reasoning model output."""
    import re
    
    think_match = re.search(r'<think>(.*?)</think>', response, re.DOTALL)
    
    if think_match:
        thinking = think_match.group(1).strip()
        answer = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
    else:
        # No explicit tags, assume entire response is the answer
        thinking = None
        answer = response.strip()
    
    return {
        "thinking": thinking,
        "thinking_tokens": len(thinking.split()) if thinking else 0,
        "answer": answer,
        "answer_tokens": len(answer.split())
    }

# Analyze reasoning efficiency
result = parse_reasoning(model_output)
ratio = result["thinking_tokens"] / max(result["answer_tokens"], 1)
print(f"Thinking/Answer ratio: {ratio:.1f}x")

Evaluation: Testing Reasoning Quality

Before deploying reasoning models, evaluate on your domain. Generic benchmarks don't predict performance on your specific tasks.

Build a Test Set

Create 50-100 problems representative of your use case. Include:

  • Easy problems (baseline sanity check)
  • Medium problems (typical workload)
  • Hard problems (stress test)
  • Edge cases specific to your domain

Metrics to Track

def evaluate_reasoning_model(model, test_cases):
    results = []
    
    for case in test_cases:
        start = time.time()
        response = model.generate(case["prompt"])
        latency = time.time() - start
        
        parsed = parse_reasoning(response)
        
        results.append({
            "correct": verify_answer(parsed["answer"], case["expected"]),
            "latency_s": latency,
            "thinking_tokens": parsed["thinking_tokens"],
            "answer_tokens": parsed["answer_tokens"],
            "cost": estimate_cost(parsed, model.pricing)
        })
    
    return {
        "accuracy": sum(r["correct"] for r in results) / len(results),
        "avg_latency": sum(r["latency_s"] for r in results) / len(results),
        "avg_cost": sum(r["cost"] for r in results) / len(results),
        "p95_latency": sorted([r["latency_s"] for r in results])[int(0.95 * len(results))]
    }

Compare Models on Your Data

Run the same test set across multiple models:

models = [
    {"name": "o3-mini", "api": openai_client, "pricing": (3, 12)},
    {"name": "deepseek-r1", "api": deepseek_client, "pricing": (0.55, 2.19)},
    {"name": "qwq-32b-local", "api": ollama_client, "pricing": (0, 0)}
]

comparison = {}
for model in models:
    comparison[model["name"]] = evaluate_reasoning_model(model, test_cases)

# Print comparison table
print(f"{'Model':<20} {'Accuracy':<10} {'Latency (s)':<12} {'Cost/query':<10}")
for name, metrics in comparison.items():
    print(f"{name:<20} {metrics['accuracy']:.1%}      {metrics['avg_latency']:.1f}          ${metrics['avg_cost']:.4f}")

Frequently Asked Questions

What's the difference between o1 and o3?

o3 uses approximately 10x more compute for reasoning than o1. It scores significantly higher on benchmarks (96.7% vs 74.3% on AIME 2024) and can handle harder problems. o3 also integrates tools (code execution, web search) into its reasoning loop. Use o3 for the hardest problems; use o3-mini or o4-mini for cost-efficiency.

Is DeepSeek R1 actually as good as OpenAI claims it is?

On math benchmarks, R1 matches or exceeds o1. It scores 79.8% on AIME 2024 vs o1's 74.3%. On science reasoning (GPQA Diamond), R1 trails: 71.5% vs o1's 75.7%. R1 is genuinely competitive on reasoning tasks, especially math and coding.

Can I run R1 locally on consumer hardware?

Not the full 671B model. You need ~400GB+ VRAM for even heavily quantized versions. The distilled models (7B, 14B, 32B) run on consumer GPUs. R1-Distill-32B with 4-bit quantization fits on an RTX 4090 and outperforms o1-mini.

Why is QwQ-32B competitive with models 20x its size?

Two reasons. First, reasoning capability transfers well through distillation and RL. The model size matters less than the training approach. Second, MoE models like R1 only activate 37B parameters per token despite having 671B total. QwQ-32B activates all 32B parameters, so the effective gap is smaller than raw parameter counts suggest.

Should I use reasoning models for my chatbot?

Probably not. Reasoning models are slower and more expensive. They're optimized for tasks requiring multi-step logic. For general conversation, standard models (GPT-4o, Claude, Llama) are faster, cheaper, and often better at maintaining natural dialogue.

How do I see the reasoning process?

DeepSeek R1 and QwQ expose reasoning in <think>...</think> tags. OpenAI's o-series hides reasoning tokens (you pay for them but can't see them). If transparency into the model's thought process matters for your use case, use the open models.

Are reasoning tokens billed even though I can't see them?

Yes. OpenAI bills reasoning tokens as output tokens. A query might produce 200 visible tokens and 2,000 hidden reasoning tokens. You pay for all 2,200 at the output token rate.

What's the latency like for reasoning models?

Slower than standard models. Reasoning takes 5-30 seconds depending on problem complexity. For interactive applications, this is noticeable. Consider streaming the output so users see progress, or use async workflows where the user doesn't need immediate responses.

Can I fine-tune reasoning models?

Not effectively with standard SFT. The reasoning capability comes from RL training. Adding SFT can degrade performance. The best approach is distillation: generate reasoning traces from a large model, then train a smaller model on those traces.

Which model should I start with?

For experimentation: QwQ-32B via Ollama. It's free, runs locally on good consumer hardware, and provides competitive reasoning with visible thinking traces. For production with budget: DeepSeek R1 API. For frontier performance: OpenAI o3 or o4-mini.

Summary

Reasoning models represent a paradigm shift in AI capabilities. By trading latency and cost for accuracy, they solve problems that previous models couldn't touch.

The current landscape:

  • OpenAI o3/o4-mini: Frontier performance, closed source, expensive ($60-400/M tokens)
  • DeepSeek R1: Near-frontier performance, open weights, 20-50x cheaper via API
  • QwQ-32B: Strong performance in a deployable size, excellent for agents

For most teams, the decision comes down to deployment model and budget. If you're using APIs and cost matters, DeepSeek R1 is compelling. If you need local deployment, the distilled models or QwQ-32B are practical options that run on consumer hardware while still beating o1-mini.

The reasoning model space is evolving quickly. DeepSeek-R2 is rumored, OpenAI continues pushing o-series, and Alibaba is iterating on QwQ. Expect continued price competition and capability improvements.

For teams building systems that need reliable reasoning—mathematical analysis, code generation, planning—these models change what's possible. The cost and deployment options mean reasoning capability is no longer limited to the largest players.

For production deployments that require domain-specific reasoning, fine-tuning smaller models on reasoning traces from R1 or QwQ provides a path to custom capabilities. Prem Studio supports this workflow with evaluation tools to validate reasoning quality before deployment.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe