How to Self-Host DeepSeek R1: Hardware, Setup, and Privacy Guide (2026)

How to Self-Host DeepSeek R1: Hardware, Setup, and Privacy Guide (2026)

DeepSeek R1 is the first open-weight model that genuinely competes with OpenAI o1 on reasoning tasks. It scores 79.8% on AIME 2024, writes code at a Codeforces 2,029 Elo level, and ships under an MIT license.

There's one problem. Every prompt you send to DeepSeek's API ends up on servers in China.

Their privacy policy doesn't hide it. Text, prompts, uploaded files, chat history, IP addresses, keystroke patterns. All stored in the People's Republic of China, retained "as long as necessary." Under China's National Intelligence Law, DeepSeek has to hand over data when the government asks. No external oversight. No appeals process.

Six governments have already banned the hosted API. Italy pulled it from app stores. Australia blocked it on all government systems. NASA, the US Navy, the Commerce Department, and Texas all followed. The EU has 13 countries investigating.

None of those bans apply to the model weights. The weights are MIT licensed. You can download them, run them on your own hardware, and keep every prompt and output on your own infrastructure. That's what this guide covers.

Why Self-Host DeepSeek R1 Instead of Using the API

Before we get into deployment, let's be clear about what you're choosing between.

Use the DeepSeek API if data privacy doesn't matter for your use case. It's the cheapest and easiest option. Your prompts go to China, but maybe that's fine for your application.

Self-host DeepSeek R1 if you need data sovereignty. Your prompts stay on your infrastructure. You control logging, retention, and deletion. No cross-border data transfers to worry about for GDPR or CCPA.

Use managed private deployment if you want sovereignty without running GPU infrastructure. PremAI deploys models in your VPC with Swiss jurisdiction and zero data retention. You get the privacy benefits without becoming a GPU ops team.

Most of this article covers self-hosting. But I'll be honest: running the full 671B model on 8x H100s is not a weekend project. If your team doesn't have GPU infrastructure experience, managed deployment will get you to production faster.

DeepSeek R1 Model Variants: Which One to Choose

DeepSeek R1 isn't one model. It's eight.

The flagship is a 671 billion parameter Mixture-of-Experts model. Only 37 billion parameters activate per token, so inference is faster than the parameter count suggests. But you still need to fit all 671B in VRAM because the router needs access to every expert.

For most teams, the distilled models are the better starting point. DeepSeek trained six smaller dense models on 800K samples from the full R1. They kept most of the reasoning capability without the MoE complexity.

DeepSeek R1 Distill Qwen 32B is the sweet spot. It fits on a single RTX 4090 at INT4 quantization. It's Apache 2.0 licensed with no commercial restrictions. And it reasons well enough that you can validate your use case before scaling up.

DeepSeek R1 Distill Llama 70B is stronger but needs two RTX 4090s or an A100. It also inherits Meta's 700M MAU license threshold.

Start with the 32B. Seriously. You can deploy it in an afternoon, test your application, and only move to the full 671B when you've confirmed you need that extra capability.

DeepSeek R1 Hardware Requirements and GPU Sizing

Here's what you actually need to run each variant.

DeepSeek R1 Distill 32B Hardware Requirements

One RTX 4090 (24GB) at INT4 quantization. Cloud rental runs $250-540/month depending on provider. Or buy the card for around $1,600.

DeepSeek R1 Distill 70B Hardware Requirements

Two RTX 4090s or one A100 80GB. Cloud rental runs $800-1,200/month.

DeepSeek R1 Full 671B Hardware Requirements

At FP8 precision (recommended): Eight H200 141GB GPUs. This configuration has enough VRAM to fit the model plus KV cache plus overhead. Monthly cost: $14,000+.

At INT4 quantization: Eight H100 80GB GPUs. This works but you're pushing limits. The model needs 335-400GB for weights, leaving less headroom. Monthly cost: $8,600-11,500.

A common mistake: people see "8x H100 80GB = 640GB total" and assume the 671B FP8 model will fit. It won't. The model needs ~685GB including the MTP module. Either quantize to INT4 or go multi-node.

If managing GPU clusters isn't your team's strength, PremAI handles deployment in your VPC. You get sovereignty without the hardware headaches.

How to Deploy DeepSeek R1 with vLLM

vLLM is the workhorse inference engine. It fully supports DeepSeek R1, both the full model and all distills.

vLLM Docker Setup for DeepSeek R1 32B

Here's a Docker Compose config for the 32B distill on two GPUs:

services:
  deepseek-r1:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    network_mode: host
    ipc: host
    command: >
      --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
      --tensor-parallel-size 2
      --max-model-len 32768
      --trust-remote-code
      --enforce-eager
      --gpu-memory-utilization 0.8
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Run docker compose up -d and you're serving.

Two flags matter. The --enforce-eager disables CUDA graph compilation, avoiding startup bugs. The --gpu-memory-utilization 0.8 prevents OOM errors that multiple teams have reported at the default 0.9 setting.

vLLM Setup for DeepSeek R1 Full 671B

For the full model on eight GPUs:

vllm serve deepseek-ai/DeepSeek-R1 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.8 \
  --trust-remote-code

Testing Your DeepSeek R1 Endpoint

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "messages": [{"role": "user", "content": "Prove sqrt(2) is irrational."}],
    "max_tokens": 4096
  }'

The endpoint is OpenAI-compatible. Any code that works with the OpenAI SDK works here with a URL swap.

How to Deploy DeepSeek R1 with SGLang

SGLang is DeepSeek's own recommendation. It has lower time-to-first-token and better performance at low concurrency.

SGLang Docker Setup for DeepSeek R1

docker run --gpus all \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --ipc=host \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --tp 2 \
    --trust-remote-code \
    --chat-template deepseek-r1

DeepSeek R1 Performance: SGLang vs vLLM

In benchmarks on 8x H200, SGLang hits 6,311 tokens/second versus lower numbers from vLLM. Mean time-to-first-token is 79ms versus 103ms.

But vLLM handles high concurrency better. If you're serving more than 32 concurrent users, vLLM scales more gracefully.

My take: start with vLLM because the ecosystem is larger and you'll find more help when things break. Switch to SGLang if you're optimizing latency on a single node.

DeepSeek R1 Self-Hosting Costs vs API Pricing

Let's do the math on whether self-hosting makes financial sense.

DeepSeek API Pricing

DeepSeek charges $0.55 per million input tokens (cache miss) and $2.19 per million output tokens. Processing 1 billion output tokens monthly costs $2,190.

Self-Hosted DeepSeek R1 Costs

Running 8x H100 at $2/GPU/hour costs about $11,520 per month.

Raw break-even is around 5 billion output tokens monthly. Below that, the API is cheaper on pure cost.

But cost isn't the whole picture. If your compliance team says no data to China, the API isn't an option at any price. Self-hosting becomes the cost of doing business.

For distilled models, the math is friendlier. The 32B on an RTX 4090 runs $250-540/month, breaking even against API costs at a few hundred million tokens.

The Managed Alternative

If engineering time matters more than GPU costs, PremAI sits between API and DIY. Your data stays in your VPC under Swiss jurisdiction. Zero data retention with cryptographic verification. SOC2, GDPR, HIPAA compliance. No GPU management on your side.

Common DeepSeek R1 Deployment Issues and Fixes

Here's what actually goes wrong when teams deploy R1.

CUDA out of memory errors: Set --gpu-memory-utilization 0.8 from the start. The default 0.9 fails on many configurations.

vLLM memory leaks: The V1 engine has a known leak where GPU memory grows across requests. Monitor and restart periodically.

AWQ quantization corrupts long context: Outputs get weird above 18K tokens with AWQ on vLLM 0.7.2. Keep inputs shorter or verify outputs at longer lengths.

8x H100 can't fit FP8: 640GB total VRAM isn't enough for the 685GB model. Use INT4/AWQ or go multi-node.

INT4 hurts reasoning quality: The irony is painful. You're running a reasoning model, and quantization artifacts hit exactly where it matters most. Use FP8 or higher if reasoning accuracy is critical.

R1-0528 is slower: The May 2025 update generates ~23K thinking tokens per query versus ~12K. Better reasoning, double the latency.

Ollama breaks tool calling: DeepSeek R1 via Ollama produces garbage with tools. Use vLLM or SGLang instead.

DeepSeek R1 Commercial License and Usage Rights

The full 671B is MIT licensed. Commercial use is fine with attribution.

Qwen-based distills (1.5B, 7B, 14B, 32B) are Apache 2.0. No restrictions.

Llama-based distills (8B, 70B) inherit Meta's license with a 700 million MAU threshold. If you're building something that might get that big, stick with Qwen variants.

For most teams, the 32B Qwen distill offers the cleanest commercial path.

FAQ: DeepSeek R1 Self-Hosting Questions

Is DeepSeek R1 safe to use?

The data sovereignty concern is solved by self-hosting. Your prompts never leave your infrastructure. Whether there are backdoors in the weights is an open question that applies to any model you didn't train yourself.

Which DeepSeek R1 model should I start with?

The 32B Qwen distill. It fits on one RTX 4090 at INT4, has no commercial restrictions, and keeps most of R1's reasoning capability.

Should I use SGLang or vLLM for DeepSeek R1?

vLLM has the larger ecosystem and more community support. SGLang has better latency at low concurrency. Start with vLLM unless you have a specific latency requirement.

What if I need DeepSeek R1 in production but can't run GPU infrastructure?

That's what managed deployment solves. PremAI deploys in your VPC with Swiss jurisdiction. Your data stays yours without the infrastructure burden.

How do I fix DeepSeek R1 out of memory errors?

Set --gpu-memory-utilization 0.8 and start with --max-model-len 8192. Increase context length only after confirming baseline stability.

Can I use DeepSeek R1 commercially?

Yes. The full 671B is MIT licensed. Qwen-based distills are Apache 2.0. Llama-based distills have Meta's 700M MAU threshold.


Getting Started with DeepSeek R1 Self-Hosting

  1. Rent a single RTX 4090 for a day. Deploy the 32B distill with vLLM using the Docker config above.
  2. Test your actual use case. Does R1's reasoning capability help? Is the latency acceptable?
  3. If yes, decide your path: scale up your own infrastructure or use managed deployment.
  4. If going DIY, budget for ops time. Memory leaks, OOM debugging, and framework updates are ongoing work.

For infrastructure deep-dives, see the self-hosted LLM guide. For compliance requirements, GDPR-compliant AI chat covers the regulatory side.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe