LLM Docker Deployment: Complete Production Guide (2026)
Getting an LLM running in a container takes maybe 20 minutes. Getting it to stay running under real traffic, survive restarts, and give your ops team something to monitor takes a lot longer.
This guide covers the full path. Base image selection, CUDA setup, vLLM vs TGI tradeoffs, single and multi-GPU configuration, a production docker-compose stack, health checks, CI/CD integration, and the failures that take down containers in production.
Prerequisites: Linux host with NVIDIA GPU, Docker Engine 23.0+, NVIDIA driver 525+.
Why Docker for LLM Deployment
Running LLM inference without containers means every environment difference becomes a debugging session. CUDA toolkit mismatches, Python dependency conflicts, and driver version gaps compound into hours of lost time.
Docker gives you environment parity across dev, staging, and production. You define dependencies once and they stay consistent.
For self-hosted inference specifically, containerization gives you things managed cloud APIs cannot: zero data leaving your infrastructure, predictable per-request compute costs, and full control over model versions. Teams trying to build production AI without heavy ML overhead need this foundation before anything else.
Step 1: CUDA and Base Image Setup
This is where most deployments break before writing a line of application code.
Your container's CUDA version must be compatible with your host's NVIDIA driver. The host driver does not need to match exactly, but it must support the CUDA version inside the container.
Compatibility reference:
| CUDA Version | Min Host Driver |
|---|---|
| 12.1 | 525.60+ |
| 12.4 | 550.54+ |
| 12.6 | 560.28+ |
Always verify GPU access before building anything:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
If nvidia-smi shows your GPU, you are ready. If it fails, the NVIDIA Container Toolkit is either not installed or misconfigured. Fix this first.
Install NVIDIA Container Toolkit on Ubuntu:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Base image choices by use case:
| Use Case | Base Image |
|---|---|
| vLLM deployment | vllm/vllm-openai:v0.6.0 |
| TGI deployment | ghcr.io/huggingface/text-generation-inference:2.3.0 |
| Custom Dockerfile | nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 |
| Lightweight inference | nvidia/cuda:12.1.0-runtime-ubuntu22.04 |
Pin image tags in production. Using :latest means a redeploy can pull breaking changes without warning.
Step 2: vLLM vs TGI vs Ollama
All three serve LLMs over an OpenAI-compatible API. They make different tradeoffs.
| vLLM | TGI | Ollama | |
|---|---|---|---|
| Throughput | Highest at high concurrency | Strong, best on long prompts | Lower, single-user optimized |
| Memory efficiency | PagedAttention, under 4% waste | Chunked prefill from v3 | Moderate |
| Observability | Basic metrics, improving | Prometheus + OpenTelemetry built in | Minimal |
| Model support | Broadest | HuggingFace ecosystem | Popular open-source models |
| Setup complexity | Low | Moderate | Lowest |
| Best for | Production API serving | Enterprise with monitoring stack | Local dev, prototyping |
Use vLLM when you need maximum throughput for concurrent users. Its PagedAttention algorithm treats KV cache like OS virtual memory pages, reducing waste from 60-80% down to under 4%. That translates directly to more concurrent requests per GPU.
Use TGI when you need production-grade observability out of the box or work with very long context prompts. TGI v3 is up to 13x faster than vLLM on long-prompt workloads because of chunked prefill and prefix caching.
Use Ollama when you are prototyping locally or need the simplest possible setup. Developers go from zero to running Llama 3.1 in under five minutes with Ollama. For production multi-user serving, it does not scale well.
Step 3: Single-GPU vLLM Deployment
docker run -d \
--name llm-server \
--gpus all \
--shm-size 16g \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
vllm/vllm-openai:v0.6.0 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype auto \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--port 8000
What each flag does:
--shm-size 16g is not optional. vLLM uses shared memory for tensor operations during concurrent inference. Without enough shared memory, containers crash under load and the error message is not obvious about why. Start at 16g for single GPU, 32g for multi-GPU.
--gpu-memory-utilization 0.90 allocates 90% of VRAM to vLLM. The remaining 10% is headroom for CUDA context overhead. Pushing above 0.95 causes OOM errors when concurrent requests spike.
-v ~/.cache/huggingface:/root/.cache/huggingface mounts the model cache as a volume. Without this, every container restart downloads weights from scratch. A 7B model is 14GB. A 70B model exceeds 100GB. This one flag saves you hours of downtime.
Verify the server loaded correctly:
# Health check
curl http://localhost:8000/health
# Confirm model is loaded
curl http://localhost:8000/v1/models
# Test inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'
Step 4: TGI Deployment
docker run -d \
--name tgi-server \
--gpus all \
--shm-size 1g \
-p 8080:80 \
-v ~/.cache/huggingface:/data \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-generation-inference:2.3.0 \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--dtype bfloat16 \
--max-concurrent-requests 128 \
--port 80
TGI exposes Prometheus metrics at /metrics and OpenTelemetry traces by default. If your team already runs a Grafana stack, TGI's instrumentation needs almost no additional setup compared to vLLM.
Test TGI is running:
curl http://localhost:8080/health
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tgi",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'
Step 5: Multi-GPU Configuration
Single GPU inference tops out around 70B parameter models on an A100 80GB. Tensor parallelism splits the model across multiple GPUs.
docker run -d \
--name vllm-multi-gpu \
--gpus '"device=0,1,2,3"' \
--shm-size 32g \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:v0.6.0 \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90
--ipc=host is required for multi-GPU. It enables shared IPC namespace between the container and host, which NCCL uses for GPU-to-GPU communication. Without it, multi-GPU initialization fails or runs significantly slower.
--tensor-parallel-size must exactly match the number of GPUs you allocate. Mismatch causes a startup error.
GPU memory requirements by model size:
| Model Size | Precision | GPUs Needed (A100 80GB) |
|---|---|---|
| 7B | FP16 | 1 |
| 13B | FP16 | 1 |
| 70B | FP16 | 2 |
| 70B | INT4 | 1 |
| 405B | FP16 | 8 |
Quantization makes larger models practical on fewer GPUs. INT4 quantization cuts memory requirements by roughly 75% with modest accuracy loss on most tasks. PremAI's own tests showed 8x throughput improvement on commodity hardware worth $12,200 by optimizing the serving stack rather than buying more GPU.
Step 6: Production Docker Compose Stack
A single container gets you inference. A production deployment needs inference, reverse proxy, and monitoring coordinated together.
version: '3.8'
services:
llm:
image: vllm/vllm-openai:v0.6.0
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- hf_cache:/root/.cache/huggingface
shm_size: '16gb'
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--dtype auto
--gpu-memory-utilization 0.90
--max-model-len 8192
--port 8000
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s
ports:
- "8000:8000"
restart: unless-stopped
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./ssl:/etc/nginx/ssl:ro
depends_on:
llm:
condition: service_healthy
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
restart: unless-stopped
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
restart: unless-stopped
volumes:
hf_cache:
prometheus_data:
grafana_data:
start_period: 120s in the health check is important. vLLM takes 30-90 seconds to load model weights depending on model size and disk speed. Without enough start period, Docker kills the container before the model finishes loading. It looks like a crash when it is actually an impatient health check.
restart: unless-stopped recovers from crashes automatically but respects manual stops for maintenance.
Named volumes (hf_cache) instead of host path bind mounts keep the cache portable and avoid permission issues across environments.
Step 7: Health Checks and Monitoring
Prometheus configuration (prometheus.yml):
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['llm:8000']
metrics_path: '/metrics'
scrape_interval: 15s
Key metrics to monitor:
| Metric | What it signals |
|---|---|
vllm:gpu_cache_usage_perc |
KV cache utilization. Consistently above 90% means you need more VRAM or lower max concurrency |
vllm:num_requests_running |
Active requests in flight |
vllm:num_requests_waiting |
Queued requests. Rising trend signals a capacity problem |
vllm:time_to_first_token_seconds |
Latency from request receipt to first output token |
vllm:e2e_request_latency_seconds |
Full request round-trip latency |
TGI metrics are available at the same /metrics path. TGI additionally exposes OpenTelemetry traces if you pass --otlp-endpoint at startup, useful for distributed tracing across services.
Step 8: Dockerfile for Custom Models
When you need to build a custom image (fine-tuned model, custom dependencies, security requirements):
# Multi-stage build to keep final image lean
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 AS builder
ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /app
RUN apt-get update && apt-get install -y \
python3.11 \
python3.11-pip \
python3.11-dev \
git \
curl \
&& rm -rf /var/lib/apt/lists/*
RUN ln -s /usr/bin/python3.11 /usr/bin/python
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Production stage
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /app
# Non-root user for security
RUN groupadd -r llmgroup && useradd -r -g llmgroup -m llmuser
RUN apt-get update && apt-get install -y \
python3.11 \
curl \
&& rm -rf /var/lib/apt/lists/*
RUN ln -s /usr/bin/python3.11 /usr/bin/python
# Copy installed packages from builder
COPY --from=builder /root/.local /home/llmuser/.local
ENV PATH=/home/llmuser/.local/bin:$PATH
# Set environment variables
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
ENV TRANSFORMERS_CACHE=/app/cache
ENV HF_HOME=/app/cache
COPY --chown=llmuser:llmgroup . .
RUN mkdir -p /app/cache && chown -R llmuser:llmgroup /app
USER llmuser
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["python", "app.py"]
Multi-stage builds keep the final image smaller by leaving build tools in the builder stage. The production image only includes runtime dependencies.
Running as a non-root user (llmuser) is a security baseline. Do not run LLM containers as root in production.
Step 9: Secrets Management
Never put API tokens or credentials directly in docker-compose environment variables for production.
Using Docker secrets:
# docker-compose.yml
services:
llm:
image: vllm/vllm-openai:v0.6.0
secrets:
- hf_token
environment:
- HUGGING_FACE_HUB_TOKEN_FILE=/run/secrets/hf_token
secrets:
hf_token:
file: ./secrets/hf_token.txt
Using a .env file (minimum for development):
# .env file (never commit this)
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
GRAFANA_PASSWORD=your_secure_password
Add .env to .gitignore. Use a secrets manager (AWS Secrets Manager, HashiCorp Vault, or similar) for production deployments.
Step 10: CI/CD Pipeline
Automating builds ensures you always deploy tested, consistent images.
GitHub Actions workflow:
# .github/workflows/deploy.yml
name: Build and Deploy LLM Container
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build image
uses: docker/build-push-action@v5
with:
context: .
push: false
tags: ghcr.io/${{ github.repository }}/llm-server:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Run smoke tests
run: |
docker run --rm \
-e HF_TOKEN=${{ secrets.HF_TOKEN }} \
ghcr.io/${{ github.repository }}/llm-server:${{ github.sha }} \
python -m pytest tests/ -v
- name: Push to registry
if: github.ref == 'refs/heads/main' && success()
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
ghcr.io/${{ github.repository }}/llm-server:latest
ghcr.io/${{ github.repository }}/llm-server:${{ github.sha }}
cache-from: type=gha uses GitHub Actions cache to speed up subsequent builds. CUDA base images are large. Without layer caching, builds take 15+ minutes every time.
Common Production Failures and Fixes
Container OOM on startup
Reduce --gpu-memory-utilization to 0.80, or enable quantization:
--quantization gptq
# or for AWQ quantized models
--quantization awq
Model never loads (container healthy but no model)
docker logs llm-server --tail 100
Almost always a missing HuggingFace token, private model without access, or typo in the model ID.
Multi-GPU initialization hangs
docker run ... -e NCCL_DEBUG=WARN ...
NCCL errors surface the real cause. Usually missing --ipc=host or a driver version mismatch between GPUs.
Shared memory errors under concurrent load
Increase --shm-size to 32g. 16g handles sequential requests fine. Under concurrent load, vLLM needs more shared memory headroom.
Health check kills container during startup
Increase start_period to 300s for large models (70B+). vLLM loading a 70B model from disk can take 3-4 minutes.
Slow inference despite GPU being available
# Check GPU is actually being used
docker exec llm-server nvidia-smi
# Check CUDA_VISIBLE_DEVICES isn't accidentally empty
docker exec llm-server env | grep CUDA
An unintentionally empty CUDA_VISIBLE_DEVICES makes vLLM fall back to CPU inference. It runs but at a fraction of expected speed.
GPU Memory Quick Reference
| Model | Precision | VRAM | Recommended Config |
|---|---|---|---|
| Llama 3.1 8B | FP16 | 16 GB | 1x A100 40GB |
| Llama 3.1 8B | INT4 | 6 GB | 1x RTX 4090 |
| Llama 3.1 70B | FP16 | 140 GB | 2x A100 80GB |
| Llama 3.1 70B | INT4 | 40 GB | 1x A100 80GB |
| Mistral 7B | FP16 | 14 GB | 1x A100 40GB |
| Qwen2.5 72B | FP16 | 144 GB | 2x A100 80GB |
For enterprise inference on constrained hardware, INT4 quantization with AWQ or GPTQ gets most models running on a single consumer or mid-range enterprise GPU with acceptable quality loss.
What Comes After Deployment
A running container serving a base model is the start, not the end.
Most teams notice a gap between benchmark performance and real-world results on their own data. General-purpose models have not seen your internal documents, your product's terminology, or your edge cases. Fine-tuning on your domain data closes that gap, and the performance difference on your specific tasks is usually significant.
Before deploying any new model version to production, structured evaluation tells you whether the update is actually better for your use case. Skipping this leads to regressions that are hard to diagnose. Tying evaluation to real reliability metrics matters as much as the infrastructure itself.
On cost, self-hosted inference is already cheaper than API-based approaches at volume. There are additional strategies that push costs down further, including caching, batching, and model routing. This cost reduction guide covers several that apply to self-hosted setups.
For teams that want dataset management, fine-tuning, evaluation, and deployment handled in one place without wiring them together, Prem Studio covers the full AI development lifecycle.
FAQ
What CUDA version do I need for vLLM Docker? CUDA 12.1 minimum with NVIDIA driver 525+. Most setups in 2026 run CUDA 12.4 with driver 550+. Always check your driver version with nvidia-smi before pulling an image.
Can I run LLMs in Docker without a GPU? Yes. CPU inference works but is impractical for production. A 7B model on CPU runs at 1-3 tokens per second. Fine for local testing with small models, not for serving real users.
How much system RAM does the host need? Match your model's VRAM requirement in system RAM as a baseline. A 7B FP16 model needs 16GB VRAM and at least 16GB system RAM for the loading process. More is better when running other services on the same host.
vLLM or Ollama for Docker deployment? Ollama for local development and prototyping. vLLM for production. The throughput difference is real at multi-user scale. Ollama does not have the continuous batching and PagedAttention that makes vLLM handle concurrent requests efficiently.
Does this work with fine-tuned models? Yes. Mount your fine-tuned model directory as a volume and point --model to the local path instead of a HuggingFace model ID. The inference server treats a fine-tuned checkpoint identically to a base model.
How do I handle model updates without downtime? Use a load balancer in front of multiple containers. Bring up the new container, verify it passes health checks, then shift traffic and tear down the old one. Kubernetes rolling updates automate this pattern cleanly.
What is the difference between tensor parallelism and pipeline parallelism? Tensor parallelism splits individual model layers across GPUs, with each GPU handling part of every computation. This gives lower latency. Pipeline parallelism assigns different layers to different GPUs sequentially, which works better when GPUs are connected by slower links. vLLM defaults to tensor parallelism with --tensor-parallel-size.