LLM Docker Deployment: Complete Production Guide (2026)

LLM Docker Deployment: Complete Production Guide (2026)

Getting an LLM running in a container takes maybe 20 minutes. Getting it to stay running under real traffic, survive restarts, and give your ops team something to monitor takes a lot longer.

This guide covers the full path. Base image selection, CUDA setup, vLLM vs TGI tradeoffs, single and multi-GPU configuration, a production docker-compose stack, health checks, CI/CD integration, and the failures that take down containers in production.

Prerequisites: Linux host with NVIDIA GPU, Docker Engine 23.0+, NVIDIA driver 525+.

Why Docker for LLM Deployment

Running LLM inference without containers means every environment difference becomes a debugging session. CUDA toolkit mismatches, Python dependency conflicts, and driver version gaps compound into hours of lost time.

Docker gives you environment parity across dev, staging, and production. You define dependencies once and they stay consistent.

For self-hosted inference specifically, containerization gives you things managed cloud APIs cannot: zero data leaving your infrastructure, predictable per-request compute costs, and full control over model versions. Teams trying to build production AI without heavy ML overhead need this foundation before anything else.


Step 1: CUDA and Base Image Setup

This is where most deployments break before writing a line of application code.

Your container's CUDA version must be compatible with your host's NVIDIA driver. The host driver does not need to match exactly, but it must support the CUDA version inside the container.

Compatibility reference:

CUDA Version Min Host Driver
12.1 525.60+
12.4 550.54+
12.6 560.28+

Always verify GPU access before building anything:

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

If nvidia-smi shows your GPU, you are ready. If it fails, the NVIDIA Container Toolkit is either not installed or misconfigured. Fix this first.

Install NVIDIA Container Toolkit on Ubuntu:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Base image choices by use case:

Use Case Base Image
vLLM deployment vllm/vllm-openai:v0.6.0
TGI deployment ghcr.io/huggingface/text-generation-inference:2.3.0
Custom Dockerfile nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
Lightweight inference nvidia/cuda:12.1.0-runtime-ubuntu22.04

Pin image tags in production. Using :latest means a redeploy can pull breaking changes without warning.


Step 2: vLLM vs TGI vs Ollama

All three serve LLMs over an OpenAI-compatible API. They make different tradeoffs.

vLLM TGI Ollama
Throughput Highest at high concurrency Strong, best on long prompts Lower, single-user optimized
Memory efficiency PagedAttention, under 4% waste Chunked prefill from v3 Moderate
Observability Basic metrics, improving Prometheus + OpenTelemetry built in Minimal
Model support Broadest HuggingFace ecosystem Popular open-source models
Setup complexity Low Moderate Lowest
Best for Production API serving Enterprise with monitoring stack Local dev, prototyping

Use vLLM when you need maximum throughput for concurrent users. Its PagedAttention algorithm treats KV cache like OS virtual memory pages, reducing waste from 60-80% down to under 4%. That translates directly to more concurrent requests per GPU.

Use TGI when you need production-grade observability out of the box or work with very long context prompts. TGI v3 is up to 13x faster than vLLM on long-prompt workloads because of chunked prefill and prefix caching.

Use Ollama when you are prototyping locally or need the simplest possible setup. Developers go from zero to running Llama 3.1 in under five minutes with Ollama. For production multi-user serving, it does not scale well.


Step 3: Single-GPU vLLM Deployment

docker run -d \
  --name llm-server \
  --gpus all \
  --shm-size 16g \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:v0.6.0 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype auto \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --port 8000

What each flag does:

--shm-size 16g is not optional. vLLM uses shared memory for tensor operations during concurrent inference. Without enough shared memory, containers crash under load and the error message is not obvious about why. Start at 16g for single GPU, 32g for multi-GPU.

--gpu-memory-utilization 0.90 allocates 90% of VRAM to vLLM. The remaining 10% is headroom for CUDA context overhead. Pushing above 0.95 causes OOM errors when concurrent requests spike.

-v ~/.cache/huggingface:/root/.cache/huggingface mounts the model cache as a volume. Without this, every container restart downloads weights from scratch. A 7B model is 14GB. A 70B model exceeds 100GB. This one flag saves you hours of downtime.

Verify the server loaded correctly:

# Health check
curl http://localhost:8000/health

# Confirm model is loaded
curl http://localhost:8000/v1/models

# Test inference
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100
  }'

Step 4: TGI Deployment

docker run -d \
  --name tgi-server \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v ~/.cache/huggingface:/data \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --dtype bfloat16 \
  --max-concurrent-requests 128 \
  --port 80

TGI exposes Prometheus metrics at /metrics and OpenTelemetry traces by default. If your team already runs a Grafana stack, TGI's instrumentation needs almost no additional setup compared to vLLM.

Test TGI is running:

curl http://localhost:8080/health

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100
  }'

Step 5: Multi-GPU Configuration

Single GPU inference tops out around 70B parameter models on an A100 80GB. Tensor parallelism splits the model across multiple GPUs.

docker run -d \
  --name vllm-multi-gpu \
  --gpus '"device=0,1,2,3"' \
  --shm-size 32g \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.6.0 \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90

--ipc=host is required for multi-GPU. It enables shared IPC namespace between the container and host, which NCCL uses for GPU-to-GPU communication. Without it, multi-GPU initialization fails or runs significantly slower.

--tensor-parallel-size must exactly match the number of GPUs you allocate. Mismatch causes a startup error.

GPU memory requirements by model size:

Model Size Precision GPUs Needed (A100 80GB)
7B FP16 1
13B FP16 1
70B FP16 2
70B INT4 1
405B FP16 8

Quantization makes larger models practical on fewer GPUs. INT4 quantization cuts memory requirements by roughly 75% with modest accuracy loss on most tasks. PremAI's own tests showed 8x throughput improvement on commodity hardware worth $12,200 by optimizing the serving stack rather than buying more GPU.


Step 6: Production Docker Compose Stack

A single container gets you inference. A production deployment needs inference, reverse proxy, and monitoring coordinated together.

version: '3.8'

services:
  llm:
    image: vllm/vllm-openai:v0.6.0
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - hf_cache:/root/.cache/huggingface
    shm_size: '16gb'
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --dtype auto
      --gpu-memory-utilization 0.90
      --max-model-len 8192
      --port 8000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
    ports:
      - "8000:8000"
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/nginx/ssl:ro
    depends_on:
      llm:
        condition: service_healthy
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped

volumes:
  hf_cache:
  prometheus_data:
  grafana_data:

start_period: 120s in the health check is important. vLLM takes 30-90 seconds to load model weights depending on model size and disk speed. Without enough start period, Docker kills the container before the model finishes loading. It looks like a crash when it is actually an impatient health check.

restart: unless-stopped recovers from crashes automatically but respects manual stops for maintenance.

Named volumes (hf_cache) instead of host path bind mounts keep the cache portable and avoid permission issues across environments.


Step 7: Health Checks and Monitoring

Prometheus configuration (prometheus.yml):

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['llm:8000']
    metrics_path: '/metrics'
    scrape_interval: 15s

Key metrics to monitor:

Metric What it signals
vllm:gpu_cache_usage_perc KV cache utilization. Consistently above 90% means you need more VRAM or lower max concurrency
vllm:num_requests_running Active requests in flight
vllm:num_requests_waiting Queued requests. Rising trend signals a capacity problem
vllm:time_to_first_token_seconds Latency from request receipt to first output token
vllm:e2e_request_latency_seconds Full request round-trip latency

TGI metrics are available at the same /metrics path. TGI additionally exposes OpenTelemetry traces if you pass --otlp-endpoint at startup, useful for distributed tracing across services.


Step 8: Dockerfile for Custom Models

When you need to build a custom image (fine-tuned model, custom dependencies, security requirements):

# Multi-stage build to keep final image lean
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 AS builder

ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /app

RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-pip \
    python3.11-dev \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/bin/python3.11 /usr/bin/python

COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Production stage
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /app

# Non-root user for security
RUN groupadd -r llmgroup && useradd -r -g llmgroup -m llmuser

RUN apt-get update && apt-get install -y \
    python3.11 \
    curl \
    && rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/bin/python3.11 /usr/bin/python

# Copy installed packages from builder
COPY --from=builder /root/.local /home/llmuser/.local
ENV PATH=/home/llmuser/.local/bin:$PATH

# Set environment variables
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
ENV TRANSFORMERS_CACHE=/app/cache
ENV HF_HOME=/app/cache

COPY --chown=llmuser:llmgroup . .
RUN mkdir -p /app/cache && chown -R llmuser:llmgroup /app

USER llmuser

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

CMD ["python", "app.py"]

Multi-stage builds keep the final image smaller by leaving build tools in the builder stage. The production image only includes runtime dependencies.

Running as a non-root user (llmuser) is a security baseline. Do not run LLM containers as root in production.


Step 9: Secrets Management

Never put API tokens or credentials directly in docker-compose environment variables for production.

Using Docker secrets:

# docker-compose.yml
services:
  llm:
    image: vllm/vllm-openai:v0.6.0
    secrets:
      - hf_token
    environment:
      - HUGGING_FACE_HUB_TOKEN_FILE=/run/secrets/hf_token

secrets:
  hf_token:
    file: ./secrets/hf_token.txt

Using a .env file (minimum for development):

# .env file (never commit this)
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
GRAFANA_PASSWORD=your_secure_password

Add .env to .gitignore. Use a secrets manager (AWS Secrets Manager, HashiCorp Vault, or similar) for production deployments.


Step 10: CI/CD Pipeline

Automating builds ensures you always deploy tested, consistent images.

GitHub Actions workflow:

# .github/workflows/deploy.yml
name: Build and Deploy LLM Container

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  build-and-test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: false
          tags: ghcr.io/${{ github.repository }}/llm-server:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Run smoke tests
        run: |
          docker run --rm \
            -e HF_TOKEN=${{ secrets.HF_TOKEN }} \
            ghcr.io/${{ github.repository }}/llm-server:${{ github.sha }} \
            python -m pytest tests/ -v

      - name: Push to registry
        if: github.ref == 'refs/heads/main' && success()
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/llm-server:latest
            ghcr.io/${{ github.repository }}/llm-server:${{ github.sha }}

cache-from: type=gha uses GitHub Actions cache to speed up subsequent builds. CUDA base images are large. Without layer caching, builds take 15+ minutes every time.


Common Production Failures and Fixes

Container OOM on startup

Reduce --gpu-memory-utilization to 0.80, or enable quantization:

--quantization gptq
# or for AWQ quantized models
--quantization awq

Model never loads (container healthy but no model)

docker logs llm-server --tail 100

Almost always a missing HuggingFace token, private model without access, or typo in the model ID.

Multi-GPU initialization hangs

docker run ... -e NCCL_DEBUG=WARN ...

NCCL errors surface the real cause. Usually missing --ipc=host or a driver version mismatch between GPUs.

Shared memory errors under concurrent load

Increase --shm-size to 32g. 16g handles sequential requests fine. Under concurrent load, vLLM needs more shared memory headroom.

Health check kills container during startup

Increase start_period to 300s for large models (70B+). vLLM loading a 70B model from disk can take 3-4 minutes.

Slow inference despite GPU being available

# Check GPU is actually being used
docker exec llm-server nvidia-smi

# Check CUDA_VISIBLE_DEVICES isn't accidentally empty
docker exec llm-server env | grep CUDA

An unintentionally empty CUDA_VISIBLE_DEVICES makes vLLM fall back to CPU inference. It runs but at a fraction of expected speed.


GPU Memory Quick Reference

Model Precision VRAM Recommended Config
Llama 3.1 8B FP16 16 GB 1x A100 40GB
Llama 3.1 8B INT4 6 GB 1x RTX 4090
Llama 3.1 70B FP16 140 GB 2x A100 80GB
Llama 3.1 70B INT4 40 GB 1x A100 80GB
Mistral 7B FP16 14 GB 1x A100 40GB
Qwen2.5 72B FP16 144 GB 2x A100 80GB

For enterprise inference on constrained hardware, INT4 quantization with AWQ or GPTQ gets most models running on a single consumer or mid-range enterprise GPU with acceptable quality loss.


What Comes After Deployment

A running container serving a base model is the start, not the end.

Most teams notice a gap between benchmark performance and real-world results on their own data. General-purpose models have not seen your internal documents, your product's terminology, or your edge cases. Fine-tuning on your domain data closes that gap, and the performance difference on your specific tasks is usually significant.

Before deploying any new model version to production, structured evaluation tells you whether the update is actually better for your use case. Skipping this leads to regressions that are hard to diagnose. Tying evaluation to real reliability metrics matters as much as the infrastructure itself.

On cost, self-hosted inference is already cheaper than API-based approaches at volume. There are additional strategies that push costs down further, including caching, batching, and model routing. This cost reduction guide covers several that apply to self-hosted setups.

For teams that want dataset management, fine-tuning, evaluation, and deployment handled in one place without wiring them together, Prem Studio covers the full AI development lifecycle.


FAQ

What CUDA version do I need for vLLM Docker? CUDA 12.1 minimum with NVIDIA driver 525+. Most setups in 2026 run CUDA 12.4 with driver 550+. Always check your driver version with nvidia-smi before pulling an image.

Can I run LLMs in Docker without a GPU? Yes. CPU inference works but is impractical for production. A 7B model on CPU runs at 1-3 tokens per second. Fine for local testing with small models, not for serving real users.

How much system RAM does the host need? Match your model's VRAM requirement in system RAM as a baseline. A 7B FP16 model needs 16GB VRAM and at least 16GB system RAM for the loading process. More is better when running other services on the same host.

vLLM or Ollama for Docker deployment? Ollama for local development and prototyping. vLLM for production. The throughput difference is real at multi-user scale. Ollama does not have the continuous batching and PagedAttention that makes vLLM handle concurrent requests efficiently.

Does this work with fine-tuned models? Yes. Mount your fine-tuned model directory as a volume and point --model to the local path instead of a HuggingFace model ID. The inference server treats a fine-tuned checkpoint identically to a base model.

How do I handle model updates without downtime? Use a load balancer in front of multiple containers. Bring up the new container, verify it passes health checks, then shift traffic and tear down the old one. Kubernetes rolling updates automate this pattern cleanly.

What is the difference between tensor parallelism and pipeline parallelism? Tensor parallelism splits individual model layers across GPUs, with each GPU handling part of every computation. This gives lower latency. Pipeline parallelism assigns different layers to different GPUs sequentially, which works better when GPUs are connected by slower links. vLLM defaults to tensor parallelism with --tensor-parallel-size.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe