Multi-GPU LLM Inference: TP vs PP vs EP Parallelism Guide (2026)
Complete guide to multi-GPU LLM inference. Learn tensor parallelism, pipeline parallelism, expert parallelism with vLLM. Includes benchmarks, memory calculations, and decision framework.
Most teams reach for multi-GPU too early. A single H100 runs Llama 70B at INT4 with room for 32K context. A single A100 handles Mistral 7B with headroom for batching. Adding GPUs before you need them just adds complexity, failure modes, and wasted compute on communication overhead.
But when your model genuinely doesn't fit, or when one GPU can't sustain production throughput, you need to split the workload. And the choice you make here, tensor parallelism, pipeline parallelism, expert parallelism, or some combination, determines whether your deployment runs efficiently or burns 30-50% of your compute on synchronization.
This guide is for engineers who've hit the limits of single-GPU inference and need to make real decisions about scaling up. We'll cover when multi-GPU actually makes sense, which parallelism strategy fits which situation, the problems you'll hit that nobody warns you about, and the specific configurations that work in production.
First Question: Do You Actually Need Multiple GPUs?
Before committing to distributed inference, verify you've exhausted single-GPU options.
Check 1: Have you tried quantization?
Llama 70B in BF16 needs 140GB. In INT4, it needs 35GB. That's the difference between "impossible on one GPU" and "runs fine on a single H100."
Modern quantization (AWQ, GPTQ, GGUF Q4) typically loses 1-3% on benchmarks. For most production workloads, that's acceptable. Try INT4 or INT8 before adding hardware.
Check 2: Is your throughput actually bottlenecked?
One H100 running Llama 70B INT4 can push 40-60 tokens/second per request with batching. That's enough for many production workloads. Run actual load tests before assuming you need more compute.
Check 3: Could a smaller model work?
Llama 8B handles most simple tasks. Specialized 34B models often match 70B general models on domain tasks. Qwen 32B fits on one 24GB GPU with INT4. Don't use 70B because it sounds impressive, use it because your task requires it.
The honest thresholds:
| Situation | Multi-GPU Needed? |
|---|---|
| Model under 35B, any quantization | No |
| 70B at INT4, moderate throughput | Probably no |
| 70B at FP16/BF16 | Yes |
| 70B at any precision, high concurrent load | Likely yes |
| 405B at any precision | Definitely yes |
| DeepSeek-V3, Llama 4 Maverick | Yes |
If you're past these thresholds, read on. If not, stop here and optimize your single-GPU setup first.
The Memory Math That Actually Matters
Understanding memory requirements prevents the "why did it OOM on startup?" debugging sessions.
Model Weights
Simple formula: parameters × bytes per precision.
| Model | BF16 (2B/param) | FP8 (1B/param) | INT4 (0.5B/param) |
|---|---|---|---|
| Llama 8B | 16 GB | 8 GB | 4 GB |
| Llama 70B | 140 GB | 70 GB | 35 GB |
| Llama 405B | 810 GB | 405 GB | 203 GB |
| Mixtral 8x7B | 94 GB | 47 GB | 24 GB |
| DeepSeek-V3 (671B) | 1,342 GB | 671 GB | 336 GB |
| Qwen 72B | 144 GB | 72 GB | 36 GB |
With tensor parallelism, divide by GPU count. Llama 70B BF16 on TP=4 means 35GB per GPU for weights.
KV Cache: The Hidden Memory Killer
Model weights are just the start. KV cache stores attention context and grows with sequence length and batch size.
Rough formula:
KV cache = 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_element
Real example: Llama 70B, 32K context, batch of 8, FP16:
2 × 80 layers × 8 kv_heads × 128 dim × 32,768 seq × 8 batch × 2 bytes = ~42 GB
That's 42GB on top of model weights. For a single request at 32K context, you need roughly 5GB of KV cache. Scale that with concurrency.
Rule of thumb: Reserve 40-50% of VRAM beyond model weights for KV cache and runtime overhead. If your weights take 35GB per GPU, you need 60GB+ GPUs, not 40GB.
Per-GPU Memory with Different Configs
For Llama 70B BF16 (140GB weights total):
| Config | Weights/GPU | Free for KV Cache | Verdict |
|---|---|---|---|
| 2× A100 80GB | 70 GB | 10 GB each | Tight, short context only |
| 4× A100 80GB | 35 GB | 45 GB each | Comfortable |
| 8× H100 80GB | 17.5 GB | 62.5 GB each | Generous, long context OK |
| 2× H200 141GB | 70 GB | 71 GB each | Generous |
More GPUs = more KV cache headroom = longer context or higher batch sizes.
The Three Parallelism Strategies (And When Each Makes Sense)
Tensor Parallelism: Split Every Layer Across GPUs
Tensor parallelism divides each layer's weight matrices across GPUs. Every GPU processes the same input simultaneously, computing different slices of each layer.
How it works in practice:
A transformer layer has attention and MLP blocks. With TP=4, each GPU holds 1/4 of the attention weights and 1/4 of the MLP weights. All four GPUs process the same token, compute their quarter of the result, then synchronize via all-reduce before the next layer.
The critical dependency: interconnect speed.
Every transformer layer requires two all-reduce operations to sync results. Llama 70B has 80 layers. That's 160 synchronization points per forward pass.
| Interconnect | Bandwidth | TP Efficiency |
|---|---|---|
| NVLink 4.0 (H100) | 900 GB/s | Excellent, TP=8 works well |
| NVLink 3.0 (A100) | 600 GB/s | Good, TP=8 acceptable |
| PCIe 5.0 | 128 GB/s | Marginal, TP=2 max |
| PCIe 4.0 | 64 GB/s | Poor, avoid TP |
On PCIe systems, communication can consume 40-50% of inference time at TP=4. NVLink is effectively mandatory for tensor parallelism beyond TP=2.
Scaling efficiency drops fast:
| TP Size | Theoretical Speedup | Actual Speedup | Efficiency |
|---|---|---|---|
| 2 | 2.0× | 1.7-1.9× | 85-95% |
| 4 | 4.0× | 2.8-3.4× | 70-85% |
| 8 | 8.0× | 4.5-6.0× | 56-75% |
At TP=8, you're losing 25-44% of potential speedup to communication. This is normal, not a misconfiguration.
When to use tensor parallelism:
- You have NVLink (DGX, HGX, or similar)
- Latency matters more than throughput
- Low to moderate concurrency (under ~200 concurrent requests)
- Single-node deployment
vLLM command:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
Pipeline Parallelism: Stack Layer Blocks Sequentially
Pipeline parallelism assigns consecutive chunks of layers to different GPUs. Data flows through like an assembly line.
How it works:
With PP=4 on an 80-layer model, GPU 0 runs layers 0-19, GPU 1 runs 20-39, GPU 2 runs 40-59, GPU 3 runs 60-79. A request enters GPU 0, gets processed through its layers, then passes to GPU 1, and so on.
The bubble problem:
For a single request, only one GPU works at a time. With PP=4, each GPU sits idle 75% of the time waiting for its turn.
vLLM mitigates this with continuous batching, while GPU 0 processes request B, GPU 1 processes request A's next stage. But this only helps with concurrent traffic. Single-request latency is always worse with PP than TP.
When pipeline parallelism shines:
- PCIe-only systems (no NVLink)
- High throughput workloads with many concurrent requests
- Multi-node deployments (PP across nodes, TP within nodes)
- Memory-constrained setups where you need maximum weight distribution
Communication overhead is minimal. PP only requires point-to-point transfers between adjacent GPUs, not all-to-all synchronization. This is why PP works on PCIe while TP doesn't.
vLLM command:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--pipeline-parallel-size 4
Expert Parallelism: For MoE Models Only
Expert parallelism applies only to Mixture-of-Experts architectures: Mixtral, DeepSeek-V2/V3, Llama 4, Qwen MoE variants, and similar models.
Why MoE needs special handling:
MoE models have many "expert" sub-networks. Only a few experts activate per token, but all expert weights must be in memory. DeepSeek-V3 has 256 experts totaling 671B parameters, but only activates ~37B per token.
Standard TP shards all experts across all GPUs. Every GPU has a slice of every expert.
Expert parallelism distributes complete experts across GPUs. Each GPU holds different experts entirely. Tokens route to whichever GPU has the relevant expert.
The KV cache problem with MoE:
DeepSeek uses Multi-Latent Attention (MLA), which has only one KV head. Standard tensor parallelism can't shard the KV cache across GPUs, there's only one head to shard.
Result: With TP=8 on DeepSeek, the KV cache is duplicated 8 times. Massive memory waste.
Data parallelism with expert parallelism (DP+EP) solves this. DP partitions the KV cache by request, while EP distributes experts. Each GPU holds 1/8 of the KV cache and 1/8 of the experts.
Configuration matters enormously for MoE:
| Config | KV Cache | Experts | Best For |
|---|---|---|---|
| TP=8 | Duplicated 8× | Sharded | Low concurrency, latency-focused |
| DP=8 + EP | Partitioned | Distributed | High concurrency, throughput-focused |
vLLM commands:
# TP with expert parallelism (low concurrency)
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--enable-expert-parallel
# DP with expert parallelism (high concurrency)
vllm serve deepseek-ai/DeepSeek-V3 \
--data-parallel-size 8 \
--enable-expert-parallel \
--disable-nccl-for-dp-synchronization
The Problems Nobody Warns You About
These issues cause real production failures. I've seen teams spend weeks debugging problems that are fundamental to distributed inference, not configuration mistakes.
Different Parallelism Configs Produce Different Outputs
Floating-point arithmetic isn't associative. Changing how computations are distributed changes rounding behavior.
TP=4 and TP=8 on the same model with the same prompt at temperature=0 produce different outputs. Not just different token probabilities, different actual tokens.
This matters for:
- Regression testing across config changes
- A/B comparisons between deployments
- Reproducibility requirements
- Debugging ("it worked yesterday, we just changed TP from 4 to 8")
There's no fix. Accept that parallelism configuration is part of your model's identity. Don't change it without re-validating outputs.
PCIe Multi-GPU Is Worse Than You Think
Teams buy 4× RTX 4090s expecting 4× performance. They get 1.5-2×.
PCIe 4.0 delivers 64 GB/s. NVLink 4.0 delivers 900 GB/s. That's 14× difference. Tensor parallelism synchronizes after every layer. At TP=4 on PCIe, you spend more time waiting for data transfer than computing.
If you don't have NVLink:
- Use pipeline parallelism instead of tensor parallelism
- Or use data parallelism (multiple independent model replicas)
- Or accept that TP=2 is your practical maximum
- Or buy NVLink hardware
Memory Fragmentation Kills Long Contexts
You calculate memory requirements, everything looks fine, then OOM hits at runtime.
KV cache allocates dynamically as sequences grow. With variable-length requests and continuous batching, memory fragments. Your GPU has 80GB total, 60GB free in theory, but no contiguous block larger than 8GB.
Mitigations:
- Use
--kv-cache-dtype fp8to halve KV cache memory - Set
--max-model-lento your actual needs, not maximum possible - Reduce
--max-num-seqsto limit concurrent sequences - Enable
--enable-prefix-cachingif requests share prefixes
Driver Version Mismatches Cause Silent Failures
All GPUs need identical driver versions. Mixed drivers cause hangs, incorrect outputs, or crashes that don't mention drivers in the error message.
Multi-node makes this worse. Node A has driver 535.104, Node B has 535.129. Everything appears fine until inference silently produces garbage.
Prevention:
- Use containerized deployments with fixed driver versions
- Pin driver versions in your infrastructure automation
- Verify driver versions across all nodes before deployment:
nvidia-smi --query-gpu=driver_version --format=csv
NCCL Hangs Under Load
NCCL (NVIDIA's communication library) can hang under high load without clear error messages. Your process just stops responding.
Common causes:
- Network issues between nodes
- Memory pressure causing allocation failures
- Timeout settings too short for large transfers
- Firewall rules blocking NCCL ports
Debug approach:
# Enable NCCL debugging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
# Increase timeout
export NCCL_TIMEOUT=3600
# Run and check logs for actual failure point
Real Hardware Configurations
Single-Node Setups
| Config | Total VRAM | NVLink | Models It Handles | Monthly Cloud Cost |
|---|---|---|---|---|
| 2× A100 40GB | 80 GB | Yes | 70B FP8 (tight) | $2,600 |
| 2× A100 80GB | 160 GB | Yes | 70B BF16 | $3,600 |
| 4× A100 80GB | 320 GB | Yes | 405B FP8 | $7,200 |
| 8× H100 80GB | 640 GB | Yes | 405B BF16, DeepSeek-V3 FP8 | $23,000 |
| 8× H200 141GB | 1,128 GB | Yes | DeepSeek-V3 BF16, everything | $35,000 |
Multi-Node: When You Actually Need It
Multi-node adds complexity without adding efficiency. Each node needs identical setup. Network between nodes becomes a bottleneck. Failure modes multiply.
Only go multi-node when:
- Model physically doesn't fit on one node (DeepSeek-V3 BF16 needs ~1.3TB)
- You need more throughput than one node provides and can't use data parallelism
Standard pattern: TP within nodes (uses fast NVLink), PP across nodes (tolerates slower network).
# Two nodes, 8 GPUs each
# Node 0
vllm serve model-name \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--nnodes 2 \
--node-rank 0 \
--master-addr 10.0.0.1
# Node 1
vllm serve model-name \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--nnodes 2 \
--node-rank 1 \
--master-addr 10.0.0.1
Network requirements: InfiniBand strongly preferred. 100Gbps Ethernet minimum. 10Gbps will bottleneck PP transfers.
Complete vLLM Configuration Reference
Basic Parallelism
# Tensor parallel only (most common single-node setup)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
# Pipeline parallel only (PCIe systems or throughput-focused)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--pipeline-parallel-size 4
# Hybrid (multi-node)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2
MoE Models
# DeepSeek-V3: TP+EP for latency
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--enable-expert-parallel
# DeepSeek-V3: DP+EP for throughput
vllm serve deepseek-ai/DeepSeek-V3 \
--data-parallel-size 8 \
--enable-expert-parallel \
--disable-nccl-for-dp-synchronization
# Mixtral
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tensor-parallel-size 2 \
--enable-expert-parallel
Memory Optimization
# FP8 KV cache (halves KV memory, minimal quality loss)
--kv-cache-dtype fp8
# Limit context length (saves KV memory for batching)
--max-model-len 32768
# Limit concurrent sequences (prevents OOM under load)
--max-num-seqs 256
# Chunked prefill (prevents long prefills from blocking)
--enable-chunked-prefill
# Prefix caching (saves memory when requests share prefixes)
--enable-prefix-caching
Production Settings
# Full production example
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 256 \
--enable-chunked-prefill \
--enable-prefix-caching \
--disable-log-requests \
--port 8000
Making the Decision: A Framework
Step 1: Verify You Need Multi-GPU
- Can you run the model quantized on one GPU?
- Is throughput actually bottlenecked?
- Would a smaller model work for your task?
If all answers are "no," continue.
Step 2: Check Your Interconnect
- NVLink available? → Tensor parallelism works well
- PCIe only? → Prefer pipeline parallelism or data parallelism
Step 3: Dense or MoE Model?
- Dense model (Llama, Mistral, Qwen dense) → Standard TP or PP
- MoE model (Mixtral, DeepSeek, Llama 4) → Consider EP, watch out for KV cache issues with MLA
Step 4: Latency or Throughput Priority?
- Latency (interactive chat, low concurrency) → Tensor parallelism
- Throughput (batch processing, high concurrency) → Pipeline or data parallelism, DP+EP for MoE
Step 5: Single Node or Multi-Node?
- Fits on one node? → Stay single-node, much simpler ops
- Needs multi-node? → TP within nodes, PP across nodes
Quick Reference Table
| Situation | Config |
|---|---|
| 70B on 2× H100 with NVLink, latency-focused | TP=2 |
| 70B on 4× A100 PCIe, throughput-focused | PP=4 |
| 405B on 8× H100 single node | TP=8 |
| 405B on 2 nodes × 8 H100 | TP=8, PP=2 |
| DeepSeek-V3 on 8× H100, latency-focused | TP=8 + EP |
| DeepSeek-V3 on 8× H100, throughput-focused | DP=8 + EP |
| Mixtral on 2× A100 | TP=2 + EP |
When Multi-GPU Isn't Worth It
Sometimes the right answer is "don't."
Consider Managed Inference
If GPU infrastructure isn't your core competency, the operational burden of multi-GPU deployment might exceed the cost of managed services.
Setting up multi-GPU inference takes days to weeks. Maintaining it takes ongoing engineering time. Debugging distributed systems issues is specialized work.
PremAI offers multi-GPU deployment in your VPC with SOC2/HIPAA/GDPR compliance. You get the model capabilities without becoming a distributed systems team. For many organizations, that trade-off makes sense.
Consider Quantization Harder
Before 8× GPUs for BF16, try 2× GPUs for FP8 or 1× GPU for INT4. Modern quantization often loses less than the engineering complexity costs you.
Consider Smaller Models
The gap between 70B and smaller models has narrowed. Task-specific fine-tuned 8B models sometimes beat general 70B models. Always benchmark your actual use case before assuming you need the biggest model.
Frequently Asked Questions
How many GPUs for Llama 70B? Minimum 2× A100/H100 80GB at FP8. Comfortable setup is 4× for good KV cache headroom. Single H100 works with INT4 quantization.
Should I use TP or PP? TP for latency with NVLink. PP for throughput or PCIe systems. Never use high TP on PCIe.
Why is my TP=8 only 5× faster than single GPU? Communication overhead. 60-75% efficiency at TP=8 is normal, not a bug.
Can I mix different GPU types? Don't. Different memory sizes and speeds cause load imbalance. All GPUs should be identical.
My outputs changed when I changed TP size. Is that a bug? No. Floating-point non-associativity means different parallelism configs legitimately produce different outputs.
What's expert parallelism? A mode for MoE models that distributes experts across GPUs instead of sharding them. Enable with --enable-expert-parallel in vLLM.
Do I need InfiniBand for multi-node? Strongly recommended. 100Gbps Ethernet is minimum viable. 10Gbps will bottleneck.
Why does my deployment OOM despite having enough total memory? Memory fragmentation. Reduce --max-model-len, use FP8 KV cache, or reduce --max-num-seqs.
The Bottom Line
Multi-GPU inference is for when your model genuinely doesn't fit on one GPU, or when one GPU can't sustain your throughput requirements. It's not for making small models faster or because "more GPUs sounds better."
Start with the smallest config that works. Verify you've exhausted single-GPU options first. Choose your parallelism strategy based on your interconnect and workload pattern. Budget for the operational complexity, this isn't set-and-forget infrastructure.
And if managing distributed GPU deployments isn't your team's strength, managed solutions exist for exactly that reason.
Start here:
vllm serve your-model --tensor-parallel-size 2
Scale up only when you hit actual limits.
For related reading, see the self-hosted LLM guide and RAG pipeline strategies.