Multi-GPU LLM Inference: TP vs PP vs EP Parallelism Guide (2026)

Complete guide to multi-GPU LLM inference. Learn tensor parallelism, pipeline parallelism, expert parallelism with vLLM. Includes benchmarks, memory calculations, and decision framework.

Multi-GPU LLM Inference: TP vs PP vs EP Parallelism Guide (2026)

Most teams reach for multi-GPU too early. A single H100 runs Llama 70B at INT4 with room for 32K context. A single A100 handles Mistral 7B with headroom for batching. Adding GPUs before you need them just adds complexity, failure modes, and wasted compute on communication overhead.

But when your model genuinely doesn't fit, or when one GPU can't sustain production throughput, you need to split the workload. And the choice you make here, tensor parallelism, pipeline parallelism, expert parallelism, or some combination, determines whether your deployment runs efficiently or burns 30-50% of your compute on synchronization.

This guide is for engineers who've hit the limits of single-GPU inference and need to make real decisions about scaling up. We'll cover when multi-GPU actually makes sense, which parallelism strategy fits which situation, the problems you'll hit that nobody warns you about, and the specific configurations that work in production.


First Question: Do You Actually Need Multiple GPUs?

Before committing to distributed inference, verify you've exhausted single-GPU options.

Check 1: Have you tried quantization?

Llama 70B in BF16 needs 140GB. In INT4, it needs 35GB. That's the difference between "impossible on one GPU" and "runs fine on a single H100."

Modern quantization (AWQ, GPTQ, GGUF Q4) typically loses 1-3% on benchmarks. For most production workloads, that's acceptable. Try INT4 or INT8 before adding hardware.

Check 2: Is your throughput actually bottlenecked?

One H100 running Llama 70B INT4 can push 40-60 tokens/second per request with batching. That's enough for many production workloads. Run actual load tests before assuming you need more compute.

Check 3: Could a smaller model work?

Llama 8B handles most simple tasks. Specialized 34B models often match 70B general models on domain tasks. Qwen 32B fits on one 24GB GPU with INT4. Don't use 70B because it sounds impressive, use it because your task requires it.

The honest thresholds:

Situation Multi-GPU Needed?
Model under 35B, any quantization No
70B at INT4, moderate throughput Probably no
70B at FP16/BF16 Yes
70B at any precision, high concurrent load Likely yes
405B at any precision Definitely yes
DeepSeek-V3, Llama 4 Maverick Yes

If you're past these thresholds, read on. If not, stop here and optimize your single-GPU setup first.


The Memory Math That Actually Matters

Understanding memory requirements prevents the "why did it OOM on startup?" debugging sessions.

Model Weights

Simple formula: parameters × bytes per precision.

Model BF16 (2B/param) FP8 (1B/param) INT4 (0.5B/param)
Llama 8B 16 GB 8 GB 4 GB
Llama 70B 140 GB 70 GB 35 GB
Llama 405B 810 GB 405 GB 203 GB
Mixtral 8x7B 94 GB 47 GB 24 GB
DeepSeek-V3 (671B) 1,342 GB 671 GB 336 GB
Qwen 72B 144 GB 72 GB 36 GB

With tensor parallelism, divide by GPU count. Llama 70B BF16 on TP=4 means 35GB per GPU for weights.

KV Cache: The Hidden Memory Killer

Model weights are just the start. KV cache stores attention context and grows with sequence length and batch size.

Rough formula:

KV cache = 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_element

Real example: Llama 70B, 32K context, batch of 8, FP16:

2 × 80 layers × 8 kv_heads × 128 dim × 32,768 seq × 8 batch × 2 bytes = ~42 GB

That's 42GB on top of model weights. For a single request at 32K context, you need roughly 5GB of KV cache. Scale that with concurrency.

Rule of thumb: Reserve 40-50% of VRAM beyond model weights for KV cache and runtime overhead. If your weights take 35GB per GPU, you need 60GB+ GPUs, not 40GB.

Per-GPU Memory with Different Configs

For Llama 70B BF16 (140GB weights total):

Config Weights/GPU Free for KV Cache Verdict
2× A100 80GB 70 GB 10 GB each Tight, short context only
4× A100 80GB 35 GB 45 GB each Comfortable
8× H100 80GB 17.5 GB 62.5 GB each Generous, long context OK
2× H200 141GB 70 GB 71 GB each Generous

More GPUs = more KV cache headroom = longer context or higher batch sizes.


The Three Parallelism Strategies (And When Each Makes Sense)

Tensor Parallelism: Split Every Layer Across GPUs

Tensor parallelism divides each layer's weight matrices across GPUs. Every GPU processes the same input simultaneously, computing different slices of each layer.

How it works in practice:

A transformer layer has attention and MLP blocks. With TP=4, each GPU holds 1/4 of the attention weights and 1/4 of the MLP weights. All four GPUs process the same token, compute their quarter of the result, then synchronize via all-reduce before the next layer.

The critical dependency: interconnect speed.

Every transformer layer requires two all-reduce operations to sync results. Llama 70B has 80 layers. That's 160 synchronization points per forward pass.

Interconnect Bandwidth TP Efficiency
NVLink 4.0 (H100) 900 GB/s Excellent, TP=8 works well
NVLink 3.0 (A100) 600 GB/s Good, TP=8 acceptable
PCIe 5.0 128 GB/s Marginal, TP=2 max
PCIe 4.0 64 GB/s Poor, avoid TP

On PCIe systems, communication can consume 40-50% of inference time at TP=4. NVLink is effectively mandatory for tensor parallelism beyond TP=2.

Scaling efficiency drops fast:

TP Size Theoretical Speedup Actual Speedup Efficiency
2 2.0× 1.7-1.9× 85-95%
4 4.0× 2.8-3.4× 70-85%
8 8.0× 4.5-6.0× 56-75%

At TP=8, you're losing 25-44% of potential speedup to communication. This is normal, not a misconfiguration.

When to use tensor parallelism:

  • You have NVLink (DGX, HGX, or similar)
  • Latency matters more than throughput
  • Low to moderate concurrency (under ~200 concurrent requests)
  • Single-node deployment

vLLM command:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

Pipeline Parallelism: Stack Layer Blocks Sequentially

Pipeline parallelism assigns consecutive chunks of layers to different GPUs. Data flows through like an assembly line.

How it works:

With PP=4 on an 80-layer model, GPU 0 runs layers 0-19, GPU 1 runs 20-39, GPU 2 runs 40-59, GPU 3 runs 60-79. A request enters GPU 0, gets processed through its layers, then passes to GPU 1, and so on.

The bubble problem:

For a single request, only one GPU works at a time. With PP=4, each GPU sits idle 75% of the time waiting for its turn.

vLLM mitigates this with continuous batching, while GPU 0 processes request B, GPU 1 processes request A's next stage. But this only helps with concurrent traffic. Single-request latency is always worse with PP than TP.

When pipeline parallelism shines:

  • PCIe-only systems (no NVLink)
  • High throughput workloads with many concurrent requests
  • Multi-node deployments (PP across nodes, TP within nodes)
  • Memory-constrained setups where you need maximum weight distribution

Communication overhead is minimal. PP only requires point-to-point transfers between adjacent GPUs, not all-to-all synchronization. This is why PP works on PCIe while TP doesn't.

vLLM command:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --pipeline-parallel-size 4

Expert Parallelism: For MoE Models Only

Expert parallelism applies only to Mixture-of-Experts architectures: Mixtral, DeepSeek-V2/V3, Llama 4, Qwen MoE variants, and similar models.

Why MoE needs special handling:

MoE models have many "expert" sub-networks. Only a few experts activate per token, but all expert weights must be in memory. DeepSeek-V3 has 256 experts totaling 671B parameters, but only activates ~37B per token.

Standard TP shards all experts across all GPUs. Every GPU has a slice of every expert.

Expert parallelism distributes complete experts across GPUs. Each GPU holds different experts entirely. Tokens route to whichever GPU has the relevant expert.

The KV cache problem with MoE:

DeepSeek uses Multi-Latent Attention (MLA), which has only one KV head. Standard tensor parallelism can't shard the KV cache across GPUs, there's only one head to shard.

Result: With TP=8 on DeepSeek, the KV cache is duplicated 8 times. Massive memory waste.

Data parallelism with expert parallelism (DP+EP) solves this. DP partitions the KV cache by request, while EP distributes experts. Each GPU holds 1/8 of the KV cache and 1/8 of the experts.

Configuration matters enormously for MoE:

Config KV Cache Experts Best For
TP=8 Duplicated 8× Sharded Low concurrency, latency-focused
DP=8 + EP Partitioned Distributed High concurrency, throughput-focused

vLLM commands:

# TP with expert parallelism (low concurrency)
vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel

# DP with expert parallelism (high concurrency)
vllm serve deepseek-ai/DeepSeek-V3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --disable-nccl-for-dp-synchronization

The Problems Nobody Warns You About

These issues cause real production failures. I've seen teams spend weeks debugging problems that are fundamental to distributed inference, not configuration mistakes.

Different Parallelism Configs Produce Different Outputs

Floating-point arithmetic isn't associative. Changing how computations are distributed changes rounding behavior.

TP=4 and TP=8 on the same model with the same prompt at temperature=0 produce different outputs. Not just different token probabilities, different actual tokens.

This matters for:

  • Regression testing across config changes
  • A/B comparisons between deployments
  • Reproducibility requirements
  • Debugging ("it worked yesterday, we just changed TP from 4 to 8")

There's no fix. Accept that parallelism configuration is part of your model's identity. Don't change it without re-validating outputs.

PCIe Multi-GPU Is Worse Than You Think

Teams buy 4× RTX 4090s expecting 4× performance. They get 1.5-2×.

PCIe 4.0 delivers 64 GB/s. NVLink 4.0 delivers 900 GB/s. That's 14× difference. Tensor parallelism synchronizes after every layer. At TP=4 on PCIe, you spend more time waiting for data transfer than computing.

If you don't have NVLink:

  • Use pipeline parallelism instead of tensor parallelism
  • Or use data parallelism (multiple independent model replicas)
  • Or accept that TP=2 is your practical maximum
  • Or buy NVLink hardware

Memory Fragmentation Kills Long Contexts

You calculate memory requirements, everything looks fine, then OOM hits at runtime.

KV cache allocates dynamically as sequences grow. With variable-length requests and continuous batching, memory fragments. Your GPU has 80GB total, 60GB free in theory, but no contiguous block larger than 8GB.

Mitigations:

  • Use --kv-cache-dtype fp8 to halve KV cache memory
  • Set --max-model-len to your actual needs, not maximum possible
  • Reduce --max-num-seqs to limit concurrent sequences
  • Enable --enable-prefix-caching if requests share prefixes

Driver Version Mismatches Cause Silent Failures

All GPUs need identical driver versions. Mixed drivers cause hangs, incorrect outputs, or crashes that don't mention drivers in the error message.

Multi-node makes this worse. Node A has driver 535.104, Node B has 535.129. Everything appears fine until inference silently produces garbage.

Prevention:

  • Use containerized deployments with fixed driver versions
  • Pin driver versions in your infrastructure automation
  • Verify driver versions across all nodes before deployment: nvidia-smi --query-gpu=driver_version --format=csv

NCCL Hangs Under Load

NCCL (NVIDIA's communication library) can hang under high load without clear error messages. Your process just stops responding.

Common causes:

  • Network issues between nodes
  • Memory pressure causing allocation failures
  • Timeout settings too short for large transfers
  • Firewall rules blocking NCCL ports

Debug approach:

# Enable NCCL debugging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# Increase timeout
export NCCL_TIMEOUT=3600

# Run and check logs for actual failure point

Real Hardware Configurations

Single-Node Setups

Config Total VRAM NVLink Models It Handles Monthly Cloud Cost
2× A100 40GB 80 GB Yes 70B FP8 (tight) $2,600
2× A100 80GB 160 GB Yes 70B BF16 $3,600
4× A100 80GB 320 GB Yes 405B FP8 $7,200
8× H100 80GB 640 GB Yes 405B BF16, DeepSeek-V3 FP8 $23,000
8× H200 141GB 1,128 GB Yes DeepSeek-V3 BF16, everything $35,000

Multi-Node: When You Actually Need It

Multi-node adds complexity without adding efficiency. Each node needs identical setup. Network between nodes becomes a bottleneck. Failure modes multiply.

Only go multi-node when:

  • Model physically doesn't fit on one node (DeepSeek-V3 BF16 needs ~1.3TB)
  • You need more throughput than one node provides and can't use data parallelism

Standard pattern: TP within nodes (uses fast NVLink), PP across nodes (tolerates slower network).

# Two nodes, 8 GPUs each
# Node 0
vllm serve model-name \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 0 \
  --master-addr 10.0.0.1

# Node 1
vllm serve model-name \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 1 \
  --master-addr 10.0.0.1

Network requirements: InfiniBand strongly preferred. 100Gbps Ethernet minimum. 10Gbps will bottleneck PP transfers.


Complete vLLM Configuration Reference

Basic Parallelism

# Tensor parallel only (most common single-node setup)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# Pipeline parallel only (PCIe systems or throughput-focused)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --pipeline-parallel-size 4

# Hybrid (multi-node)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2

MoE Models

# DeepSeek-V3: TP+EP for latency
vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel

# DeepSeek-V3: DP+EP for throughput
vllm serve deepseek-ai/DeepSeek-V3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --disable-nccl-for-dp-synchronization

# Mixtral
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --enable-expert-parallel

Memory Optimization

# FP8 KV cache (halves KV memory, minimal quality loss)
--kv-cache-dtype fp8

# Limit context length (saves KV memory for batching)
--max-model-len 32768

# Limit concurrent sequences (prevents OOM under load)
--max-num-seqs 256

# Chunked prefill (prevents long prefills from blocking)
--enable-chunked-prefill

# Prefix caching (saves memory when requests share prefixes)
--enable-prefix-caching

Production Settings

# Full production example
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --disable-log-requests \
  --port 8000

Making the Decision: A Framework

Step 1: Verify You Need Multi-GPU

  • Can you run the model quantized on one GPU?
  • Is throughput actually bottlenecked?
  • Would a smaller model work for your task?

If all answers are "no," continue.

Step 2: Check Your Interconnect

  • NVLink available? → Tensor parallelism works well
  • PCIe only? → Prefer pipeline parallelism or data parallelism

Step 3: Dense or MoE Model?

  • Dense model (Llama, Mistral, Qwen dense) → Standard TP or PP
  • MoE model (Mixtral, DeepSeek, Llama 4) → Consider EP, watch out for KV cache issues with MLA

Step 4: Latency or Throughput Priority?

  • Latency (interactive chat, low concurrency) → Tensor parallelism
  • Throughput (batch processing, high concurrency) → Pipeline or data parallelism, DP+EP for MoE

Step 5: Single Node or Multi-Node?

  • Fits on one node? → Stay single-node, much simpler ops
  • Needs multi-node? → TP within nodes, PP across nodes

Quick Reference Table

Situation Config
70B on 2× H100 with NVLink, latency-focused TP=2
70B on 4× A100 PCIe, throughput-focused PP=4
405B on 8× H100 single node TP=8
405B on 2 nodes × 8 H100 TP=8, PP=2
DeepSeek-V3 on 8× H100, latency-focused TP=8 + EP
DeepSeek-V3 on 8× H100, throughput-focused DP=8 + EP
Mixtral on 2× A100 TP=2 + EP

When Multi-GPU Isn't Worth It

Sometimes the right answer is "don't."

Consider Managed Inference

If GPU infrastructure isn't your core competency, the operational burden of multi-GPU deployment might exceed the cost of managed services.

Setting up multi-GPU inference takes days to weeks. Maintaining it takes ongoing engineering time. Debugging distributed systems issues is specialized work.

PremAI offers multi-GPU deployment in your VPC with SOC2/HIPAA/GDPR compliance. You get the model capabilities without becoming a distributed systems team. For many organizations, that trade-off makes sense.

Consider Quantization Harder

Before 8× GPUs for BF16, try 2× GPUs for FP8 or 1× GPU for INT4. Modern quantization often loses less than the engineering complexity costs you.

Consider Smaller Models

The gap between 70B and smaller models has narrowed. Task-specific fine-tuned 8B models sometimes beat general 70B models. Always benchmark your actual use case before assuming you need the biggest model.


Frequently Asked Questions

How many GPUs for Llama 70B? Minimum 2× A100/H100 80GB at FP8. Comfortable setup is 4× for good KV cache headroom. Single H100 works with INT4 quantization.

Should I use TP or PP? TP for latency with NVLink. PP for throughput or PCIe systems. Never use high TP on PCIe.

Why is my TP=8 only 5× faster than single GPU? Communication overhead. 60-75% efficiency at TP=8 is normal, not a bug.

Can I mix different GPU types? Don't. Different memory sizes and speeds cause load imbalance. All GPUs should be identical.

My outputs changed when I changed TP size. Is that a bug? No. Floating-point non-associativity means different parallelism configs legitimately produce different outputs.

What's expert parallelism? A mode for MoE models that distributes experts across GPUs instead of sharding them. Enable with --enable-expert-parallel in vLLM.

Do I need InfiniBand for multi-node? Strongly recommended. 100Gbps Ethernet is minimum viable. 10Gbps will bottleneck.

Why does my deployment OOM despite having enough total memory? Memory fragmentation. Reduce --max-model-len, use FP8 KV cache, or reduce --max-num-seqs.


The Bottom Line

Multi-GPU inference is for when your model genuinely doesn't fit on one GPU, or when one GPU can't sustain your throughput requirements. It's not for making small models faster or because "more GPUs sounds better."

Start with the smallest config that works. Verify you've exhausted single-GPU options first. Choose your parallelism strategy based on your interconnect and workload pattern. Budget for the operational complexity, this isn't set-and-forget infrastructure.

And if managing distributed GPU deployments isn't your team's strength, managed solutions exist for exactly that reason.

Start here:

vllm serve your-model --tensor-parallel-size 2

Scale up only when you hit actual limits.

For related reading, see the self-hosted LLM guide and RAG pipeline strategies.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe