LLM Infrastructure Sizing: From Hardware Requirements to Production Capacity
A 70B model needs 35GB to load. Serving 50 concurrent users needs 80GB+. The gap is KV cache and batch size. Complete sizing guide for production LLM deployments.
Most VRAM calculators answer the wrong question. They tell you whether a model will load. They don't tell you whether it will serve your production traffic.
A Llama 3.1 70B model needs about 35GB to load at 4-bit quantization. But serving 50 concurrent users with 8K context windows? That pushes memory requirements past 80GB. The difference is KV cache, and most sizing guides ignore it.
This guide covers the complete memory equation, from model weights through KV cache and batch overhead. You'll learn to calculate throughput capacity, understand when self-hosting breaks even against APIs, and size infrastructure for actual production workloads rather than demo deployments.
The Complete Memory Equation
GPU memory during LLM inference splits into four components:
Total VRAM = Model Weights + KV Cache + Activations + Framework Overhead
Most discussions stop at model weights. Production deployments can't afford to.
Model Weights
The baseline. Every parameter needs storage, and the amount depends on precision.
| Precision | Bytes per Parameter | 7B Model | 70B Model |
|---|---|---|---|
| FP32 | 4 | 28GB | 280GB |
| FP16/BF16 | 2 | 14GB | 140GB |
| INT8 | 1 | 7GB | 70GB |
| INT4/Q4 | 0.5 | 3.5GB | 35GB |
The formula: VRAM (GB) = Parameters (B) × Bytes per Parameter
A 70B model at FP16 precision needs 140GB just for weights. That's two A100-80GB cards before you process a single token. Quantization to 4-bit cuts that to 35GB, fitting on a single A100 with room for inference overhead.
Quantization quality has improved dramatically. Modern GPTQ and AWQ methods preserve 95%+ of model quality at INT4. For most production use cases, the quality difference is imperceptible while memory savings are substantial.
KV Cache: The Hidden Memory Consumer
KV cache stores attention key-value pairs for every token in the context window. It grows with sequence length and batch size, and it's where most sizing estimates go wrong.
The formula:
KV Cache (bytes) = 2 × num_layers × hidden_size × sequence_length × batch_size × bytes_per_element
For Llama 3.1 70B (80 layers, 8192 hidden size) at FP16:
| Context Length | Batch Size 1 | Batch Size 8 | Batch Size 32 |
|---|---|---|---|
| 2K tokens | 2.5GB | 20GB | 80GB |
| 8K tokens | 10GB | 80GB | 320GB |
| 32K tokens | 40GB | 320GB | 1.28TB |
At batch size 1, KV cache is manageable. At production batch sizes serving concurrent users, it dominates memory consumption.
This is why a model that loads fine for testing fails under production load. The model weights fit. The KV cache for 32 concurrent requests doesn't.
KV Cache Optimization:
PagedAttention (used by vLLM) reduces KV cache waste from 60-80% to under 4% by managing memory in fixed-size blocks rather than pre-allocating for maximum sequence length. This optimization alone can double or triple the number of concurrent requests a given GPU configuration supports.
Activations
Intermediate computation results during the forward pass. These depend on batch size, sequence length, and model architecture. Typically 1-5GB for inference workloads, more for training.
Activations are usually the smallest component for inference, but they scale with batch size. At very large batch sizes (64+), activation memory becomes significant.
Framework Overhead
CUDA contexts, memory allocators, graph compilation, and runtime buffers. Expect 5-15% overhead on top of theoretical requirements. vLLM and TensorRT-LLM are more memory-efficient than naive HuggingFace implementations.
Sizing by Model Class
Practical recommendations based on real deployment patterns:
Small Models (1B-8B)
| Model | Precision | Weights | KV Cache (8K, batch 8) | Total Recommended |
|---|---|---|---|---|
| Llama 3.2 3B | Q4 | 1.5GB | 1.2GB | 8GB |
| Mistral 7B | Q4 | 3.5GB | 2.8GB | 12GB |
| Llama 3.1 8B | Q4 | 4GB | 3.2GB | 12-16GB |
| Llama 3.1 8B | FP16 | 16GB | 3.2GB | 24GB |
Hardware: RTX 4060 Ti 16GB handles most small models comfortably. RTX 4090 24GB provides headroom for larger batch sizes or longer contexts.
Use cases: Classification, extraction, simple Q&A, edge deployment, high-volume low-complexity tasks.
Medium Models (13B-34B)
| Model | Precision | Weights | KV Cache (8K, batch 8) | Total Recommended |
|---|---|---|---|---|
| Llama 2 13B | Q4 | 6.5GB | 5GB | 16-24GB |
| Qwen 2.5 32B | Q4 | 16GB | 12.8GB | 40-48GB |
| DeepSeek-V2 (21B active) | Q4 | 10.5GB | 8.4GB | 24-32GB |
Hardware: RTX 4090 24GB for smaller medium models. Dual RTX 4090s or single A100-40GB for 32B class.
Use cases: General assistants, coding, document analysis, moderate-complexity reasoning.
Large Models (65B-72B)
| Model | Precision | Weights | KV Cache (8K, batch 8) | Total Recommended |
|---|---|---|---|---|
| Llama 3.1 70B | Q4 | 35GB | 28GB | 80GB |
| Qwen 2.5 72B | Q4 | 36GB | 29GB | 80GB |
| Llama 3.3 70B | FP16 | 140GB | 28GB | 192GB+ |
Hardware: A100-80GB minimum for Q4 at production batch sizes. Dual A100-80GB or H100-80GB for FP16 or higher throughput.
Use cases: Complex reasoning, nuanced generation, tasks where quality matters more than latency.
Frontier Models (100B+)
| Model | Precision | Weights | Notes |
|---|---|---|---|
| Llama 3.1 405B | Q4 | 200GB+ | Requires 4-8× A100-80GB |
| DeepSeek-V3 (671B total, 37B active) | MoE | ~100GB | Mixture of Experts, not all params active |
Hardware: Multi-node GPU clusters. H100 NVLink configurations. Enterprise infrastructure.
Throughput: From Memory to Capacity
Knowing a model fits tells you nothing about how many users it can serve. Throughput depends on batch efficiency and memory bandwidth.
Tokens Per Second Estimation
LLM inference is memory-bandwidth bound. The theoretical maximum throughput:
Max Tokens/Second ≈ Memory Bandwidth (GB/s) / Model Size (GB)
| GPU | Memory Bandwidth | 7B Model (Q4) | 70B Model (Q4) |
|---|---|---|---|
| RTX 4090 | 1,008 GB/s | ~288 tok/s | ~29 tok/s |
| A100-80GB | 2,039 GB/s | ~583 tok/s | ~58 tok/s |
| H100-80GB | 3,352 GB/s | ~958 tok/s | ~96 tok/s |
These are theoretical maximums at batch size 1. Real-world throughput depends heavily on batching efficiency.
Batch Size Impact
Larger batches amortize the cost of loading model weights across more requests, dramatically improving throughput per GPU:
| Batch Size | RTX 4090 (7B Q4) | A100-80GB (70B Q4) |
|---|---|---|
| 1 | 40 tok/s | 12 tok/s |
| 4 | 140 tok/s | 42 tok/s |
| 8 | 240 tok/s | 70 tok/s |
| 16 | 350 tok/s | 95 tok/s |
| 32 | 420 tok/s | 110 tok/s |
Numbers vary by framework. vLLM with continuous batching achieves 2-4× higher throughput than static batching at the same batch size.
Users to Throughput Math
Convert business requirements to infrastructure needs:
Assumptions for a typical chat application:
- Average request: 200 input tokens + 300 output tokens
- Target latency: 5 seconds for complete response
- Concurrent users: peak simultaneous requests
Required Throughput (tok/s) = Concurrent Users × Output Tokens / Target Latency
Example: 50 concurrent users, 300 output tokens, 5-second target = 3,000 tok/s required
| Concurrent Users | Output Tokens | Target Latency | Required tok/s | GPU Configuration (70B Q4) |
|---|---|---|---|---|
| 10 | 300 | 5s | 600 | 1× A100-80GB |
| 50 | 300 | 5s | 3,000 | 4× A100-80GB |
| 100 | 300 | 5s | 6,000 | 8× A100-80GB or 4× H100 |
| 200 | 500 | 3s | 33,000 | Multi-node cluster |
For smaller models at lower precision, these numbers improve dramatically. A 7B model at Q4 on a single RTX 4090 can handle 50+ concurrent users for typical chat workloads.
Self-Hosting Economics
The decision between self-hosting and APIs comes down to utilization and volume.
Cost Per Token Comparison
| Option | Cost per 1M Tokens (Input) | Cost per 1M Tokens (Output) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Self-hosted 7B (H100 spot) | ~$0.01 | ~$0.01 |
| Self-hosted 70B (A100 cluster) | ~$0.05 | ~$0.05 |
Self-hosted costs assume amortized hardware or spot instance pricing at high utilization.
Break-Even Analysis
The crossover point depends on daily token volume and utilization rate.
Self-hosting 7B model (single H100 spot instance at $2/hour):
- Monthly infrastructure cost: ~$1,500 (including overhead)
- Throughput capacity: ~500,000 tokens/hour at batch size 8
- Break-even vs GPT-4o mini: ~2.5M tokens/day
- Break-even vs GPT-4o: ~150K tokens/day
Self-hosting 70B model (2× A100-80GB at $4/hour):
- Monthly infrastructure cost: ~$3,000
- Throughput capacity: ~100,000 tokens/hour at batch size 8
- Break-even vs GPT-4o: ~300K tokens/day
- Break-even vs Claude Sonnet: ~200K tokens/day
| Daily Token Volume | Best Option |
|---|---|
| < 500K | API (GPT-4o mini or Haiku) |
| 500K - 2M | Hybrid (simple queries local, complex to API) |
| 2M - 10M | Self-hosted 7B-13B |
| > 10M | Self-hosted, potentially larger models |
For detailed guidance on the self-hosting decision, including setup walkthroughs and tool comparisons, see our complete self-hosted LLM guide.
Hidden Costs
Infrastructure cost isn't just hardware:
| Cost Category | Monthly Estimate (Production Deployment) |
|---|---|
| GPU compute (4× A100-80GB) | $8,000-12,000 |
| Engineering time (0.25 FTE) | $3,000-5,000 |
| Power and cooling (on-prem) | $500-1,000 |
| Monitoring and observability | $200-500 |
| Storage (models, logs) | $100-300 |
| Total | $12,000-19,000 |
Compare against equivalent API spend. If your projected API bill exceeds $15K/month with consistent utilization, self-hosting likely makes sense. Below $5K/month, APIs are almost always more cost-effective when accounting for engineering overhead.
Multi-GPU Scaling
When single-GPU capacity isn't enough, you have two scaling strategies.
Tensor Parallelism (Model Sharding)
Split the model across GPUs. Each GPU holds a portion of every layer, and GPUs communicate during forward passes.
- Use when: Model doesn't fit on single GPU (70B+ at FP16)
- Scaling: Near-linear up to 4-8 GPUs, diminishing returns beyond
- Requirement: High-bandwidth interconnect (NVLink recommended)
- Latency impact: Minimal per-request, communication overhead ~5-15%
# vLLM with tensor parallelism
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-70B",
tensor_parallel_size=4, # Split across 4 GPUs
dtype="float16"
)
Pipeline Parallelism (Layer Sharding)
Split layers across GPUs sequentially. GPU 1 processes layers 1-20, GPU 2 processes layers 21-40, etc.
- Use when: Memory is the bottleneck, throughput is secondary
- Scaling: Works across nodes, tolerates lower interconnect bandwidth
- Latency impact: Higher per-request (sequential dependency)
- Throughput: Lower than tensor parallelism for single requests, but scales batches well
Data Parallelism (Replicas)
Run multiple model copies, route requests across them.
- Use when: Throughput is the bottleneck, model fits on single GPU
- Scaling: Linear with GPU count
- Management: Load balancer required, simpler than model parallelism
- Cost: Most efficient for high-volume deployments
For most production deployments, data parallelism (replicas) combined with efficient single-GPU inference (vLLM) provides the best throughput per dollar.
Production Checklist
Before deploying:
Memory Validation
- [ ] Model loads with 20% VRAM headroom
- [ ] KV cache calculated for max batch size × max context
- [ ] Tested under sustained load for 30+ minutes
- [ ] OOM errors handled gracefully (request rejection, not crash)
Throughput Verification
- [ ] Benchmarked at expected batch sizes
- [ ] Latency P50/P95/P99 meet SLA targets
- [ ] Throughput sustains during peak traffic simulation
- [ ] Continuous batching enabled (vLLM, TensorRT-LLM)
Reliability
- [ ] Health checks configured
- [ ] Auto-restart on failure
- [ ] Monitoring for VRAM usage, throughput, latency
- [ ] Alerting for degraded performance
Cost Controls
- [ ] Utilization monitoring (target 60%+ for cost efficiency)
- [ ] Auto-scaling policies if using cloud
- [ ] Spot instance handling (checkpointing, graceful migration)
Quick Reference Tables
VRAM by Model Size (Q4 Quantization)
| Model Size | Weights | Recommended VRAM (batch 8, 8K context) |
|---|---|---|
| 3B | 1.5GB | 8GB |
| 7B | 3.5GB | 12-16GB |
| 13B | 6.5GB | 24GB |
| 32B | 16GB | 48GB |
| 70B | 35GB | 80GB |
GPU Recommendations by Use Case
| Use Case | Model Class | Recommended GPU | Approximate Cost |
|---|---|---|---|
| Development/Testing | 7B | RTX 4060 Ti 16GB | $450 |
| Personal/Small Team | 7B-13B | RTX 4090 24GB | $1,600 |
| Production (low traffic) | 13B-32B | A10G 24GB | $1.50/hr (cloud) |
| Production (medium traffic) | 70B | A100-80GB | $3.50/hr (cloud) |
| Production (high traffic) | 70B+ | 4× A100-80GB | $14/hr (cloud) |
| Enterprise | 70B+ | H100 cluster | Custom pricing |
Serving Framework Comparison
| Framework | Throughput | Memory Efficiency | Ease of Use | Best For |
|---|---|---|---|---|
| vLLM | Excellent | Excellent (PagedAttention) | Moderate | Production |
| TensorRT-LLM | Excellent | Good | Complex | Maximum performance |
| Ollama | Good | Good | Easy | Development |
| HuggingFace TGI | Good | Good | Easy | Quick deployment |
For teams that want production-grade infrastructure without managing the stack, Prem's self-hosted platform handles optimization, scaling, and monitoring while keeping data on your infrastructure.
Summary
Sizing LLM infrastructure requires thinking beyond model weights:
- Calculate complete memory: Weights + KV cache + activations + overhead
- Size for production batch sizes: Single-request memory is misleading
- Match throughput to users: Convert concurrent users to required tokens/second
- Factor total cost: Hardware, engineering, power, not just GPU hours
- Break-even analysis: Self-hosting wins above ~2M tokens/day for most configurations
The common mistake is sizing for "can it run" rather than "can it serve." A model that loads fine in testing may fail under production load when KV cache for concurrent requests exceeds available VRAM.
Start with clear requirements: concurrent users, target latency, acceptable cost per request. Work backward to infrastructure. Benchmark under realistic load before committing to hardware purchases or cloud reservations.
For managed infrastructure that handles the optimization complexity, Prem Studio provides fine-tuning, evaluation, and deployment as an integrated pipeline. You focus on the model; the platform handles the infrastructure math.