By Arnav Jalan — 17 Mar 2026

LLM Infrastructure Sizing: From Hardware Requirements to Production Capacity

A 70B model needs 35GB to load. Serving 50 concurrent users needs 80GB+. The gap is KV cache and batch size. Complete sizing guide for production LLM deployments.

Most VRAM calculators answer the wrong question. They tell you whether a model will load. They don't tell you whether it will serve your production traffic.

A Llama 3.1 70B model needs about 35GB to load at 4-bit quantization. But serving 50 concurrent users with 8K context windows? That pushes memory requirements past 80GB. The difference is KV cache, and most sizing guides ignore it.

This guide covers the complete memory equation, from model weights through KV cache and batch overhead. You'll learn to calculate throughput capacity, understand when self-hosting breaks even against APIs, and size infrastructure for actual production workloads rather than demo deployments.

The Complete Memory Equation

GPU memory during LLM inference splits into four components:

Total VRAM = Model Weights + KV Cache + Activations + Framework Overhead

Most discussions stop at model weights. Production deployments can't afford to.

Model Weights

The baseline. Every parameter needs storage, and the amount depends on precision.

Precision	Bytes per Parameter	7B Model	70B Model
FP32	4	28GB	280GB
FP16/BF16	2	14GB	140GB
INT8	1	7GB	70GB
INT4/Q4	0.5	3.5GB	35GB

The formula: VRAM (GB) = Parameters (B) × Bytes per Parameter

A 70B model at FP16 precision needs 140GB just for weights. That's two A100-80GB cards before you process a single token. Quantization to 4-bit cuts that to 35GB, fitting on a single A100 with room for inference overhead.

Quantization quality has improved dramatically. Modern GPTQ and AWQ methods preserve 95%+ of model quality at INT4. For most production use cases, the quality difference is imperceptible while memory savings are substantial.

KV Cache: The Hidden Memory Consumer

KV cache stores attention key-value pairs for every token in the context window. It grows with sequence length and batch size, and it's where most sizing estimates go wrong.

The formula:

KV Cache (bytes) = 2 × num_layers × hidden_size × sequence_length × batch_size × bytes_per_element

For Llama 3.1 70B (80 layers, 8192 hidden size) at FP16:

Context Length	Batch Size 1	Batch Size 8	Batch Size 32
2K tokens	2.5GB	20GB	80GB
8K tokens	10GB	80GB	320GB
32K tokens	40GB	320GB	1.28TB

At batch size 1, KV cache is manageable. At production batch sizes serving concurrent users, it dominates memory consumption.

This is why a model that loads fine for testing fails under production load. The model weights fit. The KV cache for 32 concurrent requests doesn't.

KV Cache Optimization:

PagedAttention (used by vLLM) reduces KV cache waste from 60-80% to under 4% by managing memory in fixed-size blocks rather than pre-allocating for maximum sequence length. This optimization alone can double or triple the number of concurrent requests a given GPU configuration supports.

Activations

Intermediate computation results during the forward pass. These depend on batch size, sequence length, and model architecture. Typically 1-5GB for inference workloads, more for training.

Activations are usually the smallest component for inference, but they scale with batch size. At very large batch sizes (64+), activation memory becomes significant.

Framework Overhead

CUDA contexts, memory allocators, graph compilation, and runtime buffers. Expect 5-15% overhead on top of theoretical requirements. vLLM and TensorRT-LLM are more memory-efficient than naive HuggingFace implementations.

Sizing by Model Class

Practical recommendations based on real deployment patterns:

Small Models (1B-8B)

Model	Precision	Weights	KV Cache (8K, batch 8)	Total Recommended
Llama 3.2 3B	Q4	1.5GB	1.2GB	8GB
Mistral 7B	Q4	3.5GB	2.8GB	12GB
Llama 3.1 8B	Q4	4GB	3.2GB	12-16GB
Llama 3.1 8B	FP16	16GB	3.2GB	24GB

Hardware: RTX 4060 Ti 16GB handles most small models comfortably. RTX 4090 24GB provides headroom for larger batch sizes or longer contexts.

Use cases: Classification, extraction, simple Q&A, edge deployment, high-volume low-complexity tasks.

Medium Models (13B-34B)

Model	Precision	Weights	KV Cache (8K, batch 8)	Total Recommended
Llama 2 13B	Q4	6.5GB	5GB	16-24GB
Qwen 2.5 32B	Q4	16GB	12.8GB	40-48GB
DeepSeek-V2 (21B active)	Q4	10.5GB	8.4GB	24-32GB

Hardware: RTX 4090 24GB for smaller medium models. Dual RTX 4090s or single A100-40GB for 32B class.

Use cases: General assistants, coding, document analysis, moderate-complexity reasoning.

Large Models (65B-72B)

Model	Precision	Weights	KV Cache (8K, batch 8)	Total Recommended
Llama 3.1 70B	Q4	35GB	28GB	80GB
Qwen 2.5 72B	Q4	36GB	29GB	80GB
Llama 3.3 70B	FP16	140GB	28GB	192GB+

Hardware: A100-80GB minimum for Q4 at production batch sizes. Dual A100-80GB or H100-80GB for FP16 or higher throughput.

Use cases: Complex reasoning, nuanced generation, tasks where quality matters more than latency.

Frontier Models (100B+)

Model	Precision	Weights	Notes
Llama 3.1 405B	Q4	200GB+	Requires 4-8× A100-80GB
DeepSeek-V3 (671B total, 37B active)	MoE	~100GB	Mixture of Experts, not all params active

Hardware: Multi-node GPU clusters. H100 NVLink configurations. Enterprise infrastructure.

Throughput: From Memory to Capacity

Knowing a model fits tells you nothing about how many users it can serve. Throughput depends on batch efficiency and memory bandwidth.

Tokens Per Second Estimation

LLM inference is memory-bandwidth bound. The theoretical maximum throughput:

Max Tokens/Second ≈ Memory Bandwidth (GB/s) / Model Size (GB)

GPU	Memory Bandwidth	7B Model (Q4)	70B Model (Q4)
RTX 4090	1,008 GB/s	~288 tok/s	~29 tok/s
A100-80GB	2,039 GB/s	~583 tok/s	~58 tok/s
H100-80GB	3,352 GB/s	~958 tok/s	~96 tok/s

These are theoretical maximums at batch size 1. Real-world throughput depends heavily on batching efficiency.

Batch Size Impact

Larger batches amortize the cost of loading model weights across more requests, dramatically improving throughput per GPU:

Batch Size	RTX 4090 (7B Q4)	A100-80GB (70B Q4)
1	40 tok/s	12 tok/s
4	140 tok/s	42 tok/s
8	240 tok/s	70 tok/s
16	350 tok/s	95 tok/s
32	420 tok/s	110 tok/s

Numbers vary by framework. vLLM with continuous batching achieves 2-4× higher throughput than static batching at the same batch size.

Users to Throughput Math

Convert business requirements to infrastructure needs:

Assumptions for a typical chat application:

Average request: 200 input tokens + 300 output tokens
Target latency: 5 seconds for complete response
Concurrent users: peak simultaneous requests

Required Throughput (tok/s) = Concurrent Users × Output Tokens / Target Latency

Example: 50 concurrent users, 300 output tokens, 5-second target = 3,000 tok/s required

Concurrent Users	Output Tokens	Target Latency	Required tok/s	GPU Configuration (70B Q4)
10	300	5s	600	1× A100-80GB
50	300	5s	3,000	4× A100-80GB
100	300	5s	6,000	8× A100-80GB or 4× H100
200	500	3s	33,000	Multi-node cluster

For smaller models at lower precision, these numbers improve dramatically. A 7B model at Q4 on a single RTX 4090 can handle 50+ concurrent users for typical chat workloads.

Self-Hosting Economics

The decision between self-hosting and APIs comes down to utilization and volume.

Cost Per Token Comparison

Option	Cost per 1M Tokens (Input)	Cost per 1M Tokens (Output)
GPT-4o	$2.50	$10.00
GPT-4o mini	$0.15	$0.60
Claude 3.5 Sonnet	$3.00	$15.00
Claude 3.5 Haiku	$0.80	$4.00
Self-hosted 7B (H100 spot)	~$0.01	~$0.01
Self-hosted 70B (A100 cluster)	~$0.05	~$0.05

Self-hosted costs assume amortized hardware or spot instance pricing at high utilization.

Break-Even Analysis

The crossover point depends on daily token volume and utilization rate.

Self-hosting 7B model (single H100 spot instance at $2/hour):

Monthly infrastructure cost: ~$1,500 (including overhead)
Throughput capacity: ~500,000 tokens/hour at batch size 8
Break-even vs GPT-4o mini: ~2.5M tokens/day
Break-even vs GPT-4o: ~150K tokens/day

Self-hosting 70B model (2× A100-80GB at $4/hour):

Monthly infrastructure cost: ~$3,000
Throughput capacity: ~100,000 tokens/hour at batch size 8
Break-even vs GPT-4o: ~300K tokens/day
Break-even vs Claude Sonnet: ~200K tokens/day

Daily Token Volume	Best Option
< 500K	API (GPT-4o mini or Haiku)
500K - 2M	Hybrid (simple queries local, complex to API)
2M - 10M	Self-hosted 7B-13B
> 10M	Self-hosted, potentially larger models

For detailed guidance on the self-hosting decision, including setup walkthroughs and tool comparisons, see our complete self-hosted LLM guide.

Hidden Costs

Infrastructure cost isn't just hardware:

Cost Category	Monthly Estimate (Production Deployment)
GPU compute (4× A100-80GB)	$8,000-12,000
Engineering time (0.25 FTE)	$3,000-5,000
Power and cooling (on-prem)	$500-1,000
Monitoring and observability	$200-500
Storage (models, logs)	$100-300
Total	$12,000-19,000

Compare against equivalent API spend. If your projected API bill exceeds $15K/month with consistent utilization, self-hosting likely makes sense. Below $5K/month, APIs are almost always more cost-effective when accounting for engineering overhead.

Multi-GPU Scaling

When single-GPU capacity isn't enough, you have two scaling strategies.

Tensor Parallelism (Model Sharding)

Split the model across GPUs. Each GPU holds a portion of every layer, and GPUs communicate during forward passes.

Use when: Model doesn't fit on single GPU (70B+ at FP16)
Scaling: Near-linear up to 4-8 GPUs, diminishing returns beyond
Requirement: High-bandwidth interconnect (NVLink recommended)
Latency impact: Minimal per-request, communication overhead ~5-15%

# vLLM with tensor parallelism
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-70B",
    tensor_parallel_size=4,  # Split across 4 GPUs
    dtype="float16"
)

Pipeline Parallelism (Layer Sharding)

Split layers across GPUs sequentially. GPU 1 processes layers 1-20, GPU 2 processes layers 21-40, etc.

Use when: Memory is the bottleneck, throughput is secondary
Scaling: Works across nodes, tolerates lower interconnect bandwidth
Latency impact: Higher per-request (sequential dependency)
Throughput: Lower than tensor parallelism for single requests, but scales batches well

Data Parallelism (Replicas)

Run multiple model copies, route requests across them.

Use when: Throughput is the bottleneck, model fits on single GPU
Scaling: Linear with GPU count
Management: Load balancer required, simpler than model parallelism
Cost: Most efficient for high-volume deployments

For most production deployments, data parallelism (replicas) combined with efficient single-GPU inference (vLLM) provides the best throughput per dollar.

Production Checklist

Before deploying:

Memory Validation

[ ] Model loads with 20% VRAM headroom
[ ] KV cache calculated for max batch size × max context
[ ] Tested under sustained load for 30+ minutes
[ ] OOM errors handled gracefully (request rejection, not crash)

Throughput Verification

[ ] Benchmarked at expected batch sizes
[ ] Latency P50/P95/P99 meet SLA targets
[ ] Throughput sustains during peak traffic simulation
[ ] Continuous batching enabled (vLLM, TensorRT-LLM)

Reliability

[ ] Health checks configured
[ ] Auto-restart on failure
[ ] Monitoring for VRAM usage, throughput, latency
[ ] Alerting for degraded performance

Cost Controls

[ ] Utilization monitoring (target 60%+ for cost efficiency)
[ ] Auto-scaling policies if using cloud
[ ] Spot instance handling (checkpointing, graceful migration)

Quick Reference Tables

VRAM by Model Size (Q4 Quantization)

Model Size	Weights	Recommended VRAM (batch 8, 8K context)
3B	1.5GB	8GB
7B	3.5GB	12-16GB
13B	6.5GB	24GB
32B	16GB	48GB
70B	35GB	80GB

GPU Recommendations by Use Case

Use Case	Model Class	Recommended GPU	Approximate Cost
Development/Testing	7B	RTX 4060 Ti 16GB	$450
Personal/Small Team	7B-13B	RTX 4090 24GB	$1,600
Production (low traffic)	13B-32B	A10G 24GB	$1.50/hr (cloud)
Production (medium traffic)	70B	A100-80GB	$3.50/hr (cloud)
Production (high traffic)	70B+	4× A100-80GB	$14/hr (cloud)
Enterprise	70B+	H100 cluster	Custom pricing

Serving Framework Comparison

Framework	Throughput	Memory Efficiency	Ease of Use	Best For
vLLM	Excellent	Excellent (PagedAttention)	Moderate	Production
TensorRT-LLM	Excellent	Good	Complex	Maximum performance
Ollama	Good	Good	Easy	Development
HuggingFace TGI	Good	Good	Easy	Quick deployment

For teams that want production-grade infrastructure without managing the stack, Prem's self-hosted platform handles optimization, scaling, and monitoring while keeping data on your infrastructure.

Summary

Sizing LLM infrastructure requires thinking beyond model weights:

Calculate complete memory: Weights + KV cache + activations + overhead
Size for production batch sizes: Single-request memory is misleading
Match throughput to users: Convert concurrent users to required tokens/second
Factor total cost: Hardware, engineering, power, not just GPU hours
Break-even analysis: Self-hosting wins above ~2M tokens/day for most configurations

The common mistake is sizing for "can it run" rather than "can it serve." A model that loads fine in testing may fail under production load when KV cache for concurrent requests exceeds available VRAM.

Start with clear requirements: concurrent users, target latency, acceptable cost per request. Work backward to infrastructure. Benchmark under realistic load before committing to hardware purchases or cloud reservations.

For managed infrastructure that handles the optimization complexity, Prem Studio provides fine-tuning, evaluation, and deployment as an integrated pipeline. You focus on the model; the platform handles the infrastructure math.