LLM Infrastructure Sizing: From Hardware Requirements to Production Capacity

A 70B model needs 35GB to load. Serving 50 concurrent users needs 80GB+. The gap is KV cache and batch size. Complete sizing guide for production LLM deployments.

LLM Infrastructure Sizing: From Hardware Requirements to Production Capacity

Most VRAM calculators answer the wrong question. They tell you whether a model will load. They don't tell you whether it will serve your production traffic.

A Llama 3.1 70B model needs about 35GB to load at 4-bit quantization. But serving 50 concurrent users with 8K context windows? That pushes memory requirements past 80GB. The difference is KV cache, and most sizing guides ignore it.

This guide covers the complete memory equation, from model weights through KV cache and batch overhead. You'll learn to calculate throughput capacity, understand when self-hosting breaks even against APIs, and size infrastructure for actual production workloads rather than demo deployments.

The Complete Memory Equation

GPU memory during LLM inference splits into four components:

Total VRAM = Model Weights + KV Cache + Activations + Framework Overhead

Most discussions stop at model weights. Production deployments can't afford to.

Model Weights

The baseline. Every parameter needs storage, and the amount depends on precision.

Precision Bytes per Parameter 7B Model 70B Model
FP32 4 28GB 280GB
FP16/BF16 2 14GB 140GB
INT8 1 7GB 70GB
INT4/Q4 0.5 3.5GB 35GB

The formula: VRAM (GB) = Parameters (B) × Bytes per Parameter

A 70B model at FP16 precision needs 140GB just for weights. That's two A100-80GB cards before you process a single token. Quantization to 4-bit cuts that to 35GB, fitting on a single A100 with room for inference overhead.

Quantization quality has improved dramatically. Modern GPTQ and AWQ methods preserve 95%+ of model quality at INT4. For most production use cases, the quality difference is imperceptible while memory savings are substantial.

KV Cache: The Hidden Memory Consumer

KV cache stores attention key-value pairs for every token in the context window. It grows with sequence length and batch size, and it's where most sizing estimates go wrong.

The formula:

KV Cache (bytes) = 2 × num_layers × hidden_size × sequence_length × batch_size × bytes_per_element

For Llama 3.1 70B (80 layers, 8192 hidden size) at FP16:

Context Length Batch Size 1 Batch Size 8 Batch Size 32
2K tokens 2.5GB 20GB 80GB
8K tokens 10GB 80GB 320GB
32K tokens 40GB 320GB 1.28TB

At batch size 1, KV cache is manageable. At production batch sizes serving concurrent users, it dominates memory consumption.

This is why a model that loads fine for testing fails under production load. The model weights fit. The KV cache for 32 concurrent requests doesn't.

KV Cache Optimization:

PagedAttention (used by vLLM) reduces KV cache waste from 60-80% to under 4% by managing memory in fixed-size blocks rather than pre-allocating for maximum sequence length. This optimization alone can double or triple the number of concurrent requests a given GPU configuration supports.

Activations

Intermediate computation results during the forward pass. These depend on batch size, sequence length, and model architecture. Typically 1-5GB for inference workloads, more for training.

Activations are usually the smallest component for inference, but they scale with batch size. At very large batch sizes (64+), activation memory becomes significant.

Framework Overhead

CUDA contexts, memory allocators, graph compilation, and runtime buffers. Expect 5-15% overhead on top of theoretical requirements. vLLM and TensorRT-LLM are more memory-efficient than naive HuggingFace implementations.

Sizing by Model Class

Practical recommendations based on real deployment patterns:

Small Models (1B-8B)

Model Precision Weights KV Cache (8K, batch 8) Total Recommended
Llama 3.2 3B Q4 1.5GB 1.2GB 8GB
Mistral 7B Q4 3.5GB 2.8GB 12GB
Llama 3.1 8B Q4 4GB 3.2GB 12-16GB
Llama 3.1 8B FP16 16GB 3.2GB 24GB

Hardware: RTX 4060 Ti 16GB handles most small models comfortably. RTX 4090 24GB provides headroom for larger batch sizes or longer contexts.

Use cases: Classification, extraction, simple Q&A, edge deployment, high-volume low-complexity tasks.

Medium Models (13B-34B)

Model Precision Weights KV Cache (8K, batch 8) Total Recommended
Llama 2 13B Q4 6.5GB 5GB 16-24GB
Qwen 2.5 32B Q4 16GB 12.8GB 40-48GB
DeepSeek-V2 (21B active) Q4 10.5GB 8.4GB 24-32GB

Hardware: RTX 4090 24GB for smaller medium models. Dual RTX 4090s or single A100-40GB for 32B class.

Use cases: General assistants, coding, document analysis, moderate-complexity reasoning.

Large Models (65B-72B)

Model Precision Weights KV Cache (8K, batch 8) Total Recommended
Llama 3.1 70B Q4 35GB 28GB 80GB
Qwen 2.5 72B Q4 36GB 29GB 80GB
Llama 3.3 70B FP16 140GB 28GB 192GB+

Hardware: A100-80GB minimum for Q4 at production batch sizes. Dual A100-80GB or H100-80GB for FP16 or higher throughput.

Use cases: Complex reasoning, nuanced generation, tasks where quality matters more than latency.

Frontier Models (100B+)

Model Precision Weights Notes
Llama 3.1 405B Q4 200GB+ Requires 4-8× A100-80GB
DeepSeek-V3 (671B total, 37B active) MoE ~100GB Mixture of Experts, not all params active

Hardware: Multi-node GPU clusters. H100 NVLink configurations. Enterprise infrastructure.

Throughput: From Memory to Capacity

Knowing a model fits tells you nothing about how many users it can serve. Throughput depends on batch efficiency and memory bandwidth.

Tokens Per Second Estimation

LLM inference is memory-bandwidth bound. The theoretical maximum throughput:

Max Tokens/Second ≈ Memory Bandwidth (GB/s) / Model Size (GB)
GPU Memory Bandwidth 7B Model (Q4) 70B Model (Q4)
RTX 4090 1,008 GB/s ~288 tok/s ~29 tok/s
A100-80GB 2,039 GB/s ~583 tok/s ~58 tok/s
H100-80GB 3,352 GB/s ~958 tok/s ~96 tok/s

These are theoretical maximums at batch size 1. Real-world throughput depends heavily on batching efficiency.

Batch Size Impact

Larger batches amortize the cost of loading model weights across more requests, dramatically improving throughput per GPU:

Batch Size RTX 4090 (7B Q4) A100-80GB (70B Q4)
1 40 tok/s 12 tok/s
4 140 tok/s 42 tok/s
8 240 tok/s 70 tok/s
16 350 tok/s 95 tok/s
32 420 tok/s 110 tok/s

Numbers vary by framework. vLLM with continuous batching achieves 2-4× higher throughput than static batching at the same batch size.

Users to Throughput Math

Convert business requirements to infrastructure needs:

Assumptions for a typical chat application:

  • Average request: 200 input tokens + 300 output tokens
  • Target latency: 5 seconds for complete response
  • Concurrent users: peak simultaneous requests
Required Throughput (tok/s) = Concurrent Users × Output Tokens / Target Latency

Example: 50 concurrent users, 300 output tokens, 5-second target = 3,000 tok/s required

Concurrent Users Output Tokens Target Latency Required tok/s GPU Configuration (70B Q4)
10 300 5s 600 1× A100-80GB
50 300 5s 3,000 4× A100-80GB
100 300 5s 6,000 8× A100-80GB or 4× H100
200 500 3s 33,000 Multi-node cluster

For smaller models at lower precision, these numbers improve dramatically. A 7B model at Q4 on a single RTX 4090 can handle 50+ concurrent users for typical chat workloads.

Self-Hosting Economics

The decision between self-hosting and APIs comes down to utilization and volume.

Cost Per Token Comparison

Option Cost per 1M Tokens (Input) Cost per 1M Tokens (Output)
GPT-4o $2.50 $10.00
GPT-4o mini $0.15 $0.60
Claude 3.5 Sonnet $3.00 $15.00
Claude 3.5 Haiku $0.80 $4.00
Self-hosted 7B (H100 spot) ~$0.01 ~$0.01
Self-hosted 70B (A100 cluster) ~$0.05 ~$0.05

Self-hosted costs assume amortized hardware or spot instance pricing at high utilization.

Break-Even Analysis

The crossover point depends on daily token volume and utilization rate.

Self-hosting 7B model (single H100 spot instance at $2/hour):

  • Monthly infrastructure cost: ~$1,500 (including overhead)
  • Throughput capacity: ~500,000 tokens/hour at batch size 8
  • Break-even vs GPT-4o mini: ~2.5M tokens/day
  • Break-even vs GPT-4o: ~150K tokens/day

Self-hosting 70B model (2× A100-80GB at $4/hour):

  • Monthly infrastructure cost: ~$3,000
  • Throughput capacity: ~100,000 tokens/hour at batch size 8
  • Break-even vs GPT-4o: ~300K tokens/day
  • Break-even vs Claude Sonnet: ~200K tokens/day
Daily Token Volume Best Option
< 500K API (GPT-4o mini or Haiku)
500K - 2M Hybrid (simple queries local, complex to API)
2M - 10M Self-hosted 7B-13B
> 10M Self-hosted, potentially larger models

For detailed guidance on the self-hosting decision, including setup walkthroughs and tool comparisons, see our complete self-hosted LLM guide.

Hidden Costs

Infrastructure cost isn't just hardware:

Cost Category Monthly Estimate (Production Deployment)
GPU compute (4× A100-80GB) $8,000-12,000
Engineering time (0.25 FTE) $3,000-5,000
Power and cooling (on-prem) $500-1,000
Monitoring and observability $200-500
Storage (models, logs) $100-300
Total $12,000-19,000

Compare against equivalent API spend. If your projected API bill exceeds $15K/month with consistent utilization, self-hosting likely makes sense. Below $5K/month, APIs are almost always more cost-effective when accounting for engineering overhead.

Multi-GPU Scaling

When single-GPU capacity isn't enough, you have two scaling strategies.

Tensor Parallelism (Model Sharding)

Split the model across GPUs. Each GPU holds a portion of every layer, and GPUs communicate during forward passes.

  • Use when: Model doesn't fit on single GPU (70B+ at FP16)
  • Scaling: Near-linear up to 4-8 GPUs, diminishing returns beyond
  • Requirement: High-bandwidth interconnect (NVLink recommended)
  • Latency impact: Minimal per-request, communication overhead ~5-15%
# vLLM with tensor parallelism
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-70B",
    tensor_parallel_size=4,  # Split across 4 GPUs
    dtype="float16"
)

Pipeline Parallelism (Layer Sharding)

Split layers across GPUs sequentially. GPU 1 processes layers 1-20, GPU 2 processes layers 21-40, etc.

  • Use when: Memory is the bottleneck, throughput is secondary
  • Scaling: Works across nodes, tolerates lower interconnect bandwidth
  • Latency impact: Higher per-request (sequential dependency)
  • Throughput: Lower than tensor parallelism for single requests, but scales batches well

Data Parallelism (Replicas)

Run multiple model copies, route requests across them.

  • Use when: Throughput is the bottleneck, model fits on single GPU
  • Scaling: Linear with GPU count
  • Management: Load balancer required, simpler than model parallelism
  • Cost: Most efficient for high-volume deployments

For most production deployments, data parallelism (replicas) combined with efficient single-GPU inference (vLLM) provides the best throughput per dollar.

Production Checklist

Before deploying:

Memory Validation

  • [ ] Model loads with 20% VRAM headroom
  • [ ] KV cache calculated for max batch size × max context
  • [ ] Tested under sustained load for 30+ minutes
  • [ ] OOM errors handled gracefully (request rejection, not crash)

Throughput Verification

  • [ ] Benchmarked at expected batch sizes
  • [ ] Latency P50/P95/P99 meet SLA targets
  • [ ] Throughput sustains during peak traffic simulation
  • [ ] Continuous batching enabled (vLLM, TensorRT-LLM)

Reliability

  • [ ] Health checks configured
  • [ ] Auto-restart on failure
  • [ ] Monitoring for VRAM usage, throughput, latency
  • [ ] Alerting for degraded performance

Cost Controls

  • [ ] Utilization monitoring (target 60%+ for cost efficiency)
  • [ ] Auto-scaling policies if using cloud
  • [ ] Spot instance handling (checkpointing, graceful migration)

Quick Reference Tables

VRAM by Model Size (Q4 Quantization)

Model Size Weights Recommended VRAM (batch 8, 8K context)
3B 1.5GB 8GB
7B 3.5GB 12-16GB
13B 6.5GB 24GB
32B 16GB 48GB
70B 35GB 80GB

GPU Recommendations by Use Case

Use Case Model Class Recommended GPU Approximate Cost
Development/Testing 7B RTX 4060 Ti 16GB $450
Personal/Small Team 7B-13B RTX 4090 24GB $1,600
Production (low traffic) 13B-32B A10G 24GB $1.50/hr (cloud)
Production (medium traffic) 70B A100-80GB $3.50/hr (cloud)
Production (high traffic) 70B+ 4× A100-80GB $14/hr (cloud)
Enterprise 70B+ H100 cluster Custom pricing

Serving Framework Comparison

Framework Throughput Memory Efficiency Ease of Use Best For
vLLM Excellent Excellent (PagedAttention) Moderate Production
TensorRT-LLM Excellent Good Complex Maximum performance
Ollama Good Good Easy Development
HuggingFace TGI Good Good Easy Quick deployment

For teams that want production-grade infrastructure without managing the stack, Prem's self-hosted platform handles optimization, scaling, and monitoring while keeping data on your infrastructure.

Summary

Sizing LLM infrastructure requires thinking beyond model weights:

  1. Calculate complete memory: Weights + KV cache + activations + overhead
  2. Size for production batch sizes: Single-request memory is misleading
  3. Match throughput to users: Convert concurrent users to required tokens/second
  4. Factor total cost: Hardware, engineering, power, not just GPU hours
  5. Break-even analysis: Self-hosting wins above ~2M tokens/day for most configurations

The common mistake is sizing for "can it run" rather than "can it serve." A model that loads fine in testing may fail under production load when KV cache for concurrent requests exceeds available VRAM.

Start with clear requirements: concurrent users, target latency, acceptable cost per request. Work backward to infrastructure. Benchmark under realistic load before committing to hardware purchases or cloud reservations.

For managed infrastructure that handles the optimization complexity, Prem Studio provides fine-tuning, evaluation, and deployment as an integrated pipeline. You focus on the model; the platform handles the infrastructure math.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe