KV Cache Optimization: PagedAttention, Prefix Caching & Memory Management
KV cache optimization guide covering PagedAttention, prefix caching, FP8 quantization, and memory management. Practical strategies for production LLM inference.
Every token your LLM generates depends on every token that came before it. That dependency creates a memory problem that grows linearly with context length and batch size. The culprit is the KV cache.
Traditional inference systems waste 60-80% of allocated KV cache memory through fragmentation and over-allocation. A 70B model processing 8K context with batch size 32 needs roughly 640GB of KV cache alone. That number often exceeds the model weights themselves.
This guide covers the optimization techniques that actually matter: PagedAttention for memory efficiency, prefix caching for latency reduction, and quantization for capacity gains. We will look at real implementations, not theory.
What the KV Cache Actually Does
Transformers generate text one token at a time. Each new token requires computing attention against all previous tokens. Without caching, generating 1,000 tokens means recomputing attention from scratch 1,000 times.
The KV cache stores key and value tensors from previous tokens so they can be reused. Each layer in the model maintains its own cache. The memory formula is straightforward:
KV Cache Size = 2 × num_layers × hidden_size × sequence_length × batch_size × bytes_per_param
For Llama 3.1-70B at FP16 precision, that works out to about 2.5MB per token. An 8K context window consumes roughly 20GB per request. Scale that to batch size 32 and you are looking at 640GB just for caching.
The LLaMA-2 7B model uses approximately 0.5MB per token. A 28,000-token context demands around 14GB, roughly equal to the memory needed for the model weights themselves.
This is why KV cache optimization matters more than almost any other inference optimization. The cache frequently becomes the binding constraint on throughput, context length, and batch size.
PagedAttention: How vLLM Fixed Memory Fragmentation
Traditional inference systems allocate KV cache as contiguous memory blocks. They reserve space for the maximum possible sequence length upfront. A 4K max context allocates 4K worth of cache even for 100-token requests, wasting 97.5% of reserved memory.
This leads to three types of waste:
Internal fragmentation happens when allocated slots go unused because the system cannot predict how many tokens the model will generate.
External fragmentation creates gaps between fixed memory blocks that cannot be reclaimed for other requests.
Reservation overhead forces the system to over-provision memory based on worst-case assumptions.
PagedAttention, introduced by Berkeley researchers in 2023, applies virtual memory concepts from operating systems to KV cache management. Instead of storing each conversation's cache as one giant contiguous block, PagedAttention breaks the KV cache into small fixed-size blocks (typically 16 tokens) that can be stored anywhere in memory.
Each request maintains a block table that maps logical sequence positions to physical memory locations. Sequences see continuous memory while physical storage remains non-contiguous.
The results speak for themselves. vLLM reduces KV cache waste from 60-80% down to under 4%. That efficiency gain translates directly to 2-4x throughput improvements compared to systems like HuggingFace Transformers or TGI.
from vllm import LLM, SamplingParams
# PagedAttention enabled by default in vLLM
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
gpu_memory_utilization=0.90,
max_model_len=32768,
)
Block allocation happens on demand. When a sequence completes, its blocks return to the free pool immediately. No more reserving memory for tokens that never get generated.
Automatic Prefix Caching: Skipping Redundant Computation
PagedAttention solves memory waste. Prefix caching solves computational waste.
Many workloads share common prefixes across requests. System prompts, few-shot examples, and long documents often repeat verbatim. Without prefix caching, the model recomputes KV tensors for these shared segments every time.
Automatic Prefix Caching (APC) in vLLM identifies when new queries share prefixes with cached requests and reuses those KV tensors directly. The new query skips computation of the shared part entirely.
Two workloads benefit most:
Long document queries where users repeatedly ask questions about the same document. Instead of processing a 50-page manual for every question, APC processes it once. All subsequent queries reuse that cached computation.
Multi-turn conversations where chat history accumulates across rounds. Rather than reprocessing the entire conversation history at each turn, APC maintains continuity through cache reuse.
Implementation in vLLM requires one flag:
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_prefix_caching=True
)
Under the hood, vLLM hashes each KV cache block using the token sequence it contains. When a new request arrives, the system computes hashes for its blocks and checks for matches. Matching blocks get reused directly. The approach works because every block can be uniquely identified by hash(prefix_tokens, tokens_in_block).
APC primarily accelerates the prefill phase. It does not reduce decode time because that phase generates new tokens rather than reprocessing existing ones. Workloads with long outputs relative to shared prefixes see less benefit.
For enterprise AI fine-tuning pipelines that process the same evaluation datasets repeatedly, prefix caching can dramatically reduce total compute time.
KV Cache Quantization: Halving Memory With FP8
Reducing numerical precision cuts memory requirements in half. FP8 quantization stores KV tensors in 8-bit format instead of 16-bit, doubling effective cache capacity.
vLLM supports two FP8 formats:
FP8 E4M3 offers higher precision with 4 exponent bits and 3 mantissa bits. Range is limited to ±240, requiring careful scale factor management.
FP8 E5M2 provides wider dynamic range at the cost of precision. Better for values with high variance.
Studies consistently show E4M3 quantization causes minimal accuracy degradation in practice. The tradeoff is compelling: 2x memory reduction with negligible quality loss.
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
kv_cache_dtype="fp8",
calculate_kv_scales=True, # Auto-calibrate during warmup
)
For production deployments, calibrating scales against representative data yields better results than runtime estimation. The llm-compressor library provides tooling for this:
from llmcompressor import oneshot
recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
kv_cache_scheme:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
"""
oneshot(
model=model,
dataset=calibration_ds,
recipe=recipe,
max_seq_length=2048,
num_calibration_samples=512,
)
NVIDIA's recent NVFP4 format pushes further, achieving 50% memory reduction versus FP8 with less than 1% accuracy loss on benchmarks like LiveCodeBench and MMLU-PRO. This requires Blackwell GPUs and represents the bleeding edge of KV cache compression.
For teams running self-hosted LLM deployments, FP8 quantization often provides the easiest path to doubling context length or batch size without hardware changes.
Platforms like Prem Studio handle these optimizations automatically when you deploy fine-tuned models. The infrastructure manages KV cache allocation, quantization settings, and memory configuration based on your workload patterns, so you can focus on model quality rather than inference engineering.
Memory Management Strategies
Beyond the core techniques, several strategies help manage KV cache at scale.
Grouped Query Attention
Llama 2 70B and newer models use Grouped Query Attention (GQA) instead of standard Multi-Head Attention. GQA shares key-value heads across multiple query heads, reducing the number of KV pairs stored.
Llama 2 70B uses 8 KV heads instead of 64, achieving an 8x reduction in KV cache size compared to equivalent MHA architecture. Without GQA, the 70B model would be impractical on most hardware for long-context inference.
Cache Offloading
When GPU memory proves insufficient, offloading cache to CPU or SSD provides a fallback. LMCache integrates with vLLM to manage hierarchical caching across storage tiers.
from lmcache import LMCacheEngine
cache_engine = LMCacheEngine(
backend="cpu",
max_gpu_cache_size="20GB",
cpu_cache_size="100GB",
)
Latency impact ranges from 10-50ms per cache retrieval depending on transfer size. This makes offloading viable for prefill-heavy workloads but problematic for decode-intensive generation.
Combining LMCache with vLLM has shown 3-10x latency reductions in benchmarks where cache reuse is high.
Preemption Handling
When KV cache space runs short, vLLM can preempt lower-priority requests to free memory for others. Preempted requests get recomputed when capacity becomes available.
Watch for this warning in logs:
WARNING scheduler.py: Sequence group is preempted by PreemptionMode.RECOMPUTE
Frequent preemptions indicate undersized memory allocation. Solutions include increasing gpu_memory_utilization, adding tensor parallelism to distribute cache across GPUs, or reducing max batch size.
Practical Implementation Checklist
For production deployments, apply these optimizations in order:
- Enable PagedAttention by using vLLM. This is the highest-impact change, delivering 2-4x throughput improvement with no configuration required.
- Enable prefix caching if workloads share common prefixes. Set enable_prefix_caching=True and structure prompts so shared content appears at the start.
- Add FP8 quantization on Hopper/Ada GPUs or newer. Use calibrated scales for best accuracy.
- Monitor cache metrics through vLLM's Prometheus endpoint. Track cache utilization, hit rates, and eviction frequency.
- Configure alerting for cache utilization above 90%, hit rates below 50%, and elevated eviction rates.
For enterprise AI evaluation workflows that repeatedly process the same test sets, prefix caching alone can cut evaluation time by 60% or more.
When to Use What
| Optimization | Best For | Trade-offs |
|---|---|---|
| PagedAttention | All production workloads | None significant, use by default |
| Prefix Caching | Shared system prompts, document QA, multi-turn chat | Minimal overhead, disable only if no prefix sharing |
| FP8 KV Cache | Memory-bound deployments on modern GPUs | Small accuracy impact, requires calibration for best results |
| Cache Offloading | Extreme context lengths exceeding GPU memory | 10-50ms latency per retrieval |
| GQA Models | Long-context inference | Architecture must be baked into model |
Memory Calculation Reference
Quick reference for estimating KV cache requirements:
def kv_cache_size_bytes(
num_layers: int,
hidden_size: int,
num_kv_heads: int,
head_dim: int,
sequence_length: int,
batch_size: int,
bytes_per_param: int = 2, # FP16
) -> int:
# 2 for K and V matrices
return (2 * num_layers * num_kv_heads * head_dim
* sequence_length * batch_size * bytes_per_param)
# Llama 3.1 70B example
cache_bytes = kv_cache_size_bytes(
num_layers=80,
hidden_size=8192,
num_kv_heads=8, # GQA reduces this from 64
head_dim=128,
sequence_length=8192,
batch_size=32,
)
print(f"KV Cache: {cache_bytes / 1e9:.1f} GB")
# Output: KV Cache: ~160 GB (with GQA)
# Without GQA: ~1.3 TB
For teams working on model customization or fine-tuning pipelines, understanding these memory constraints helps with capacity planning and batch size selection.
What Comes Next
KV cache optimization continues evolving. Active research areas include:
Adaptive compression that varies precision based on layer importance. Some layers tolerate more aggressive quantization than others.
Entropy-guided caching that allocates budget based on attention patterns. Layers with broader attention distributions receive more cache, focused layers receive less.
Streaming LLM techniques that maintain attention sinks for theoretically unlimited generation with fixed memory. Quality degrades for tasks requiring long-range dependencies.
Cache-aware routing at cluster scale that directs requests with shared prefixes to the same replicas, maximizing hit rates across distributed deployments.
For most production deployments today, the combination of PagedAttention, prefix caching, and FP8 quantization addresses the immediate bottlenecks. These techniques are mature, well-supported in vLLM, and deliver measurable gains without exotic hardware or complex configuration.
Start with vLLM defaults. Add prefix caching for shared workloads. Enable FP8 when memory pressure demands it. Monitor, measure, and iterate from there.
If managing inference infrastructure sounds like more overhead than you need, Prem's deployment options handle KV cache optimization as part of the managed stack. You bring the model, the platform handles the memory engineering