LLM Latency Optimization: From 5s to 500ms (2026)
Why your LLM is slow and how to fix it. TTFT reduction, quantization benchmarks, prefix caching, model selection, hardware sizing. From 5s to 500ms in practice.
A 5-second response time kills user trust. Research on interactive AI applications consistently shows abandonment spikes above 2 seconds for time-to-first-token. Most teams hit this problem and reach for the wrong fix first.
The mistake is treating LLM latency as a single number. It is two separate problems that require different solutions. Getting this distinction right is what separates teams that go from 5s to 500ms from teams that spend weeks optimizing the wrong thing.
TTFT and ITL: Two Problems, Two Optimization Stacks
Time to First Token (TTFT) is how long the user waits before seeing any output. It covers network round-trip, prompt processing (the prefill phase), and queuing delay. Prefill is compute-bound: the model processes all input tokens in parallel, which is GPU-FLOPs-limited.
Inter-Token Latency (ITL), sometimes called Time Per Output Token (TPOT), is the gap between each generated token after the first. Decode is memory-bound: each step loads the full model weights from GPU memory to generate one token. The GPU sits mostly idle on compute while waiting for memory transfers.
These require different fixes:
| Problem | Root cause | Fix |
|---|---|---|
| High TTFT | Slow prefill, long prompts, cold cache | Prefix caching, chunked prefill, prompt optimization |
| High ITL | Memory bandwidth, large model | Quantization, speculative decoding, tensor parallelism |
| High end-to-end latency | Both, plus network | Streaming, model selection, hardware |
Most guides conflate them. A team optimizing ITL when their bottleneck is TTFT (slow first response on long-context RAG queries) will spend real effort for minimal user-perceived improvement.
Before Optimizing: Measure the Right Things
Every optimization decision should start from measured baselines, not intuition.
import time
import httpx
async def measure_latency(prompt: str, stream: bool = True):
ttft = None
tokens = 0
start = time.perf_counter()
async with httpx.AsyncClient() as client:
async with client.stream("POST", "http://localhost:8000/v1/chat/completions",
json={
"model": "your-model",
"messages": [{"role": "user", "content": prompt}],
"stream": True
}
) as response:
async for chunk in response.aiter_lines():
if chunk.startswith("data: ") and chunk != "data: [DONE]":
if ttft is None:
ttft = time.perf_counter() - start
tokens += 1
total = time.perf_counter() - start
itl = (total - ttft) / max(tokens - 1, 1) if tokens > 1 else 0
return {
"ttft_ms": ttft * 1000,
"itl_ms": itl * 1000,
"total_ms": total * 1000,
"tokens": tokens,
"tokens_per_sec": tokens / total
}
Run this across your actual production query distribution. P50 and P95 numbers tell you different things. P50 drives median user experience. P95 shows where users occasionally hit bad delays, often from long prompts or queue buildup.
Knowing which number is your problem determines which optimization layer to attack first.
The Optimization Stack: Order Matters
These techniques compound. Apply them in this order. Each one improves your baseline for the next.
Layer 1: Prompt structure and output limits (free, immediate) Layer 2: Streaming (free, perceived latency only) Layer 3: Model selection (potentially free) Layer 4: Quantization (moderate effort, high ROI) Layer 5: KV cache and prefix caching (moderate effort, high ROI on RAG/multi-turn) Layer 6: Speculative decoding (moderate effort, 2-3x ITL improvement) Layer 7: Hardware and parallelism (high cost, diminishing ROI if earlier layers are done)
Teams that skip to Layer 7 before Layer 3-4 are spending $30k+ on hardware to compensate for fixable software problems.
Layer 1: Prompt Structure
Prompt length directly drives TTFT. Every token in your system prompt costs prefill compute on every request.
Two high-ROI changes that cost nothing:
Put static content first. System prompts, instructions, and context that do not change between requests should sit at the beginning of the prompt. Dynamic content (user messages, RAG results, conversation history) should come after. This layout maximizes prefix cache hit rate, which can cut TTFT dramatically. One benchmark saw a 10,000-token prompt's TTFT drop from 4.3 seconds to 0.6 seconds on cache hit.
Cap output length aggressively. Many deployments use default max_tokens of 2048 or 4096. If your actual P95 output is 300 tokens, you are paying the memory cost of the full KV cache allocation for nothing. Set max_tokens to your realistic P99 output length. For classification, summarization, or structured extraction tasks, this alone can cut end-to-end latency by 30-60%.
# Before: open-ended generation
response = client.chat.completions.create(
model="llama-3.1-70b",
messages=messages,
max_tokens=2048 # almost never needed
)
# After: bounded to your actual use case
response = client.chat.completions.create(
model="llama-3.1-70b",
messages=messages,
max_tokens=300, # based on P99 of your real outputs
stop=["\n\n", "###"] # domain-specific stop sequences
)
For RAG applications, trim retrieved context before injecting it. Sending 8,000 tokens of retrieved text when 2,000 is sufficient doubles your prefill cost without improving output quality.
Layer 2: Streaming
Streaming does not reduce total latency. It reduces perceived latency: the moment the user sees something happening.
Without streaming, a 3-second response shows nothing for 3 seconds, then dumps all output at once. With streaming, the user sees the first token in 300ms and reads as the model generates. Subjectively, this feels 5-10x faster even though wall-clock time is identical.
# vLLM / OpenAI-compatible streaming
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
stream = client.chat.completions.create(
model="llama-3.1-70b",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Every production LLM application should stream. There is no valid reason not to once TTFT is reasonable. The implementation cost is one afternoon.
Layer 3: Model Selection
The fastest optimization is using a smaller model for tasks that do not need a large one. Teams routinely run 70B models on tasks where a fine-tuned 7B-13B model would match quality.
Latency scales roughly with parameter count on the same hardware, but not linearly. Databricks' inference benchmarks show MPT-30B at roughly 2.5x the latency of MPT-7B, and Llama 2-70B at roughly 2x the latency of Llama 2-13B. So going from 70B to a fine-tuned 8B can yield 2-4x latency improvement while often matching task-specific quality.
This is especially true for structured outputs, classification, domain Q&A, and document extraction. A fine-tuned small model trained on your specific task routinely beats a generic 70B at a fraction of the latency cost.
For routing mixed workloads, consider tiered model selection:
def select_model(query: str, context_len: int) -> str:
# Simple queries → fast small model
if context_len < 500 and is_simple_query(query):
return "llama-3.2-3b-finetuned"
# Complex reasoning → medium model
elif context_len < 4000:
return "llama-3.1-8b-instruct"
# Long context or complex tasks → full model
else:
return "llama-3.1-70b-instruct"
Small models deployed efficiently on appropriate hardware consistently beat large models on latency-per-dollar.
Layer 4: Quantization
Quantization reduces model weight precision, which cuts memory footprint and speeds up memory-bound decode. It is one of the highest-ROI optimizations available because the quality tradeoff is minimal at FP8 and INT8.
FP8 is the current production standard for H100/L40s hardware. FP8 W8A8 is essentially lossless. Multiple studies confirm less than 0.5% accuracy degradation on standard benchmarks. Benchmarks on Mistral 7B with TensorRT-LLM on H100 show FP8 delivering 33% faster decode throughput versus FP16, with 8.5% lower TTFT. When combined with speculative decoding, the AMD MI300X benchmark showed 3.6x total improvement on Llama 3.1-405B.
INT4 is for memory-constrained scenarios or when you need to maximize concurrent users. Qwen3-32B on a single H100: BF16 uses 61GB and supports 4 concurrent users at 4,096-token context. INT4 drops to 18GB and supports 47 concurrent users at the same context length. INT4 shows roughly 1.9% accuracy drop on MMLU-Pro versus BF16, which is generally acceptable for production.
The accuracy truth: FP8 W8A8 is lossless. INT8 W8A8 shows 1-3% degradation with proper tuning. INT4 W4A16 is competitive with INT8 on most benchmarks. You do not need FP16 for production inference unless you are doing extremely precision-sensitive tasks.
# FP8 quantization with vLLM (H100/L40s)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--quantization fp8 \
--max-model-len 8192 \
-tp 2
# Or use a pre-quantized checkpoint
vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--max-model-len 8192 \
-tp 2
# INT4 with AWQ + Marlin kernel (best performance)
vllm serve Qwen/Qwen3-32B-Instruct-AWQ \
--quantization marlin \
--max-model-len 8192
One non-obvious point: the kernel matters as much as the quantization method. Marlin-AWQ on H200 achieves 741 tokens/sec versus standard AWQ at 68 tokens/sec for Qwen2.5-32B. Same weights, different kernel. 10.9x throughput difference. Always use Marlin kernels for INT4 production deployments on supported hardware.
Hardware requirement: FP8 compute requires NVIDIA Ada Lovelace (RTX 4000 series) or Hopper (H100) GPUs. A100 GPUs support FP8 weight storage but not FP8 matrix multiply, so they get fewer benefits. Plan hardware accordingly.
Layer 5: KV Cache and Prefix Caching
The KV cache stores computed attention states (keys and values) for previously seen tokens, so the model does not recompute them on every decode step. Prefix caching extends this by saving the KV states for entire prompt prefixes across requests.
If two requests share the same system prompt and first few turns of conversation, prefix caching means the second request skips computing that shared prefix entirely. The TTFT for the cached portion becomes a memory read instead of a GPU compute operation.
Real measured impact: a single 10,000-token prompt's TTFT dropped from 4.3 seconds to 0.6 seconds on cache hit in a Qwen3-32B benchmark. For RAG applications with consistent system prompts and retrieved context, this is one of the largest latency wins available.
# Enable automatic prefix caching in vLLM
vllm serve your-model \
--enable-prefix-caching \
--max-model-len 8192 \
-tp 2
# Enable chunked prefill (reduces TTFT by up to 30% for long prompts)
vllm serve your-model \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 2048 \
-tp 2
Chunked prefill breaks long prompts into smaller chunks processed incrementally. This allows the model to start generating while still processing late chunks of a long prompt. vLLM's documentation shows up to 30% TTFT reduction and 1.4x ITL improvement from chunked prefill on long-context workloads.
For multi-turn applications: Structure prompts so the longest stable prefix comes first. If your system prompt is 2,000 tokens and conversation history is 3,000 tokens, put the system prompt before the history. This maximizes the prefix cache hit on the stable portion across all users.
For RAG applications: If many queries hit the same documents, the retrieved context prefix will cache. This makes repeated queries against the same document set dramatically faster. The data retrieval patterns in enterprise AI often make prefix caching one of the highest-ROI infrastructure decisions.
KV cache offloading to CPU via LMCache extends this further. When the same long context is needed again after it has been evicted from GPU memory, loading from CPU RAM instead of recomputing from scratch has shown 10x TTFT reduction for 128,000-token context windows.
Layer 6: Speculative Decoding
Speculative decoding uses a small draft model to propose multiple tokens, then verifies them in a single target model pass. At acceptable acceptance rates (0.6-0.8), this reduces ITL by 2-3x without any quality change.
This is covered in depth in the speculative decoding guide, but the quick production setup:
# EAGLE3 speculative decoding on vLLM (best acceptance rate)
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.3-70B-Instruct \
-tp 4 \
--speculative-config '{
"model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B",
"num_speculative_tokens": 3,
"method": "eagle3",
"draft_tensor_parallel_size": 1
}'
Where it helps most: low-to-moderate concurrency (under 10 simultaneous requests), interactive applications, large models (70B+). Where it hurts: high-throughput batch inference, very high concurrency (40+ requests), short outputs under 50 tokens.
Layer 7: Hardware and Parallelism
Hardware is the most expensive and least efficient fix if earlier layers are not addressed. That said, hardware configuration choices matter significantly once you reach it.
Tensor parallelism splits weight matrices across GPUs, increasing aggregate memory bandwidth. Each added GPU reduces per-token latency. Going from 1 to 2 H100s for a 70B model typically cuts latency by 40-50%. The communication overhead is low because weights are split across GPUs for each forward pass. This is the right form of parallelism for latency optimization.
Pipeline parallelism splits the model vertically across GPUs (early layers on GPU 1, later layers on GPU 2). It improves throughput but adds inter-GPU communication latency per token. Avoid it for latency-sensitive deployments.
# 2x H100 tensor parallel for 70B model
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 \
--enable-prefix-caching \
--quantization fp8 \
--max-model-len 8192
H100 vs A100: On single-batch processes, H100 cuts latency by roughly 36% versus A100-40GB. At batch size 16, the improvement grows to about 52%. If you are running A100s and hitting latency limits, H100 hardware is a meaningful upgrade, but only after software optimizations are maxed out.
The GPU memory math: For latency, you need the entire model in GPU memory with room for KV cache. A Llama 3.1-70B in BF16 requires ~140GB. That means 2x H100-80GB minimum for comfortable deployment. With FP8 quantization, it drops to ~70GB and fits on a single H100-80GB, which reduces both cost and latency (no inter-GPU communication).
Before/After: A Realistic Case Study
Take a team running a document Q&A application with these initial specs:
- Model: Llama 3.1-70B in BF16 on 2x A100-80GB
- No streaming
- 4,000-token system prompt with instructions + RAG context mixed together
- max_tokens = 2048
- No prefix caching, no quantization
Baseline measurements:
- TTFT: 4.8s (P50), 8.2s (P95 on long queries)
- ITL: 42ms per token
- End-to-end on 300-token response: ~17s
- Max concurrent users: ~4
After each optimization layer:
| Change | TTFT | ITL | Notes |
|---|---|---|---|
| Baseline | 4.8s | 42ms | 2x A100, BF16, no caching |
| + Streaming | 4.8s | 42ms | Perceived latency: first output at 4.8s |
| + Prompt restructuring (static prefix first) | 4.8s first, 0.6s on cache hit | 42ms | ~80% of requests hit prefix cache |
| + max_tokens = 350 | 4.8s / 0.6s cached | 42ms | End-to-end drops from 17s to 9.5s |
| + FP8 quantization | 4.1s / 0.5s cached | 28ms | 33% ITL improvement; fits on 1x H100 |
| + Chunked prefill | 3.2s / 0.5s cached | 26ms | 30% TTFT improvement on uncached requests |
| + Speculative decoding (EAGLE3) | 3.2s / 0.5s cached | 13ms | 2x ITL improvement |
Final state:
- Typical request (cache hit): TTFT 0.5s, end-to-end ~4.5s for 300 tokens
- Cold request: TTFT 3.2s, end-to-end ~7.2s
- ITL: 13ms (vs 42ms original)
- Hardware: 1x H100-80GB (vs 2x A100-80GB)
- Max concurrent users: 47 (vs 4 original)
The hardware cost dropped. The performance improved significantly. None of this required custom kernels or infrastructure changes beyond configuration.
The "5s to 500ms" headline is achievable on cached requests. Uncached cold-start requests to a 70B model will not hit 500ms end-to-end, but the user experience of a typical session (where most requests are cached) can absolutely reach sub-second perceived response times.
What Does Not Compound Well
A few honest notes on optimization limits:
Quantization and speculative decoding stack multiplicatively. FP8 + speculative decoding produced 3.6x total improvement in the AMD benchmark. These two work together.
Prefix caching and speculative decoding also stack. Better cache hits reduce TTFT. Speculative decoding reduces ITL. Different phases, independent improvements.
High concurrency breaks speculative decoding. Once you hit 20-30 simultaneous requests, speculative decoding's overhead starts exceeding its benefit. At that point, continuous batching and quantization do more.
Tensor parallelism has diminishing returns. Going from 1 to 2 GPUs for a 70B model meaningfully cuts latency. Going from 4 to 8 GPUs adds communication overhead that partially offsets the bandwidth gain. The improvement is real but sub-linear.
Model distillation is the long game. Training a smaller distilled model on your specific task can outperform all inference optimizations combined. A well-distilled 7B model often matches 70B quality on narrow tasks at 5-8x lower latency. This takes months to execute correctly but produces the most durable latency improvements.
Monitoring in Production
Optimizations degrade. Cache hit rates drop as query distribution shifts. Model updates invalidate cached prefixes. Quantization behavior changes with new model versions.
Track these metrics continuously:
# Key metrics to monitor via vLLM /metrics endpoint
metrics_to_watch = {
"vllm:time_to_first_token_seconds": "P50, P95, P99 - your user experience",
"vllm:time_per_output_token_seconds": "ITL - decode performance",
"vllm:gpu_cache_usage_perc": "KV cache utilization - if >90%, OOM risk",
"vllm:prefix_cache_hit_rate": "Cache efficiency - target >60% for RAG apps",
"vllm:spec_decode_draft_acceptance_rate": "Speculative decoding health",
"vllm:num_requests_waiting": "Queue depth - spikes mean capacity issues",
}
Set alerts on P95 TTFT crossing 2x your P50 baseline. That gap usually means a cache problem, a queue buildup, or a long-context outlier hitting the system. Catching these early prevents them from compounding.
For teams using custom fine-tuned models on enterprise data, also track that your quantized deployed model matches your evaluated baseline. FP8 quantization is lossless in theory, but it is worth running your domain evaluation suite on the quantized checkpoint before deploying, not after users start complaining.
Quick Reference: Latency Optimization Decision Tree
Is TTFT your main problem?
├── YES → Is it a long prompt?
│ ├── YES → Enable prefix caching + chunked prefill
│ │ Restructure prompt (static prefix first)
│ │ Trim RAG context aggressively
│ └── NO → Check network latency + queue depth
│ Consider smaller model for task
└── NO → Is ITL your main problem?
├── YES → Apply FP8 quantization first
│ Then speculative decoding
│ Then tensor parallelism
└── NO → Check if streaming is enabled
Set max_tokens to realistic limit
Check if model is appropriately sized
FAQ
What is a good target for TTFT in a production chatbot?
Under 500ms is the UX threshold where responses feel immediate. Under 1s is acceptable for most users. Above 2s leads to measurable abandonment increases. With prefix caching enabled and a reasonable prompt structure, most 70B deployments can hit sub-1s TTFT for cached requests.
Does FP8 quantization require re-evaluation of my fine-tuned model?
Yes. Run your domain evaluation suite on the FP8 checkpoint before deploying. The quality loss is minimal on standard benchmarks but your domain task may be more sensitive. This is a 1-2 hour step worth taking.
Can I use all these optimizations together?
FP8 + prefix caching + chunked prefill + speculative decoding + streaming: yes, these all stack. The configuration example in the case study section combines all of them. The main thing to watch is speculative decoding under high concurrency (drop it above ~20 simultaneous users if ITL degrades).
What is the minimum hardware to run Llama 3.1-70B with good latency?
With FP8 quantization: 1x H100-80GB (model fits at ~70GB). Without: 2x H100-80GB or 2x A100-80GB. The 1x H100 FP8 configuration actually outperforms 2x A100 BF16 on latency due to faster memory bandwidth and no inter-GPU communication overhead.
My model is fine-tuned on proprietary data. Does prefix caching still help?
Yes. Prefix caching works on any prompt structure. If your system prompt and instructions are consistent across requests, they will cache regardless of whether the underlying model is fine-tuned. Structure your prompts accordingly.
When should I consider model distillation instead of inference optimization?
When your latency requirements cannot be met by a 70B model running on available hardware, or when you need sub-100ms TTFT at scale. Distillation is a 2-4 month project but produces a model 5-8x faster with maintained domain quality. Data distillation is the path when inference optimization hits its ceiling.