LLM Batching: Static vs Continuous and Why It Matters for Throughput
Static batching wastes GPU cycles waiting for slow requests. Continuous batching fixes this by scheduling per-iteration. Benchmarks and implementation inside.
Your GPU loads 140GB of model weights for every forward pass. That transfer takes time. Batching multiple requests together means you load those weights once and apply them to many inputs simultaneously. More requests per weight load equals higher throughput.
The problem shows up when requests finish at different times.
Static batching waits for the slowest request before starting the next batch. If one request generates 500 tokens while others need 20, those shorter requests sit idle. The GPU wastes cycles on padding.
Continuous batching removes that constraint. Requests join and leave the batch independently at every iteration. A request that finishes after 20 tokens frees its slot immediately. Another request takes that slot. The GPU stays busy.
Anyscale benchmarks show vLLM achieves 23x throughput over HuggingFace Transformers using this approach. The difference grows as output lengths vary.
Static Batching Wastes Time
Static batching works like this:
- Collect N requests
- Run them all until every request finishes
- Return all responses
- Accept new requests
When sequences have different output lengths, shorter ones wait. The batch holds 8 requests. One generates 400 tokens. The other seven finish in 30 tokens each. Those seven sit padded while the slow one runs.
Request A: ●●●●●●●●●●●●●●●●●●●●●●●●●●●● (400 tokens)
Request B: ●●●●●●···················· done at 30, waiting
Request C: ●●●●●●●··················· done at 35, waiting
Request D: ●●●●●····················· done at 25, waiting
The waste scales with variance. In Anyscale's benchmarks, static batching throughput drops from 200 tokens/second to 81 tokens/second as generation length variance increases.
Offline batch jobs with uniform outputs work fine with static batching. The overhead is low and implementation is simple. Real-time serving with diverse queries is a different story.
Dynamic Batching: Still Request-Level
Dynamic batching improves on static by starting batches earlier. Instead of waiting for a full batch, it triggers on two conditions: batch reaches max size, or timeout expires.
This prevents early requests from waiting indefinitely. Light traffic means smaller, faster batches. Heavy traffic fills batches quickly.
The limitation: once a batch starts, it runs to completion. Short requests still wait for long ones within their batch. The scheduling granularity is wrong for token-by-token generation.
Triton and similar inference servers use dynamic batching. It works for image classifiers and other models with fixed output sizes. For LLMs generating variable-length text, you leave performance behind.
Continuous Batching: Per-Iteration Scheduling
The Orca paper (OSDI 2022) introduced iteration-level scheduling. The batch changes every forward pass. Finished requests leave immediately. Waiting requests join immediately.
Iteration 1: [A, B, C, D] → all active
Iteration 15: [A, B, C, D] → still going
Iteration 30: [A, B, E, D] → C finished, E joined
Iteration 45: [A, F, E, D] → B finished, F joined
Iteration 60: [A, F, E, G] → D finished, G joined
No padding. No waiting. Slots fill as they open. Orca demonstrated 36.9x throughput improvement over FasterTransformer on GPT-3 175B at equivalent latency.
Every major inference framework now implements this: vLLM, TGI, TensorRT-LLM (calls it "in-flight batching"), SGLang, LMDeploy ("persistent batching"). The technique became standard because it works.
How vLLM Pulls It Off
Continuous batching sounds simple in concept. Implementation has complications.
Ragged Batching
Standard matrix operations need rectangular tensors. All sequences in a batch need matching shapes. That means padding shorter sequences.
vLLM removes the batch dimension entirely. Sequences get flattened into one long stream with position indices tracking boundaries. Custom attention kernels handle variable lengths without padding.
# Traditional: pad to match shapes
# [[tok1, tok2, PAD, PAD, PAD],
# [tok1, tok2, tok3, tok4, tok5]]
# vLLM: concatenate with position tracking
# tokens: [tok1, tok2, tok1, tok2, tok3, tok4, tok5]
# positions: [0, 1, 0, 1, 2, 3, 4]
# sequence_ids: [0, 0, 1, 1, 1, 1, 1]
Mixing Prefill and Decode
LLM inference has two phases. Prefill processes the entire input prompt in parallel to build the KV cache. Decode generates tokens one at a time using that cache.
The phases have different resource profiles. Prefill is compute-bound. Decode is memory-bandwidth-bound.
vLLM mixes them in the same batch. While existing sequences decode their next token, new sequences can prefill. Both compute and memory bandwidth stay utilized.
PagedAttention for Memory
Continuous batching solves scheduling efficiency. It doesn't solve memory fragmentation.
Orca reserved max_tokens of memory per sequence. If a sequence generates fewer tokens than allocated, that memory sits unused.
PagedAttention allocates KV cache blocks on demand. Memory utilization jumps from roughly 40% to over 96%. That means more concurrent sequences fit in GPU memory, which means larger effective batch sizes.
The combination of continuous batching (scheduling) and PagedAttention (memory) is why vLLM hits such dramatic improvements. Neither technique alone gets you there.
For more on KV cache memory management, see our PagedAttention guide.
Benchmarks
Anyscale tested OPT-13B on an A100 40GB across different generation length distributions.
Throughput vs generation variance:
| Variance Level | Static | Continuous (TGI) | vLLM |
|---|---|---|---|
| Low (uniform) | 200 tok/s | 210 tok/s | 420 tok/s |
| Medium | 150 tok/s | 200 tok/s | 450 tok/s |
| High (1-1536) | 81 tok/s | 200 tok/s | 480 tok/s |
Static batching collapses under variance. Continuous batching stays stable. vLLM doubles continuous batching throughput again through PagedAttention.
The 23x claim compares vLLM against naive HuggingFace Transformers serving. Against TGI (which uses continuous batching), vLLM achieves roughly 2x improvement from memory efficiency alone.
Latency under load:
Continuous batching improves latency because requests don't queue behind slow batches. New requests enter immediately as slots open. Static batching builds queues during bursts.
For production AI deployments, continuous batching wins on both throughput and latency for real-world traffic patterns.
When to Use Each
Static batching works for:
- Offline processing without latency requirements
- Uniform input and output lengths
- Single-user batch jobs
- Maximum peak throughput on homogeneous data
Dynamic batching works for:
- Non-LLM models (image classifiers, embeddings)
- Legacy systems without continuous batching support
- Workloads with predictable request shapes
Continuous batching works for:
- Real-time LLM serving
- Variable output lengths
- Concurrent users
- Latency SLAs
For chat, assistants, and interactive applications, use continuous batching.
Implementation
vLLM enables continuous batching by default. No configuration needed.
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
prompts = [
"Explain photosynthesis briefly.",
"Write a function to reverse a string in Python.",
"What causes thunder?",
]
outputs = llm.generate(prompts, SamplingParams(max_tokens=256))
All three prompts batch together. As each finishes, its slot opens for more work.
For serving:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct
Concurrent API requests batch automatically. The server handles scheduling.
Configuration parameters that matter:
max_num_seqs: Maximum sequences per batch (default 256)max_num_batched_tokens: Token budget per iterationgpu_memory_utilization: Fraction of GPU memory for KV cache (default 0.9)
Defaults work for most cases. Tune max_num_seqs if you see memory pressure or preemption warnings in logs.
Platforms like Prem Studio configure these parameters automatically based on model size and expected traffic when you deploy fine-tuned models.
Chunked Prefill
Long prompts create spikes. A 10,000 token prompt monopolizes an iteration, blocking decode operations for other sequences.
Chunked prefill splits long prompts across multiple iterations. Process 2,000 tokens, handle some decode operations, process the next 2,000 tokens.
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_chunked_prefill=True,
max_num_batched_tokens=2048,
)
This smooths latency for workloads mixing short queries with long document inputs. RAG applications benefit because context injection creates exactly this pattern.
Comparison
| Static | Dynamic | Continuous | |
|---|---|---|---|
| Scheduling level | Batch | Batch | Iteration |
| New requests join | After batch | After batch | Every iteration |
| Padding | Yes | Yes | No (ragged) |
| Memory efficiency | Low | Low | High with PagedAttention |
| Best for | Offline uniform | Non-LLM | LLM serving |
| Throughput (variable) | Degrades | Degrades | Stable |
| Latency under load | Spikes | Moderate | Low |
Summary
Static batching forces short requests to wait for long ones. GPU cycles waste on padding. Throughput tanks when output lengths vary.
Continuous batching schedules per iteration. Requests join and leave independently. No waiting. No padding. Throughput stays high regardless of variance.
vLLM combines continuous batching with PagedAttention for memory efficiency. The result is 23x throughput over naive serving and 2x over continuous batching alone.
Use vLLM defaults. Monitor for preemption warnings. Tune batch size if needed. The framework handles the complexity.
For managed deployments, Prem's infrastructure handles batching configuration as part of the stack.