LLM Batching: Static vs Continuous and Why It Matters for Throughput

Static batching wastes GPU cycles waiting for slow requests. Continuous batching fixes this by scheduling per-iteration. Benchmarks and implementation inside.

LLM Batching: Static vs Continuous and Why It Matters for Throughput

Your GPU loads 140GB of model weights for every forward pass. That transfer takes time. Batching multiple requests together means you load those weights once and apply them to many inputs simultaneously. More requests per weight load equals higher throughput.

The problem shows up when requests finish at different times.

Static batching waits for the slowest request before starting the next batch. If one request generates 500 tokens while others need 20, those shorter requests sit idle. The GPU wastes cycles on padding.

Continuous batching removes that constraint. Requests join and leave the batch independently at every iteration. A request that finishes after 20 tokens frees its slot immediately. Another request takes that slot. The GPU stays busy.

Anyscale benchmarks show vLLM achieves 23x throughput over HuggingFace Transformers using this approach. The difference grows as output lengths vary.

Static Batching Wastes Time

Static batching works like this:

  1. Collect N requests
  2. Run them all until every request finishes
  3. Return all responses
  4. Accept new requests

When sequences have different output lengths, shorter ones wait. The batch holds 8 requests. One generates 400 tokens. The other seven finish in 30 tokens each. Those seven sit padded while the slow one runs.

Request A: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●  (400 tokens)
Request B: ●●●●●●····················  done at 30, waiting
Request C: ●●●●●●●···················  done at 35, waiting
Request D: ●●●●●·····················  done at 25, waiting

The waste scales with variance. In Anyscale's benchmarks, static batching throughput drops from 200 tokens/second to 81 tokens/second as generation length variance increases.

Offline batch jobs with uniform outputs work fine with static batching. The overhead is low and implementation is simple. Real-time serving with diverse queries is a different story.

Dynamic Batching: Still Request-Level

Dynamic batching improves on static by starting batches earlier. Instead of waiting for a full batch, it triggers on two conditions: batch reaches max size, or timeout expires.

This prevents early requests from waiting indefinitely. Light traffic means smaller, faster batches. Heavy traffic fills batches quickly.

The limitation: once a batch starts, it runs to completion. Short requests still wait for long ones within their batch. The scheduling granularity is wrong for token-by-token generation.

Triton and similar inference servers use dynamic batching. It works for image classifiers and other models with fixed output sizes. For LLMs generating variable-length text, you leave performance behind.

Continuous Batching: Per-Iteration Scheduling

The Orca paper (OSDI 2022) introduced iteration-level scheduling. The batch changes every forward pass. Finished requests leave immediately. Waiting requests join immediately.

Iteration 1:  [A, B, C, D]     → all active
Iteration 15: [A, B, C, D]     → still going  
Iteration 30: [A, B, E, D]     → C finished, E joined
Iteration 45: [A, F, E, D]     → B finished, F joined
Iteration 60: [A, F, E, G]     → D finished, G joined

No padding. No waiting. Slots fill as they open. Orca demonstrated 36.9x throughput improvement over FasterTransformer on GPT-3 175B at equivalent latency.

Every major inference framework now implements this: vLLM, TGI, TensorRT-LLM (calls it "in-flight batching"), SGLang, LMDeploy ("persistent batching"). The technique became standard because it works.

How vLLM Pulls It Off

Continuous batching sounds simple in concept. Implementation has complications.

Ragged Batching

Standard matrix operations need rectangular tensors. All sequences in a batch need matching shapes. That means padding shorter sequences.

vLLM removes the batch dimension entirely. Sequences get flattened into one long stream with position indices tracking boundaries. Custom attention kernels handle variable lengths without padding.

# Traditional: pad to match shapes
# [[tok1, tok2, PAD, PAD, PAD],
#  [tok1, tok2, tok3, tok4, tok5]]

# vLLM: concatenate with position tracking
# tokens: [tok1, tok2, tok1, tok2, tok3, tok4, tok5]
# positions: [0, 1, 0, 1, 2, 3, 4]
# sequence_ids: [0, 0, 1, 1, 1, 1, 1]

Mixing Prefill and Decode

LLM inference has two phases. Prefill processes the entire input prompt in parallel to build the KV cache. Decode generates tokens one at a time using that cache.

The phases have different resource profiles. Prefill is compute-bound. Decode is memory-bandwidth-bound.

vLLM mixes them in the same batch. While existing sequences decode their next token, new sequences can prefill. Both compute and memory bandwidth stay utilized.

PagedAttention for Memory

Continuous batching solves scheduling efficiency. It doesn't solve memory fragmentation.

Orca reserved max_tokens of memory per sequence. If a sequence generates fewer tokens than allocated, that memory sits unused.

PagedAttention allocates KV cache blocks on demand. Memory utilization jumps from roughly 40% to over 96%. That means more concurrent sequences fit in GPU memory, which means larger effective batch sizes.

The combination of continuous batching (scheduling) and PagedAttention (memory) is why vLLM hits such dramatic improvements. Neither technique alone gets you there.

For more on KV cache memory management, see our PagedAttention guide.

Benchmarks

Anyscale tested OPT-13B on an A100 40GB across different generation length distributions.

Throughput vs generation variance:

Variance Level Static Continuous (TGI) vLLM
Low (uniform) 200 tok/s 210 tok/s 420 tok/s
Medium 150 tok/s 200 tok/s 450 tok/s
High (1-1536) 81 tok/s 200 tok/s 480 tok/s

Static batching collapses under variance. Continuous batching stays stable. vLLM doubles continuous batching throughput again through PagedAttention.

The 23x claim compares vLLM against naive HuggingFace Transformers serving. Against TGI (which uses continuous batching), vLLM achieves roughly 2x improvement from memory efficiency alone.

Latency under load:

Continuous batching improves latency because requests don't queue behind slow batches. New requests enter immediately as slots open. Static batching builds queues during bursts.

For production AI deployments, continuous batching wins on both throughput and latency for real-world traffic patterns.

When to Use Each

Static batching works for:

  • Offline processing without latency requirements
  • Uniform input and output lengths
  • Single-user batch jobs
  • Maximum peak throughput on homogeneous data

Dynamic batching works for:

  • Non-LLM models (image classifiers, embeddings)
  • Legacy systems without continuous batching support
  • Workloads with predictable request shapes

Continuous batching works for:

  • Real-time LLM serving
  • Variable output lengths
  • Concurrent users
  • Latency SLAs

For chat, assistants, and interactive applications, use continuous batching.

Implementation

vLLM enables continuous batching by default. No configuration needed.

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

prompts = [
    "Explain photosynthesis briefly.",
    "Write a function to reverse a string in Python.",
    "What causes thunder?",
]

outputs = llm.generate(prompts, SamplingParams(max_tokens=256))

All three prompts batch together. As each finishes, its slot opens for more work.

For serving:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct

Concurrent API requests batch automatically. The server handles scheduling.

Configuration parameters that matter:

  • max_num_seqs: Maximum sequences per batch (default 256)
  • max_num_batched_tokens: Token budget per iteration
  • gpu_memory_utilization: Fraction of GPU memory for KV cache (default 0.9)

Defaults work for most cases. Tune max_num_seqs if you see memory pressure or preemption warnings in logs.

Platforms like Prem Studio configure these parameters automatically based on model size and expected traffic when you deploy fine-tuned models.

Chunked Prefill

Long prompts create spikes. A 10,000 token prompt monopolizes an iteration, blocking decode operations for other sequences.

Chunked prefill splits long prompts across multiple iterations. Process 2,000 tokens, handle some decode operations, process the next 2,000 tokens.

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_chunked_prefill=True,
    max_num_batched_tokens=2048,
)

This smooths latency for workloads mixing short queries with long document inputs. RAG applications benefit because context injection creates exactly this pattern.

Comparison

Static Dynamic Continuous
Scheduling level Batch Batch Iteration
New requests join After batch After batch Every iteration
Padding Yes Yes No (ragged)
Memory efficiency Low Low High with PagedAttention
Best for Offline uniform Non-LLM LLM serving
Throughput (variable) Degrades Degrades Stable
Latency under load Spikes Moderate Low

Summary

Static batching forces short requests to wait for long ones. GPU cycles waste on padding. Throughput tanks when output lengths vary.

Continuous batching schedules per iteration. Requests join and leave independently. No waiting. No padding. Throughput stays high regardless of variance.

vLLM combines continuous batching with PagedAttention for memory efficiency. The result is 23x throughput over naive serving and 2x over continuous batching alone.

Use vLLM defaults. Monitor for preemption warnings. Tune batch size if needed. The framework handles the complexity.

For managed deployments, Prem's infrastructure handles batching configuration as part of the stack.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe