Load Testing LLMs: Tools, Metrics & Realistic Traffic Simulation (2026)

LLM performance testing goes beyond basic API benchmarks. Learn to measure TTFT, tokens per second, p99 latency, and throughput under realistic concurrent load.

Load Testing LLMs: Tools, Metrics & Realistic Traffic Simulation (2026)

Your LLM works fine with one user. What happens with 100 concurrent requests?

Traditional API load testing measures requests per second and response times. LLM load testing is different. You're dealing with streaming responses, variable-length outputs, token-level metrics, and GPU saturation patterns that don't exist in typical REST APIs.

A chat endpoint might handle 50 requests per second with 200ms p50 latency. But if p99 latency spikes to 8 seconds under load, 1% of your users experience terrible performance. At scale, that's thousands of frustrated users daily.

This guide covers the metrics that matter for LLM deployments, the tools to measure them, and how to design tests that reflect real-world traffic.

Why LLM Load Testing Is Different

Standard API benchmarks measure request duration. LLMs require more granular metrics because the response unfolds over time.

Consider a streaming chat response:

  1. User sends prompt (request starts)
  2. Model processes input (prefill phase)
  3. First token appears (TTFT)
  4. Subsequent tokens stream (decode phase)
  5. Final token arrives (request ends)

Each phase has different performance characteristics. Prefill is compute-bound and scales with input length. Decode is memory-bound and determines streaming smoothness. A slow TTFT makes your app feel unresponsive even if total latency is acceptable.

Traditional load testing tools like Apache JMeter weren't designed for this. They measure total request duration but miss the streaming dynamics that determine user experience.

The Metrics That Matter

Time to First Token (TTFT)

The delay between sending a request and receiving the first token. This is what users perceive as "responsiveness."

TTFT matters for:

  • Chat interfaces where users expect immediate feedback
  • Streaming applications where perceived latency determines UX
  • Interactive coding assistants

Target SLOs vary by use case. Consumer chat apps often target TTFT under 500ms. Real-time copilots might need sub-200ms.

Inter-Token Latency (ITL)

The time between consecutive tokens during streaming. Low, consistent ITL creates smooth text appearance. High or variable ITL creates a choppy, stuttering effect.

ITL is calculated as:

ITL = (End-to-End Latency - TTFT) / (Output Tokens - 1)

For readable streaming, aim for ITL under 50ms. Above 100ms, users notice the lag between tokens.

Time Per Output Token (TPOT)

Similar to ITL but sometimes calculated differently across tools. TPOT represents the average time to generate each token after the first. The distinction from ITL varies in the literature, but both measure decode-phase performance.

End-to-End Latency (E2EL)

Total time from request sent to final token received.

E2E Latency = TTFT + (ITL × Output Tokens)

E2EL matters for non-streaming use cases: summarization, batch processing, or any workflow that waits for complete responses before proceeding.

Throughput Metrics

Tokens per second (TPS): Total output tokens generated across all concurrent requests per second. This measures raw system capacity.

Requests per second (RPS): Number of completed requests per second. Less informative for LLMs because request complexity varies wildly. A 10-token response and a 1,000-token response count equally.

Percentile Latencies

Averages hide outliers. A system with 200ms average latency might have 4-second p99 latency.

  • p50 (median): The typical user experience
  • p95: Experience for the "unlucky" 5%
  • p99: Near worst-case scenario

Track p95 and p99 separately. Optimize for tail latency, not averages. User complaints come from the tail.

Goodput

The percentage of requests meeting all your SLOs. High throughput means nothing if half your requests violate latency targets.

Goodput = (Requests Meeting All SLOs / Total Requests) × 100%

Define goodput by setting thresholds: TTFT < 500ms, ITL < 50ms, E2EL < 5s. Goodput then reports what fraction satisfies all requirements.

Tools for LLM Load Testing

LLMPerf (Ray Project)

Purpose-built for LLM benchmarking. Spawns concurrent requests and measures token-level metrics.

Strengths:

  • Measures TTFT, ITL, and generation throughput per request
  • Supports configurable input/output token distributions
  • Works with LiteLLM, OpenAI, Anthropic, and custom endpoints

Basic usage:

python token_benchmark_ray.py \
  --model "meta-llama/Llama-3-70B" \
  --mean-input-tokens 550 \
  --stddev-input-tokens 150 \
  --mean-output-tokens 150 \
  --stddev-output-tokens 10 \
  --num-concurrent-requests 10 \
  --max-num-completed-requests 100 \
  --llm-api "openai" \
  --results-dir "./results"

This simulates realistic traffic: variable input lengths averaging 550 tokens, variable outputs averaging 150 tokens, 10 concurrent users.

Limitations:

  • Less flexible for custom traffic patterns
  • No built-in UI for real-time monitoring

NVIDIA GenAI-Perf

NVIDIA's official benchmarking tool for LLM inference. Deep integration with TensorRT-LLM but works with any OpenAI-compatible endpoint.

Strengths:

  • Comprehensive metrics including TTFT, TPOT, and system TPS
  • Concurrency sweeps to find saturation points
  • Designed for production performance tuning

Example:

genai-perf profile \
  -m llama-3-8b \
  --endpoint-type chat \
  --streaming \
  --concurrency 1,2,4,8,16,32 \
  --input-tokens-mean 128 \
  --output-tokens-mean 128

The concurrency sweep identifies where throughput saturates and latency degrades.

Limitations:

  • Primarily for performance benchmarking, less for load testing at scale
  • Steeper learning curve

GuideLLM (Red Hat)

Open-source toolkit for evaluating LLM deployment performance by simulating real-world traffic.

Strengths:

  • Simulates multiple simultaneous users at various rates
  • Real-time progress display during tests
  • JSON/YAML/CSV output for analysis
  • Kubernetes-friendly

Use cases:

  • Pre-deployment benchmarking
  • Regression testing after updates
  • Hardware evaluation across GPU configurations
guidellm benchmark \
  --model "your-model" \
  --endpoint "http://localhost:8000/v1/chat/completions" \
  --rate 10 \
  --duration 60s

k6 with LLM Extensions

Grafana's k6 is a general-purpose load testing tool. With extensions, it handles LLM-specific scenarios including streaming.

Strengths:

  • JavaScript scripting for complex scenarios
  • Excellent Grafana/Prometheus integration
  • Lightweight and scales well
  • Active community

LLM load test example:

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Trend } from 'k6/metrics';

const ttft = new Trend('time_to_first_token');
const tokensPerSecond = new Trend('tokens_per_second');

export const options = {
  stages: [
    { duration: '1m', target: 10 },   // Ramp up
    { duration: '3m', target: 50 },   // Sustained load
    { duration: '1m', target: 100 },  // Peak
    { duration: '1m', target: 0 },    // Ramp down
  ],
  thresholds: {
    'time_to_first_token': ['p95<1000'],  // TTFT p95 < 1s
    'http_req_failed': ['rate<0.01'],     // Error rate < 1%
  },
};

export default function () {
  const payload = JSON.stringify({
    model: 'llama-3-8b',
    messages: [{ role: 'user', content: 'Explain quantum computing briefly.' }],
    stream: false,
    max_tokens: 150,
  });

  const params = {
    headers: { 
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${__ENV.API_KEY}`,
    },
  };

  const start = Date.now();
  const res = http.post('http://localhost:8000/v1/chat/completions', payload, params);
  const duration = Date.now() - start;

  check(res, {
    'status is 200': (r) => r.status === 200,
  });

  // Parse response for token metrics
  if (res.status === 200) {
    const body = JSON.parse(res.body);
    const outputTokens = body.usage?.completion_tokens || 0;
    if (outputTokens > 0) {
      tokensPerSecond.add(outputTokens / (duration / 1000));
    }
  }

  sleep(1);
}

Limitations:

  • Streaming SSE support requires extra configuration
  • Token-level metrics need custom parsing

Locust with LLM Extensions

Python-based load testing with a web UI. LLM-Locust extends it for streaming and token metrics.

Strengths:

  • Python scripting for flexibility
  • Real-time web UI
  • Distributed testing support

Basic LLM test:

from locust import HttpUser, task, between
import json

class LLMUser(HttpUser):
    wait_time = between(1, 3)
    
    @task
    def chat_completion(self):
        payload = {
            "model": "llama-3-8b",
            "messages": [{"role": "user", "content": "What is machine learning?"}],
            "max_tokens": 100
        }
        
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            headers={"Authorization": f"Bearer {self.api_key}"},
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed: {response.status_code}")

Limitations:

  • Standard Locust doesn't capture LLM-specific metrics
  • Need LLM-Locust extension for TTFT, ITL tracking

Gatling

Enterprise-grade load testing with recent SSE support for LLM APIs.

Strengths:

  • Strong streaming/SSE support
  • Detailed HTML reports
  • CI/CD integration

Gatling handles the streaming response parsing that trips up simpler tools.

Designing Realistic Test Scenarios

Input/Output Length Distribution

Real traffic has variable prompt and response lengths. A test that always sends 100-token prompts and expects 100-token responses misses important edge cases.

Use distributions matching your production data:

  • Short prompts, short responses: Quick Q&A, classification
  • Long prompts, short responses: Summarization, extraction
  • Short prompts, long responses: Creative writing, code generation
  • Long prompts, long responses: Document analysis, report generation

LLMPerf supports this with --mean-input-tokens and --stddev-input-tokens.

Concurrency Patterns

Different applications have different load patterns:

Steady state: Consistent request rate. Typical for background processing.

Ramp up/down: Gradual increase to peak, then decrease. Simulates daily traffic patterns.

Spike: Sudden burst of requests. Tests autoscaling and queue handling.

Soak: Moderate load for extended periods. Reveals memory leaks, gradual degradation.

// k6 spike test
export const options = {
  stages: [
    { duration: '2m', target: 10 },   // Baseline
    { duration: '10s', target: 100 }, // Spike
    { duration: '2m', target: 100 },  // Hold spike
    { duration: '10s', target: 10 },  // Recover
    { duration: '2m', target: 10 },   // Baseline
  ],
};

Prompt Diversity

Don't test with the same prompt repeatedly. Cache systems, speculative decoding, and prefix caching can artificially inflate performance.

Use diverse prompts that match your production distribution:

  • Different topics
  • Varying complexity
  • Multiple languages if applicable
  • Edge cases: very long prompts, special characters, malformed inputs

Streaming vs Non-Streaming

If your production app uses streaming, test with streaming enabled. Streaming changes:

  • Connection management overhead
  • Metrics you can capture
  • How failures manifest

A non-streaming test might show 1s latency. The same request streaming might show 300ms TTFT with 1s total, feeling much more responsive.

Identifying Bottlenecks

GPU Saturation

Signs of GPU saturation:

  • Throughput plateaus while concurrency increases
  • Latency spikes at specific concurrency levels
  • GPU utilization hits 100%

Test by sweeping concurrency from 1 to beyond your max batch size. Throughput typically saturates around max batch size while latency continues climbing.

KV Cache Pressure

Long contexts or high concurrency can exhaust KV cache memory. Symptoms:

  • Sudden latency spikes
  • Request failures or evictions
  • Memory errors in logs

Monitor KV cache utilization during load tests. vLLM exposes this via Prometheus metrics.

Queue Depth

When requests arrive faster than the model can process:

  • Queue depth grows
  • Latency increases linearly with queue depth
  • Timeouts become common

A healthy system maintains bounded queue depth. Unbounded growth indicates insufficient capacity.

Network and I/O

For streaming endpoints:

  • Proxy buffering can delay tokens (NGINX buffers by default)
  • Connection limits can throttle concurrent requests
  • SSL handshake overhead adds to TTFT

Test from multiple network locations to separate model latency from network latency.

Sample Test Scenarios

Scenario 1: Baseline Performance

Establish single-request performance before testing under load.

# Single request, measure latency distribution
genai-perf profile \
  -m your-model \
  --concurrency 1 \
  --request-count 100 \
  --input-tokens-mean 256 \
  --output-tokens-mean 128

Record p50, p95, p99 TTFT and E2EL. This is your baseline.

Scenario 2: Throughput Saturation

Find your system's maximum throughput.

# Sweep concurrency to find saturation
genai-perf profile \
  -m your-model \
  --concurrency 1,2,4,8,16,32,64,128 \
  --input-tokens-mean 256 \
  --output-tokens-mean 128

Plot TPS vs concurrency. Throughput will increase, then plateau, then potentially decrease as queuing overhead dominates.

Scenario 3: Production Traffic Simulation

Simulate realistic traffic for 30 minutes.

// k6 production simulation
export const options = {
  scenarios: {
    steady_traffic: {
      executor: 'ramping-arrival-rate',
      startRate: 1,
      timeUnit: '1s',
      preAllocatedVUs: 50,
      maxVUs: 200,
      stages: [
        { duration: '5m', target: 10 },  // Ramp up
        { duration: '20m', target: 10 }, // Steady state
        { duration: '5m', target: 0 },   // Ramp down
      ],
    },
  },
  thresholds: {
    http_req_duration: ['p95<3000'],
    http_req_failed: ['rate<0.01'],
  },
};

Scenario 4: Spike Handling

Test response to sudden traffic bursts.

export const options = {
  stages: [
    { duration: '2m', target: 5 },    // Normal
    { duration: '30s', target: 50 },  // 10x spike
    { duration: '2m', target: 50 },   // Hold
    { duration: '30s', target: 5 },   // Recovery
    { duration: '2m', target: 5 },    // Normal
  ],
};

Measure recovery time: how long until latency returns to baseline after the spike ends?

Scenario 5: Long-Running Soak

Catch memory leaks and gradual degradation.

# 4-hour soak test at moderate load
python token_benchmark_ray.py \
  --model your-model \
  --num-concurrent-requests 20 \
  --timeout 14400 \
  --max-num-completed-requests 10000

Compare metrics from first hour vs last hour. Degradation indicates resource leaks.

Setting SLOs

Before testing, define what "good" looks like.

Interactive chat (consumer):

  • TTFT p95 < 500ms
  • ITL p95 < 50ms
  • Error rate < 0.1%

Real-time copilot:

  • TTFT p95 < 200ms
  • ITL p95 < 30ms
  • Error rate < 0.01%

Batch processing:

  • E2EL p95 < 10s
  • Throughput > 1000 TPS
  • Error rate < 1%

Enterprise chatbot:

  • TTFT p99 < 1s
  • E2EL p99 < 30s
  • Goodput > 99%

These vary by use case. A document summarization pipeline has different requirements than a real-time coding assistant.

Avoiding Common Mistakes

Testing with Uniform Prompts

Repeated identical prompts trigger caching. Your results will be artificially good. Use diverse, realistic prompts.

Ignoring Token Costs

Cloud LLM APIs charge per token. A 4-hour load test at high concurrency can cost thousands of dollars. Start with short tests against cheaper models.

Missing Streaming Dynamics

If your app streams, test streaming. Non-streaming tests miss TTFT, ITL, and connection management issues.

Testing in Isolation

Your LLM might be fast, but what about the database queries, vector search, and post-processing in your pipeline? Test the full stack, not just the model.

Overlooking Error Handling

What happens when requests fail? Does your retry logic work under load? Test error scenarios: rate limits, timeouts, malformed responses.

Running from CI/CD

GitHub Actions runners have variable network conditions and resource constraints. Results aren't reproducible. Use dedicated infrastructure for performance tests.

Monitoring After Deployment

Load testing validates capacity before production. Monitoring maintains it after.

Key metrics to track continuously:

  • TTFT p95/p99: User-perceived responsiveness
  • Throughput (TPS): System capacity
  • Error rate: Reliability
  • Queue depth: Capacity utilization
  • GPU utilization: Resource efficiency
  • KV cache usage: Memory pressure

vLLM, TGI, and llama.cpp all expose Prometheus-compatible metrics. Feed them into Grafana for dashboards and alerts.

For deeper observability including request tracing and quality monitoring, see the LLM observability guide.

When You Need More Than Testing

Load testing tells you what your system can handle today. It doesn't solve:

  • Model quality under load: Does output quality degrade at high concurrency?
  • Cost optimization: Is your infrastructure sized correctly?
  • Continuous monitoring: Does performance drift over time?

If you're spending more time on infrastructure than on your application, managed platforms handle this complexity. Prem provides production LLM deployment with built-in performance monitoring, evaluation pipelines to validate quality, and fine-tuning to optimize for your specific workload.

For teams with data sovereignty requirements, Prem deploys to your own infrastructure while handling the observability and scaling that would otherwise require dedicated MLOps resources.

The right approach depends on your team. If you have infrastructure expertise and specific optimization needs, build your own benchmarking pipeline. If you'd rather focus on your application, use managed infrastructure with production-grade monitoring built in.

FAQ

How many concurrent users should I test?

Start at 1 to establish baseline. Then test at expected peak load, 2x peak, and beyond until you find the breaking point. Understanding where your system fails helps capacity planning.

What's a good TTFT target?

Depends on use case. Consumer chat: 500ms. Real-time copilot: 200ms. Batch processing: doesn't matter much. Define based on user expectations, then verify you can meet it under load.

Should I test with streaming enabled?

If your production app uses streaming, yes. Streaming changes connection management, metrics capture, and failure modes. Non-streaming tests miss these dynamics.

How long should load tests run?

Baseline tests: a few minutes. Throughput tests: 10-30 minutes. Soak tests: 4+ hours. Short tests miss gradual degradation; very long tests cost money without additional insight.

How do I test without spending a fortune on API calls?

Start with shorter tests. Use cheaper models for initial validation. Test against self-hosted models during development. Reserve expensive cloud API tests for final validation.

What's the difference between load testing and benchmarking?

Load testing simulates production traffic to validate capacity and identify bottlenecks. Benchmarking measures raw model performance under controlled conditions. Both matter; they answer different questions.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe