Deploying LLMs on Kubernetes: vLLM, Ray Serve & GPU Scheduling Guide (2026)
Most K8s LLM guides stop at kubectl apply. This one covers GPU topology, KV-cache autoscaling, graceful shutdown, and canary deployments for production inference.
Most guides stop at kubectl apply and call it done. Then you hit production: GPU nodes sitting idle because the scheduler doesn't understand topology. Autoscaling that triggers on CPU while your inference queue backs up. Model updates that drop in-flight requests.
This guide covers the full stack. vLLM and Ray Serve deployment, GPU scheduling with MIG and topology awareness, autoscaling on queue depth and KV cache utilization, Prometheus/Grafana monitoring, and production patterns like canary rollouts and graceful shutdown. Configurations are verified against vLLM v0.17.0 and Ray 2.54.0.
Prerequisites: A Kubernetes cluster with GPU nodes (NVIDIA), kubectl, Helm 3+, and working knowledge of K8s concepts (Deployments, Services, PVCs).
Why Kubernetes for LLM Inference
Kubernetes isn't the only way to serve LLMs. But once you need to scale, it handles GPU workloads better than any alternative.
The NVIDIA GPU Operator (v25.10.1) gives you automatic GPU discovery, MIG partitioning, and time-slicing from a single Helm install. GPU Feature Discovery auto-labels nodes with hardware metadata — model, memory, CUDA version — so you can schedule a 70B model to H100 nodes and a 7B model to L40S nodes using node affinity rules.
Kubernetes HPA with custom metrics lets you scale on inference-specific signals like queue depth and KV cache utilization instead of CPU. The Gateway API Inference Extension (GA as of February 2026, v1.3.1) adds model-aware routing, KV-cache-aware scheduling, and traffic splitting by model name for A/B testing.
When not to use Kubernetes: one model, one GPU, no scaling needs. Standalone vLLM with Docker is enough. Don't add K8s complexity for a single-replica deployment.
Choosing Your Serving Engine
Three stacks dominate LLM inference on Kubernetes. Each fits a different scale.
| Feature | vLLM (standalone) | Ray Serve + vLLM | llm-d |
|---|---|---|---|
| Best for | Single-node, single-model | Multi-node, multi-model | Disaggregated serving at scale |
| Multi-node inference | Manual setup | Automatic placement groups | Native with NIXL KV transfer |
| Multi-model serving | Separate Deployments | Single cluster, shared resources | Shared infrastructure with SLO guarantees |
| Autoscaling | External (HPA/KEDA) | Built-in (replica + cluster + infra) | Workload-variant autoscaler |
| K8s integration | Raw manifests or Helm | KubeRay operator (RayService CRD) | Helm + K8s Inference Gateway |
| Operational complexity | Low | Medium | High |
| GitHub stars | 72.4k | 41.6k (Ray) / 2.4k (KubeRay) | 2.6k |
Stars as of March 2026. Sources: vLLM, Ray, KubeRay, llm-d.
The decision is simple. If your model fits on one node's GPUs, start with standalone vLLM. When you need multi-node inference or want to serve multiple models from one cluster, move to Ray Serve. The official Ray docs are direct about this: "Traditional vLLM serves single-node scenarios better; Ray Serve LLM adds coordination overhead justified only by distributed scaling requirements."
For disaggregated prefill/decode at massive scale, consider llm-d (co-created by Red Hat, Google, and IBM). The team reports ~3.1k tokens/sec per B200 decode GPU.
The K8s ecosystem also has higher-level operators worth knowing: vLLM Production Stack (2.2k stars) bundles vLLM with a KV-cache-aware request router and Prometheus/Grafana. AIBrix (4.7k stars) adds LoRA management and SLO-aware autoscaling. KubeAI (1.2k stars) is a lightweight operator with scale-from-zero and no Istio dependencies.
GPU Scheduling for LLM Workloads
Setting Up the NVIDIA GPU Stack
Install the GPU Operator via Helm:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace
This deploys the device plugin (exposes nvidia.com/gpu resources), GPU Feature Discovery (auto-labels nodes), and DCGM Exporter (GPU metrics for Prometheus). After install, verify with:
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
GPU Feature Discovery labels nodes automatically. Target specific GPU types like this:
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
Node Affinity and Taints
Isolate GPU nodes from non-GPU workloads with taints:
kubectl taint nodes gpu-node-1 nvidia.com/gpu=true:NoSchedule
Add tolerations to your LLM pods:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
For multi-GPU-type clusters, use node affinity with GPU Feature Discovery labels to route models to the right hardware:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values: ["NVIDIA-H100-SXM5-80GB"]
- key: nvidia.com/gpu.memory
operator: Gt
values: ["40000"]
GPU Sharing: MIG and Time-Slicing
Multi-Instance GPU (MIG) partitions A100 and H100 GPUs into hardware-isolated instances, each with dedicated memory and compute. A single A100 80GB can run up to seven 1g.10gb instances.
Enable MIG through the GPU Operator:
helm install gpu-operator nvidia/gpu-operator --set mig.strategy=single
kubectl label nodes gpu-node nvidia.com/mig.config=all-1g.10gb
Pods request MIG slices instead of full GPUs:
resources:
limits:
nvidia.com/mig-1g.10gb: 1
Time-slicing shares a GPU across multiple workloads without hardware isolation. Configure via ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
data:
any: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
Use MIG for production multi-tenant workloads where memory isolation matters. Use time-slicing for development and testing. Time-slicing works on all NVIDIA GPUs; MIG requires A100 or newer (source).
Topology-Aware Scheduling
For latency-sensitive inference, configure the Topology Manager to keep GPU and CPU on the same NUMA node:
# In kubelet configuration
topologyManagerPolicy: single-numa-node
topologyManagerScope: pod
For multi-GPU jobs that require all GPUs allocated simultaneously (tensor parallelism across GPUs), Volcano (v1.12.0) provides gang scheduling. This prevents deadlocks where half the GPUs for a model are allocated on one node while the other half wait on a different node.
Deploying vLLM on Kubernetes
Model Storage with Persistent Volumes
LLM weights are large. A 70B model is ~140GB in FP16. Cache them on a PersistentVolumeClaim so pods don't re-download on every restart.
Create the PVC and a Secret for HuggingFace auth:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
namespace: llm-inference
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token
namespace: llm-inference
type: Opaque
stringData:
token: "your-hf-token-here"
For multi-replica deployments sharing the same model weights, use ReadOnlyMany (ROX) access mode with NFS, Amazon EFS, or CephFS. This avoids duplicating 140GB per replica (K8s PV docs).
The vLLM Deployment Manifest
A complete production Deployment for vLLM serving Mistral 7B on a single GPU:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-mistral
namespace: llm-inference
labels:
app: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
initContainers:
- name: model-download
image: bitnami/huggingface-hub-cli:latest
command:
- huggingface-cli
- download
- mistralai/Mistral-7B-Instruct-v0.3
- --cache-dir
- /models
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: model-cache
mountPath: /models
containers:
- name: vllm
image: vllm/vllm-openai:latest
command:
- vllm
- serve
- mistralai/Mistral-7B-Instruct-v0.3
- --tensor-parallel-size
- "1"
- --max-model-len
- "8192"
- --enable-chunked-prefill
- --gpu-memory-utilization
- "0.9"
ports:
- containerPort: 8000
name: http
env:
- name: HF_HOME
value: /models
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
resources:
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: 24Gi
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 5
volumeMounts:
- name: model-cache
mountPath: /models
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-models
- name: shm
emptyDir:
medium: Memory
sizeLimit: 2Gi
terminationGracePeriodSeconds: 300
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: llm-inference
labels:
app: vllm
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
name: http
Three things most guides skip:
- Shared memory volume. The
emptyDirwithmedium: Memoryat/dev/shmis required for tensor parallel inference. Without it, vLLM crashes with OOM errors on multi-GPU setups (source). terminationGracePeriodSeconds: 300. The default 30 seconds kills in-flight inference requests. Increase this to let ongoing generations finish before the pod shuts down.initialDelaySeconds: 120. A 7B model takes 30-60 seconds to load into GPU memory. Set readiness probes accordingly or you'll route traffic to a pod that isn't ready.
Deploying with the vLLM Helm Chart
For a faster setup, use the official Helm chart:
helm install vllm oci://ghcr.io/vllm-project/vllm-chart \
--set model=mistralai/Mistral-7B-Instruct-v0.3 \
--set gpu=1 \
--namespace llm-inference --create-namespace
The chart lives in the vLLM repo at examples/online_serving/chart-helm/ (docs).
For production clusters serving multiple models, the vLLM Production Stack (v0.1.10) adds a KV-cache-aware request router and bundled Prometheus/Grafana:
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f values.yaml
Testing the Deployment
Port-forward and send a request:
kubectl port-forward svc/vllm-service 8000:8000 -n llm-inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "What is Kubernetes?"}],
"max_tokens": 100
}'
Verify Prometheus metrics:
curl http://localhost:8000/metrics | grep vllm
You should see vllm:num_requests_running, vllm:gpu_cache_usage_perc, and vllm:time_to_first_token_seconds.
Deploying with Ray Serve on Kubernetes
When you need multi-node inference or multi-model serving from a shared GPU cluster, Ray Serve adds the coordination layer that standalone vLLM lacks. If you're running self-hosted fine-tuned models, this is the setup that scales.
Installing KubeRay
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator \
--namespace kuberay-system --create-namespace
Verify the CRDs are installed:
kubectl get crd | grep ray
# rayclusters.ray.io, rayjobs.ray.io, rayservices.ray.io
KubeRay v1.5.1 provides three CRDs: RayCluster for raw clusters, RayJob for batch workloads, and RayService for serving with zero-downtime upgrades (source).
RayService Manifest for LLM Serving
Deploy a Qwen 2.5 7B model with autoscaling:
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llm-serve
namespace: llm-inference
spec:
serveConfigV2: |
applications:
- name: llms
import_path: ray.serve.llm:build_openai_app
route_prefix: "/"
args:
llm_configs:
- model_loading_config:
model_id: qwen2.5-7b
model_source: Qwen/Qwen2.5-7B-Instruct
engine_kwargs:
dtype: bfloat16
max_model_len: 4096
gpu_memory_utilization: 0.85
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 4
target_ongoing_requests: 64
max_ongoing_requests: 128
accelerator_type: A10G
rayClusterConfig:
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.54.0
resources:
requests:
cpu: "4"
memory: 8Gi
workerGroupSpecs:
- groupName: gpu-workers
replicas: 2
minReplicas: 1
maxReplicas: 4
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.54.0
resources:
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
Ray Serve autoscaling works at three levels simultaneously. The application autoscaler adjusts model replicas based on target_ongoing_requests. The Ray Autoscaler adds/removes worker pods based on logical resource demands. The Kubernetes Cluster Autoscaler provisions new GPU nodes when needed (source).
Multi-Model Serving
Pass multiple LLMConfig objects to serve multiple models from one cluster:
args:
llm_configs:
- model_loading_config:
model_id: mistral-7b
model_source: mistralai/Mistral-7B-Instruct-v0.3
engine_kwargs:
max_model_len: 8192
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 2
accelerator_type: A10G
- model_loading_config:
model_id: qwen-7b
model_source: Qwen/Qwen2.5-7B-Instruct
engine_kwargs:
max_model_len: 4096
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 2
accelerator_type: A10G
Each model gets independent autoscaling. Clients select the model via the model field in the request body — identical to OpenAI's API. For scenarios with many similar models (fine-tuned variants), Ray Serve's model multiplexing serves them from a shared replica pool with LRU eviction.
Autoscaling LLM Inference on Kubernetes
Why CPU and Memory Metrics Don't Work
This is the most common mistake in LLM deployment. Standard HPA scales on CPU utilization, but LLM inference is GPU-bound. Your CPU can sit at 5% while your inference queue backs up with 50 waiting requests.
Google's GKE best practices document this clearly:
- GPU Utilization (
DCGM_FI_DEV_GPU_UTIL) is a duty cycle measurement. A GPU at "100% utilization" could be processing 10 requests or 100. This metric won't tell you the difference. - GPU Memory is pre-allocated by vLLM for the KV cache. Memory usage stays constant regardless of load, so it never triggers scale-down.
Scale on queue depth and batch size instead.
| Metric | vLLM Prometheus Name | Best For | Starting Threshold |
|---|---|---|---|
| Queue depth | vllm:num_requests_waiting |
Maximizing throughput | 3-5 requests |
| Batch size | vllm:num_requests_running |
Latency-sensitive workloads | Below max observed batch size |
| KV cache utilization | vllm:gpu_cache_usage_perc |
Memory pressure detection | 0.85 (85%) |
| TTFT p99 | vllm:time_to_first_token_seconds |
User experience SLOs | App-specific |
Thresholds from GKE best practices. Metric names from vLLM metrics docs.
HPA with vLLM Queue Depth (Prometheus Adapter)
Wire vLLM's queue depth to Kubernetes HPA using the Prometheus Adapter (v0.12.0).
Configure the adapter to expose vllm:num_requests_waiting as a custom metric:
# prometheus-adapter-config ConfigMap
rules:
- seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: 'vllm:num_requests_waiting'
as: 'vllm_queue_depth'
metricsQuery: 'sum(vllm:num_requests_waiting{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
Then create an HPA targeting that metric:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: llm-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-mistral
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: vllm_queue_depth
target:
type: AverageValue
averageValue: "5"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
GKE best practices recommend scale-up stabilization of 0 seconds (respond immediately to load) and scale-down stabilization of 300 seconds to avoid premature downscaling.
Scale-to-Zero with KEDA
Standard HPA can't scale to zero replicas. KEDA (v2.19) adds this, which cuts GPU costs significantly for low-traffic models:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
namespace: llm-inference
spec:
scaleTargetRef:
name: vllm-mistral
minReplicaCount: 0
maxReplicaCount: 8
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
threshold: "1"
query: sum(rate(vllm:request_success_total{namespace="llm-inference"}[2m]))
activationThreshold: "0.5"
The tradeoff is cold start time. Scaling from zero means re-loading the model into GPU memory. A 7B model takes 30-60 seconds. A 70B model takes several minutes. Use scale-to-zero for models with predictable low-traffic windows, not for latency-critical endpoints.
GPU Node Autoscaling with Karpenter
Karpenter provisions GPU nodes automatically when pods can't be scheduled:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
template:
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["p4d.24xlarge", "g6e.xlarge", "g6e.2xlarge"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
consolidationPolicy: WhenEmptyOrUnderutilized bin-packs GPU workloads to minimize idle nodes. Including both on-demand and spot lets Karpenter fall back to on-demand when spot GPU instances are unavailable — which happens more than you'd expect.
Monitoring LLM Inference with Prometheus and Grafana
Scraping vLLM Metrics
vLLM exposes Prometheus metrics at /metrics on port 8000 with the vllm: prefix. Create a ServiceMonitor for the Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-metrics
namespace: llm-inference
labels:
release: prometheus
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: http
path: /metrics
interval: 15s
namespaceSelector:
matchNames:
- llm-inference
For Ray Serve deployments, use a PodMonitor targeting the head node on port 8080:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: ray-head-monitor
labels:
release: prometheus
spec:
selector:
matchLabels:
ray.io/node-type: head
podMetricsEndpoints:
- port: metrics
- port: as-metrics
- port: dash-metrics
Ray Serve includes a pre-built Grafana dashboard since Ray 2.51 (source).
Essential PromQL Queries
Time to First Token (P95):
histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket[5m]))
Generation Tokens Per Second:
rate(vllm:generation_tokens_total[1m])
KV Cache Utilization:
vllm:gpu_cache_usage_perc
Request Queue Depth:
vllm:num_requests_waiting
End-to-End Latency (P99):
histogram_quantile(0.99, rate(vllm:e2e_request_latency_seconds_bucket[5m]))
Set up alerting rules for production:
groups:
- name: vllm-alerts
rules:
- alert: HighKVCacheUsage
expr: vllm:gpu_cache_usage_perc > 0.9
for: 5m
annotations:
summary: "KV cache usage above 90%, requests may be preempted"
- alert: HighQueueDepth
expr: vllm:num_requests_waiting > 10
for: 2m
annotations:
summary: "Request queue backing up, consider scaling replicas"
- alert: HighTTFT
expr: histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m])) > 5
for: 5m
annotations:
summary: "TTFT P99 above 5 seconds"
Production Patterns
Graceful Shutdown for Long-Running Inference
The default terminationGracePeriodSeconds of 30 seconds kills in-flight LLM requests. A streaming response generating 500 tokens can take 10-30 seconds. Batch requests take longer.
Increase the grace period and add a preStop hook:
spec:
terminationGracePeriodSeconds: 300
containers:
- name: vllm
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
The preStop hook runs before SIGTERM, giving the load balancer time to drain the pod from its endpoint list. For streaming workloads, 600+ seconds is more appropriate (source).
Canary Deployments for Model Updates
Hard cutovers on model updates are risky. Use Argo Rollouts (v1.8.4) to gradually shift traffic to the new version while monitoring TTFT and error rates.
The Gateway API Inference Extension gives you a more LLM-native approach: traffic splitting by model name. Route 10% of requests to the new model version, watch quality metrics, promote incrementally. This operates at the request routing layer rather than the replica layer, giving you finer control.
Security Hardening
Pod security. Apply the Baseline Pod Security Standard to your inference namespace. The Restricted standard conflicts with GPU driver requirements, so Baseline is the practical choice.
Secret management. Don't rely on base64-encoded Kubernetes Secrets alone for HuggingFace tokens. Use the External Secrets Operator (v2.1.0) to sync secrets from AWS Secrets Manager, HashiCorp Vault, or your cloud provider's KMS.
Network isolation. Create NetworkPolicies that restrict traffic to your inference namespace. Only the API gateway and monitoring stack should reach your vLLM pods.
Choosing the Right Stack
| Scenario | Recommended Stack | Why |
|---|---|---|
| Single model, single GPU | vLLM + K8s Deployment | Lowest complexity |
| Single model, multi-GPU (70B+) | vLLM + tensor parallelism | Set --tensor-parallel-size to match GPU count |
| Multiple models, shared cluster | Ray Serve + KubeRay | Built-in multi-model, independent autoscaling per model |
| Massive scale, latency SLOs | llm-d + K8s Inference Gateway | Disaggregated prefill/decode, KV-cache-aware routing |
| Managed, no K8s ops | PremAI Platform | Deploys in your VPC, zero data retention, no infra management |
If managing Kubernetes GPU infrastructure isn't where your team's time adds value, PremAI deploys LLM inference in your own cloud account with zero data retention and built-in autoscaling. Book a technical call to discuss your setup.
Common Pitfalls
OOMKilled on startup. Two causes: missing shared memory volume at /dev/shm for tensor parallelism, or the model is too large for available GPU VRAM. Fix: add the emptyDir with medium: Memory, or use quantization (--quantization awq) to reduce memory footprint.
Slow cold starts. Model download from HuggingFace takes 5-10 minutes for 7B models, 20+ minutes for 70B. Fix: use init containers to pre-download to a PVC, and set initialDelaySeconds: 120 on readiness probes.
CUDA version mismatch. vLLM compiled for CUDA 12.x fails with PTX was compiled with an unsupported toolchain on CUDA 13.x nodes. Fix: use the official vllm/vllm-openai Docker image, which bundles the correct CUDA version (source).
Pods stuck in Pending. The NVIDIA device plugin DaemonSet isn't running on GPU nodes, or all GPUs are allocated. Verify with kubectl get daemonset -n gpu-operator and check allocatable GPU count with kubectl describe node.
Autoscaling not working. You're scaling on CPU utilization, which stays flat during GPU inference. Switch to queue depth (vllm:num_requests_waiting) via Prometheus Adapter.
FAQ
How much GPU memory do I need for a 70B model?
In FP16, a 70B model needs ~140GB of GPU VRAM for weights, plus memory for the KV cache. That's 2x A100 80GB or 4x A100 40GB with tensor parallelism. With INT4 quantization (AWQ or GPTQ), the weight footprint drops to ~35GB — fitting on a single A100 80GB or H100.
Can I run multiple models on the same GPU?
Yes. Use MIG on A100/H100 for hardware-isolated partitions, or time-slicing for software-level sharing. Ray Serve's model multiplexing also supports multiple models on shared replicas with LRU eviction. Time-slicing has no memory isolation between models.
What's the difference between vLLM standalone and Ray Serve?
vLLM standalone runs a single inference engine on one node. Ray Serve wraps vLLM with distributed coordination for multi-node inference, multi-model serving, built-in autoscaling, and zero-downtime upgrades via KubeRay. Ray Serve uses the same vLLM engine underneath — you can migrate with zero code changes (source).
How do I scale LLM inference to zero?
Standard HPA can't go below 1 replica. Use KEDA with a Prometheus trigger monitoring request rate. When requests drop to zero, KEDA scales to 0. The tradeoff: cold start time when the first request arrives (30-60 seconds for a 7B model with cached weights).
How long does cold start take for LLM pods?
With weights pre-cached on a PVC: 30-60 seconds for a 7B model, 2-5 minutes for a 70B model. Without caching (downloading from HuggingFace): add 5-10 minutes for 7B and 20+ minutes for 70B.
Should I use MIG or time-slicing?
MIG gives hardware-level isolation with dedicated memory and compute per instance. Use it for production multi-tenant workloads on A100/H100. Time-slicing has no memory isolation but works on all NVIDIA GPUs. Use it for development, testing, and non-critical workloads.
How do I monitor LLM inference quality?
Track four metrics via vLLM's Prometheus endpoint: TTFT (time to first token) for perceived latency, inter-token latency for streaming quality, KV cache utilization for memory pressure, and queue depth for capacity planning. Starting alert thresholds: TTFT P99 above 5s, KV cache above 90%, queue depth above 10.
What Kubernetes version do I need?
1.26+ for stable GPU scheduling. 1.27+ for topology manager stability. 1.29+ for llm-d. 1.30+ for scheduling gates. 1.31+ for Image Volume (OCI model artifacts).
For teams evaluating managed LLM deployment without Kubernetes overhead, see the PremAI self-host guide or book a technical call to talk through your setup.