Serverless LLM Deployment: RunPod vs Modal vs Lambda (2026)
Cold starts: 5-120 seconds. Break-even: 40% GPU utilization. Lambda doesn't even offer serverless anymore.
If you're evaluating serverless GPU inference in 2026, here's the short version:
| If You Need | Use |
|---|---|
| Fastest setup (Hugging Face → endpoint in minutes) | RunPod |
| Lowest per-request cost | Modal |
| Lowest always-on cost | Lambda (on-demand VM) |
| Enterprise compliance on serverless | Cerebrium |
| Managed infrastructure + data sovereignty | PremAI |
The rest of this guide covers the pricing, cold start data, and break-even math behind those recommendations.
The Real Comparison: Cost Per Request
Skip the hourly rates. What matters is cost per inference request.
Llama 3.1 70B, 5-second inference, A100 80GB:
| Platform | Cost/Request | Cold Start | Scale to Zero |
|---|---|---|---|
| Modal | ~$0.004 | 15-45s | Yes |
| Replicate | $0.007 | 0s (pre-warmed) | Yes |
| RunPod Active | $0.067 | 10-15s (FlashBoot) | No |
| RunPod Flex | $0.095 | 10-15s (FlashBoot) | Yes |
| Lambda VM | $0.002* | 0s (always on) | No |
*Lambda requires managing your own inference stack. Others are managed.
The trade-off is clear: Modal gives you the lowest managed serverless cost. RunPod gives you the fastest setup. Lambda gives you the cheapest compute if you're willing to manage infrastructure yourself.
If you want managed infrastructure without cold starts or per-second billing complexity, PremAI deploys in your VPC with dedicated GPUs. You get predictable costs, zero cold starts, and don't pay the serverless premium.
When Serverless Makes Sense (And When It Doesn't)
Use serverless when:
- Traffic is bursty with long idle periods (dev/staging, internal tools, batch jobs)
- You're serving multiple fine-tuned models with occasional traffic each
- Scale-to-zero is a hard requirement
- You're prototyping and want fast iteration
Use dedicated when:
- GPU utilization exceeds 40%
- Volume exceeds 500K tokens/minute
- Latency is critical (cold starts are unacceptable)
- Compliance requires dedicated infrastructure
Use managed dedicated (like PremAI) when:
- You need dedicated infrastructure but don't want to manage GPUs
- Data sovereignty matters (Swiss jurisdiction, zero data retention)
- You need SOC2/GDPR/HIPAA without the serverless complexity
RunPod: Fastest Path to Deployment
Best for: Getting a Hugging Face model running in under 5 minutes.
Three clicks: Serverless → Quick Deploy → Serverless vLLM → enter model name → deploy.
The vLLM worker image is pre-cached. Container deploys instantly. Model weights download separately.
Pricing (March 2026)
| GPU | Flex $/sec | Active $/sec | Flex $/hr |
|---|---|---|---|
| H100 80GB | 0.0272 | 0.0217 | ~$97.92 |
| A100 80GB | 0.0190 | 0.0133 | ~$68.40 |
| RTX 4090 | 0.0069 | 0.0048 | ~$24.84 |
Source: RunPod Pricing
Flex scales to zero. Active runs 24/7 at 30-40% discount.
The hourly rate looks insane ($97.92/hr for H100 Flex vs $2.69/hr for on-demand pod). That's the serverless premium. You're paying for FlashBoot, orchestration, and per-second granularity. The math only works if utilization is very low.
FlashBoot Cold Starts
RunPod's cold start optimization retains worker state after spin-down.
- Popular endpoints: sub-200ms revival
- Infrequent endpoints: 8-30 seconds
- First cold start (7.5GB model, RTX 4090): 52.6 seconds
- Subsequent FlashBoot starts: 10-15 seconds
Source: GitHub Issue #111
Limitations
- 90s HTTP timeout (100s max)
- 2,000 requests per 10 seconds rate limit
- FlashBoot inconsistent for low-traffic endpoints
- Workers reinitialize ~1 minute after last request
Modal: Lowest Per-Request Cost
Best for: Teams who want infrastructure-as-code and care about cost optimization.
Everything is Python. No Docker. No YAML. No Kubernetes.
import modal
@modal.function(gpu="H100", min_containers=0, max_containers=10)
@modal.web_server(port=8000)
def serve():
# vLLM initialization
pass
Deploy with modal deploy app.py.
Pricing (March 2026)
| GPU | $/sec | $/hr |
|---|---|---|
| H100 | 0.001097 | ~$3.95 |
| A100 80GB | 0.000694 | ~$2.50 |
| L40S | 0.000542 | ~$1.95 |
Source: Modal Pricing
CPU and memory billed separately. Add ~$0.43-0.89/hr for typical LLM workloads (8 cores, 64 GiB).
Free tier: $30/month credits.
GPU Memory Snapshots
Checkpoints GPU state after model load. Restores instead of reloading.
- vLLM + Qwen2.5-0.5B: 45s → 5s startup
- Up to 10x faster cold starts
Enable with experimental_options={"enable_gpu_snapshot": True}.
Limitations
- Python-only infrastructure
- Can't bring arbitrary Docker images
- Separate CPU/memory billing adds 10-30%
- 30-second lag on sudden traffic spikes
Lambda: Cheapest Compute (Not Serverless)
Lambda deprecated serverless in September 2025. They now offer on-demand GPU VMs only.
| GPU | $/hr |
|---|---|
| H100 SXM | $3.29 |
| A100 80GB | ~$1.39 |
Source: Lambda Pricing
A100 at $1.39/hr is the cost baseline. Zero egress fees. Per-second billing.
No scale-to-zero. No autoscaling. You manage everything.
Lambda matters as a benchmark: if your serverless bill exceeds what Lambda would cost for always-on, you're overpaying.
Other Options Worth Knowing
| Platform | Cold Start | Compliance | Best For |
|---|---|---|---|
| Replicate | 0s (pre-warmed) | - | One-line deployment |
| Beam | <1s | - | Fastest cold starts |
| Cerebrium | 2-4s | HIPAA, SOC2, GDPR | Enterprise serverless |
| BentoML | 25x faster (streaming) | HIPAA, SOC2, GDPR | Self-host option |
What About Managed Dedicated?
Serverless isn't the only alternative to managing your own GPUs.
PremAI sits in a different category: managed dedicated infrastructure deployed in your VPC. No cold starts. No per-second billing complexity. Predictable monthly costs.
The differentiators:
- Zero data retention with cryptographic verification ("don't trust, verify")
- Swiss jurisdiction under FADP for data sovereignty
- SOC2, GDPR, HIPAA compliance
- Sub-100ms inference latency on dedicated GPUs
If you're choosing between serverless complexity and managing infrastructure yourself, managed dedicated is the third option most comparisons miss.
The Cold Start Problem
Model weight transfer accounts for 72% of time-to-first-token.
| Model Size | Weights | Cold Start |
|---|---|---|
| 1-3B | 2-6 GB | 5-15s |
| 7-13B | 14-26 GB | 15-45s |
| 30-70B | 60-140 GB | 45-120s+ |
Solutions
- Warm pools. Set
min_workers=1. Eliminates cold starts for baseline traffic. You pay for idle. - GPU memory snapshots (Modal). Up to 10x faster restarts.
- FlashBoot (RunPod). Sub-200ms for popular endpoints.
- Smaller models. Distilled models in 1-3B range have manageable cold starts.
- Managed dedicated. Zero cold starts by design. PremAI and similar platforms keep models loaded.
Break-Even Math
The 40% Rule
Dedicated beats serverless when GPU utilization exceeds 40%.
The 500K Tokens/Minute Threshold
Above ~500K tokens/minute (~100 requests/minute at 5K tokens each), dedicated wins on cost.
RunPod Serverless vs RunPod Pod
- Serverless Active: $0.0133/sec (~$47.88/hr continuous)
- On-Demand Pod: ~$2.17/hr
- Break-even: 4.5% utilization
If GPU runs more than 163 seconds per hour, the pod is cheaper.
Modal vs Lambda
- Modal: ~$2.50/hr of actual use
- Lambda: ~$1.39/hr always-on
- Break-even: 13.3 hours of compute per day
The Hidden Option
Both calculations assume you manage dedicated infrastructure yourself. If you factor in engineering time for GPU management, monitoring, and scaling, managed platforms like PremAI can be cheaper than DIY dedicated even at higher utilization rates.
Decision Framework
Step 1: Check your utilization
- Below 40% utilization → Serverless makes sense
- Above 40% → Dedicated wins on cost
Step 2: Check your volume
- Below 500K tokens/minute → Serverless can work
- Above 500K → Dedicated wins
Step 3: Check your constraints
| Constraint | Best Choice |
|---|---|
| Need fastest setup | RunPod |
| Need lowest cost | Modal |
| Need sub-second cold starts | Beam |
| Need compliance (HIPAA/SOC2/GDPR) | Cerebrium or PremAI |
| Need data sovereignty | PremAI (Swiss jurisdiction) |
| Can manage own infra | Lambda VMs |
| Want managed + dedicated | PremAI |
Step 4: Consider hybrid
Many production deployments use:
- Dedicated for baseline traffic
- Serverless for burst overflow
- Scale-to-zero for off-hours
FAQ
What's the cheapest way to deploy an LLM?
For low utilization (<40%): Modal serverless. For high utilization: Lambda VM at $1.39/hr (A100) if you can manage infra, or PremAI if you want managed.
How do I eliminate cold starts?
Set min_workers=1 (RunPod) or min_containers=1 (Modal) to keep one instance warm. Or use dedicated infrastructure (Lambda, PremAI) where models stay loaded.
Can serverless handle production traffic?
Yes, if utilization is below 40% and cold starts are acceptable. Above 500K tokens/minute, dedicated wins on both cost and latency.
What if I need compliance + don't want to manage GPUs?
Cerebrium offers HIPAA/SOC2/GDPR on serverless. PremAI offers compliance on managed dedicated with Swiss jurisdiction and zero data retention.
RunPod vs Modal: which is better?
RunPod for fastest setup (Quick Deploy). Modal for lowest cost and infrastructure-as-code. If cost is priority, Modal wins. If time-to-deploy matters, RunPod wins.
Start Here
- Calculate utilization. Pull request logs. Estimate GPU active time vs total time.
- If utilization >40%: Skip serverless. Use Lambda (DIY) or PremAI (managed).
- If utilization <40%:
- Want fastest setup → RunPod
- Want lowest cost → Modal
- Need compliance → Cerebrium
- Set min_workers=1 to kill cold starts for baseline traffic.
- Book a demo with PremAI if you need managed dedicated with data sovereignty.
For infrastructure deep-dives, see the self-hosted LLM guide or private deployment guide.