GPU Buying Guide for LLMs: RTX 5090 vs H100 vs H200 Complete Comparison (2026)
Which GPU should you buy for running LLMs? From $250 budget cards to $40K datacenter GPUs. Covers VRAM needs, tokens per second benchmarks, and total cost of ownership.
Choosing a GPU for LLMs is fundamentally different from choosing one for gaming or rendering.
Raw compute matters less than you'd expect. Memory bandwidth and VRAM capacity matter more. A $2,000 consumer card can outperform a $10,000 workstation GPU for inference. And sometimes the best answer isn't buying hardware at all.
This guide covers every tier: budget consumer cards, high-end gaming GPUs, professional workstation options, datacenter accelerators, and Apple Silicon. We'll look at what actually determines LLM performance, which models fit on which GPUs, real benchmark numbers, and total cost of ownership including cloud alternatives.
What Actually Determines LLM Performance
Before comparing specific GPUs, understand what drives LLM inference speed.
VRAM: The Hard Constraint
VRAM determines what models you can run. Period.
A 7B parameter model in FP16 needs approximately 14GB of VRAM. A 70B model needs around 140GB. No amount of compute power helps if you can't fit the model in memory.
Quantization changes these requirements dramatically:
| Model Size | FP16 | INT8 | INT4 (Q4) |
|---|---|---|---|
| 7B | 14GB | 7GB | 4GB |
| 13B | 26GB | 13GB | 7GB |
| 32B | 64GB | 32GB | 18GB |
| 70B | 140GB | 70GB | 35GB |
| 405B | 810GB | 405GB | 203GB |
These are minimums for model weights only. Add 2-6GB for KV cache, CUDA overhead, and context. Longer context windows require more KV cache memory, scaling linearly with sequence length.
Practical VRAM guidelines:
- 8GB: 7B models quantized to Q4, limited context
- 12-16GB: 7B-13B models comfortably, some 32B quantized
- 24GB: Most 7B-32B models, 70B heavily quantized
- 32GB: 70B Q4 models, comfortable headroom
- 48GB: 70B Q8, larger context windows
- 80GB: 70B FP16, 405B quantized
- 141GB+: Largest models without extreme quantization
Memory Bandwidth: The Speed Determinant
LLM inference is memory-bandwidth bound, not compute bound.
During token generation, the GPU reads model weights from memory for every single token. Generation speed is limited by how fast data moves from VRAM to compute units.
Memory bandwidth comparison:
| GPU | Bandwidth | Approx. tok/s (8B Q4) |
|---|---|---|
| RTX 3090 | 936 GB/s | 85 |
| RTX 4090 | 1,008 GB/s | 128 |
| RTX 5090 | 1,792 GB/s | 213 |
| A100 80GB | 2,039 GB/s | 138 |
| H100 80GB | 3,350 GB/s | 144 |
| H200 141GB | 4,800 GB/s | ~180 |
Notice the RTX 5090 outperforms the A100 despite costing 1/15th as much. Bandwidth is that important.
Tensor Cores and Precision
Modern GPUs have specialized tensor cores for matrix operations. For LLM inference:
- FP16/BF16: Default precision, good performance
- FP8: 2x throughput vs FP16, minor quality loss (H100+, Blackwell)
- INT8: Fast inference with proper calibration
- INT4: Fastest quantized inference
Newer architectures (Hopper, Blackwell) have transformer-specific optimizations. But for inference, bandwidth usually matters more than these architectural features.
Consumer GPUs: Best Value for Local LLM
Consumer GPUs offer the best price-to-performance for personal LLM use.
RTX 5090 (32GB) - The New Champion
Specs:
- VRAM: 32GB GDDR7
- Bandwidth: 1,792 GB/s
- CUDA Cores: 21,760
- TDP: 575W
- Price: $1,999 MSRP (street: $2,000-2,500)
The RTX 5090 is a game-changer for local LLM inference. Its 32GB VRAM handles 70B Q4 models on a single card. Its GDDR7 memory delivers 1.8 TB/s bandwidth, 77% faster than the 4090.
Benchmark highlights:
- 213 tok/s on Llama 3 8B Q4
- 61 tok/s on 32B models
- Outperforms A100 80GB on small-to-medium models
- 2.6x faster than A100 on Qwen 7B (RunPod benchmarks)
Best for: Enthusiasts, developers, researchers who need to run 70B models locally. The single best consumer GPU for LLMs in 2026.
Limitations: 575W TDP requires robust PSU and cooling. Street prices above MSRP due to demand. No ECC memory.
RTX 4090 (24GB) - Still Excellent
Specs:
- VRAM: 24GB GDDR6X
- Bandwidth: 1,008 GB/s
- CUDA Cores: 16,384
- TDP: 450W
- Price: $1,600-1,800 (new), $1,200-1,400 (used)
The 4090 remains an outstanding LLM GPU. Its 24GB fits most 7B-32B models. Performance is excellent for its price.
Benchmark highlights:
- 128 tok/s on Llama 3 8B Q4
- Mature ecosystem, excellent driver support
- Widely available and well-understood
Best for: Budget-conscious developers, those who don't need 70B models, gamers who also do ML work.
Limitations: 24GB VRAM limits model size. Can't run 70B without heavy quantization and reduced context.
RTX 3090 (24GB) - Budget King
Specs:
- VRAM: 24GB GDDR6X
- Bandwidth: 936 GB/s
- CUDA Cores: 10,496
- TDP: 350W
- Price: $700-900 (used)
Used 3090s offer exceptional value. Same 24GB VRAM as 4090 at half the price. Performance is ~30% slower but still very capable.
Best for: Budget builds, multi-GPU setups, hobbyists.
RTX 4060 Ti 16GB - Entry Point
Specs:
- VRAM: 16GB GDDR6
- Bandwidth: 288 GB/s
- Price: $450-500
The 16GB 4060 Ti is the cheapest path to running 7B-13B models locally. Performance is modest but adequate for development and experimentation.
Best for: Students, hobbyists, development/testing.
Intel Arc B580 - Budget Experimentation
Specs:
- VRAM: 12GB GDDR6
- Price: $249
Intel's Arc GPUs now support LLM inference through IPEX-LLM. The B580 offers 12GB VRAM at $249, enough for 7B Q4 models. Software support is maturing but still behind CUDA.
Best for: Extreme budget builds, experimentation.
Consumer GPU Summary
| GPU | VRAM | Bandwidth | Price | tok/s (8B Q4) | Best For |
|---|---|---|---|---|---|
| RTX 5090 | 32GB | 1,792 GB/s | $2,000 | 213 | 70B models, serious work |
| RTX 4090 | 24GB | 1,008 GB/s | $1,600 | 128 | Most users, great value |
| RTX 3090 | 24GB | 936 GB/s | $800 | 85 | Budget, multi-GPU |
| RTX 4060 Ti 16GB | 16GB | 288 GB/s | $450 | 45 | Entry level |
| Arc B580 | 12GB | - | $249 | ~30 | Experimentation |
Professional/Workstation GPUs
Workstation GPUs sit between consumer and datacenter. They offer enterprise features like ECC memory, NVLink, and certified drivers, but at substantial price premiums.
RTX 6000 Ada (48GB)
Specs:
- VRAM: 48GB GDDR6 ECC
- Bandwidth: 960 GB/s
- CUDA Cores: 18,176
- TDP: 300W
- Price: $6,800
The RTX 6000 Ada provides 48GB VRAM with ECC memory. Its Ada architecture delivers strong performance.
Benchmark highlights:
- ~130 tok/s on Llama 3 8B
- 48GB fits 70B Q8 models
- Supports NVLink for multi-GPU scaling
Best for: Professional workstations needing ECC and certified drivers. Video production with AI features.
Value assessment: For pure LLM work, the RTX 5090 offers better performance at 1/3 the price. The 6000 Ada's value comes from its professional features and 48GB VRAM.
RTX A6000 (48GB)
Specs:
- VRAM: 48GB GDDR6 ECC
- Bandwidth: 768 GB/s
- Architecture: Ampere (older)
- TDP: 300W
- Price: $4,500
The A6000 is the previous-generation workstation flagship. Its 48GB VRAM remains valuable, but Ampere architecture is slower than Ada.
Best for: Budget professional workstations, used market deals.
Value assessment: At current prices, the RTX 6000 Ada is worth the premium. Used A6000s around $2,500 can be good value.
L40S (48GB)
Specs:
- VRAM: 48GB GDDR6 ECC
- Bandwidth: 864 GB/s
- Architecture: Ada Lovelace
- TDP: 350W
- Price: $8,000-10,000
NVIDIA positions the L40S as a "universal datacenter GPU" for AI, graphics, and video. It's essentially a datacenter-packaged version of Ada architecture.
Benchmark highlights:
- ~114 tok/s on 8B models
- Optimized for inference workloads
- MIG support for multi-tenant deployments
Best for: Datacenter deployments needing mixed AI/graphics/video workloads.
Value assessment: Caught between consumer 5090 (faster, cheaper) and H100 (more memory, faster). Hard to recommend for pure LLM work.
Professional GPU Summary
| GPU | VRAM | Bandwidth | Price | Notes |
|---|---|---|---|---|
| RTX 6000 Ada | 48GB | 960 GB/s | $6,800 | ECC, NVLink, pro drivers |
| RTX A6000 | 48GB | 768 GB/s | $4,500 | Older, still capable |
| L40S | 48GB | 864 GB/s | $8,000+ | Datacenter universal GPU |
Datacenter GPUs: Maximum Performance
Datacenter GPUs are purpose-built for AI at scale. They're not cost-effective for individuals, but essential for production deployments.
A100 (80GB) - The Established Workhorse
Specs:
- VRAM: 80GB HBM2e
- Bandwidth: 2,039 GB/s
- Architecture: Ampere
- TDP: 400W (PCIe), 500W (SXM)
- Price: ~$15,000 (used), $1.79/hr cloud
The A100 dominated AI infrastructure from 2020-2023. It remains widely deployed and available.
Performance:
- 138 tok/s on 8B models (vLLM)
- 80GB fits 70B FP16 models
- NVLink scales to 8-GPU systems
Best for: Organizations with existing Ampere infrastructure. Budget-conscious datacenter deployments.
Value assessment: Cloud rental at $1.79/hr makes more sense than purchasing for most use cases. The H100 offers 4x better performance at ~2x the cost.
H100 (80GB) - Current Production Standard
Specs:
- VRAM: 80GB HBM3
- Bandwidth: 3,350 GB/s
- Architecture: Hopper
- TDP: 700W
- Price: $25,000-40,000 (purchase), $2-4/hr (cloud)
The H100 is the current standard for production AI inference. Its Transformer Engine and FP8 support deliver substantial speedups.
Performance:
- 144 tok/s on 8B models
- 984 tok/s throughput on 70B (vLLM, high batch)
- 4x training performance vs A100
Cloud pricing (2026):
- Hyperscalers (AWS, GCP, Azure): $3-5/hr
- Specialized providers (Jarvislabs, RunPod): $2-3/hr
- Spot/preemptible: $1.50-2.50/hr
Best for: Production inference at scale. Training large models.
H200 (141GB) - Memory King
Specs:
- VRAM: 141GB HBM3e
- Bandwidth: 4,800 GB/s
- Architecture: Hopper
- TDP: 700W
- Price: $30,000-40,000 (purchase), $3.70-5/hr (cloud)
The H200 upgrades H100's memory from 80GB to 141GB while boosting bandwidth 43%. Same compute, more memory.
Performance:
- ~180 tok/s on 8B models
- Fits Llama 70B FP16 on single GPU (H100 requires 2)
- 1.9x inference improvement on memory-bound workloads
Best for: Large models where memory is the bottleneck. Long-context applications.
Value assessment: 20% price premium over H100 for 76% more memory. Excellent value for memory-constrained workloads.
B200 (192GB) - Next Generation
Specs:
- VRAM: 192GB HBM3e
- Bandwidth: 8,000 GB/s
- Architecture: Blackwell
- TDP: 1000W
- Price: Not yet widely available
NVIDIA's Blackwell architecture brings fifth-generation tensor cores, FP4 support, and massive memory improvements.
Performance claims:
- 15x inference improvement vs H100 (NVIDIA benchmarks)
- 450 tok/s on 8B models
- 192GB fits even larger models
Availability: Limited in early 2026. Sold out through mid-2026 according to reports.
Best for: Frontier model serving. Extreme-scale deployments. Wait for availability.
Datacenter GPU Summary
| GPU | VRAM | Bandwidth | Purchase | Cloud/hr | Use Case |
|---|---|---|---|---|---|
| A100 80GB | 80GB | 2,039 GB/s | ~$15K | $1.79 | Budget datacenter |
| H100 80GB | 80GB | 3,350 GB/s | $25-40K | $2-4 | Production standard |
| H200 141GB | 141GB | 4,800 GB/s | $30-40K | $3.70-5 | Large model serving |
| B200 192GB | 192GB | 8,000 GB/s | TBD | TBD | Next-gen (limited) |
Apple Silicon: The Unified Memory Advantage
Apple Silicon takes a different approach: unified memory shared between CPU and GPU.
Why Apple Silicon Works for LLMs
Traditional GPUs have separate VRAM. Model weights must fit in VRAM. A 70B FP16 model needs 140GB, exceeding any single consumer GPU.
Apple's unified memory architecture lets the GPU access all system RAM. A Mac Studio with 192GB can load a 70B FP16 model that no consumer NVIDIA GPU can touch.
Tradeoffs:
Pros
- More memory than any consumer GPU
- Extremely power efficient (40-80W vs 450W)
- Silent operation
Cons:
- Slower tokens per second than equivalent NVIDIA
- Less software ecosystem support
- Higher cost per GB of memory
M4 Max (128GB)
Specs:
- Unified Memory: Up to 128GB
- Bandwidth: 546 GB/s
- Neural Engine: 38 TOPS
- TDP: ~100W
- Price: $3,999 (64GB), $4,999 (128GB) MacBook Pro
Performance:
- ~96-100 tok/s on 8B Q4 (projected)
- ~25-30 tok/s on 70B Q4
- Can run models that don't fit on any consumer NVIDIA GPU
Best for: Developers wanting portable LLM capability. Silent home setups.
M3 Ultra (512GB)
Specs:
- Unified Memory: Up to 512GB
- Bandwidth: 819 GB/s
- TDP: ~200W
- Price: $4,999 (96GB) to $9,499 (512GB) Mac Studio
Performance:
- 76-84 tok/s on 8B Q4
- ~17-18 tok/s on 671B parameter models (DeepSeek R1)
- Can run virtually any model with quantization
Best for: Research requiring very large models. Teams preferring macOS. Silent operation requirements.
Value assessment: The 512GB configuration at $9,499 can run models impossible on consumer NVIDIA hardware. But for models that fit in 32GB, an RTX 5090 is 3x faster at 1/4 the cost.
Apple vs NVIDIA: When to Choose Each
Choose Apple Silicon when:
- Model size exceeds 32GB VRAM
- Silent operation is required
- Power efficiency matters
- You're already in the Apple ecosystem
- Portability is important (MacBook Pro)
Choose NVIDIA when:
- Maximum tokens per second matters
- Model fits in available VRAM
- You need CUDA ecosystem
- Cloud deployment is the goal
- Budget is constrained
Multi-GPU Considerations
Running multiple GPUs introduces complexity but enables larger models and higher throughput.
Consumer Multi-GPU
Multiple consumer GPUs communicate over PCIe, which is slow (~32 GB/s) compared to NVLink (~900 GB/s).
What works:
- Running different models on different GPUs (no communication needed)
- Pipeline parallelism for very large models
- Batch serving with one model per GPU
What doesn't work well:
- Tensor parallelism (requires high-bandwidth interconnect)
- Training (gradient synchronization is slow)
Practical guidance:
- 2x RTX 4090 (48GB total): Good for 70B Q8 with some overhead
- 4x RTX 4090 (96GB total): Can fit 405B Q4 with pipeline parallelism
- 8x RTX 4090 servers exist (192GB) but PCIe bandwidth limits scaling
A dual RTX 5090 setup (64GB total) often outperforms single H100 for models that fit.
Datacenter Multi-GPU
NVLink enables efficient tensor parallelism. H100/H200 systems scale to 8 GPUs with 900 GB/s interconnect.
For multi-node scaling, NVLink Switch and InfiniBand provide high-bandwidth connectivity across servers.
Cloud vs Buy: The Economic Analysis
When Cloud Makes Sense
Utilization matters most. If you're using GPUs <40% of the time, cloud rental beats ownership.
Break-even analysis for H100:
- Purchase: $30,000 + $100,000/year operating = $330,000 over 3 years
- Cloud at $2.50/hr: $21,900/year at 24/7 usage = $65,700 over 3 years
- Break-even: ~70% utilization
Most users don't run 24/7. Intermittent usage strongly favors cloud.
Cloud advantages:
- No upfront capital
- Access to latest hardware
- Geographic flexibility
- Elastic scaling
Cloud disadvantages:
- Higher cost at high utilization
- Data transfer costs
- Vendor lock-in risk
- Less control
Cloud Pricing (2026)
| GPU | Hyperscaler | Specialized | Spot |
|---|---|---|---|
| RTX 4090 | - | $0.40-0.65 | $0.25 |
| A100 80GB | $3.67 | $1.79 | $0.80 |
| H100 80GB | $3.50-5 | $2-3 | $1.50 |
| H200 141GB | $5-10 | $3.70-4.30 | $2.50 |
Specialized providers (RunPod, Vast.ai, Jarvislabs, Lambda) typically offer 40-60% lower prices than AWS/GCP/Azure.
When Buying Makes Sense
Buy consumer GPUs when:
- You'll use them daily
- Privacy/compliance requires local data
- Long-term cost matters more than capital efficiency
- You also use them for gaming/other workloads
Buy datacenter GPUs when:
- Utilization exceeds 70% sustained
- You have infrastructure expertise
- Compliance requires on-premises
- Multi-year budget is available
Buying Recommendations by Use Case
Hobbyist/Learning ($500-1,000)
Recommendation: Used RTX 3090 ($700-900)
24GB VRAM runs most useful models. Excellent community support. Can be upgraded later.
Alternative: RTX 4060 Ti 16GB ($450) for tighter budgets, but 16GB limits model options.
Serious Developer ($1,500-2,500)
Recommendation: RTX 5090 ($2,000)
The clear winner for local LLM work. 32GB handles 70B models. Performance rivals datacenter GPUs.
Alternative: RTX 4090 ($1,600) if 24GB is sufficient for your models.
Professional Workstation ($5,000-10,000)
Recommendation: RTX 5090 + cloud credits
Unless you need ECC memory or certified drivers, the 5090 outperforms professional cards. Use cloud for larger models.
Alternative: RTX 6000 Ada ($6,800) if professional features are required.
Small Team/Startup
Recommendation: Cloud-first approach
Start with H100/H200 cloud instances. Measure actual usage. Consider purchasing only after establishing utilization patterns.
Providers: RunPod, Lambda, Jarvislabs for cost efficiency.
Enterprise Production
Recommendation: H200 cloud or purchase depending on scale
For <70% utilization: Cloud with reserved instances For >70% utilization: Consider H200 purchase with operational team
The H200's 141GB VRAM simplifies deployment for large models that would require 2x H100.
Maximum Local Memory
Recommendation: Mac Studio M3 Ultra 512GB ($9,499)
When you need to run models that don't fit anywhere else, Apple's unified memory is the only consumer option.
Budget Multi-GPU
Recommendation: 2-4x Used RTX 3090 ($2,800-3,600)
Used 3090s offer excellent value for multi-GPU setups. 96GB total VRAM across 4 cards handles most workloads.
The Real Bottleneck: When Hardware Isn't the Answer
Sometimes the right GPU decision is recognizing you don't need to make one.
API providers handle infrastructure entirely. You pay per token, scale instantly, and never worry about GPU availability.
For many applications, the complexity of managing GPU infrastructure exceeds its value. Teams spend weeks optimizing CUDA environments when they could be building products.
Prem sits between DIY infrastructure and pure API providers. The platform handles fine-tuning, evaluation, and deployment without requiring you to manage GPUs directly. For organizations with data sovereignty requirements, deployment options include your own AWS VPC or on-premise infrastructure.
The decision isn't just "which GPU" but "should I manage GPUs at all."
FAQ
What's the minimum VRAM for running local LLMs?
8GB runs 7B models with Q4 quantization. 16GB is comfortable for 7B-13B. 24GB handles most practical use cases. More is always better for flexibility.
Is the RTX 5090 worth it over the 4090?
For LLM inference, yes. The 32GB VRAM (vs 24GB) and 77% more bandwidth translate to significantly better performance. If you're buying new in 2026, the 5090 is the clear choice.
Should I buy an H100 or use cloud?
Cloud unless you have >70% sustained utilization and infrastructure expertise. H100 cloud instances cost $2-4/hr. Purchasing makes sense only at enterprise scale.
Can Apple Silicon compete with NVIDIA for LLMs?
In tokens per second, no. A 5090 is 2-3x faster than M3 Ultra for models that fit. But Apple Silicon runs models that don't fit on any consumer NVIDIA GPU, and it does so silently at a fraction of the power.
What about AMD GPUs?
AMD's ROCm has improved significantly, and MI300X competes with H100 at datacenter scale. Consumer AMD GPUs (RX 7900 XTX) have decent software support through ROCm but trail NVIDIA in ecosystem maturity. For most users, NVIDIA remains the safer choice.
How many RTX 4090s equal an H100?
Roughly 2-4 depending on workload. For inference with models that fit in 24GB, two 4090s often match or exceed H100 performance. For larger models or training, H100's NVLink and larger memory provide advantages that consumer cards can't match.
When will GPU prices drop?
Consumer GPUs follow predictable cycles: prices stabilize 6-12 months after launch. Datacenter GPU pricing has dropped 40-60% from 2023 peaks as supply caught up with demand. H100 cloud pricing should continue declining through 2026 as B200 becomes available.