By Arnav Jalan — 17 Mar 2026

GPU Buying Guide for LLMs: RTX 5090 vs H100 vs H200 Complete Comparison (2026)

Which GPU should you buy for running LLMs? From $250 budget cards to $40K datacenter GPUs. Covers VRAM needs, tokens per second benchmarks, and total cost of ownership.

Choosing a GPU for LLMs is fundamentally different from choosing one for gaming or rendering.

Raw compute matters less than you'd expect. Memory bandwidth and VRAM capacity matter more. A $2,000 consumer card can outperform a $10,000 workstation GPU for inference. And sometimes the best answer isn't buying hardware at all.

This guide covers every tier: budget consumer cards, high-end gaming GPUs, professional workstation options, datacenter accelerators, and Apple Silicon. We'll look at what actually determines LLM performance, which models fit on which GPUs, real benchmark numbers, and total cost of ownership including cloud alternatives.

What Actually Determines LLM Performance

Before comparing specific GPUs, understand what drives LLM inference speed.

VRAM: The Hard Constraint

VRAM determines what models you can run. Period.

A 7B parameter model in FP16 needs approximately 14GB of VRAM. A 70B model needs around 140GB. No amount of compute power helps if you can't fit the model in memory.

Quantization changes these requirements dramatically:

Model Size	FP16	INT8	INT4 (Q4)
7B	14GB	7GB	4GB
13B	26GB	13GB	7GB
32B	64GB	32GB	18GB
70B	140GB	70GB	35GB
405B	810GB	405GB	203GB

These are minimums for model weights only. Add 2-6GB for KV cache, CUDA overhead, and context. Longer context windows require more KV cache memory, scaling linearly with sequence length.

Practical VRAM guidelines:

8GB: 7B models quantized to Q4, limited context
12-16GB: 7B-13B models comfortably, some 32B quantized
24GB: Most 7B-32B models, 70B heavily quantized
32GB: 70B Q4 models, comfortable headroom
48GB: 70B Q8, larger context windows
80GB: 70B FP16, 405B quantized
141GB+: Largest models without extreme quantization

Memory Bandwidth: The Speed Determinant

LLM inference is memory-bandwidth bound, not compute bound.

During token generation, the GPU reads model weights from memory for every single token. Generation speed is limited by how fast data moves from VRAM to compute units.

Memory bandwidth comparison:

GPU	Bandwidth	Approx. tok/s (8B Q4)
RTX 3090	936 GB/s	85
RTX 4090	1,008 GB/s	128
RTX 5090	1,792 GB/s	213
A100 80GB	2,039 GB/s	138
H100 80GB	3,350 GB/s	144
H200 141GB	4,800 GB/s	~180

Notice the RTX 5090 outperforms the A100 despite costing 1/15th as much. Bandwidth is that important.

Tensor Cores and Precision

Modern GPUs have specialized tensor cores for matrix operations. For LLM inference:

FP16/BF16: Default precision, good performance
FP8: 2x throughput vs FP16, minor quality loss (H100+, Blackwell)
INT8: Fast inference with proper calibration
INT4: Fastest quantized inference

Newer architectures (Hopper, Blackwell) have transformer-specific optimizations. But for inference, bandwidth usually matters more than these architectural features.

Consumer GPUs: Best Value for Local LLM

Consumer GPUs offer the best price-to-performance for personal LLM use.

RTX 5090 (32GB) - The New Champion

Specs:

VRAM: 32GB GDDR7
Bandwidth: 1,792 GB/s
CUDA Cores: 21,760
TDP: 575W
Price: $1,999 MSRP (street: $2,000-2,500)

The RTX 5090 is a game-changer for local LLM inference. Its 32GB VRAM handles 70B Q4 models on a single card. Its GDDR7 memory delivers 1.8 TB/s bandwidth, 77% faster than the 4090.

Benchmark highlights:

213 tok/s on Llama 3 8B Q4
61 tok/s on 32B models
Outperforms A100 80GB on small-to-medium models
2.6x faster than A100 on Qwen 7B (RunPod benchmarks)

Best for: Enthusiasts, developers, researchers who need to run 70B models locally. The single best consumer GPU for LLMs in 2026.

Limitations: 575W TDP requires robust PSU and cooling. Street prices above MSRP due to demand. No ECC memory.

RTX 4090 (24GB) - Still Excellent

Specs:

VRAM: 24GB GDDR6X
Bandwidth: 1,008 GB/s
CUDA Cores: 16,384
TDP: 450W
Price: $1,600-1,800 (new), $1,200-1,400 (used)

The 4090 remains an outstanding LLM GPU. Its 24GB fits most 7B-32B models. Performance is excellent for its price.

Benchmark highlights:

128 tok/s on Llama 3 8B Q4
Mature ecosystem, excellent driver support
Widely available and well-understood

Best for: Budget-conscious developers, those who don't need 70B models, gamers who also do ML work.

Limitations: 24GB VRAM limits model size. Can't run 70B without heavy quantization and reduced context.

RTX 3090 (24GB) - Budget King

Specs:

VRAM: 24GB GDDR6X
Bandwidth: 936 GB/s
CUDA Cores: 10,496
TDP: 350W
Price: $700-900 (used)

Used 3090s offer exceptional value. Same 24GB VRAM as 4090 at half the price. Performance is ~30% slower but still very capable.

Best for: Budget builds, multi-GPU setups, hobbyists.

RTX 4060 Ti 16GB - Entry Point

Specs:

VRAM: 16GB GDDR6
Bandwidth: 288 GB/s
Price: $450-500

The 16GB 4060 Ti is the cheapest path to running 7B-13B models locally. Performance is modest but adequate for development and experimentation.

Best for: Students, hobbyists, development/testing.

Intel Arc B580 - Budget Experimentation

Specs:

VRAM: 12GB GDDR6
Price: $249

Intel's Arc GPUs now support LLM inference through IPEX-LLM. The B580 offers 12GB VRAM at $249, enough for 7B Q4 models. Software support is maturing but still behind CUDA.

Best for: Extreme budget builds, experimentation.

Consumer GPU Summary

GPU	VRAM	Bandwidth	Price	tok/s (8B Q4)	Best For
RTX 5090	32GB	1,792 GB/s	$2,000	213	70B models, serious work
RTX 4090	24GB	1,008 GB/s	$1,600	128	Most users, great value
RTX 3090	24GB	936 GB/s	$800	85	Budget, multi-GPU
RTX 4060 Ti 16GB	16GB	288 GB/s	$450	45	Entry level
Arc B580	12GB	-	$249	~30	Experimentation

Professional/Workstation GPUs

Workstation GPUs sit between consumer and datacenter. They offer enterprise features like ECC memory, NVLink, and certified drivers, but at substantial price premiums.

RTX 6000 Ada (48GB)

Specs:

VRAM: 48GB GDDR6 ECC
Bandwidth: 960 GB/s
CUDA Cores: 18,176
TDP: 300W
Price: $6,800

The RTX 6000 Ada provides 48GB VRAM with ECC memory. Its Ada architecture delivers strong performance.

Benchmark highlights:

~130 tok/s on Llama 3 8B
48GB fits 70B Q8 models
Supports NVLink for multi-GPU scaling

Best for: Professional workstations needing ECC and certified drivers. Video production with AI features.

Value assessment: For pure LLM work, the RTX 5090 offers better performance at 1/3 the price. The 6000 Ada's value comes from its professional features and 48GB VRAM.

RTX A6000 (48GB)

Specs:

VRAM: 48GB GDDR6 ECC
Bandwidth: 768 GB/s
Architecture: Ampere (older)
TDP: 300W
Price: $4,500

The A6000 is the previous-generation workstation flagship. Its 48GB VRAM remains valuable, but Ampere architecture is slower than Ada.

Best for: Budget professional workstations, used market deals.

Value assessment: At current prices, the RTX 6000 Ada is worth the premium. Used A6000s around $2,500 can be good value.

L40S (48GB)

Specs:

VRAM: 48GB GDDR6 ECC
Bandwidth: 864 GB/s
Architecture: Ada Lovelace
TDP: 350W
Price: $8,000-10,000

NVIDIA positions the L40S as a "universal datacenter GPU" for AI, graphics, and video. It's essentially a datacenter-packaged version of Ada architecture.

Benchmark highlights:

~114 tok/s on 8B models
Optimized for inference workloads
MIG support for multi-tenant deployments

Best for: Datacenter deployments needing mixed AI/graphics/video workloads.

Value assessment: Caught between consumer 5090 (faster, cheaper) and H100 (more memory, faster). Hard to recommend for pure LLM work.

Professional GPU Summary

GPU	VRAM	Bandwidth	Price	Notes
RTX 6000 Ada	48GB	960 GB/s	$6,800	ECC, NVLink, pro drivers
RTX A6000	48GB	768 GB/s	$4,500	Older, still capable
L40S	48GB	864 GB/s	$8,000+	Datacenter universal GPU

Datacenter GPUs: Maximum Performance

Datacenter GPUs are purpose-built for AI at scale. They're not cost-effective for individuals, but essential for production deployments.

A100 (80GB) - The Established Workhorse

Specs:

VRAM: 80GB HBM2e
Bandwidth: 2,039 GB/s
Architecture: Ampere
TDP: 400W (PCIe), 500W (SXM)
Price: ~$15,000 (used), $1.79/hr cloud

The A100 dominated AI infrastructure from 2020-2023. It remains widely deployed and available.

Performance:

138 tok/s on 8B models (vLLM)
80GB fits 70B FP16 models
NVLink scales to 8-GPU systems

Best for: Organizations with existing Ampere infrastructure. Budget-conscious datacenter deployments.

Value assessment: Cloud rental at $1.79/hr makes more sense than purchasing for most use cases. The H100 offers 4x better performance at ~2x the cost.

H100 (80GB) - Current Production Standard

Specs:

VRAM: 80GB HBM3
Bandwidth: 3,350 GB/s
Architecture: Hopper
TDP: 700W
Price: $25,000-40,000 (purchase), $2-4/hr (cloud)

The H100 is the current standard for production AI inference. Its Transformer Engine and FP8 support deliver substantial speedups.

Performance:

144 tok/s on 8B models
984 tok/s throughput on 70B (vLLM, high batch)
4x training performance vs A100

Cloud pricing (2026):

Hyperscalers (AWS, GCP, Azure): $3-5/hr
Specialized providers (Jarvislabs, RunPod): $2-3/hr
Spot/preemptible: $1.50-2.50/hr

Best for: Production inference at scale. Training large models.

H200 (141GB) - Memory King

Specs:

VRAM: 141GB HBM3e
Bandwidth: 4,800 GB/s
Architecture: Hopper
TDP: 700W
Price: $30,000-40,000 (purchase), $3.70-5/hr (cloud)

The H200 upgrades H100's memory from 80GB to 141GB while boosting bandwidth 43%. Same compute, more memory.

Performance:

~180 tok/s on 8B models
Fits Llama 70B FP16 on single GPU (H100 requires 2)
1.9x inference improvement on memory-bound workloads

Best for: Large models where memory is the bottleneck. Long-context applications.

Value assessment: 20% price premium over H100 for 76% more memory. Excellent value for memory-constrained workloads.

B200 (192GB) - Next Generation

Specs:

VRAM: 192GB HBM3e
Bandwidth: 8,000 GB/s
Architecture: Blackwell
TDP: 1000W
Price: Not yet widely available

NVIDIA's Blackwell architecture brings fifth-generation tensor cores, FP4 support, and massive memory improvements.

Performance claims:

15x inference improvement vs H100 (NVIDIA benchmarks)
450 tok/s on 8B models
192GB fits even larger models

Availability: Limited in early 2026. Sold out through mid-2026 according to reports.

Best for: Frontier model serving. Extreme-scale deployments. Wait for availability.

Datacenter GPU Summary

GPU	VRAM	Bandwidth	Purchase	Cloud/hr	Use Case
A100 80GB	80GB	2,039 GB/s	~$15K	$1.79	Budget datacenter
H100 80GB	80GB	3,350 GB/s	$25-40K	$2-4	Production standard
H200 141GB	141GB	4,800 GB/s	$30-40K	$3.70-5	Large model serving
B200 192GB	192GB	8,000 GB/s	TBD	TBD	Next-gen (limited)

Apple Silicon: The Unified Memory Advantage

Apple Silicon takes a different approach: unified memory shared between CPU and GPU.

Why Apple Silicon Works for LLMs

Traditional GPUs have separate VRAM. Model weights must fit in VRAM. A 70B FP16 model needs 140GB, exceeding any single consumer GPU.

Apple's unified memory architecture lets the GPU access all system RAM. A Mac Studio with 192GB can load a 70B FP16 model that no consumer NVIDIA GPU can touch.

Tradeoffs:

Pros

More memory than any consumer GPU
Extremely power efficient (40-80W vs 450W)
Silent operation

Cons:

Slower tokens per second than equivalent NVIDIA
Less software ecosystem support
Higher cost per GB of memory

M4 Max (128GB)

Specs:

Unified Memory: Up to 128GB
Bandwidth: 546 GB/s
Neural Engine: 38 TOPS
TDP: ~100W
Price: $3,999 (64GB), $4,999 (128GB) MacBook Pro

Performance:

~96-100 tok/s on 8B Q4 (projected)
~25-30 tok/s on 70B Q4
Can run models that don't fit on any consumer NVIDIA GPU

Best for: Developers wanting portable LLM capability. Silent home setups.

M3 Ultra (512GB)

Specs:

Unified Memory: Up to 512GB
Bandwidth: 819 GB/s
TDP: ~200W
Price: $4,999 (96GB) to $9,499 (512GB) Mac Studio

Performance:

76-84 tok/s on 8B Q4
~17-18 tok/s on 671B parameter models (DeepSeek R1)
Can run virtually any model with quantization

Best for: Research requiring very large models. Teams preferring macOS. Silent operation requirements.

Value assessment: The 512GB configuration at $9,499 can run models impossible on consumer NVIDIA hardware. But for models that fit in 32GB, an RTX 5090 is 3x faster at 1/4 the cost.

Apple vs NVIDIA: When to Choose Each

Choose Apple Silicon when:

Model size exceeds 32GB VRAM
Silent operation is required
Power efficiency matters
You're already in the Apple ecosystem
Portability is important (MacBook Pro)

Choose NVIDIA when:

Maximum tokens per second matters
Model fits in available VRAM
You need CUDA ecosystem
Cloud deployment is the goal
Budget is constrained

Multi-GPU Considerations

Running multiple GPUs introduces complexity but enables larger models and higher throughput.

Consumer Multi-GPU

Multiple consumer GPUs communicate over PCIe, which is slow (~32 GB/s) compared to NVLink (~900 GB/s).

What works:

Running different models on different GPUs (no communication needed)
Pipeline parallelism for very large models
Batch serving with one model per GPU

What doesn't work well:

Tensor parallelism (requires high-bandwidth interconnect)
Training (gradient synchronization is slow)

Practical guidance:

2x RTX 4090 (48GB total): Good for 70B Q8 with some overhead
4x RTX 4090 (96GB total): Can fit 405B Q4 with pipeline parallelism
8x RTX 4090 servers exist (192GB) but PCIe bandwidth limits scaling

A dual RTX 5090 setup (64GB total) often outperforms single H100 for models that fit.

Datacenter Multi-GPU

NVLink enables efficient tensor parallelism. H100/H200 systems scale to 8 GPUs with 900 GB/s interconnect.

For multi-node scaling, NVLink Switch and InfiniBand provide high-bandwidth connectivity across servers.

Cloud vs Buy: The Economic Analysis

When Cloud Makes Sense

Utilization matters most. If you're using GPUs <40% of the time, cloud rental beats ownership.

Break-even analysis for H100:

Purchase: $30,000 + $100,000/year operating = $330,000 over 3 years
Cloud at $2.50/hr: $21,900/year at 24/7 usage = $65,700 over 3 years
Break-even: ~70% utilization

Most users don't run 24/7. Intermittent usage strongly favors cloud.

Cloud advantages:

No upfront capital
Access to latest hardware
Geographic flexibility
Elastic scaling

Cloud disadvantages:

Higher cost at high utilization
Data transfer costs
Vendor lock-in risk
Less control

Cloud Pricing (2026)

GPU	Hyperscaler	Specialized	Spot
RTX 4090	-	$0.40-0.65	$0.25
A100 80GB	$3.67	$1.79	$0.80
H100 80GB	$3.50-5	$2-3	$1.50
H200 141GB	$5-10	$3.70-4.30	$2.50

Specialized providers (RunPod, Vast.ai, Jarvislabs, Lambda) typically offer 40-60% lower prices than AWS/GCP/Azure.

When Buying Makes Sense

Buy consumer GPUs when:

You'll use them daily
Privacy/compliance requires local data
Long-term cost matters more than capital efficiency
You also use them for gaming/other workloads

Buy datacenter GPUs when:

Utilization exceeds 70% sustained
You have infrastructure expertise
Compliance requires on-premises
Multi-year budget is available

Buying Recommendations by Use Case

Hobbyist/Learning ($500-1,000)

Recommendation: Used RTX 3090 ($700-900)

24GB VRAM runs most useful models. Excellent community support. Can be upgraded later.

Alternative: RTX 4060 Ti 16GB ($450) for tighter budgets, but 16GB limits model options.

Serious Developer ($1,500-2,500)

Recommendation: RTX 5090 ($2,000)

The clear winner for local LLM work. 32GB handles 70B models. Performance rivals datacenter GPUs.

Alternative: RTX 4090 ($1,600) if 24GB is sufficient for your models.

Professional Workstation ($5,000-10,000)

Recommendation: RTX 5090 + cloud credits

Unless you need ECC memory or certified drivers, the 5090 outperforms professional cards. Use cloud for larger models.

Alternative: RTX 6000 Ada ($6,800) if professional features are required.

Small Team/Startup

Recommendation: Cloud-first approach

Start with H100/H200 cloud instances. Measure actual usage. Consider purchasing only after establishing utilization patterns.

Providers: RunPod, Lambda, Jarvislabs for cost efficiency.

Enterprise Production

Recommendation: H200 cloud or purchase depending on scale

For <70% utilization: Cloud with reserved instances For >70% utilization: Consider H200 purchase with operational team

The H200's 141GB VRAM simplifies deployment for large models that would require 2x H100.

Maximum Local Memory

Recommendation: Mac Studio M3 Ultra 512GB ($9,499)

When you need to run models that don't fit anywhere else, Apple's unified memory is the only consumer option.

Budget Multi-GPU

Recommendation: 2-4x Used RTX 3090 ($2,800-3,600)

Used 3090s offer excellent value for multi-GPU setups. 96GB total VRAM across 4 cards handles most workloads.

The Real Bottleneck: When Hardware Isn't the Answer

Sometimes the right GPU decision is recognizing you don't need to make one.

API providers handle infrastructure entirely. You pay per token, scale instantly, and never worry about GPU availability.

For many applications, the complexity of managing GPU infrastructure exceeds its value. Teams spend weeks optimizing CUDA environments when they could be building products.

Prem sits between DIY infrastructure and pure API providers. The platform handles fine-tuning, evaluation, and deployment without requiring you to manage GPUs directly. For organizations with data sovereignty requirements, deployment options include your own AWS VPC or on-premise infrastructure.

The decision isn't just "which GPU" but "should I manage GPUs at all."

FAQ

What's the minimum VRAM for running local LLMs?

8GB runs 7B models with Q4 quantization. 16GB is comfortable for 7B-13B. 24GB handles most practical use cases. More is always better for flexibility.

Is the RTX 5090 worth it over the 4090?

For LLM inference, yes. The 32GB VRAM (vs 24GB) and 77% more bandwidth translate to significantly better performance. If you're buying new in 2026, the 5090 is the clear choice.

Should I buy an H100 or use cloud?

Cloud unless you have >70% sustained utilization and infrastructure expertise. H100 cloud instances cost $2-4/hr. Purchasing makes sense only at enterprise scale.

Can Apple Silicon compete with NVIDIA for LLMs?

In tokens per second, no. A 5090 is 2-3x faster than M3 Ultra for models that fit. But Apple Silicon runs models that don't fit on any consumer NVIDIA GPU, and it does so silently at a fraction of the power.

What about AMD GPUs?

AMD's ROCm has improved significantly, and MI300X competes with H100 at datacenter scale. Consumer AMD GPUs (RX 7900 XTX) have decent software support through ROCm but trail NVIDIA in ecosystem maturity. For most users, NVIDIA remains the safer choice.

How many RTX 4090s equal an H100?

Roughly 2-4 depending on workload. For inference with models that fit in 24GB, two 4090s often match or exceed H100 performance. For larger models or training, H100's NVLink and larger memory provide advantages that consumer cards can't match.

When will GPU prices drop?

Consumer GPUs follow predictable cycles: prices stabilize 6-12 months after launch. Datacenter GPU pricing has dropped 40-60% from 2023 peaks as supply caught up with demand. H100 cloud pricing should continue declining through 2026 as B200 becomes available.

What Actually Determines LLM Performance

VRAM: The Hard Constraint

Memory Bandwidth: The Speed Determinant

Tensor Cores and Precision

Consumer GPUs: Best Value for Local LLM

RTX 5090 (32GB) - The New Champion

RTX 4090 (24GB) - Still Excellent

RTX 3090 (24GB) - Budget King

RTX 4060 Ti 16GB - Entry Point

Intel Arc B580 - Budget Experimentation

Consumer GPU Summary

Professional/Workstation GPUs

RTX 6000 Ada (48GB)

RTX A6000 (48GB)

L40S (48GB)

Professional GPU Summary

Datacenter GPUs: Maximum Performance

A100 (80GB) - The Established Workhorse

H100 (80GB) - Current Production Standard

H200 (141GB) - Memory King

B200 (192GB) - Next Generation

Datacenter GPU Summary

Apple Silicon: The Unified Memory Advantage

Why Apple Silicon Works for LLMs

M4 Max (128GB)

M3 Ultra (512GB)

Apple vs NVIDIA: When to Choose Each

Multi-GPU Considerations

Consumer Multi-GPU

Datacenter Multi-GPU

Cloud vs Buy: The Economic Analysis

When Cloud Makes Sense

Cloud Pricing (2026)

When Buying Makes Sense

Buying Recommendations by Use Case

Hobbyist/Learning ($500-1,000)

Serious Developer ($1,500-2,500)

Professional Workstation ($5,000-10,000)

Small Team/Startup

Enterprise Production

Maximum Local Memory

Budget Multi-GPU

The Real Bottleneck: When Hardware Isn't the Answer

FAQ

What's the minimum VRAM for running local LLMs?

Is the RTX 5090 worth it over the 4090?

Should I buy an H100 or use cloud?

Can Apple Silicon compete with NVIDIA for LLMs?

What about AMD GPUs?

How many RTX 4090s equal an H100?

When will GPU prices drop?

Subscribe to Prem AI