GPU Buying Guide for LLMs: RTX 5090 vs H100 vs H200 Complete Comparison (2026)

Which GPU should you buy for running LLMs? From $250 budget cards to $40K datacenter GPUs. Covers VRAM needs, tokens per second benchmarks, and total cost of ownership.

GPU Buying Guide for LLMs: RTX 5090 vs H100 vs H200 Complete Comparison (2026)

Choosing a GPU for LLMs is fundamentally different from choosing one for gaming or rendering.

Raw compute matters less than you'd expect. Memory bandwidth and VRAM capacity matter more. A $2,000 consumer card can outperform a $10,000 workstation GPU for inference. And sometimes the best answer isn't buying hardware at all.

This guide covers every tier: budget consumer cards, high-end gaming GPUs, professional workstation options, datacenter accelerators, and Apple Silicon. We'll look at what actually determines LLM performance, which models fit on which GPUs, real benchmark numbers, and total cost of ownership including cloud alternatives.

What Actually Determines LLM Performance

Before comparing specific GPUs, understand what drives LLM inference speed.

VRAM: The Hard Constraint

VRAM determines what models you can run. Period.

A 7B parameter model in FP16 needs approximately 14GB of VRAM. A 70B model needs around 140GB. No amount of compute power helps if you can't fit the model in memory.

Quantization changes these requirements dramatically:

Model Size FP16 INT8 INT4 (Q4)
7B 14GB 7GB 4GB
13B 26GB 13GB 7GB
32B 64GB 32GB 18GB
70B 140GB 70GB 35GB
405B 810GB 405GB 203GB

These are minimums for model weights only. Add 2-6GB for KV cache, CUDA overhead, and context. Longer context windows require more KV cache memory, scaling linearly with sequence length.

Practical VRAM guidelines:

  • 8GB: 7B models quantized to Q4, limited context
  • 12-16GB: 7B-13B models comfortably, some 32B quantized
  • 24GB: Most 7B-32B models, 70B heavily quantized
  • 32GB: 70B Q4 models, comfortable headroom
  • 48GB: 70B Q8, larger context windows
  • 80GB: 70B FP16, 405B quantized
  • 141GB+: Largest models without extreme quantization

Memory Bandwidth: The Speed Determinant

LLM inference is memory-bandwidth bound, not compute bound.

During token generation, the GPU reads model weights from memory for every single token. Generation speed is limited by how fast data moves from VRAM to compute units.

Memory bandwidth comparison:

GPU Bandwidth Approx. tok/s (8B Q4)
RTX 3090 936 GB/s 85
RTX 4090 1,008 GB/s 128
RTX 5090 1,792 GB/s 213
A100 80GB 2,039 GB/s 138
H100 80GB 3,350 GB/s 144
H200 141GB 4,800 GB/s ~180

Notice the RTX 5090 outperforms the A100 despite costing 1/15th as much. Bandwidth is that important.

Tensor Cores and Precision

Modern GPUs have specialized tensor cores for matrix operations. For LLM inference:

  • FP16/BF16: Default precision, good performance
  • FP8: 2x throughput vs FP16, minor quality loss (H100+, Blackwell)
  • INT8: Fast inference with proper calibration
  • INT4: Fastest quantized inference

Newer architectures (Hopper, Blackwell) have transformer-specific optimizations. But for inference, bandwidth usually matters more than these architectural features.

Consumer GPUs: Best Value for Local LLM

Consumer GPUs offer the best price-to-performance for personal LLM use.

RTX 5090 (32GB) - The New Champion

Specs:

  • VRAM: 32GB GDDR7
  • Bandwidth: 1,792 GB/s
  • CUDA Cores: 21,760
  • TDP: 575W
  • Price: $1,999 MSRP (street: $2,000-2,500)

The RTX 5090 is a game-changer for local LLM inference. Its 32GB VRAM handles 70B Q4 models on a single card. Its GDDR7 memory delivers 1.8 TB/s bandwidth, 77% faster than the 4090.

Benchmark highlights:

  • 213 tok/s on Llama 3 8B Q4
  • 61 tok/s on 32B models
  • Outperforms A100 80GB on small-to-medium models
  • 2.6x faster than A100 on Qwen 7B (RunPod benchmarks)

Best for: Enthusiasts, developers, researchers who need to run 70B models locally. The single best consumer GPU for LLMs in 2026.

Limitations: 575W TDP requires robust PSU and cooling. Street prices above MSRP due to demand. No ECC memory.

RTX 4090 (24GB) - Still Excellent

Specs:

  • VRAM: 24GB GDDR6X
  • Bandwidth: 1,008 GB/s
  • CUDA Cores: 16,384
  • TDP: 450W
  • Price: $1,600-1,800 (new), $1,200-1,400 (used)

The 4090 remains an outstanding LLM GPU. Its 24GB fits most 7B-32B models. Performance is excellent for its price.

Benchmark highlights:

  • 128 tok/s on Llama 3 8B Q4
  • Mature ecosystem, excellent driver support
  • Widely available and well-understood

Best for: Budget-conscious developers, those who don't need 70B models, gamers who also do ML work.

Limitations: 24GB VRAM limits model size. Can't run 70B without heavy quantization and reduced context.

RTX 3090 (24GB) - Budget King

Specs:

  • VRAM: 24GB GDDR6X
  • Bandwidth: 936 GB/s
  • CUDA Cores: 10,496
  • TDP: 350W
  • Price: $700-900 (used)

Used 3090s offer exceptional value. Same 24GB VRAM as 4090 at half the price. Performance is ~30% slower but still very capable.

Best for: Budget builds, multi-GPU setups, hobbyists.

RTX 4060 Ti 16GB - Entry Point

Specs:

  • VRAM: 16GB GDDR6
  • Bandwidth: 288 GB/s
  • Price: $450-500

The 16GB 4060 Ti is the cheapest path to running 7B-13B models locally. Performance is modest but adequate for development and experimentation.

Best for: Students, hobbyists, development/testing.

Intel Arc B580 - Budget Experimentation

Specs:

  • VRAM: 12GB GDDR6
  • Price: $249

Intel's Arc GPUs now support LLM inference through IPEX-LLM. The B580 offers 12GB VRAM at $249, enough for 7B Q4 models. Software support is maturing but still behind CUDA.

Best for: Extreme budget builds, experimentation.

Consumer GPU Summary

GPU VRAM Bandwidth Price tok/s (8B Q4) Best For
RTX 5090 32GB 1,792 GB/s $2,000 213 70B models, serious work
RTX 4090 24GB 1,008 GB/s $1,600 128 Most users, great value
RTX 3090 24GB 936 GB/s $800 85 Budget, multi-GPU
RTX 4060 Ti 16GB 16GB 288 GB/s $450 45 Entry level
Arc B580 12GB - $249 ~30 Experimentation

Professional/Workstation GPUs

Workstation GPUs sit between consumer and datacenter. They offer enterprise features like ECC memory, NVLink, and certified drivers, but at substantial price premiums.

RTX 6000 Ada (48GB)

Specs:

  • VRAM: 48GB GDDR6 ECC
  • Bandwidth: 960 GB/s
  • CUDA Cores: 18,176
  • TDP: 300W
  • Price: $6,800

The RTX 6000 Ada provides 48GB VRAM with ECC memory. Its Ada architecture delivers strong performance.

Benchmark highlights:

  • ~130 tok/s on Llama 3 8B
  • 48GB fits 70B Q8 models
  • Supports NVLink for multi-GPU scaling

Best for: Professional workstations needing ECC and certified drivers. Video production with AI features.

Value assessment: For pure LLM work, the RTX 5090 offers better performance at 1/3 the price. The 6000 Ada's value comes from its professional features and 48GB VRAM.

RTX A6000 (48GB)

Specs:

  • VRAM: 48GB GDDR6 ECC
  • Bandwidth: 768 GB/s
  • Architecture: Ampere (older)
  • TDP: 300W
  • Price: $4,500

The A6000 is the previous-generation workstation flagship. Its 48GB VRAM remains valuable, but Ampere architecture is slower than Ada.

Best for: Budget professional workstations, used market deals.

Value assessment: At current prices, the RTX 6000 Ada is worth the premium. Used A6000s around $2,500 can be good value.

L40S (48GB)

Specs:

  • VRAM: 48GB GDDR6 ECC
  • Bandwidth: 864 GB/s
  • Architecture: Ada Lovelace
  • TDP: 350W
  • Price: $8,000-10,000

NVIDIA positions the L40S as a "universal datacenter GPU" for AI, graphics, and video. It's essentially a datacenter-packaged version of Ada architecture.

Benchmark highlights:

  • ~114 tok/s on 8B models
  • Optimized for inference workloads
  • MIG support for multi-tenant deployments

Best for: Datacenter deployments needing mixed AI/graphics/video workloads.

Value assessment: Caught between consumer 5090 (faster, cheaper) and H100 (more memory, faster). Hard to recommend for pure LLM work.

Professional GPU Summary

GPU VRAM Bandwidth Price Notes
RTX 6000 Ada 48GB 960 GB/s $6,800 ECC, NVLink, pro drivers
RTX A6000 48GB 768 GB/s $4,500 Older, still capable
L40S 48GB 864 GB/s $8,000+ Datacenter universal GPU

Datacenter GPUs: Maximum Performance

Datacenter GPUs are purpose-built for AI at scale. They're not cost-effective for individuals, but essential for production deployments.

A100 (80GB) - The Established Workhorse

Specs:

  • VRAM: 80GB HBM2e
  • Bandwidth: 2,039 GB/s
  • Architecture: Ampere
  • TDP: 400W (PCIe), 500W (SXM)
  • Price: ~$15,000 (used), $1.79/hr cloud

The A100 dominated AI infrastructure from 2020-2023. It remains widely deployed and available.

Performance:

  • 138 tok/s on 8B models (vLLM)
  • 80GB fits 70B FP16 models
  • NVLink scales to 8-GPU systems

Best for: Organizations with existing Ampere infrastructure. Budget-conscious datacenter deployments.

Value assessment: Cloud rental at $1.79/hr makes more sense than purchasing for most use cases. The H100 offers 4x better performance at ~2x the cost.

H100 (80GB) - Current Production Standard

Specs:

  • VRAM: 80GB HBM3
  • Bandwidth: 3,350 GB/s
  • Architecture: Hopper
  • TDP: 700W
  • Price: $25,000-40,000 (purchase), $2-4/hr (cloud)

The H100 is the current standard for production AI inference. Its Transformer Engine and FP8 support deliver substantial speedups.

Performance:

  • 144 tok/s on 8B models
  • 984 tok/s throughput on 70B (vLLM, high batch)
  • 4x training performance vs A100

Cloud pricing (2026):

  • Hyperscalers (AWS, GCP, Azure): $3-5/hr
  • Specialized providers (Jarvislabs, RunPod): $2-3/hr
  • Spot/preemptible: $1.50-2.50/hr

Best for: Production inference at scale. Training large models.

H200 (141GB) - Memory King

Specs:

  • VRAM: 141GB HBM3e
  • Bandwidth: 4,800 GB/s
  • Architecture: Hopper
  • TDP: 700W
  • Price: $30,000-40,000 (purchase), $3.70-5/hr (cloud)

The H200 upgrades H100's memory from 80GB to 141GB while boosting bandwidth 43%. Same compute, more memory.

Performance:

  • ~180 tok/s on 8B models
  • Fits Llama 70B FP16 on single GPU (H100 requires 2)
  • 1.9x inference improvement on memory-bound workloads

Best for: Large models where memory is the bottleneck. Long-context applications.

Value assessment: 20% price premium over H100 for 76% more memory. Excellent value for memory-constrained workloads.

B200 (192GB) - Next Generation

Specs:

  • VRAM: 192GB HBM3e
  • Bandwidth: 8,000 GB/s
  • Architecture: Blackwell
  • TDP: 1000W
  • Price: Not yet widely available

NVIDIA's Blackwell architecture brings fifth-generation tensor cores, FP4 support, and massive memory improvements.

Performance claims:

  • 15x inference improvement vs H100 (NVIDIA benchmarks)
  • 450 tok/s on 8B models
  • 192GB fits even larger models

Availability: Limited in early 2026. Sold out through mid-2026 according to reports.

Best for: Frontier model serving. Extreme-scale deployments. Wait for availability.

Datacenter GPU Summary

GPU VRAM Bandwidth Purchase Cloud/hr Use Case
A100 80GB 80GB 2,039 GB/s ~$15K $1.79 Budget datacenter
H100 80GB 80GB 3,350 GB/s $25-40K $2-4 Production standard
H200 141GB 141GB 4,800 GB/s $30-40K $3.70-5 Large model serving
B200 192GB 192GB 8,000 GB/s TBD TBD Next-gen (limited)

Apple Silicon: The Unified Memory Advantage

Apple Silicon takes a different approach: unified memory shared between CPU and GPU.

Why Apple Silicon Works for LLMs

Traditional GPUs have separate VRAM. Model weights must fit in VRAM. A 70B FP16 model needs 140GB, exceeding any single consumer GPU.

Apple's unified memory architecture lets the GPU access all system RAM. A Mac Studio with 192GB can load a 70B FP16 model that no consumer NVIDIA GPU can touch.

Tradeoffs:

Pros

  • More memory than any consumer GPU
  • Extremely power efficient (40-80W vs 450W)
  • Silent operation

Cons:

  • Slower tokens per second than equivalent NVIDIA
  • Less software ecosystem support
  • Higher cost per GB of memory

M4 Max (128GB)

Specs:

  • Unified Memory: Up to 128GB
  • Bandwidth: 546 GB/s
  • Neural Engine: 38 TOPS
  • TDP: ~100W
  • Price: $3,999 (64GB), $4,999 (128GB) MacBook Pro

Performance:

  • ~96-100 tok/s on 8B Q4 (projected)
  • ~25-30 tok/s on 70B Q4
  • Can run models that don't fit on any consumer NVIDIA GPU

Best for: Developers wanting portable LLM capability. Silent home setups.

M3 Ultra (512GB)

Specs:

  • Unified Memory: Up to 512GB
  • Bandwidth: 819 GB/s
  • TDP: ~200W
  • Price: $4,999 (96GB) to $9,499 (512GB) Mac Studio

Performance:

  • 76-84 tok/s on 8B Q4
  • ~17-18 tok/s on 671B parameter models (DeepSeek R1)
  • Can run virtually any model with quantization

Best for: Research requiring very large models. Teams preferring macOS. Silent operation requirements.

Value assessment: The 512GB configuration at $9,499 can run models impossible on consumer NVIDIA hardware. But for models that fit in 32GB, an RTX 5090 is 3x faster at 1/4 the cost.

Apple vs NVIDIA: When to Choose Each

Choose Apple Silicon when:

  • Model size exceeds 32GB VRAM
  • Silent operation is required
  • Power efficiency matters
  • You're already in the Apple ecosystem
  • Portability is important (MacBook Pro)

Choose NVIDIA when:

  • Maximum tokens per second matters
  • Model fits in available VRAM
  • You need CUDA ecosystem
  • Cloud deployment is the goal
  • Budget is constrained

Multi-GPU Considerations

Running multiple GPUs introduces complexity but enables larger models and higher throughput.

Consumer Multi-GPU

Multiple consumer GPUs communicate over PCIe, which is slow (~32 GB/s) compared to NVLink (~900 GB/s).

What works:

  • Running different models on different GPUs (no communication needed)
  • Pipeline parallelism for very large models
  • Batch serving with one model per GPU

What doesn't work well:

  • Tensor parallelism (requires high-bandwidth interconnect)
  • Training (gradient synchronization is slow)

Practical guidance:

  • 2x RTX 4090 (48GB total): Good for 70B Q8 with some overhead
  • 4x RTX 4090 (96GB total): Can fit 405B Q4 with pipeline parallelism
  • 8x RTX 4090 servers exist (192GB) but PCIe bandwidth limits scaling

A dual RTX 5090 setup (64GB total) often outperforms single H100 for models that fit.

Datacenter Multi-GPU

NVLink enables efficient tensor parallelism. H100/H200 systems scale to 8 GPUs with 900 GB/s interconnect.

For multi-node scaling, NVLink Switch and InfiniBand provide high-bandwidth connectivity across servers.

Cloud vs Buy: The Economic Analysis

When Cloud Makes Sense

Utilization matters most. If you're using GPUs <40% of the time, cloud rental beats ownership.

Break-even analysis for H100:

  • Purchase: $30,000 + $100,000/year operating = $330,000 over 3 years
  • Cloud at $2.50/hr: $21,900/year at 24/7 usage = $65,700 over 3 years
  • Break-even: ~70% utilization

Most users don't run 24/7. Intermittent usage strongly favors cloud.

Cloud advantages:

  • No upfront capital
  • Access to latest hardware
  • Geographic flexibility
  • Elastic scaling

Cloud disadvantages:

  • Higher cost at high utilization
  • Data transfer costs
  • Vendor lock-in risk
  • Less control

Cloud Pricing (2026)

GPU Hyperscaler Specialized Spot
RTX 4090 - $0.40-0.65 $0.25
A100 80GB $3.67 $1.79 $0.80
H100 80GB $3.50-5 $2-3 $1.50
H200 141GB $5-10 $3.70-4.30 $2.50

Specialized providers (RunPod, Vast.ai, Jarvislabs, Lambda) typically offer 40-60% lower prices than AWS/GCP/Azure.

When Buying Makes Sense

Buy consumer GPUs when:

  • You'll use them daily
  • Privacy/compliance requires local data
  • Long-term cost matters more than capital efficiency
  • You also use them for gaming/other workloads

Buy datacenter GPUs when:

  • Utilization exceeds 70% sustained
  • You have infrastructure expertise
  • Compliance requires on-premises
  • Multi-year budget is available

Buying Recommendations by Use Case

Hobbyist/Learning ($500-1,000)

Recommendation: Used RTX 3090 ($700-900)

24GB VRAM runs most useful models. Excellent community support. Can be upgraded later.

Alternative: RTX 4060 Ti 16GB ($450) for tighter budgets, but 16GB limits model options.

Serious Developer ($1,500-2,500)

Recommendation: RTX 5090 ($2,000)

The clear winner for local LLM work. 32GB handles 70B models. Performance rivals datacenter GPUs.

Alternative: RTX 4090 ($1,600) if 24GB is sufficient for your models.

Professional Workstation ($5,000-10,000)

Recommendation: RTX 5090 + cloud credits

Unless you need ECC memory or certified drivers, the 5090 outperforms professional cards. Use cloud for larger models.

Alternative: RTX 6000 Ada ($6,800) if professional features are required.

Small Team/Startup

Recommendation: Cloud-first approach

Start with H100/H200 cloud instances. Measure actual usage. Consider purchasing only after establishing utilization patterns.

Providers: RunPod, Lambda, Jarvislabs for cost efficiency.

Enterprise Production

Recommendation: H200 cloud or purchase depending on scale

For <70% utilization: Cloud with reserved instances For >70% utilization: Consider H200 purchase with operational team

The H200's 141GB VRAM simplifies deployment for large models that would require 2x H100.

Maximum Local Memory

Recommendation: Mac Studio M3 Ultra 512GB ($9,499)

When you need to run models that don't fit anywhere else, Apple's unified memory is the only consumer option.

Budget Multi-GPU

Recommendation: 2-4x Used RTX 3090 ($2,800-3,600)

Used 3090s offer excellent value for multi-GPU setups. 96GB total VRAM across 4 cards handles most workloads.

The Real Bottleneck: When Hardware Isn't the Answer

Sometimes the right GPU decision is recognizing you don't need to make one.

API providers handle infrastructure entirely. You pay per token, scale instantly, and never worry about GPU availability.

For many applications, the complexity of managing GPU infrastructure exceeds its value. Teams spend weeks optimizing CUDA environments when they could be building products.

Prem sits between DIY infrastructure and pure API providers. The platform handles fine-tuning, evaluation, and deployment without requiring you to manage GPUs directly. For organizations with data sovereignty requirements, deployment options include your own AWS VPC or on-premise infrastructure.

The decision isn't just "which GPU" but "should I manage GPUs at all."

FAQ

What's the minimum VRAM for running local LLMs?

8GB runs 7B models with Q4 quantization. 16GB is comfortable for 7B-13B. 24GB handles most practical use cases. More is always better for flexibility.

Is the RTX 5090 worth it over the 4090?

For LLM inference, yes. The 32GB VRAM (vs 24GB) and 77% more bandwidth translate to significantly better performance. If you're buying new in 2026, the 5090 is the clear choice.

Should I buy an H100 or use cloud?

Cloud unless you have >70% sustained utilization and infrastructure expertise. H100 cloud instances cost $2-4/hr. Purchasing makes sense only at enterprise scale.

Can Apple Silicon compete with NVIDIA for LLMs?

In tokens per second, no. A 5090 is 2-3x faster than M3 Ultra for models that fit. But Apple Silicon runs models that don't fit on any consumer NVIDIA GPU, and it does so silently at a fraction of the power.

What about AMD GPUs?

AMD's ROCm has improved significantly, and MI300X competes with H100 at datacenter scale. Consumer AMD GPUs (RX 7900 XTX) have decent software support through ROCm but trail NVIDIA in ecosystem maturity. For most users, NVIDIA remains the safer choice.

How many RTX 4090s equal an H100?

Roughly 2-4 depending on workload. For inference with models that fit in 24GB, two 4090s often match or exceed H100 performance. For larger models or training, H100's NVLink and larger memory provide advantages that consumer cards can't match.

When will GPU prices drop?

Consumer GPUs follow predictable cycles: prices stabilize 6-12 months after launch. Datacenter GPU pricing has dropped 40-60% from 2023 peaks as supply caught up with demand. H100 cloud pricing should continue declining through 2026 as B200 becomes available.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe