How to Self-Host Mistral Large 3: Hardware, vLLM Setup & Function Calling (2026)

Self-host Mistral Large 3 on your own infrastructure. Hardware sizing, vLLM commands, function calling, Eagle speculative decoding, and production tips for the 675B MoE model.

How to Self-Host Mistral Large 3: Hardware, vLLM Setup & Function Calling (2026)

Most guides covering Mistral Large 3 stop at a single vllm serve command. That gets you running but leaves you guessing on the things that actually matter in production: which precision format fits your hardware, how to set context length without wasting VRAM, and why your function calls break if you skip two specific flags.

This guide covers all of it. Hardware requirements across GPU tiers, FP8 vs NVFP4 trade-offs, the full vLLM configuration, function calling setup, and speculative decoding for throughput gains. Tested against the mistralai/Mistral-Large-3-675B-Instruct-2512 checkpoint on vLLM v0.17.0.


What Mistral Large 3 Actually Is

Before touching deployment, it's worth understanding the architecture because it directly affects hardware planning.

Mistral Large 3 is a sparse Mixture-of-Experts model with 675B total parameters, but only 41B are active per forward pass (39B language model + 2.5B vision encoder). That MoE architecture is why deployment is cheaper than a 675B dense model — you're loading all 675B into memory but only computing 41B at inference time.

Key specs:

Property Value
Total parameters 675B
Active parameters 41B
Architecture Granular MoE + Vision Encoder
Context window 256k tokens
License Apache 2.0
Formats available BF16, FP8, NVFP4
HuggingFace ID (instruct) mistralai/Mistral-Large-3-675B-Instruct-2512

The Apache 2.0 license matters for enterprise deployment. No usage fees, no restrictions on commercial use, no need to contact Mistral for a commercial license. Mistral Large 2 required a separate commercial license for self-deployment. Large 3 does not.


Hardware Requirements

This is where most guides get vague. Here are the actual requirements, broken down by GPU tier.

Precision Formats and What They Mean

Mistral Large 3 ships in three formats:

  • FP8 - Full precision for production use, especially if you plan to fine-tune. Requires B200 or H200 GPUs. Best for long-context workloads (up to 256k tokens).
  • NVFP4 - More memory-efficient. Runs on A100s and H100s. Performance matches FP8 for contexts under 64k tokens. Above 64k, expect some degradation.
  • BF16 - Full-precision base weights. Requires multi-node configuration. Most teams don't need this unless they're doing custom post-training.

GPU Requirements by Format

Format Minimum Setup Notes
FP8 8x H200 (80GB each) Single-node, recommended for production
FP8 8x B200 Faster on Blackwell, significant speed-up vs H200
NVFP4 8x A100 (80GB each) Single-node, good for context < 64k
NVFP4 8x H100 (80GB each) Single-node, same constraint
BF16 Multi-node Not recommended unless you have a specific reason

The 8xA100 path is the most accessible for teams that already have A100 nodes. If you're on H100s, use NVFP4 for the same memory footprint with similar quality on typical workloads.

Text-only deployments can skip loading the vision encoder entirely, which frees up meaningful KV cache space:

--limit-mm-per-prompt '{"image": 0}'

Context Length and VRAM Trade-offs

The default --max-model-len is 262,144 (256k tokens). That's rarely needed and reserves significant VRAM for KV cache. For most production workloads, set it explicitly:

  • 32k or less: No performance gap between FP8 and NVFP4
  • 64k: NVFP4 starts to show minor degradation
  • 256k: Use FP8 only

A practical starting point for most RAG and agentic workloads is --max-model-len 32768. You can always increase it once you've validated memory headroom.


Installing vLLM

Use uv for faster dependency resolution:

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

This will automatically pull mistral_common >= 1.8.6, which is required for Mistral's tokenizer format. If you install vLLM via pip without this, tokenization will fall back to the HuggingFace default and tool calling will break.

Set your HuggingFace token:

export HF_TOKEN=your_token_here

Accept the model card on HuggingFace before running. The model requires explicit agreement at huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512.


Deploying with vLLM

FP8 on 8x H200 or B200

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --max-model-len 32768

The three mistral flags (tokenizer_mode, config_format, load_format) are not optional. They tell vLLM to use Mistral's native format rather than the HuggingFace defaults. Skipping them causes silent tokenization differences that affect output quality, especially for function calling.

NVFP4 on 8x A100 or H100

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
  --tensor-parallel-size 4 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --max-model-len 32768

Note --tensor-parallel-size 4 here, not 8. NVFP4 is more memory-efficient and can run on a 4-GPU slice for most context lengths.

Docker Deployment

For containerized infrastructure:

docker run --runtime nvidia --gpus all \
  --shm-size=4g \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --max-model-len 32768

The --shm-size=4g flag is required for tensor parallelism. Without shared memory, the process will crash on multi-GPU setups.

Verifying the Deployment

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer token" \
  -d '{
    "model": "mistralai/Mistral-Large-3-675B-Instruct-2512",
    "messages": [{"role": "user", "content": "What is 12 * 144?"}],
    "max_tokens": 100
  }'

If the model responds correctly, your deployment is working. Check Prometheus metrics at http://localhost:8000/metrics — look for vllm:num_requests_running and vllm:gpu_cache_usage_perc to confirm the inference engine is healthy.


Function Calling Configuration

Function calling requires two specific flags that most basic deployment guides omit:

--enable-auto-tool-choice \
--tool-call-parser mistral

Without --tool-call-parser mistral, vLLM uses a generic parser that doesn't handle Mistral's tool call format correctly. You'll get malformed JSON in tool responses.

Here's a working function call example using the OpenAI SDK pointed at your local endpoint:

import json
from openai import OpenAI
from huggingface_hub import hf_hub_download

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512",
    messages=[{"role": "user", "content": "What's the weather in Berlin?"}],
    tools=tools,
    tool_choice="auto"
)

# Handle tool call
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    args = json.loads(tool_call.function.arguments)
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {args}")

Mistral recommends keeping the tool set well-defined and limited. Performance degrades with large tool lists — keep it under 10 tools per request for reliable results. For agentic workflows with many tools, consider grouping related functions or routing through a smaller model that handles tool selection before dispatching to Large 3 for generation.

For teams building production-ready AI models with custom tooling, the function calling quality of Large 3 is strong enough for most enterprise agentic use cases out of the box.


Speculative Decoding for Better Throughput

Mistral ships a companion Eagle draft model specifically for Large 3. It generates candidate tokens that the main model verifies in parallel, improving throughput without changing output quality.

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --tensor-parallel-size 8 \
  --load-format mistral \
  --tokenizer-mode mistral \
  --config-format mistral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --limit-mm-per-prompt '{"image": 10}' \
  --speculative_config '{
    "model": "mistralai/Mistral-Large-3-675B-Instruct-2512-Eagle",
    "num_speculative_tokens": 3,
    "method": "eagle",
    "max_model_len": "16384"
  }'

The num_speculative_tokens: 3 setting is Mistral's recommendation. Higher values increase throughput potential but also increase the cost of rejected drafts. For workloads with predictable output patterns (structured JSON, code), 3-5 speculative tokens tends to work well. For open-ended generation, stick with 3.

Note the max_model_len in the speculative config is separate from the main model's context length. Set it to match your typical request length, not the maximum.


Performance Benchmarks

Mistral Large 3 benchmarks comparably to DeepSeek 3.1 670B and Kimi K2 1.2T on standard evaluation suites, despite having fewer active parameters than either.

What this means for self-hosted deployments:

  • For pure text workloads, NVFP4 on A100s gives you frontier-class quality with an accessible hardware footprint
  • The MoE architecture means throughput scales well under concurrent load — you're activating 41B parameters per request, not 675B
  • Context performance above 64k tokens requires FP8; for most enterprise RAG pipelines, NVFP4 is sufficient

Teams running self-hosted LLM deployments at scale typically see that Large 3's MoE architecture handles concurrent requests more efficiently than dense models of comparable quality. The active parameter count stays constant regardless of batch size.

If you're comparing costs against an API alternative, the Apache 2.0 license means zero per-token licensing costs — only infrastructure. For high-volume internal workloads, the break-even against API pricing is typically 3-6 months on A100 cloud nodes.


Production Considerations

Context Length vs. Throughput

Higher --max-model-len values reserve more VRAM for KV cache, reducing the memory available for batching. For production:

  • Start with --max-model-len 32768 and profile actual request lengths
  • Increase only if p99 context lengths regularly exceed that limit
  • Use --max-num-batched-tokens to tune the throughput/latency trade-off: higher values increase throughput at the cost of first-token latency

Graceful Shutdown

Set terminationGracePeriodSeconds to at least 300 in Kubernetes, or use a preStop hook in Docker to avoid dropping in-flight requests. Large 3's generation latency is higher than smaller models due to the MoE routing overhead — don't use the default 30-second shutdown window.

Mistral-Specific Tokenizer

Always pass the three mistral format flags. If you're migrating from a previous Mistral deployment that used the HuggingFace tokenizer, outputs will differ slightly. Run a regression test before switching production traffic.

Secret Management

Don't hardcode HF_TOKEN in container commands or Docker compose files. Use environment variables or a secrets manager. The token grants read access to your HuggingFace account — treat it accordingly.


When Self-Hosting Isn't the Right Call

Self-hosting a 675B model is not a small operational commitment. An 8xH200 node costs $25-35/hour on major cloud providers. You're responsible for model updates, monitoring, scaling, and uptime.

It makes sense when:

  • You have high-volume inference that crosses the API cost break-even point
  • Data sovereignty requirements prevent sending data to third-party APIs
  • You need fine-tuning on proprietary data with full control over the training environment

For teams that want data sovereignty without the infrastructure overhead, platforms like PremAI deploy inside your own VPC with zero data retention — you get the privacy guarantees of self-hosting with managed operations. Self-hosting fine-tuned models through a managed layer is often the right middle ground for enterprise teams.

For smaller teams or lower-volume workloads, how to save on LLM API costs while maintaining quality is worth reading before committing to full self-hosted infrastructure.


FAQ

Do I need a commercial license for Mistral Large 3?

No. All Mistral 3 models are Apache 2.0. Commercial use, self-hosting, and fine-tuning are all permitted without contacting Mistral or paying licensing fees. This is a change from Mistral Large 2, which required a commercial license for self-deployment.

Can I run Mistral Large 3 on A100s?

Yes, using the NVFP4 checkpoint on an 8xA100 80GB node. Performance matches FP8 for context lengths under 64k tokens. For longer contexts, use FP8 on H200s or B200s.

What's the minimum hardware for a test deployment?

The NVFP4 checkpoint on 4xA100s is the most accessible entry point for testing. Set --max-model-len 8192 to reduce KV cache requirements and --tensor-parallel-size 4.

Why does function calling produce malformed output?

Almost always because --tool-call-parser mistral is missing from the serve command. Without it, vLLM uses the default tool parser which doesn't match Mistral's format. Also verify mistral_common >= 1.8.6 is installed.

How does the NVFP4 format affect fine-tuning?

Use FP8 weights for fine-tuning. NVFP4 introduces quantization that can affect gradient quality at the tails. Mistral's own recommendation is FP8 for any use case that includes fine-tuning. For inference-only production deployments, NVFP4 on A100s is the practical choice. For full enterprise fine-tuning workflows, see the PremAI fine-tuning guide.

How do I monitor the deployment?

vLLM exposes Prometheus metrics at /metrics. The key ones to watch: vllm:num_requests_waiting for queue pressure, vllm:gpu_cache_usage_perc for KV cache saturation, and vllm:time_to_first_token_seconds for user-perceived latency. For broader LLM observability practices, the setup is similar to any vLLM-backed deployment.


Running fine-tuned variants of Mistral Large 3 in your own infrastructure? The PremAI self-host guide covers serving custom checkpoints with vLLM, or book a technical call to talk through your deployment.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe