LLM Structured Output: From JSON Mode to Self-Hosted Inference (Complete Guide)
Get reliable JSON from any LLM. Compare OpenAI, Anthropic, Gemini structured outputs. Learn constrained decoding with XGrammar, Outlines, and Instructor. Production patterns included.
LLMs generate text. Your code needs JSON. The gap between them breaks production systems daily.
A prompt that returns valid JSON 95% of the time fails 5% of the time. At 10,000 requests per day, that means 500 parsing errors, 500 retries, 500 potential user-facing failures. The math works against you fast.
This guide covers four approaches to getting structured output from LLMs: prompting, provider APIs, constrained decoding, and fine-tuning. Each layer adds reliability but also complexity. The right choice depends on your infrastructure, compliance requirements, and tolerance for latency.
The Four Approaches
Before diving into specifics, here's how the methods compare:
| Approach | Compliance Rate | Latency Impact | Works With |
|---|---|---|---|
| Prompting | 70-85% | None | Any model |
| Provider APIs | 95-100% | +50-200ms first call | OpenAI, Gemini, Anthropic |
| Constrained Decoding | 100% | +10-40μs/token | Self-hosted only |
| Fine-tuning | 90-98% | None after training | Models you control |
Prompting is unreliable but universal. Provider APIs work well but lock you to specific vendors. Constrained decoding guarantees compliance but requires self-hosted inference. Fine-tuning improves base model behavior but requires training infrastructure.
Most teams start with provider APIs and add constrained decoding when they move to self-hosted models.
Provider API Comparison
Every major LLM provider now supports structured outputs, but implementations differ significantly.
OpenAI
OpenAI offers the most mature structured output support. Their August 2024 release introduced schema-enforced generation that achieves 100% compliance on their evaluation dataset.
from openai import OpenAI
from pydantic import BaseModel
class Event(BaseModel):
title: str
date: str
location: str
client = OpenAI()
response = client.responses.parse(
model="gpt-4o",
input="Extract: PyData Sydney is on 2025-11-03 at Darling Harbour.",
text_format=Event
)
The SDK converts Pydantic models to JSON Schema and enforces compliance server-side. First requests with new schemas incur 1-10 seconds of compilation latency. Subsequent requests with the same schema run at normal speed.
Limitations: No recursive types. Constraints like minimum, maximum, and minLength become prompt hints rather than hard constraints. Pydantic validates these after generation, not during.
Anthropic
Claude lacks native response_format support. Instead, you force structured output through tool definitions.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=[{
"name": "extract_event",
"description": "Extract event details",
"input_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"date": {"type": "string"},
"location": {"type": "string"}
},
"required": ["title", "date", "location"]
}
}],
tool_choice={"type": "tool", "name": "extract_event"},
messages=[{"role": "user", "content": "Extract: PyData Sydney..."}]
)
Without the tool trick, Claude fails to produce valid JSON in roughly 14-20% of requests. With it, compliance improves substantially but still requires client-side validation.
Google Gemini
Gemini supports response_schema directly and achieves massive token efficiency gains:
import google.generativeai as genai
from google.genai import types
schema = types.Schema(
type=types.Type.OBJECT,
properties={
"title": types.Schema(type=types.Type.STRING),
"date": types.Schema(type=types.Type.STRING),
"location": types.Schema(type=types.Type.STRING),
},
required=["title", "date", "location"],
additional_properties=False,
)
response = genai.generate_content(
model="gemini-2.0-flash",
contents="Extract the event...",
generation_config=genai.GenerationConfig(
response_mime_type="application/json",
response_schema=schema,
),
)
Benchmarks show Gemini with structured outputs uses 56-61% fewer tokens than enhanced prompting approaches. For high-volume applications, this translates to significant cost savings.
Mistral and Cohere
Both support JSON mode with schema definitions. Mistral requires explicit "respond in JSON" instructions alongside the schema parameter. Cohere enforces schemas through their Chat API v2 with strict_tools.
Provider Summary
| Provider | Method | Schema Enforcement | First-Call Overhead |
|---|---|---|---|
| OpenAI | response_format |
Server-side | 1-10s compilation |
| Anthropic | Tool-use trick | Client validates | Minimal |
| Gemini | response_schema |
Server-side | Minimal |
| Mistral | response_format |
Server-side | Minimal |
| Cohere | strict_tools |
Server-side | Minimal |
For teams locked to a single provider, native APIs work well. For multi-provider setups, abstraction libraries like Instructor provide a unified interface.
How Constrained Decoding Works
Provider APIs are black boxes. If you need control over the generation process, or if you're running self-hosted models, constrained decoding gives you 100% compliance at the inference layer.
LLMs generate tokens by sampling from a probability distribution. At each step, the model outputs probabilities for every token in its vocabulary. Normally, sampling picks from these probabilities directly.
Constrained decoding modifies this process. Before sampling, it masks tokens that would violate the schema. If the model is partway through generating {"name": "Alice", "age": and the next token must be a number, the mask zeros out probabilities for everything except digits.
The mask updates dynamically. What tokens are valid depends on what came before. A finite state machine or pushdown automaton tracks the current position in the grammar and computes valid tokens for each step.
Engine Comparison
Three engines dominate the constrained decoding space:
XGrammar (CMU/MLC team) is the default in vLLM since version 0.8.5. It uses pushdown automata to handle context-free grammars, achieving roughly 40 microseconds per token for mask computation. XGrammar precomputes validity for 99% of tokens (those whose validity depends only on grammar position, not stack state) and handles the remaining 1% at runtime.
llguidance (Microsoft) uses a derivative-based regex engine and optimized Earley parser. It achieves approximately 50 microseconds per token with near-zero startup cost. OpenAI publicly credited llguidance for foundational work behind their Structured Outputs implementation.
Outlines pioneered the FSM approach. It compiles JSON Schemas into finite state machines with precomputed vocabulary indexes. The tradeoff: complex schemas can cause compilation times from 40 seconds to over 10 minutes. JSONSchemaBench found Outlines had the lowest compliance rate among tested engines, primarily due to these timeouts.
| Engine | Approach | Latency/Token | Startup Cost | Recursive Schemas |
|---|---|---|---|---|
| XGrammar | CFG/PDA | ~40μs | Low | Yes |
| llguidance | Earley parser | ~50μs | Negligible | Yes |
| Outlines | FSM | O(1) lookup | 40s-10min | No (flattened) |
For production use with complex schemas or recursive structures (nested comments, tree data, recursive $ref), XGrammar or llguidance are the safer choices.
Setting Up vLLM with Structured Output
vLLM integrates XGrammar as its default structured output backend. Configuration is straightforward:
from openai import OpenAI
from pydantic import BaseModel
class UserProfile(BaseModel):
name: str
age: int
email: str
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": "Extract user info from: John Smith, 32, [email protected]"}],
response_format={
"type": "json_schema",
"schema": UserProfile.model_json_schema()
}
)
vLLM handles schema compilation internally. Unlike cloud APIs, there's no network round-trip for schema validation. The constraint logic runs in the same process as inference.
For teams running self-hosted models, this setup gives full control over both the model and the output format. No data leaves your infrastructure. Compliance frameworks like HIPAA and GDPR often require this level of control.
The Instructor Library
Instructor wraps provider SDKs with Pydantic integration, automatic retries, and streaming support. It has over 3 million monthly downloads and supports 15+ providers.
import instructor
from pydantic import BaseModel, Field, field_validator
class Product(BaseModel):
name: str
price: float = Field(gt=0)
in_stock: bool
@field_validator('price')
def price_must_be_positive(cls, v):
if v <= 0:
raise ValueError('Price must be positive')
return v
client = instructor.from_provider("openai/gpt-4o-mini")
product = client.chat.completions.create(
response_model=Product,
messages=[{"role": "user", "content": "iPhone 15 Pro, $999, available"}],
max_retries=3
)
When validation fails, Instructor sends the error back to the model and retries. If the model returns price: -5 and your Pydantic model requires gt=0, Instructor catches the validation error, includes the error message in a follow-up prompt, and tries again.
This retry pattern matters. Without it, you need manual error handling for every extraction call. With it, transient failures resolve automatically. The tradeoff: retries mean additional API calls and higher costs.
Streaming and Partial Responses
Instructor supports streaming extractions where fields populate progressively:
from instructor import Partial
for partial in client.chat.completions.create(
response_model=Partial[Product],
messages=[...],
stream=True
):
print(partial) # Fields fill in as tokens arrive
This works well for user-facing applications where showing partial results improves perceived responsiveness.
Failure Modes Nobody Mentions
Structured output implementations have edge cases that break in production.
Token Boundary Issues
LLMs generate tokens, not characters. A number like 12345 might tokenize as 123 + 45 or 1234 + 5. Constrained decoding must handle both cases.
The problem: some implementations check validity at token boundaries incorrectly. A schema requiring integers between 100-999 might reject valid values if the first token happens to be 1 (valid start for 100-999) followed by 00 (making 100, valid) but the intermediate state 1 alone looks like it could lead to invalid values.
XGrammar and llguidance handle this correctly. Older implementations sometimes don't.
Key Ordering Effects
JSON key order shouldn't matter semantically, but it affects generation. LLMs generate left-to-right. If your schema has a reasoning field and an answer field, putting answer first forces the model to commit to an answer before generating its reasoning.
The recommendation: order schema fields so reasoning or justification comes before conclusions. This lets the model "think" before deciding.
Recursive Schema Limitations
FSM-based tools like Outlines can't handle true recursion. Schemas with recursive $ref definitions (nested comments, tree structures) get flattened to a fixed depth or rejected entirely.
If your data model includes recursion, use CFG-based engines (XGrammar, llguidance, llama.cpp grammar mode).
Compilation Overhead
First request with a new schema incurs compilation cost. OpenAI reports under 10 seconds. Outlines can take 40 seconds to 10+ minutes for complex schemas.
For production systems with many distinct schemas, this overhead adds up. Caching compiled schemas helps, but schema updates still trigger recompilation.
Model Refusals
Models can refuse requests even with structured output enabled. If a model decides the request violates its guidelines, it may output a refusal message that breaks your schema entirely. The schema can't force the model to comply with requests it deems inappropriate.
Handle refusals as a separate code path, not a parsing error.
Fine-Tuning for Schema Compliance
Constrained decoding guarantees valid output structure. Fine-tuning improves how well the model follows the structure on its first attempt, reducing the overhead of constraint enforcement.
Research from SchemaBench shows that base models frequently ignore target schemas entirely, generating free-form text or incorrect JSON structures. Supervised fine-tuning on schema-following examples dramatically improves first-pass compliance.
The pattern:
- Create training pairs: (input text, filled schema) for your domain
- Include schema in training prompts so the model learns to follow different structures
- Fine-tune with LoRA or similar parameter-efficient methods
- Evaluate compliance rates on held-out schemas
SLOT (a recent approach from EMNLP 2025) demonstrates that even lightweight models (Llama-3.2 1B/3B, Mistral-7B) outperform larger base models at structured output when fine-tuned specifically for schema following.
For teams with fine-tuning infrastructure, this approach reduces reliance on constrained decoding at inference time. The model learns to produce valid JSON naturally, and constraint checking becomes a fallback rather than the primary mechanism.
When Fine-Tuning Makes Sense
Fine-tuning for structured output works best when:
- You have a stable set of schemas that won't change frequently
- Inference latency matters and you want to minimize constraint overhead
- You're running self-hosted models and control the training process
- Your schemas are complex enough that base models struggle with them
It's less useful when:
- Schemas change frequently (retraining is expensive)
- You're using cloud APIs exclusively (no fine-tuning access for latest models)
- Your schemas are simple and base models handle them reliably
For enterprise deployments with specific compliance or data handling requirements, combining fine-tuned models with constrained decoding provides both high first-pass accuracy and guaranteed compliance.
Production Patterns
Structured output in production requires more than schema enforcement.
Layered Validation
Don't trust any single layer completely:
- Schema enforcement (API or constrained decoding) handles structure
- Pydantic validation catches semantic constraints (value ranges, formats)
- Business logic validation ensures domain rules
from pydantic import BaseModel, Field, field_validator
import re
class OrderExtraction(BaseModel):
order_id: str = Field(pattern=r'^ORD-\d{8}$')
amount: float = Field(gt=0, lt=1000000)
currency: str
@field_validator('currency')
def valid_currency(cls, v):
if v not in ['USD', 'EUR', 'GBP']:
raise ValueError(f'Unsupported currency: {v}')
return v
The schema guarantees JSON structure. Pydantic validates patterns and ranges. Custom validators enforce business rules.
Retry Strategies
Not all failures deserve retries:
- Validation failures: Retry with error feedback (Instructor handles this)
- Refusals: Log and escalate, don't retry
- Timeout: Retry with backoff
- Rate limits: Retry with exponential backoff
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
def extract_with_retry(client, messages, response_model):
return client.chat.completions.create(
response_model=response_model,
messages=messages
)
Fallback Chains
When the primary model fails, fall back to alternatives:
def extract_robust(content, schema):
# Try primary model
try:
return extract_openai(content, schema)
except Exception:
pass
# Fall back to local model with constrained decoding
try:
return extract_local_vllm(content, schema)
except Exception:
pass
# Last resort: simpler extraction with regex
return extract_regex_fallback(content)
This pattern is common in production systems where availability matters more than always using the best model.
Monitoring
Track these metrics:
- First-pass compliance rate: How often does the model produce valid output without retries?
- Retry rate: What percentage of requests need multiple attempts?
- Latency by schema complexity: Do complex schemas slow down generation?
- Cost per successful extraction: Including retry costs
These metrics reveal whether your structured output pipeline is healthy. A dropping first-pass rate might indicate model degradation or schema changes that confuse the model.
For ongoing model reliability, automated monitoring catches regressions before users notice.
FAQ
What's the difference between JSON mode and structured outputs?
JSON mode guarantees syntactically valid JSON. Structured outputs enforce a specific schema. JSON mode might return {"foo": "bar"} when you wanted {"name": "...", "age": ...}. Structured outputs guarantee the schema you specified.
Can I use constrained decoding with cloud APIs?
No. Constrained decoding requires access to token probabilities during generation. Cloud APIs don't expose this. You need self-hosted models (vLLM, llama.cpp, TGI) to use constrained decoding.
How do I handle optional fields?
OpenAI's structured outputs don't support optional fields directly. All fields must appear in output. Workarounds: make fields nullable ("field": null allowed) or use union types. Constrained decoding engines handle optional fields natively since they work with full JSON Schema.
Does structured output affect model reasoning?
Research suggests yes. Wang et al. (2024) found that format constraints can degrade reasoning performance, especially on complex tasks. The model spends capacity following the format instead of solving the problem. For reasoning-heavy tasks, consider generating free-form reasoning first, then extracting structured data in a second pass.
How do I test structured output reliability?
Create a test set with diverse inputs. Run extractions. Measure: (1) percentage producing valid JSON, (2) percentage matching schema, (3) percentage with semantically correct values. Track these over time. JSONSchemaBench provides standardized benchmarks if you want to compare against published results.
Which approach is fastest?
For cloud APIs, structured outputs add 50-200ms on first request per schema, then minimal overhead. For self-hosted, constrained decoding adds roughly 40-50 microseconds per token. Fine-tuned models with high first-pass compliance have no per-token overhead but require training investment upfront.
Conclusion
Structured output from LLMs requires choosing the right tools for your constraints:
- Cloud-only teams: Use provider APIs (OpenAI, Gemini) with Instructor for validation and retries
- Self-hosted inference: Add XGrammar via vLLM for guaranteed compliance
- High-volume, latency-sensitive: Fine-tune for schema following, use constrained decoding as fallback
- Multi-provider flexibility: Instructor abstracts differences across 15+ providers
The field is maturing quickly. XGrammar and llguidance have made constrained decoding fast enough for production. Provider APIs have converged on similar interfaces. Fine-tuning for structured output is becoming a standard practice for teams with training infrastructure.
Start with provider APIs if you're on cloud. Add constrained decoding when you move to self-hosted. Consider fine-tuning when you have stable schemas and want to minimize runtime overhead.