LLM Observability: Setting Up Langfuse, LangSmith, Helicone & Phoenix
Implement LLM monitoring with step-by-step tool setup. Covers Langfuse, LangSmith, Helicone, Phoenix with code, pricing tables, and production debugging.
Production LLMs fail quietly. The API returns 200, but the output is garbage. Costs spike without warning. Quality degrades after a prompt change and nobody notices for days.
Traditional APM tools track server health. They don't tell you whether your model is hallucinating. LLM observability fills that gap with tracing, cost tracking, and quality evaluation.
This guide covers four tools with actual setup code: Langfuse (open source, self-hostable), LangSmith (LangChain ecosystem), Helicone (proxy-based, simplest setup), and Phoenix (fully open source). You'll get real pricing, implementation examples, and guidance on which tool fits your stack.
Tool Comparison at a Glance
| Tool | Setup Time | Self-Host | Free Tier | Best For |
|---|---|---|---|---|
| Helicone | 5 minutes | Yes | 100K requests/mo | Fastest setup, proxy model |
| Langfuse | 30 minutes | Yes (free) | 50K events/mo | Open source, framework-agnostic |
| LangSmith | 15 minutes | Enterprise only | 5K traces/mo | LangChain/LangGraph users |
| Phoenix | 1 hour | Yes (free) | Unlimited self-host | Full control, evaluations |
Helicone: 5-Minute Setup via Proxy
Helicone uses a proxy model. Change your base URL, add a header, and you're logging. No SDK changes, no decorators.
Basic Setup
from openai import OpenAI
client = OpenAI(
api_key="your-openai-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer your-helicone-key"
}
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
That's it. Every request now logs to Helicone's dashboard with latency, tokens, and cost.
Adding Context
Track sessions, users, and custom properties:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this document"}],
extra_headers={
"Helicone-Session-Id": "session-abc123",
"Helicone-User-Id": "user-456",
"Helicone-Property-Feature": "document-summary",
"Helicone-Property-Environment": "production"
}
)
Custom properties let you filter by feature, environment, or any dimension you need.
Multi-Provider Support
Helicone supports 20+ providers. Change the base URL:
# Anthropic
client = OpenAI(
api_key="your-anthropic-key",
base_url="https://anthropic.helicone.ai/v1",
default_headers={"Helicone-Auth": "Bearer your-helicone-key"}
)
# Together AI
client = OpenAI(
api_key="your-together-key",
base_url="https://together.helicone.ai/v1",
default_headers={"Helicone-Auth": "Bearer your-helicone-key"}
)
Helicone Pricing
| Plan | Price | Requests | Features |
|---|---|---|---|
| Free | $0 | 100K/month | Basic logging, 30-day retention |
| Growth | $20/seat/month | Unlimited | Caching, alerts, 1-year retention |
| Enterprise | Custom | Unlimited | SSO, dedicated support, custom retention |
Helicone caps at $200/month for unlimited seats on Growth. Good for fast-growing teams.
When to Use Helicone
Choose Helicone when you want the simplest possible setup. It's ideal for teams using OpenAI or Anthropic directly without frameworks. The proxy model means zero code changes beyond the base URL.
Helicone lacks deep evaluation features. For quality scoring and LLM-as-judge workflows, pair it with a dedicated evaluation tool or choose Langfuse or Phoenix instead.
Langfuse: Open Source with Full Control
Langfuse is MIT-licensed and self-hostable. It provides tracing, prompt management, and evaluation in a framework-agnostic package.
Installation and Setup
pip install langfuse
Set environment variables:
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com" # or your self-hosted URL
OpenAI Integration
Langfuse wraps the OpenAI SDK to auto-capture traces:
from langfuse.openai import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
Every call is traced automatically with inputs, outputs, tokens, cost, and latency.
Manual Tracing with Decorators
For custom functions and multi-step workflows:
from langfuse import observe
from langfuse.openai import openai
client = openai.OpenAI()
@observe()
def retrieve_context(query: str) -> list:
# Your retrieval logic
return ["doc1", "doc2", "doc3"]
@observe()
def generate_response(query: str, context: list) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
@observe()
def rag_pipeline(query: str) -> str:
context = retrieve_context(query)
return generate_response(query, context)
# This creates a nested trace: rag_pipeline > retrieve_context + generate_response
result = rag_pipeline("What is quantum entanglement?")
The @observe() decorator creates spans. Nested calls become child spans automatically.
LangChain Integration
from langfuse.langchain import CallbackHandler
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
langfuse_handler = CallbackHandler()
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("Explain {topic} simply")
chain = prompt | llm
response = chain.invoke(
{"topic": "machine learning"},
config={"callbacks": [langfuse_handler]}
)
Self-Hosting Langfuse
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up
Access at http://localhost:3000. Self-hosted is fully featured with no restrictions.
Langfuse Pricing
| Plan | Price | Events | Features |
|---|---|---|---|
| Hobby (Self-host) | Free | Unlimited | Full features, you manage infra |
| Cloud Free | $0 | 50K/month | Hosted, community support |
| Cloud Pro | $59/month | 100K included | Priority support, extended retention |
| Enterprise | Custom | Custom | SSO, SLA, dedicated support |
Pro plan charges $0.001 per additional event beyond 100K.
When to Use Langfuse
Langfuse fits teams wanting open-source flexibility without vendor lock-in. Self-hosting eliminates per-event costs entirely. The OpenTelemetry foundation means traces can export to existing observability infrastructure (Datadog, Grafana).
Langfuse's alerting is limited compared to commercial tools. For native Slack/PagerDuty alerts, export metrics to Grafana or Datadog.
LangSmith: Native LangChain Integration
LangSmith comes from the LangChain team. If you're building with LangChain or LangGraph, tracing is automatic with a single environment variable.
Setup
pip install langsmith
Set environment variables:
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="ls-..."
export LANGSMITH_PROJECT="my-project" # optional, defaults to "default"
Automatic LangChain Tracing
With environment variables set, LangChain traces automatically:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("user", "{input}")
])
llm = ChatOpenAI(model="gpt-4o")
chain = prompt | llm | StrOutputParser()
# Automatically traced to LangSmith
result = chain.invoke({"input": "What's the capital of France?"})
No callbacks, no decorators. Every chain execution appears in LangSmith.
Non-LangChain Tracing
Use the @traceable decorator for vanilla Python:
from langsmith import traceable
from langsmith.wrappers import wrap_openai
import openai
client = wrap_openai(openai.Client())
@traceable(name="generate_summary")
def generate_summary(text: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Summarize: {text}"}]
)
return response.choices[0].message.content
# Traced to LangSmith
summary = generate_summary("Long document text here...")
wrap_openai auto-captures LLM calls. @traceable creates spans for your functions.
LangGraph Agent Tracing
LangGraph agents trace automatically:
from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI
# With LANGSMITH_TRACING=true, all agent steps are traced
# including tool calls, state transitions, and LLM invocations
Each node, edge, and tool call becomes a span in the trace. For complex agentic AI systems, this visibility is critical.
LangSmith Pricing
| Plan | Price | Traces | Features |
|---|---|---|---|
| Developer | Free | 5K/month | 14-day retention, 1 seat |
| Plus | $39/seat/month | 100K included | 400-day retention, dashboards, alerts |
| Enterprise | Custom | Custom | Self-host option, SSO, dedicated support |
Plus plan overage: ~$0.50 per 1,000 traces beyond 100K.
When to Use LangSmith
LangSmith is the obvious choice if LangChain or LangGraph powers your application. The zero-config tracing and deep framework understanding make debugging easier than any alternative.
The tradeoff is vendor coupling. If you migrate away from LangChain, LangSmith's value drops significantly. Self-hosting requires Enterprise pricing.
Phoenix: Fully Open Source with Evaluations
Phoenix from Arize AI is open source under Elastic License 2.0. It runs locally, in notebooks, or in production clusters. The evaluation features rival commercial tools.
Local Setup
pip install arize-phoenix
phoenix serve
Access at http://localhost:6006.
Docker Deployment
docker run -p 6006:6006 arizephoenix/phoenix:latest
Instrumentation
Phoenix uses OpenTelemetry with the OpenInference standard:
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
# Register Phoenix as trace collector
tracer_provider = register(
project_name="my-app",
endpoint="http://localhost:6006/v1/traces"
)
# Auto-instrument OpenAI
OpenAIInstrumentor().instrument()
# Now all OpenAI calls are traced
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
LangChain/LangGraph Integration
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor
register(project_name="my-agent", auto_instrument=True)
LangChainInstrumentor().instrument()
# All LangChain/LangGraph executions are traced
Built-in Evaluations
Phoenix includes LLM-as-judge evaluators:
from phoenix.evals import llm_classify, OpenAIModel
eval_model = OpenAIModel(model="gpt-4o")
results = llm_classify(
dataframe=traces_df, # Export traces as DataFrame
model=eval_model,
template="Is this response factually accurate? {response}",
rails=["accurate", "inaccurate", "unclear"]
)
Pre-built templates cover hallucination detection, relevance scoring, and toxicity checks.
Phoenix Cloud
For managed hosting:
import os
os.environ["PHOENIX_API_KEY"] = "your-key"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/v1/traces"
from phoenix.otel import register
register(project_name="my-app", auto_instrument=True)
Phoenix Pricing
| Plan | Price | Features |
|---|---|---|
| Self-hosted | Free | Full features, unlimited |
| Cloud Free | $0 | Limited traces, community support |
| Cloud Pro | Contact | Higher limits, priority support |
Self-hosted Phoenix has no restrictions. Run it forever at no cost if you manage the infrastructure.
When to Use Phoenix
Phoenix suits teams wanting full control with no vendor lock-in. The built-in evaluation framework is genuinely useful, not a checkbox feature. Self-hosting means your prompts and responses never leave your infrastructure. For teams running fine-tuned models, this matters.
Setup requires more effort than Helicone or LangSmith. Budget 1-2 hours for initial instrumentation.
Metrics to Track
Regardless of tool, capture these metrics:
Latency
| Metric | Target | Why It Matters |
|---|---|---|
| Time to First Token (TTFT) | <500ms | Perceived responsiveness |
| Total Response Time | <3s (P95) | User experience |
| Time Per Output Token | 30-50ms | Streaming smoothness |
Cost
| Metric | Alert Threshold | Action |
|---|---|---|
| Cost per request (P95) | 2x average | Investigate expensive prompts |
| Daily spend | 80% of budget | Review before hitting limits |
| Cost per user/feature | Varies | Identify cost drivers |
Teams running small language models have different cost profiles but still need this tracking.
For teams focused on LLM cost optimization, observability data reveals which prompts waste tokens.
Error Rates
| Category | Alert Threshold | Meaning |
|---|---|---|
| Provider errors (429, 500) | >2% | Rate limits or API issues |
| Application errors | >1% | Your code is breaking |
| Silent failures | Any increase | Model returning garbage |
Quality
Quality metrics require evaluation. Run LLM evaluations on a sample of production traffic. Understanding evaluation benchmarks helps you choose the right metrics for your use case.
| Metric | Method | Frequency |
|---|---|---|
| Relevance | LLM-as-judge | 10% of traffic |
| Groundedness (RAG) | Compare to sources | All RAG requests |
| User feedback | Thumbs up/down | Ongoing |
Quality degrades over time as inputs shift. Continual learning approaches help, but require even more rigorous monitoring.
Debugging Workflow
When something breaks, follow this process:
Step 1: Identify the Scope
Check your dashboard:
- Which model(s) affected?
- Which time window?
- All users or a segment?
Step 2: Find Representative Traces
Filter to failing requests. In Langfuse, use the trace list filters. In LangSmith, use the "Threads" view to cluster similar issues.
Step 3: Inspect the Trace
Each tool shows the full request flow:
TRACE: user_query
├── SPAN: retrieve_context (latency: 120ms)
│ └── Vector search returned 5 documents
├── SPAN: generate_response (latency: 2.3s)
│ ├── Input: [system prompt + context + query]
│ ├── Output: [response text]
│ └── Tokens: 1,847 input, 523 output
└── SPAN: post_process (latency: 15ms)
Identify where time is spent and where errors occur.
Step 4: Correlate with Changes
Common failure patterns:
- Prompt change broke edge cases: Roll back or fix the prompt
- Provider rate limiting: Implement retry logic or switch models
- Retrieval returning bad context: Debug your RAG pipeline
- Input drift: User behavior changed, prompts need adjustment
Step 5: Verify the Fix
Deploy the fix. Monitor for 15-30 minutes. Confirm error rates dropped and quality recovered.
Setting Up Alerts
Langfuse
Export metrics to Grafana or Datadog via OpenTelemetry:
# otel-collector-config.yaml
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlp:
endpoint: "https://otel.datadoghq.com"
Create alerts in your monitoring tool based on exported metrics.
LangSmith
Native alerting in Plus/Enterprise:
- Go to Project > Monitoring > Alerts
- Set conditions: error rate >5%, latency P95 >10s, cost spike
- Configure Slack/email/webhook notifications
Helicone
Built-in alerts on Growth tier:
- Dashboard > Alerts > New Alert
- Set thresholds for cost, error rate, latency
- Receive notifications via configured channels
Phoenix
For self-hosted, export to Prometheus:
from phoenix.otel import register
register(
project_name="my-app",
endpoint="http://localhost:6006/v1/traces",
# Add Prometheus endpoint for metrics
)
Create Grafana dashboards and alerts from Prometheus metrics.
Integration with Existing Infrastructure
OpenTelemetry Export
Langfuse and Phoenix support OTLP export:
# Send traces to both Phoenix and your existing collector
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry import trace
otlp_exporter = OTLPSpanExporter(
endpoint="https://your-collector.com/v1/traces"
)
Datadog Integration
For teams on Datadog:
# Option 1: Datadog LLM Observability (native)
# Option 2: Export from Langfuse/Phoenix via OTLP
# Datadog native setup
export DD_API_KEY="your-key"
export DD_SITE="datadoghq.com"
export DD_LLMOBS_ENABLED=true
Datadog's LLM Observability provides unified APM + LLM monitoring but at enterprise pricing.
Making the Decision
Choose Helicone if:
- You want the fastest possible setup (5 minutes)
- You use OpenAI/Anthropic directly without frameworks
- Cost tracking and basic logging are your primary needs
- You don't need deep evaluation features
Choose Langfuse if:
- You want open source with no vendor lock-in
- Self-hosting matters (data residency, cost control)
- You use multiple frameworks or providers
- You need prompt versioning and management
Choose LangSmith if:
- LangChain or LangGraph powers your application
- Zero-config tracing is worth the framework coupling
- You want built-in evaluation workflows
- Budget allows $39+/seat/month
Choose Phoenix if:
- You need fully open source (self-host everything)
- Built-in evaluation is important
- You want to keep prompts/responses on your infrastructure
- You're comfortable with more setup effort
Getting Started
Week 1: Basic instrumentation
- Pick a tool based on your stack
- Instrument your main LLM calls
- Verify traces appear in dashboard
Week 2: Add context
- Tag requests with user, feature, environment
- Set up cost tracking
- Create overview dashboard
Week 3: Alerts and evaluation
- Configure critical alerts (error rate, cost spikes)
- Run quality evaluations on 10% of traffic
- Build debugging workflow
Week 4: Production hardening
- Instrument multi-step workflows fully
- Set up sampling for high-volume endpoints
- Document debugging runbooks
Enterprise AI trends show observability becoming a baseline requirement for production deployments.
For teams building AI applications without dedicated platform expertise, integrated solutions like Prem Studio include evaluation and observability as part of the development workflow. The documentation covers the full integration process.
FAQ
Which LLM observability tool has the best free tier?
Helicone offers 100K requests/month free. Langfuse Cloud provides 50K events/month. LangSmith gives 5K traces/month. Phoenix self-hosted is unlimited and free. For production use without paying, self-host Langfuse or Phoenix.
Can I use multiple observability tools together?
Yes. A common pattern is Helicone for cost tracking (proxy) plus Phoenix for evaluations. OpenTelemetry-based tools export traces in standard format, so you can send data to multiple backends.
How much engineering time does setup require?
Helicone: 15-30 minutes. LangSmith with LangChain: 15 minutes. Langfuse: 1-2 hours for full instrumentation. Phoenix: 2-4 hours including evaluation setup. Self-hosting any tool adds 2-4 hours for infrastructure.
Should I self-host or use cloud?
Self-host if: data residency requirements, high volume (cost savings), or you need full control. Use cloud if: team is small, want managed infrastructure, or need support SLAs. Most teams start on cloud and migrate to self-hosted as volume grows.
How do I handle high-volume logging?
Sample at high volume. Log 100% of errors and slow requests. Log 10-20% of normal requests for quality evaluation. All tools support sampling configuration. For teams with strict data security requirements, consider what you're logging since prompts may contain sensitive data.
What metrics should I alert on?
Start with: error rate >5% for 5 minutes, P95 latency >10 seconds, cost exceeding 150% of daily budget. Add quality score drops once you have evaluation running. Tune thresholds based on your baseline to avoid alert fatigue.