By Arnav Jalan — 10 Mar 2026

LLM Observability: Setting Up Langfuse, LangSmith, Helicone & Phoenix

Implement LLM monitoring with step-by-step tool setup. Covers Langfuse, LangSmith, Helicone, Phoenix with code, pricing tables, and production debugging.

Production LLMs fail quietly. The API returns 200, but the output is garbage. Costs spike without warning. Quality degrades after a prompt change and nobody notices for days.

Traditional APM tools track server health. They don't tell you whether your model is hallucinating. LLM observability fills that gap with tracing, cost tracking, and quality evaluation.

This guide covers four tools with actual setup code: Langfuse (open source, self-hostable), LangSmith (LangChain ecosystem), Helicone (proxy-based, simplest setup), and Phoenix (fully open source). You'll get real pricing, implementation examples, and guidance on which tool fits your stack.

Tool Comparison at a Glance

Tool	Setup Time	Self-Host	Free Tier	Best For
Helicone	5 minutes	Yes	100K requests/mo	Fastest setup, proxy model
Langfuse	30 minutes	Yes (free)	50K events/mo	Open source, framework-agnostic
LangSmith	15 minutes	Enterprise only	5K traces/mo	LangChain/LangGraph users
Phoenix	1 hour	Yes (free)	Unlimited self-host	Full control, evaluations

Helicone: 5-Minute Setup via Proxy

Helicone uses a proxy model. Change your base URL, add a header, and you're logging. No SDK changes, no decorators.

Basic Setup

from openai import OpenAI

client = OpenAI(
    api_key="your-openai-key",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-key"
    }
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

That's it. Every request now logs to Helicone's dashboard with latency, tokens, and cost.

Adding Context

Track sessions, users, and custom properties:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this document"}],
    extra_headers={
        "Helicone-Session-Id": "session-abc123",
        "Helicone-User-Id": "user-456",
        "Helicone-Property-Feature": "document-summary",
        "Helicone-Property-Environment": "production"
    }
)

Custom properties let you filter by feature, environment, or any dimension you need.

Multi-Provider Support

Helicone supports 20+ providers. Change the base URL:

# Anthropic
client = OpenAI(
    api_key="your-anthropic-key",
    base_url="https://anthropic.helicone.ai/v1",
    default_headers={"Helicone-Auth": "Bearer your-helicone-key"}
)

# Together AI
client = OpenAI(
    api_key="your-together-key",
    base_url="https://together.helicone.ai/v1",
    default_headers={"Helicone-Auth": "Bearer your-helicone-key"}
)

Helicone Pricing

Plan	Price	Requests	Features
Free	$0	100K/month	Basic logging, 30-day retention
Growth	$20/seat/month	Unlimited	Caching, alerts, 1-year retention
Enterprise	Custom	Unlimited	SSO, dedicated support, custom retention

Helicone caps at $200/month for unlimited seats on Growth. Good for fast-growing teams.

When to Use Helicone

Choose Helicone when you want the simplest possible setup. It's ideal for teams using OpenAI or Anthropic directly without frameworks. The proxy model means zero code changes beyond the base URL.

Helicone lacks deep evaluation features. For quality scoring and LLM-as-judge workflows, pair it with a dedicated evaluation tool or choose Langfuse or Phoenix instead.

Langfuse: Open Source with Full Control

Langfuse is MIT-licensed and self-hostable. It provides tracing, prompt management, and evaluation in a framework-agnostic package.

Installation and Setup

pip install langfuse

Set environment variables:

export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com"  # or your self-hosted URL

OpenAI Integration

Langfuse wraps the OpenAI SDK to auto-capture traces:

from langfuse.openai import openai

client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Every call is traced automatically with inputs, outputs, tokens, cost, and latency.

Manual Tracing with Decorators

For custom functions and multi-step workflows:

from langfuse import observe
from langfuse.openai import openai

client = openai.OpenAI()

@observe()
def retrieve_context(query: str) -> list:
    # Your retrieval logic
    return ["doc1", "doc2", "doc3"]

@observe()
def generate_response(query: str, context: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

@observe()
def rag_pipeline(query: str) -> str:
    context = retrieve_context(query)
    return generate_response(query, context)

# This creates a nested trace: rag_pipeline > retrieve_context + generate_response
result = rag_pipeline("What is quantum entanglement?")

The @observe() decorator creates spans. Nested calls become child spans automatically.

LangChain Integration

from langfuse.langchain import CallbackHandler
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

langfuse_handler = CallbackHandler()

llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("Explain {topic} simply")
chain = prompt | llm

response = chain.invoke(
    {"topic": "machine learning"},
    config={"callbacks": [langfuse_handler]}
)

Self-Hosting Langfuse

git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up

Access at http://localhost:3000. Self-hosted is fully featured with no restrictions.

Langfuse Pricing

Plan	Price	Events	Features
Hobby (Self-host)	Free	Unlimited	Full features, you manage infra
Cloud Free	$0	50K/month	Hosted, community support
Cloud Pro	$59/month	100K included	Priority support, extended retention
Enterprise	Custom	Custom	SSO, SLA, dedicated support

Pro plan charges $0.001 per additional event beyond 100K.

When to Use Langfuse

Langfuse fits teams wanting open-source flexibility without vendor lock-in. Self-hosting eliminates per-event costs entirely. The OpenTelemetry foundation means traces can export to existing observability infrastructure (Datadog, Grafana).

Langfuse's alerting is limited compared to commercial tools. For native Slack/PagerDuty alerts, export metrics to Grafana or Datadog.

LangSmith: Native LangChain Integration

LangSmith comes from the LangChain team. If you're building with LangChain or LangGraph, tracing is automatic with a single environment variable.

Setup

pip install langsmith

Set environment variables:

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="ls-..."
export LANGSMITH_PROJECT="my-project"  # optional, defaults to "default"

Automatic LangChain Tracing

With environment variables set, LangChain traces automatically:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}")
])
llm = ChatOpenAI(model="gpt-4o")
chain = prompt | llm | StrOutputParser()

# Automatically traced to LangSmith
result = chain.invoke({"input": "What's the capital of France?"})

No callbacks, no decorators. Every chain execution appears in LangSmith.

Non-LangChain Tracing

Use the @traceable decorator for vanilla Python:

from langsmith import traceable
from langsmith.wrappers import wrap_openai
import openai

client = wrap_openai(openai.Client())

@traceable(name="generate_summary")
def generate_summary(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Summarize: {text}"}]
    )
    return response.choices[0].message.content

# Traced to LangSmith
summary = generate_summary("Long document text here...")

wrap_openai auto-captures LLM calls. @traceable creates spans for your functions.

LangGraph Agent Tracing

LangGraph agents trace automatically:

from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI

# With LANGSMITH_TRACING=true, all agent steps are traced
# including tool calls, state transitions, and LLM invocations

Each node, edge, and tool call becomes a span in the trace. For complex agentic AI systems, this visibility is critical.

LangSmith Pricing

Plan	Price	Traces	Features
Developer	Free	5K/month	14-day retention, 1 seat
Plus	$39/seat/month	100K included	400-day retention, dashboards, alerts
Enterprise	Custom	Custom	Self-host option, SSO, dedicated support

Plus plan overage: ~$0.50 per 1,000 traces beyond 100K.

When to Use LangSmith

LangSmith is the obvious choice if LangChain or LangGraph powers your application. The zero-config tracing and deep framework understanding make debugging easier than any alternative.

The tradeoff is vendor coupling. If you migrate away from LangChain, LangSmith's value drops significantly. Self-hosting requires Enterprise pricing.

Phoenix: Fully Open Source with Evaluations

Phoenix from Arize AI is open source under Elastic License 2.0. It runs locally, in notebooks, or in production clusters. The evaluation features rival commercial tools.

Local Setup

pip install arize-phoenix
phoenix serve

Access at http://localhost:6006.

Docker Deployment

docker run -p 6006:6006 arizephoenix/phoenix:latest

Instrumentation

Phoenix uses OpenTelemetry with the OpenInference standard:

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

# Register Phoenix as trace collector
tracer_provider = register(
    project_name="my-app",
    endpoint="http://localhost:6006/v1/traces"
)

# Auto-instrument OpenAI
OpenAIInstrumentor().instrument()

# Now all OpenAI calls are traced
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

LangChain/LangGraph Integration

from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

register(project_name="my-agent", auto_instrument=True)
LangChainInstrumentor().instrument()

# All LangChain/LangGraph executions are traced

Built-in Evaluations

Phoenix includes LLM-as-judge evaluators:

from phoenix.evals import llm_classify, OpenAIModel

eval_model = OpenAIModel(model="gpt-4o")

results = llm_classify(
    dataframe=traces_df,  # Export traces as DataFrame
    model=eval_model,
    template="Is this response factually accurate? {response}",
    rails=["accurate", "inaccurate", "unclear"]
)

Pre-built templates cover hallucination detection, relevance scoring, and toxicity checks.

Phoenix Cloud

For managed hosting:

import os
os.environ["PHOENIX_API_KEY"] = "your-key"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/v1/traces"

from phoenix.otel import register
register(project_name="my-app", auto_instrument=True)

Phoenix Pricing

Plan	Price	Features
Self-hosted	Free	Full features, unlimited
Cloud Free	$0	Limited traces, community support
Cloud Pro	Contact	Higher limits, priority support

Self-hosted Phoenix has no restrictions. Run it forever at no cost if you manage the infrastructure.

When to Use Phoenix

Phoenix suits teams wanting full control with no vendor lock-in. The built-in evaluation framework is genuinely useful, not a checkbox feature. Self-hosting means your prompts and responses never leave your infrastructure. For teams running fine-tuned models, this matters.

Setup requires more effort than Helicone or LangSmith. Budget 1-2 hours for initial instrumentation.

Metrics to Track

Regardless of tool, capture these metrics:

Latency

Metric	Target	Why It Matters
Time to First Token (TTFT)	<500ms	Perceived responsiveness
Total Response Time	<3s (P95)	User experience
Time Per Output Token	30-50ms	Streaming smoothness

Cost

Metric	Alert Threshold	Action
Cost per request (P95)	2x average	Investigate expensive prompts
Daily spend	80% of budget	Review before hitting limits
Cost per user/feature	Varies	Identify cost drivers

Teams running small language models have different cost profiles but still need this tracking.

For teams focused on LLM cost optimization, observability data reveals which prompts waste tokens.

Error Rates

Category	Alert Threshold	Meaning
Provider errors (429, 500)	>2%	Rate limits or API issues
Application errors	>1%	Your code is breaking
Silent failures	Any increase	Model returning garbage

Quality

Quality metrics require evaluation. Run LLM evaluations on a sample of production traffic. Understanding evaluation benchmarks helps you choose the right metrics for your use case.

Metric	Method	Frequency
Relevance	LLM-as-judge	10% of traffic
Groundedness (RAG)	Compare to sources	All RAG requests
User feedback	Thumbs up/down	Ongoing

Quality degrades over time as inputs shift. Continual learning approaches help, but require even more rigorous monitoring.

Debugging Workflow

When something breaks, follow this process:

Step 1: Identify the Scope

Check your dashboard:

Which model(s) affected?
Which time window?
All users or a segment?

Step 2: Find Representative Traces

Filter to failing requests. In Langfuse, use the trace list filters. In LangSmith, use the "Threads" view to cluster similar issues.

Step 3: Inspect the Trace

Each tool shows the full request flow:

TRACE: user_query
├── SPAN: retrieve_context (latency: 120ms)
│   └── Vector search returned 5 documents
├── SPAN: generate_response (latency: 2.3s)
│   ├── Input: [system prompt + context + query]
│   ├── Output: [response text]
│   └── Tokens: 1,847 input, 523 output
└── SPAN: post_process (latency: 15ms)

Identify where time is spent and where errors occur.

Step 4: Correlate with Changes

Common failure patterns:

Prompt change broke edge cases: Roll back or fix the prompt
Provider rate limiting: Implement retry logic or switch models
Retrieval returning bad context: Debug your RAG pipeline
Input drift: User behavior changed, prompts need adjustment

Step 5: Verify the Fix

Deploy the fix. Monitor for 15-30 minutes. Confirm error rates dropped and quality recovered.

Setting Up Alerts

Langfuse

Export metrics to Grafana or Datadog via OpenTelemetry:

# otel-collector-config.yaml
exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  otlp:
    endpoint: "https://otel.datadoghq.com"

Create alerts in your monitoring tool based on exported metrics.

LangSmith

Native alerting in Plus/Enterprise:

Go to Project > Monitoring > Alerts
Set conditions: error rate >5%, latency P95 >10s, cost spike
Configure Slack/email/webhook notifications

Helicone

Built-in alerts on Growth tier:

Dashboard > Alerts > New Alert
Set thresholds for cost, error rate, latency
Receive notifications via configured channels

Phoenix

For self-hosted, export to Prometheus:

from phoenix.otel import register

register(
    project_name="my-app",
    endpoint="http://localhost:6006/v1/traces",
    # Add Prometheus endpoint for metrics
)

Create Grafana dashboards and alerts from Prometheus metrics.

Integration with Existing Infrastructure

OpenTelemetry Export

Langfuse and Phoenix support OTLP export:

# Send traces to both Phoenix and your existing collector
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry import trace

otlp_exporter = OTLPSpanExporter(
    endpoint="https://your-collector.com/v1/traces"
)

Datadog Integration

For teams on Datadog:

# Option 1: Datadog LLM Observability (native)
# Option 2: Export from Langfuse/Phoenix via OTLP

# Datadog native setup
export DD_API_KEY="your-key"
export DD_SITE="datadoghq.com"
export DD_LLMOBS_ENABLED=true

Datadog's LLM Observability provides unified APM + LLM monitoring but at enterprise pricing.

Making the Decision

Choose Helicone if:

You want the fastest possible setup (5 minutes)
You use OpenAI/Anthropic directly without frameworks
Cost tracking and basic logging are your primary needs
You don't need deep evaluation features

Choose Langfuse if:

You want open source with no vendor lock-in
Self-hosting matters (data residency, cost control)
You use multiple frameworks or providers
You need prompt versioning and management

Choose LangSmith if:

LangChain or LangGraph powers your application
Zero-config tracing is worth the framework coupling
You want built-in evaluation workflows
Budget allows $39+/seat/month

Choose Phoenix if:

You need fully open source (self-host everything)
Built-in evaluation is important
You want to keep prompts/responses on your infrastructure
You're comfortable with more setup effort

Getting Started

Week 1: Basic instrumentation

Pick a tool based on your stack
Instrument your main LLM calls
Verify traces appear in dashboard

Week 2: Add context

Tag requests with user, feature, environment
Set up cost tracking
Create overview dashboard

Week 3: Alerts and evaluation

Configure critical alerts (error rate, cost spikes)
Run quality evaluations on 10% of traffic
Build debugging workflow

Week 4: Production hardening

Instrument multi-step workflows fully
Set up sampling for high-volume endpoints
Document debugging runbooks

Enterprise AI trends show observability becoming a baseline requirement for production deployments.

For teams building AI applications without dedicated platform expertise, integrated solutions like Prem Studio include evaluation and observability as part of the development workflow. The documentation covers the full integration process.

FAQ

Which LLM observability tool has the best free tier?

Helicone offers 100K requests/month free. Langfuse Cloud provides 50K events/month. LangSmith gives 5K traces/month. Phoenix self-hosted is unlimited and free. For production use without paying, self-host Langfuse or Phoenix.

Can I use multiple observability tools together?

Yes. A common pattern is Helicone for cost tracking (proxy) plus Phoenix for evaluations. OpenTelemetry-based tools export traces in standard format, so you can send data to multiple backends.

How much engineering time does setup require?

Helicone: 15-30 minutes. LangSmith with LangChain: 15 minutes. Langfuse: 1-2 hours for full instrumentation. Phoenix: 2-4 hours including evaluation setup. Self-hosting any tool adds 2-4 hours for infrastructure.

Should I self-host or use cloud?

Self-host if: data residency requirements, high volume (cost savings), or you need full control. Use cloud if: team is small, want managed infrastructure, or need support SLAs. Most teams start on cloud and migrate to self-hosted as volume grows.

How do I handle high-volume logging?

Sample at high volume. Log 100% of errors and slow requests. Log 10-20% of normal requests for quality evaluation. All tools support sampling configuration. For teams with strict data security requirements, consider what you're logging since prompts may contain sensitive data.

What metrics should I alert on?

Start with: error rate >5% for 5 minutes, P95 latency >10 seconds, cost exceeding 150% of daily budget. Add quality score drops once you have evaluation running. Tune thresholds based on your baseline to avoid alert fatigue.