LLM Observability: Setting Up Langfuse, LangSmith, Helicone & Phoenix

Implement LLM monitoring with step-by-step tool setup. Covers Langfuse, LangSmith, Helicone, Phoenix with code, pricing tables, and production debugging.

LLM Observability: Setting Up Langfuse, LangSmith, Helicone & Phoenix

Production LLMs fail quietly. The API returns 200, but the output is garbage. Costs spike without warning. Quality degrades after a prompt change and nobody notices for days.

Traditional APM tools track server health. They don't tell you whether your model is hallucinating. LLM observability fills that gap with tracing, cost tracking, and quality evaluation.

This guide covers four tools with actual setup code: Langfuse (open source, self-hostable), LangSmith (LangChain ecosystem), Helicone (proxy-based, simplest setup), and Phoenix (fully open source). You'll get real pricing, implementation examples, and guidance on which tool fits your stack.

Tool Comparison at a Glance

Tool Setup Time Self-Host Free Tier Best For
Helicone 5 minutes Yes 100K requests/mo Fastest setup, proxy model
Langfuse 30 minutes Yes (free) 50K events/mo Open source, framework-agnostic
LangSmith 15 minutes Enterprise only 5K traces/mo LangChain/LangGraph users
Phoenix 1 hour Yes (free) Unlimited self-host Full control, evaluations

Helicone: 5-Minute Setup via Proxy

Helicone uses a proxy model. Change your base URL, add a header, and you're logging. No SDK changes, no decorators.

Basic Setup

from openai import OpenAI

client = OpenAI(
    api_key="your-openai-key",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-key"
    }
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

That's it. Every request now logs to Helicone's dashboard with latency, tokens, and cost.

Adding Context

Track sessions, users, and custom properties:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this document"}],
    extra_headers={
        "Helicone-Session-Id": "session-abc123",
        "Helicone-User-Id": "user-456",
        "Helicone-Property-Feature": "document-summary",
        "Helicone-Property-Environment": "production"
    }
)

Custom properties let you filter by feature, environment, or any dimension you need.

Multi-Provider Support

Helicone supports 20+ providers. Change the base URL:

# Anthropic
client = OpenAI(
    api_key="your-anthropic-key",
    base_url="https://anthropic.helicone.ai/v1",
    default_headers={"Helicone-Auth": "Bearer your-helicone-key"}
)

# Together AI
client = OpenAI(
    api_key="your-together-key",
    base_url="https://together.helicone.ai/v1",
    default_headers={"Helicone-Auth": "Bearer your-helicone-key"}
)

Helicone Pricing

Plan Price Requests Features
Free $0 100K/month Basic logging, 30-day retention
Growth $20/seat/month Unlimited Caching, alerts, 1-year retention
Enterprise Custom Unlimited SSO, dedicated support, custom retention

Helicone caps at $200/month for unlimited seats on Growth. Good for fast-growing teams.

When to Use Helicone

Choose Helicone when you want the simplest possible setup. It's ideal for teams using OpenAI or Anthropic directly without frameworks. The proxy model means zero code changes beyond the base URL.

Helicone lacks deep evaluation features. For quality scoring and LLM-as-judge workflows, pair it with a dedicated evaluation tool or choose Langfuse or Phoenix instead.

Langfuse: Open Source with Full Control

Langfuse is MIT-licensed and self-hostable. It provides tracing, prompt management, and evaluation in a framework-agnostic package.

Installation and Setup

pip install langfuse

Set environment variables:

export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com"  # or your self-hosted URL

OpenAI Integration

Langfuse wraps the OpenAI SDK to auto-capture traces:

from langfuse.openai import openai

client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Every call is traced automatically with inputs, outputs, tokens, cost, and latency.

Manual Tracing with Decorators

For custom functions and multi-step workflows:

from langfuse import observe
from langfuse.openai import openai

client = openai.OpenAI()

@observe()
def retrieve_context(query: str) -> list:
    # Your retrieval logic
    return ["doc1", "doc2", "doc3"]

@observe()
def generate_response(query: str, context: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

@observe()
def rag_pipeline(query: str) -> str:
    context = retrieve_context(query)
    return generate_response(query, context)

# This creates a nested trace: rag_pipeline > retrieve_context + generate_response
result = rag_pipeline("What is quantum entanglement?")

The @observe() decorator creates spans. Nested calls become child spans automatically.

LangChain Integration

from langfuse.langchain import CallbackHandler
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

langfuse_handler = CallbackHandler()

llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("Explain {topic} simply")
chain = prompt | llm

response = chain.invoke(
    {"topic": "machine learning"},
    config={"callbacks": [langfuse_handler]}
)

Self-Hosting Langfuse

git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up

Access at http://localhost:3000. Self-hosted is fully featured with no restrictions.

Langfuse Pricing

Plan Price Events Features
Hobby (Self-host) Free Unlimited Full features, you manage infra
Cloud Free $0 50K/month Hosted, community support
Cloud Pro $59/month 100K included Priority support, extended retention
Enterprise Custom Custom SSO, SLA, dedicated support

Pro plan charges $0.001 per additional event beyond 100K.

When to Use Langfuse

Langfuse fits teams wanting open-source flexibility without vendor lock-in. Self-hosting eliminates per-event costs entirely. The OpenTelemetry foundation means traces can export to existing observability infrastructure (Datadog, Grafana).

Langfuse's alerting is limited compared to commercial tools. For native Slack/PagerDuty alerts, export metrics to Grafana or Datadog.

LangSmith: Native LangChain Integration

LangSmith comes from the LangChain team. If you're building with LangChain or LangGraph, tracing is automatic with a single environment variable.

Setup

pip install langsmith

Set environment variables:

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="ls-..."
export LANGSMITH_PROJECT="my-project"  # optional, defaults to "default"

Automatic LangChain Tracing

With environment variables set, LangChain traces automatically:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}")
])
llm = ChatOpenAI(model="gpt-4o")
chain = prompt | llm | StrOutputParser()

# Automatically traced to LangSmith
result = chain.invoke({"input": "What's the capital of France?"})

No callbacks, no decorators. Every chain execution appears in LangSmith.

Non-LangChain Tracing

Use the @traceable decorator for vanilla Python:

from langsmith import traceable
from langsmith.wrappers import wrap_openai
import openai

client = wrap_openai(openai.Client())

@traceable(name="generate_summary")
def generate_summary(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Summarize: {text}"}]
    )
    return response.choices[0].message.content

# Traced to LangSmith
summary = generate_summary("Long document text here...")

wrap_openai auto-captures LLM calls. @traceable creates spans for your functions.

LangGraph Agent Tracing

LangGraph agents trace automatically:

from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI

# With LANGSMITH_TRACING=true, all agent steps are traced
# including tool calls, state transitions, and LLM invocations

Each node, edge, and tool call becomes a span in the trace. For complex agentic AI systems, this visibility is critical.

LangSmith Pricing

Plan Price Traces Features
Developer Free 5K/month 14-day retention, 1 seat
Plus $39/seat/month 100K included 400-day retention, dashboards, alerts
Enterprise Custom Custom Self-host option, SSO, dedicated support

Plus plan overage: ~$0.50 per 1,000 traces beyond 100K.

When to Use LangSmith

LangSmith is the obvious choice if LangChain or LangGraph powers your application. The zero-config tracing and deep framework understanding make debugging easier than any alternative.

The tradeoff is vendor coupling. If you migrate away from LangChain, LangSmith's value drops significantly. Self-hosting requires Enterprise pricing.

Phoenix: Fully Open Source with Evaluations

Phoenix from Arize AI is open source under Elastic License 2.0. It runs locally, in notebooks, or in production clusters. The evaluation features rival commercial tools.

Local Setup

pip install arize-phoenix
phoenix serve

Access at http://localhost:6006.

Docker Deployment

docker run -p 6006:6006 arizephoenix/phoenix:latest

Instrumentation

Phoenix uses OpenTelemetry with the OpenInference standard:

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

# Register Phoenix as trace collector
tracer_provider = register(
    project_name="my-app",
    endpoint="http://localhost:6006/v1/traces"
)

# Auto-instrument OpenAI
OpenAIInstrumentor().instrument()

# Now all OpenAI calls are traced
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

LangChain/LangGraph Integration

from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

register(project_name="my-agent", auto_instrument=True)
LangChainInstrumentor().instrument()

# All LangChain/LangGraph executions are traced

Built-in Evaluations

Phoenix includes LLM-as-judge evaluators:

from phoenix.evals import llm_classify, OpenAIModel

eval_model = OpenAIModel(model="gpt-4o")

results = llm_classify(
    dataframe=traces_df,  # Export traces as DataFrame
    model=eval_model,
    template="Is this response factually accurate? {response}",
    rails=["accurate", "inaccurate", "unclear"]
)

Pre-built templates cover hallucination detection, relevance scoring, and toxicity checks.

Phoenix Cloud

For managed hosting:

import os
os.environ["PHOENIX_API_KEY"] = "your-key"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/v1/traces"

from phoenix.otel import register
register(project_name="my-app", auto_instrument=True)

Phoenix Pricing

Plan Price Features
Self-hosted Free Full features, unlimited
Cloud Free $0 Limited traces, community support
Cloud Pro Contact Higher limits, priority support

Self-hosted Phoenix has no restrictions. Run it forever at no cost if you manage the infrastructure.

When to Use Phoenix

Phoenix suits teams wanting full control with no vendor lock-in. The built-in evaluation framework is genuinely useful, not a checkbox feature. Self-hosting means your prompts and responses never leave your infrastructure. For teams running fine-tuned models, this matters.

Setup requires more effort than Helicone or LangSmith. Budget 1-2 hours for initial instrumentation.

Metrics to Track

Regardless of tool, capture these metrics:

Latency

Metric Target Why It Matters
Time to First Token (TTFT) <500ms Perceived responsiveness
Total Response Time <3s (P95) User experience
Time Per Output Token 30-50ms Streaming smoothness

Cost

Metric Alert Threshold Action
Cost per request (P95) 2x average Investigate expensive prompts
Daily spend 80% of budget Review before hitting limits
Cost per user/feature Varies Identify cost drivers

Teams running small language models have different cost profiles but still need this tracking.

For teams focused on LLM cost optimization, observability data reveals which prompts waste tokens.

Error Rates

Category Alert Threshold Meaning
Provider errors (429, 500) >2% Rate limits or API issues
Application errors >1% Your code is breaking
Silent failures Any increase Model returning garbage

Quality

Quality metrics require evaluation. Run LLM evaluations on a sample of production traffic. Understanding evaluation benchmarks helps you choose the right metrics for your use case.

Metric Method Frequency
Relevance LLM-as-judge 10% of traffic
Groundedness (RAG) Compare to sources All RAG requests
User feedback Thumbs up/down Ongoing

Quality degrades over time as inputs shift. Continual learning approaches help, but require even more rigorous monitoring.

Debugging Workflow

When something breaks, follow this process:

Step 1: Identify the Scope

Check your dashboard:

  • Which model(s) affected?
  • Which time window?
  • All users or a segment?

Step 2: Find Representative Traces

Filter to failing requests. In Langfuse, use the trace list filters. In LangSmith, use the "Threads" view to cluster similar issues.

Step 3: Inspect the Trace

Each tool shows the full request flow:

TRACE: user_query
├── SPAN: retrieve_context (latency: 120ms)
│   └── Vector search returned 5 documents
├── SPAN: generate_response (latency: 2.3s)
│   ├── Input: [system prompt + context + query]
│   ├── Output: [response text]
│   └── Tokens: 1,847 input, 523 output
└── SPAN: post_process (latency: 15ms)

Identify where time is spent and where errors occur.

Step 4: Correlate with Changes

Common failure patterns:

  • Prompt change broke edge cases: Roll back or fix the prompt
  • Provider rate limiting: Implement retry logic or switch models
  • Retrieval returning bad context: Debug your RAG pipeline
  • Input drift: User behavior changed, prompts need adjustment

Step 5: Verify the Fix

Deploy the fix. Monitor for 15-30 minutes. Confirm error rates dropped and quality recovered.

Setting Up Alerts

Langfuse

Export metrics to Grafana or Datadog via OpenTelemetry:

# otel-collector-config.yaml
exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  otlp:
    endpoint: "https://otel.datadoghq.com"

Create alerts in your monitoring tool based on exported metrics.

LangSmith

Native alerting in Plus/Enterprise:

  • Go to Project > Monitoring > Alerts
  • Set conditions: error rate >5%, latency P95 >10s, cost spike
  • Configure Slack/email/webhook notifications

Helicone

Built-in alerts on Growth tier:

  • Dashboard > Alerts > New Alert
  • Set thresholds for cost, error rate, latency
  • Receive notifications via configured channels

Phoenix

For self-hosted, export to Prometheus:

from phoenix.otel import register

register(
    project_name="my-app",
    endpoint="http://localhost:6006/v1/traces",
    # Add Prometheus endpoint for metrics
)

Create Grafana dashboards and alerts from Prometheus metrics.

Integration with Existing Infrastructure

OpenTelemetry Export

Langfuse and Phoenix support OTLP export:

# Send traces to both Phoenix and your existing collector
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry import trace

otlp_exporter = OTLPSpanExporter(
    endpoint="https://your-collector.com/v1/traces"
)

Datadog Integration

For teams on Datadog:

# Option 1: Datadog LLM Observability (native)
# Option 2: Export from Langfuse/Phoenix via OTLP

# Datadog native setup
export DD_API_KEY="your-key"
export DD_SITE="datadoghq.com"
export DD_LLMOBS_ENABLED=true

Datadog's LLM Observability provides unified APM + LLM monitoring but at enterprise pricing.

Making the Decision

Choose Helicone if:

  • You want the fastest possible setup (5 minutes)
  • You use OpenAI/Anthropic directly without frameworks
  • Cost tracking and basic logging are your primary needs
  • You don't need deep evaluation features

Choose Langfuse if:

  • You want open source with no vendor lock-in
  • Self-hosting matters (data residency, cost control)
  • You use multiple frameworks or providers
  • You need prompt versioning and management

Choose LangSmith if:

  • LangChain or LangGraph powers your application
  • Zero-config tracing is worth the framework coupling
  • You want built-in evaluation workflows
  • Budget allows $39+/seat/month

Choose Phoenix if:

  • You need fully open source (self-host everything)
  • Built-in evaluation is important
  • You want to keep prompts/responses on your infrastructure
  • You're comfortable with more setup effort

Getting Started

Week 1: Basic instrumentation

  • Pick a tool based on your stack
  • Instrument your main LLM calls
  • Verify traces appear in dashboard

Week 2: Add context

  • Tag requests with user, feature, environment
  • Set up cost tracking
  • Create overview dashboard

Week 3: Alerts and evaluation

  • Configure critical alerts (error rate, cost spikes)
  • Run quality evaluations on 10% of traffic
  • Build debugging workflow

Week 4: Production hardening

  • Instrument multi-step workflows fully
  • Set up sampling for high-volume endpoints
  • Document debugging runbooks

Enterprise AI trends show observability becoming a baseline requirement for production deployments.

For teams building AI applications without dedicated platform expertise, integrated solutions like Prem Studio include evaluation and observability as part of the development workflow. The documentation covers the full integration process.


FAQ

Which LLM observability tool has the best free tier?

Helicone offers 100K requests/month free. Langfuse Cloud provides 50K events/month. LangSmith gives 5K traces/month. Phoenix self-hosted is unlimited and free. For production use without paying, self-host Langfuse or Phoenix.

Can I use multiple observability tools together?

Yes. A common pattern is Helicone for cost tracking (proxy) plus Phoenix for evaluations. OpenTelemetry-based tools export traces in standard format, so you can send data to multiple backends.

How much engineering time does setup require?

Helicone: 15-30 minutes. LangSmith with LangChain: 15 minutes. Langfuse: 1-2 hours for full instrumentation. Phoenix: 2-4 hours including evaluation setup. Self-hosting any tool adds 2-4 hours for infrastructure.

Should I self-host or use cloud?

Self-host if: data residency requirements, high volume (cost savings), or you need full control. Use cloud if: team is small, want managed infrastructure, or need support SLAs. Most teams start on cloud and migrate to self-hosted as volume grows.

How do I handle high-volume logging?

Sample at high volume. Log 100% of errors and slow requests. Log 10-20% of normal requests for quality evaluation. All tools support sampling configuration. For teams with strict data security requirements, consider what you're logging since prompts may contain sensitive data.

What metrics should I alert on?

Start with: error rate >5% for 5 minutes, P95 latency >10 seconds, cost exceeding 150% of daily budget. Add quality score drops once you have evaluation running. Tune thresholds based on your baseline to avoid alert fatigue.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe