The Problem: AI Agents Are a Black Box in Production
Picture this: It's Tuesday afternoon. Your on-call engineer gets an alert — your customer-facing AI assistant has been giving wrong pricing information to enterprise accounts for the past three hours. You open your observability dashboard. Prometheus shows a normal request rate. Grafana shows p99 latency at 620ms — within SLO. Error rate: 0.2%. Nothing is obviously wrong.
But your agent has been confidently hallucinating prices.
This is the fundamental problem with applying traditional observability to AI agents. The tools that served us brilliantly for microservices — Prometheus, Grafana, structured logs, Jaeger for HTTP traces — were designed for deterministic, stateless request-response systems. AI agents are none of those things.
An AI agent in a multi-agent system performs a fundamentally different kind of work:
- It reasons — internal chain-of-thought steps that have no HTTP equivalent
- It retrieves — vector database lookups where result quality directly affects output quality
- It calls tools — spawning sub-agents, APIs, or code executors with non-deterministic results
- It synthesizes — combining retrieved context + prior reasoning into a final response
A latency metric at the boundary tells you nothing about which of these steps degraded. And the community has noticed: the r/devops thread "Observability For AI Models and GPU Inferencing" hit the front page this week, with hundreds of engineers sharing the same frustration — their traditional stacks are blind to what's happening inside their agents.
506 stars/week on PostHog — one of the largest non-AI open source growth rates in early 2026 — reflects this demand. Engineers are actively searching for new tools because the old ones don't fit.
Why Logs and Metrics Are Structurally Insufficient
Let me be precise here, because this is where I see enterprise teams make a costly mistake: they add more logging and call it observability. It isn't. There is a structural mismatch between what logs/metrics capture and what AI agents need.
| Observability Signal | Good for Microservices? | Good for AI Agents? | Why It Fails for AI |
|---|---|---|---|
| Prometheus Metrics | ✓ Excellent | ✗ Insufficient | Counters/gauges can't express the causal reasoning chain |
| Structured Logs | ✓ Good | ✗ Partial | No parent-child span context; can't correlate prompt → retrieval → response |
| Grafana Dashboards | ✓ Excellent | ✗ Misleading | Aggregated metrics mask per-request quality degradation |
| HTTP Distributed Tracing | ✓ Good | ⚠️ Incomplete | Captures network hops but not LLM reasoning steps or token budgets |
| OTel LLM Spans | N/A | ✓ Purpose-Built | Captures prompt, completion, tokens, model, temp, retrieval scores, tool calls |
The core issue is context propagation. In a multi-agent system, a single user query might spawn three sub-agents, each making two LLM calls and one vector DB lookup. A traditional log aggregator captures 12 disconnected log lines. An OTel trace captures them as a single causal tree — you can see exactly which sub-agent retrieved what, which LLM call exceeded its token budget, and which tool invocation returned a 404 that the agent then "hallucinated" around.
This is not a tooling preference. It is a structural capability difference that determines whether you can do root-cause analysis on agent failures at all.
I have trained teams at financial institutions where a single wrong AI response could trigger regulatory exposure. "We checked our Grafana dashboards and everything looked fine" is an answer that gets engineering leadership fired. You need traces.
OpenTelemetry for LLMs: The 2026 Standard
The good news: the industry has converged on a standard. At KubeCon EU 2026, OpenTelemetry was formally positioned as the canonical observability layer for AI/ML workloads — not a nice-to-have, but the reference architecture that cloud vendors, LLM providers, and orchestration frameworks are aligning behind.
The OTel GenAI semantic conventions (finalized in late 2025) define a rich set of span attributes specifically for LLM interactions:
gen_ai.system— e.g.,openai,anthropic,ollamagen_ai.request.model— the exact model version usedgen_ai.request.max_tokens,gen_ai.request.temperaturegen_ai.usage.prompt_tokens,gen_ai.usage.completion_tokensgen_ai.response.finish_reason—stopvslengthvstool_calls- Agent-specific: tool name, tool input/output, retrieval score, span kind (
agent,chain,tool,retriever)
eBPF 1.0 GA (released Q1 2026) adds another dimension: zero-instrumentation capture of LLM API calls at the kernel level, even from containers that have no OTel SDK installed. For legacy model serving infrastructure, this is transformative — you get traces without touching the application code at all.
The converging 2026 stack looks like this:
- OTel Python SDK with
opentelemetry-instrumentation-langchainfor automatic LangChain/LangGraph tracing - OTel Collector as the telemetry gateway (batching, filtering, routing to multiple backends)
- Langfuse as the LLM-native trace backend (open source, Kubernetes-deployable, understands prompt/completion natively)
- Jaeger or Tempo for the broader distributed tracing picture (non-LLM service hops)
- Grafana for infrastructure metrics — retained, but now downstream of the OTel Collector, not the source of truth for AI behaviour
grafana/pyroscope (47 stars/week trending on GitHub) rounds this out with continuous profiling — especially valuable for GPU inference workloads where you want to correlate LLM latency spikes with CUDA kernel execution time.
Hands-On: Tracing a LangChain Agent with OTel + Langfuse
Enough theory. Let me show you exactly how to instrument a LangChain ReAct agent with OpenTelemetry and ship traces to Langfuse. This is the pattern I use in production environments and the one I teach in the gheWARE Agentic AI Workshop.
Step 1 — Install Dependencies
pip install \ opentelemetry-sdk \ opentelemetry-api \ opentelemetry-exporter-otlp-proto-http \ opentelemetry-instrumentation-langchain \ langchain \ langchain-openai \ langfuse
Step 2 — Configure OTel Tracer Provider
"""
otel_setup.py — OTel tracer configuration for multi-agent AI observability
gheWARE Agentic AI Workshop pattern
"""
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
def configure_otel_for_agents(service_name: str = "multi-agent-pipeline") -> trace.Tracer:
"""
Set up OpenTelemetry tracing for a multi-agent AI system.
Exports to OTel Collector → Langfuse via OTLP HTTP.
"""
resource = Resource.create({
"service.name": service_name,
"service.version": "1.0.0",
"deployment.environment": os.getenv("ENVIRONMENT", "production"),
# AI-specific resource attributes
"ai.framework": "langchain",
"ai.orchestrator": "langgraph",
})
# OTLP exporter pointing to your OTel Collector (or directly to Langfuse)
otlp_exporter = OTLPSpanExporter(
endpoint=os.getenv(
"OTEL_EXPORTER_OTLP_ENDPOINT",
"http://otel-collector:4318/v1/traces"
),
headers={
"Authorization": f"Bearer {os.getenv('OTEL_AUTH_TOKEN', '')}",
}
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)
return trace.get_tracer(service_name)
Step 3 — Auto-Instrument LangChain and Run Your Agent
"""
agent_runner.py — LangChain ReAct agent with full OTel tracing
Every LLM call, tool invocation, and chain step emits a span.
"""
from opentelemetry.instrumentation.langchain import LangChainInstrumentor
from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain.tools import tool
from langchain import hub
from otel_setup import configure_otel_for_agents
# 1. Configure OTel BEFORE any LangChain imports or usage
tracer = configure_otel_for_agents(service_name="pricing-agent")
# 2. Auto-instrument LangChain — this patches all chains, LLM calls, and tools
LangChainInstrumentor().instrument()
# 3. Define your tools
@tool
def get_product_price(product_id: str) -> str:
"""Retrieve current pricing for a product from the pricing service."""
# In production: call your internal pricing API
prices = {"PROD-001": "$499/month", "PROD-002": "$999/month"}
return prices.get(product_id, "Product not found")
@tool
def search_knowledge_base(query: str) -> str:
"""Search the product knowledge base for relevant information."""
# In production: vector DB lookup (Chroma, Weaviate, Qdrant)
return f"Knowledge base result for: {query}"
# 4. Build the agent
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
tools = [get_product_price, search_knowledge_base]
prompt = hub.pull("hwchase17/react")
agent = create_react_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=False)
# 5. Run with manual span wrapping for the top-level user request
def handle_user_query(query: str, user_id: str) -> str:
with tracer.start_as_current_span(
"user-query",
attributes={
"user.id": user_id,
"query.length": len(query),
"agent.type": "react",
}
) as span:
try:
result = agent_executor.invoke({"input": query})
span.set_attribute("agent.output_length", len(result["output"]))
span.set_status(trace.StatusCode.OK)
return result["output"]
except Exception as e:
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, str(e))
raise
if __name__ == "__main__":
response = handle_user_query(
query="What is the current price for PROD-001 and what are its key features?",
user_id="enterprise-user-42"
)
print(response)
With this setup, every run of handle_user_query produces a trace tree in Langfuse that looks like:
└─ user-query [342ms] ← your top-level span
├─ langchain.AgentExecutor [338ms]
│ ├─ langchain.ChatOpenAI [210ms] ← initial reasoning
│ │ ├─ gen_ai.request.model: gpt-4o
│ │ ├─ gen_ai.usage.prompt_tokens: 487
│ │ └─ gen_ai.response.finish_reason: tool_calls
│ ├─ langchain.tool.get_product_price [12ms]
│ │ └─ tool.input: "PROD-001"
│ ├─ langchain.tool.search_knowledge_base [18ms]
│ │ └─ tool.input: "PROD-001 features"
│ └─ langchain.ChatOpenAI [88ms] ← synthesis call
│ ├─ gen_ai.usage.prompt_tokens: 721
│ ├─ gen_ai.usage.completion_tokens: 156
│ └─ gen_ai.response.finish_reason: stop
└─ [end]
Now when an agent produces a wrong answer, you open this trace and immediately see: was it the retrieval step that returned bad context? Was it the synthesis LLM call that truncated due to finish_reason: length? Was the tool call returning stale data? The trace answers the question that logs cannot.
Step 4 — Langfuse Integration (Alternative Backend)
Langfuse provides a hosted or self-hosted LLM-native trace backend with a UI purpose-built for LLM observability — prompt playground, cost analytics, evaluation scoring, and regression testing against trace history. For teams who want a zero-config start, use the Langfuse SDK directly:
"""
langfuse_tracing.py — Langfuse integration for LLM observability
Works alongside OTel or standalone; recommended for LLM-first teams.
"""
import os
from langfuse import Langfuse
from langfuse.callback import CallbackHandler
# Initialize Langfuse client
langfuse = Langfuse(
public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"), # or self-hosted
)
# Create a Langfuse callback handler for LangChain
langfuse_handler = CallbackHandler(
trace_name="pricing-agent-production",
user_id="enterprise-user-42",
session_id="session-2026-02-28-001",
tags=["production", "pricing", "react-agent"],
metadata={
"deployment_env": "production",
"agent_version": "2.1.0",
"kubernetes_namespace": "ai-agents",
}
)
# Use the handler in your AgentExecutor
result = agent_executor.invoke(
{"input": "What is the current price for PROD-001?"},
config={"callbacks": [langfuse_handler]}
)
# Flush traces before process exit (critical in serverless/short-lived containers)
langfuse.flush()
The Full OTel-Native Observability Stack for Multi-Agent AI
Individual instrumentation is step one. The full production-grade multi-agent observability stack requires all the layers working together — and deployed on Kubernetes where your agents actually run.
Reference Architecture
Here's the Kubernetes-native observability stack I deploy for enterprise teams running multi-agent AI in production:
# otel-collector-config.yaml
# OTel Collector as the central telemetry hub for multi-agent AI
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: ai-otel-collector
namespace: observability
spec:
config: |
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
# eBPF receiver — zero-instrumentation capture of LLM API calls
ebpf:
endpoint: 0.0.0.0:4319
processors:
batch:
timeout: 1s
send_batch_size: 1024
# Enrich spans with K8s pod/namespace metadata
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
- k8s.node.name
# Filter sensitive PII from prompt content
attributes/redact_pii:
actions:
- key: gen_ai.prompt
action: hash
- key: gen_ai.completion
action: hash
exporters:
# Langfuse for LLM-native analytics
otlphttp/langfuse:
endpoint: https://cloud.langfuse.com/api/public/otel
headers:
Authorization: "Basic ${LANGFUSE_AUTH}"
# Jaeger for infrastructure-level distributed tracing
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
# Prometheus for infrastructure metrics (GPU, token rate, latency histograms)
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, ebpf]
processors: [k8sattributes, attributes/redact_pii, batch]
exporters: [otlphttp/langfuse, jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
A few things worth calling out explicitly from production experience:
- PII Redaction at the Collector level — never let raw prompt content reach your trace backends unless you have explicit data governance approval. Hash it at the collector, store a reference, and have a secure audit path for when you need to inspect specific traces.
- k8sattributes processor — automatically enriches every span with Kubernetes metadata. When you're debugging a multi-agent failure, knowing exactly which pod and node served the request is invaluable.
- Dual export to Langfuse + Jaeger — Langfuse for LLM-specific analysis, Jaeger for the full service graph including your non-LLM services (databases, APIs, message queues) that your agents interact with.
- Prometheus for token budget monitoring — set up alerts on
gen_ai.usage.prompt_tokenshistograms. Token budget exhaustion (finish_reason: length) is a leading indicator of hallucination risk and often invisible to infrastructure metrics.
The 2026 Observability Maturity Model for AI Teams
Based on what I've seen across enterprise AI teams from financial services to e-commerce, here's how observability maturity maps to operational capability:
- Level 1 — Blind: Logs only. Can't explain why an agent failed. Post-mortems are guesswork.
- Level 2 — Reactive: Prometheus + Grafana. Know when latency spiked; can't explain why an agent hallucinated.
- Level 3 — Trace-Aware: OTel instrumentation on LangChain/LangGraph. Can trace individual agent runs. Manual investigation of failures.
- Level 4 — AI-Native Observability: OTel + Langfuse + automated evaluation. Detect quality regressions automatically. A/B test prompts with trace-level evidence. Real-time token budget alerts.
- Level 5 — Self-Healing: OTel traces feed back into agent behaviour. Anomalous reasoning patterns trigger automatic fallbacks. Continuous evaluation against ground truth in production.
Most enterprise teams in 2026 are at Level 1 or 2. The teams shipping reliable AI products to production are at Level 3-4. Level 5 is where the most advanced AI-native companies are heading — and the OTel ecosystem is the infrastructure layer that makes it possible.
The question for your organisation is not whether to adopt OTel for your AI systems. Given the KubeCon EU 2026 convergence, the CNCF standardisation, and the fact that every major LLM provider is shipping OTel-native SDKs, it will become as default as Prometheus is for Kubernetes today. The question is how quickly you adopt it and whether your engineering teams have the skills to implement it correctly before a production incident forces the issue.
Frequently Asked Questions
Why does traditional observability fail for AI agents?
Traditional observability tools like Prometheus and Grafana track numerical metrics — CPU, memory, request rates, latency percentiles. AI agents produce causal chains: prompt → retrieval → reasoning → tool call → response. A latency spike metric tells you something was slow; it cannot tell you whether the LLM hit its token limit, whether the vector search returned irrelevant context, or whether a tool call returned stale data that the agent then incorporated incorrectly. You need distributed traces — specifically OTel traces with GenAI semantic conventions — to capture and correlate those causal steps end-to-end.
What is OpenTelemetry for LLMs and why does it matter in 2026?
OpenTelemetry (OTel) for LLMs extends the CNCF open standard to capture AI-specific telemetry as first-class trace spans: prompt content, token counts, model name, temperature, retrieval scores, tool invocations, and agent reasoning steps. With eBPF 1.0 GA and KubeCon EU 2026 convergence on OTel as the AI observability standard, teams can now get end-to-end traces from user query to final agent response without vendor lock-in. Every major LLM provider — OpenAI, Anthropic, Google — is aligning their SDKs to OTel GenAI semantic conventions.
How do I trace a LangChain agent with OpenTelemetry?
Install opentelemetry-sdk and opentelemetry-instrumentation-langchain. Set up a TracerProvider pointing to your OTel Collector or Langfuse endpoint. Call LangChainInstrumentor().instrument() before any agent runs. Every LLM call, tool invocation, and chain step will automatically emit spans with token counts, latency, model metadata, and tool inputs/outputs. See the full working code in Step 2 and Step 3 of this article — it is production-ready as written.
What is Langfuse and how does it compare to Jaeger for AI observability?
Langfuse is an open-source LLM observability platform with a UI purpose-built for LLM traces — it understands prompt/completion natively, provides cost analytics per session, supports evaluation scoring, and lets you compare prompt versions against historical traces. Jaeger is a general-purpose distributed tracing backend. For multi-agent AI systems, use both: Langfuse for LLM-native analytics and prompt debugging, Jaeger (or Tempo) for the full service graph including your non-LLM services. Route both from a single OTel Collector instance.
How do I prevent PII in prompts from leaking into my trace backend?
Configure an attributes/redact_pii processor in your OTel Collector pipeline (see the YAML in Section 5). Hash sensitive span attributes — particularly gen_ai.prompt and gen_ai.completion — before they reach any external backend. Store a mapping of hash → original in a separate, access-controlled store for audit purposes. Never log raw prompt content to general-purpose log aggregators. This is non-negotiable for financial services, healthcare, and any system processing personal data.
Conclusion: The Observability Gap Is Now a Competitive Risk
The question I started with — "Can you explain why your AI agent hallucinated in production last Tuesday?" — is not a gotcha. It's the baseline competency question for any team running AI agents in production in 2026. If your answer is "we checked our Grafana dashboard and it looked fine," you have a structural observability gap that will eventually manifest as a production incident, a customer complaint, or worse.
The OTel-native observability stack I've walked through here — OTel Python SDK + LangChain instrumentation + OTel Collector + Langfuse — is battle-tested, open source, vendor-neutral, and deployable on any Kubernetes cluster in under two hours. The code in this article is the exact pattern we teach and deploy in enterprise engagements.
The teams who invest in AI observability now, before they're forced to by a production incident, will have a significant operational advantage as multi-agent AI systems become core business infrastructure. The teams who wait will be doing forensic archaeology on logs, trying to reconstruct why their agent behaved the way it did, with no causal evidence to work from.
Build the traces-first culture now. Your future on-call engineer will thank you.