OpenTelemetry for AI Agents: Production Observability Setup Guide

Q: How do I propagate trace context across multiple AI agents?

Use W3C TraceContext propagation headers (traceparent, tracestate) when one agent calls another over HTTP or a message queue. For in-process orchestrators like LangGraph, the OTel context is propagated automatically via Python's contextvars. For cross-service hops (e.g., an orchestrator agent calling a sub-agent microservice on Kubernetes), inject the W3C headers into your HTTP client and extract them in the receiving service—this creates a single distributed trace spanning all agents, enabling root cause analysis across the full execution chain.

What Is OpenTelemetry for AI Agents?

OpenTelemetry (OTel) for AI agents is the application of the CNCF open-source observability standard to instrument LLM API calls, tool invocations, memory reads/writes, and inter-agent communication—producing vendor-neutral traces, metrics, and logs that reveal exactly what your AI system did, how much it cost, and why it failed. Unlike traditional microservice tracing where you track HTTP calls and database queries, AI agent observability must capture the non-deterministic, multi-step reasoning chains that happen inside an agent's decision loop.

The breakthrough came in late 2024 when the OpenTelemetry project published the GenAI Semantic Conventions (stabilised in OTel v1.27), introducing a standardised schema for LLM-specific telemetry. Attributes like gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.tool.name give every observability backend the vocabulary to understand AI agent behaviour—not just "a 2-second HTTP call" but "Claude 3.5 Sonnet processed a 1,847-token prompt and invoked the database_search tool".

The Three Pillars of AI Agent Observability

OTel's three signal types map directly to what enterprise teams need for AI agents:

Traces: A distributed trace represents one complete agent task—from the triggering user request, through each reasoning step, tool call, LLM invocation, and sub-agent handoff, to the final response. Each step is a span. In a 5-step LangGraph agent, you'll see 5+ spans nested under one root trace, with each LLM call carrying token counts and latency.
Metrics: Time-series counters and gauges for aggregate behaviour: tokens consumed per minute, average step latency, error rates, tool call success percentages. These power your Grafana dashboards and PagerDuty alerts.
Logs: Structured logs correlated to trace IDs—so when you see a spike in P95 latency on the dashboard, you can drill into the exact trace and read the prompts, tool outputs, and intermediate reasoning that caused the slowdown.

When all three signals are correlated through a shared trace_id, you get what the industry calls observability—the ability to ask arbitrary questions about your system's behaviour without deploying new code.

Why Traditional APM Fails for Agentic AI

I've spent 25+ years building production systems at JPMorgan, Deutsche Bank, and Morgan Stanley. When we first deployed LLM-powered agents at enterprise scale in 2024, the instinct from our DevOps teams was to reach for Datadog or New Relic—tools that had served us well for microservices. Within weeks, it was clear those tools were blind to everything that mattered in AI agent behaviour.

Here's the fundamental problem: traditional APM is designed for deterministic systems. If a database query takes 200ms instead of 10ms, the APM alert fires and you know exactly which query to optimise. But an AI agent's "latency" is a function of prompt complexity, model load, the number of reasoning iterations, tool call latency, and whether the model decided to use its full context window. APM sees one opaque HTTP POST to api.openai.com—OTel sees all of that.

Capability	Traditional APM	OTel for AI Agents
LLM model name & version per call	❌	✅ `gen_ai.request.model`
Token count (input/output/total)	❌	✅ `gen_ai.usage.*_tokens`
Tool call name & input/output	❌	✅ `gen_ai.tool.name`
Agent step / iteration tracking	❌	✅ Custom span attributes
Cross-agent distributed trace	⚠️ HTTP only	✅ W3C TraceContext anywhere
Prompt/response logging (redacted)	❌	✅ OTel Log events
Vendor-neutral (no lock-in)	❌ Vendor-specific agents	✅ OTLP to any backend

The business cost of this observability gap is real. According to the 2025 DORA State of DevOps Report (AI Supplement), enterprises without dedicated LLM observability pipelines experience 4.2× longer mean time to resolution (MTTR) for AI agent production incidents compared to teams with full OTel instrumentation. At $50,000+ per hour of SLA breach for Tier-1 financial services applications, that's not an academic concern.

Step-by-Step Production Setup Guide

The following guide covers Python-based agents (the dominant enterprise choice in 2026) using LangChain, LangGraph, or CrewAI. Node.js equivalents are noted where they differ.

Step 1: Install OpenLLMetry (Auto-Instrumentation)

OpenLLMetry by Traceloop is the de facto standard for Python agent instrumentation. It wraps the OpenAI SDK, Anthropic SDK, LangChain, LangGraph, CrewAI, and Llama Index and emits OTel-compliant spans automatically.

# Core OTel packages
pip install opentelemetry-api \
            opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-grpc

# OpenLLMetry auto-instrumentation
pip install traceloop-sdk \
            opentelemetry-instrumentation-openai \
            opentelemetry-instrumentation-langchain \
            opentelemetry-instrumentation-anthropic \
            opentelemetry-instrumentation-crewai

Step 2: Initialise in Your Agent Entry Point

Add three lines to your agent application's startup code—before any LLM calls are made:

from traceloop.sdk import Traceloop

Traceloop.init(
    app_name="finance-research-agent",
    api_endpoint="http://otel-collector.monitoring.svc.cluster.local:4317",
    # Disable prompt logging in prod for compliance; enable in staging
    disable_batch=False,
    # Mask sensitive data in prompt/response logs
    tracing_enabled=True,
)

# That's it. All LangChain/OpenAI/Anthropic calls are now traced.
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph
# ... your agent code unchanged

Step 3: Add Custom Spans for Business Logic

Auto-instrumentation covers LLM calls, but you also want spans for your agent's orchestration logic—task routing decisions, memory retrieval, human-in-the-loop checkpoints:

from opentelemetry import trace
from traceloop.sdk.decorators import task, workflow, agent

tracer = trace.get_tracer("finance-agent")

# Decorator-based spans (Traceloop DSL)
@workflow(name="research_workflow")
def run_research_workflow(query: str, user_id: str):
    with tracer.start_as_current_span("task_routing") as span:
        span.set_attribute("agent.user_id", user_id)
        span.set_attribute("agent.query_length", len(query))
        routed_agent = route_to_specialist(query)
        span.set_attribute("agent.routed_to", routed_agent)
    return execute_agent(routed_agent, query)

@task(name="memory_retrieval")
def retrieve_context(query: str) -> list:
    # Vector DB call automatically timed and traced
    results = vector_db.similarity_search(query, k=5)
    span = trace.get_current_span()
    span.set_attribute("memory.results_count", len(results))
    return results

Step 4: Configure W3C Context Propagation for Multi-Agent Systems

When Agent A calls Agent B over HTTP (a common pattern in LangGraph multi-agent architectures), you must propagate trace context so both agents appear in the same distributed trace:

from opentelemetry.propagate import inject, extract
from opentelemetry import context
import httpx

# Agent A — inject context into outgoing request
def call_sub_agent(payload: dict) -> dict:
    headers = {}
    inject(headers)  # Adds 'traceparent' header automatically
    response = httpx.post(
        "http://sub-agent-svc.agents.svc.cluster.local/run",
        json=payload,
        headers=headers,
    )
    return response.json()

# Agent B (FastAPI) — extract context from incoming request
from fastapi import FastAPI, Request
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-extracts W3C headers

@app.post("/run")
async def run_agent(request: Request, payload: dict):
    # Trace context already restored by FastAPIInstrumentor
    # This span is automatically a child of Agent A's span
    ...

Deploying the OTel Collector on Kubernetes

The OTel Collector is the central hub that receives telemetry from your agents, processes it (batching, filtering, enrichment), and exports it to your backends. Deploy it as a DaemonSet (one per node) for low-latency local collection, plus a Deployment-based Gateway Collector for aggregation and fan-out to multiple backends.

Helm Deployment

# Add the OTel Helm chart repo
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

# Deploy DaemonSet collector
helm install otel-collector open-telemetry/opentelemetry-collector \
  --namespace monitoring \
  --create-namespace \
  --set mode=daemonset \
  --values otel-collector-values.yaml

Your otel-collector-values.yaml for an AI agent fleet:

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

  processors:
    batch:
      send_batch_size: 1024
      timeout: 5s
    memory_limiter:
      limit_mib: 512
      spike_limit_mib: 128
    # Tag all spans with Kubernetes metadata
    k8sattributes:
      auth_type: serviceAccount
      passthrough: false
      extract:
        metadata: [k8s.pod.name, k8s.namespace.name, k8s.deployment.name]
    # Filter out health-check noise
    filter/drop_health:
      spans:
        exclude:
          match_type: strict
          span_names: ["health_check", "/healthz"]
    # Tail-based sampling: keep all errors + 10% of successes
    tail_sampling:
      decision_wait: 10s
      num_traces: 100000
      policies:
        - name: errors-policy
          type: status_code
          status_code: {status_codes: [ERROR]}
        - name: slow-spans
          type: latency
          latency: {threshold_ms: 5000}
        - name: sample-10-percent
          type: probabilistic
          probabilistic: {sampling_percentage: 10}

  exporters:
    # Traces → Grafana Tempo
    otlp/tempo:
      endpoint: tempo.monitoring.svc.cluster.local:4317
      tls:
        insecure: true
    # Metrics → Prometheus
    prometheus:
      endpoint: 0.0.0.0:8889
    # Logs → Loki
    loki:
      endpoint: http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push

  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, k8sattributes, filter/drop_health, tail_sampling, batch]
        exporters: [otlp/tempo]
      metrics:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [prometheus]
      logs:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [loki]

Grafana Stack Integration

With the above config, deploy the Grafana LGTM stack (Loki + Grafana + Tempo + Mimir/Prometheus) using the community Helm umbrella chart. Import the AI agent observability dashboard from Grafana Labs (ID: 19268) to get pre-built panels for LLM latency, token spend, and error rates out of the box.

Essential Metrics, Dashboards & Alerts

Observability is only valuable if it drives action. Here are the five metrics your SRE team must alert on—and the PromQL queries to back them up.

1. Token Throughput & Cost Rate

# Total tokens per minute by model
rate(gen_ai_usage_total_tokens_total[1m]) by (gen_ai_request_model)

# Alert: Cost spike > 200% of 1-hour average
alert: LLMTokenCostSpike
expr: rate(gen_ai_usage_total_tokens_total[5m]) > 2 * avg_over_time(rate(gen_ai_usage_total_tokens_total[5m])[1h:5m])

2. Agent Step Latency P95

# P95 latency of all LLM calls
histogram_quantile(0.95, rate(gen_ai_client_operation_duration_seconds_bucket[5m]))

# Alert: P95 > 10 seconds for 3 consecutive minutes
alert: LLMHighLatency
expr: histogram_quantile(0.95, rate(gen_ai_client_operation_duration_seconds_bucket[5m])) > 10
for: 3m

3. Tool Call Success Rate

# Tool call error rate
rate(gen_ai_tool_calls_total{status="error"}[5m]) /
rate(gen_ai_tool_calls_total[5m])

# Alert: Tool error rate > 5%
alert: ToolCallFailureHigh
expr: (rate(gen_ai_tool_calls_total{status="error"}[5m]) / rate(gen_ai_tool_calls_total[5m])) > 0.05

4. Agent Goal Completion Rate

This requires a custom metric emitted at the end of your agent workflow:

from opentelemetry import metrics

meter = metrics.get_meter("finance-agent")
task_counter = meter.create_counter(
    "agent.task.completions",
    description="Agent task completion count by outcome",
    unit="1",
)

# At end of task execution:
task_counter.add(1, {"outcome": "success", "agent_type": "research"})
# or
task_counter.add(1, {"outcome": "failure", "failure_reason": "tool_timeout", "agent_type": "research"})

5. LLM Provider Error Rate (429s, 500s, Timeouts)

# Rate of LLM API errors by error type
rate(gen_ai_client_operation_duration_seconds_count{error_type!=""}[5m])

# Alert: Any rate limit errors in production
alert: LLMRateLimitErrors
expr: increase(gen_ai_client_operation_duration_seconds_count{error_type="rate_limit_exceeded"}[5m]) > 0

7 Common OTel Mistakes in AI Agent Projects

Having reviewed dozens of enterprise AI observability implementations, these are the mistakes I see most often:

100% head-based sampling in production. Capturing every trace of a high-volume agent fleet will overwhelm your Tempo/Jaeger storage within days. Use tail-based sampling with an error-keep policy from day one.
Logging raw prompts in production. Prompts often contain PII, financial data, or health information. Use OTel's SpanProcessor to redact sensitive patterns before export—or disable prompt logging entirely in prod and enable it only in staging environments.
Not propagating context across async task queues. If your agent dispatches work via Celery, Kafka, or AWS SQS, you must serialise the OTel context into the message headers and extract it in the consumer. Many teams miss this, creating broken traces that stop at the queue boundary.
Treating token count as a vanity metric. Token counts are your cost ledger. Attribute them to business units (via span attributes like business.team and business.use_case) so finance can see which AI application is responsible for which spend.
Skipping the OTel Collector and exporting directly to backends. Direct export couples your agent code to specific backends and makes it impossible to fan out to multiple destinations (e.g., Tempo for traces AND Langfuse for LLM-specific analytics). Always use the Collector as an intermediary.
Not instrumenting retrieval steps in RAG pipelines. Vector DB queries (Pinecone, Weaviate, pgvector) are often the hidden latency culprit in RAG pipelines. Use opentelemetry-instrumentation-weaviate or manual spans to capture retrieval latency and result counts.
Ignoring the OTel Collector's resource limits. The Collector runs on the same nodes as your agents. Without memory_limiter processor configured, a telemetry spike can OOM the Collector pod and silently drop all telemetry. Always configure memory_limiter and set Kubernetes resource requests/limits.

Real Enterprise Example: Financial Services Agent Fleet

One of our enterprise training clients—a tier-1 asset management firm—deployed a fleet of 12 specialised AI agents (market research, portfolio rebalancing, compliance checking, client reporting) on Kubernetes. Here's how their OTel implementation unfolded and what the numbers looked like 90 days post-deployment.

The Challenge

Before OTel, the team was flying blind. The compliance-checking agent was mysteriously failing on ~3% of requests, but because the failures happened inside the agent's reasoning loop (not at the API layer), their existing Datadog setup showed no errors. Users were seeing "analysis unavailable" messages with no explanation. MTTR was running at 4+ hours because engineers had to manually add print statements and redeploy to debug.

The Implementation (2-Week Sprint)

Week 1: Deployed OTel Collector DaemonSet + Grafana LGTM stack on the existing Kubernetes cluster. Instrumented all 12 agents with OpenLLMetry (Traceloop.init()). Total code change across all agents: ~15 lines each.
Week 2: Added custom spans for business logic (task routing, compliance rule lookups, report generation). Built 5 Grafana dashboards. Configured 8 PagerDuty alerts (token cost, latency, error rates).

What They Found

Within 48 hours of enabling traces, the team identified the compliance-checking agent failure: a specific tool call to their internal regulatory database was timing out after exactly 30 seconds when querying historical data older than 5 years. The agent was silently swallowing the timeout and returning an empty compliance check (marked as "passed"). OTel surfaced this via a P99 latency spike on the compliance_db_lookup span correlated with the agent.task.completions{outcome="failure"} counter.

90-Day Results

87%

Reduction in MTTR
(4 hrs → 31 min)

$41K

Monthly LLM cost saved
via token optimisation

99.4%

Agent goal completion rate
(from 97.1%)

2 wks

Implementation time
for full OTel coverage

The token optimisation savings ($41K/month) came directly from the OTel metrics dashboard revealing that the market research agent was sending the full document corpus in every prompt—a prompt engineering issue that was invisible without per-call token count telemetry. With the data in hand, the team refactored to selective context injection in one sprint.

This is the kind of data-driven AI operations capability we teach in our Agentic AI for Enterprise workshop—where participants build fully instrumented multi-agent systems with OTel, Grafana, and LangGraph in a 5-day hands-on lab environment. Our training earns an Oracle-verified 4.91/5.0 rating because every module is grounded in real production patterns like this one.

Frequently Asked Questions

What is OpenTelemetry for AI agents?

OpenTelemetry (OTel) for AI agents is the application of the CNCF open-source observability framework to instrument LLM calls, tool invocations, memory reads, and agent decision loops. It produces vendor-neutral traces, metrics, and logs that reveal exactly what your AI agent did, how long each step took, how many tokens were consumed, and where failures occurred. The OpenLLMetry SDK and the OTel GenAI semantic conventions (introduced in 2024) standardise the span attributes—such as gen_ai.system, gen_ai.request.model, and gen_ai.usage.input_tokens—so data flows into any backend including Grafana, Jaeger, Langfuse, or Arize.

How is OTel different from traditional APM for AI agents?

Traditional APM tools track HTTP latency, database queries, and CPU—none of which capture the semantics of LLM behaviour. OpenTelemetry for AI agents adds LLM-specific spans (prompt tokens, completion tokens, model name, temperature, tool call names, agent step IDs) and propagates trace context across multi-agent hops. Where APM sees a single 2-second HTTP POST, OTel reveals it was an agent calling GPT-4o with a 1,200-token prompt, then invoking a search tool, and finally writing to a vector database—each with its own latency and cost attribution.

Which OpenTelemetry SDK should I use for LLM and agent instrumentation?

For Python-based agents (LangChain, LangGraph, CrewAI, AutoGen), use OpenLLMetry by Traceloop, which provides auto-instrumentation packages such as opentelemetry-instrumentation-langchain and opentelemetry-instrumentation-openai. For TypeScript/Node.js agents, use the @traceloop/node-server-sdk. Both emit spans conforming to the OTel GenAI semantic conventions, ensuring your data is portable across backends. If you build custom agents, manually create spans with the opentelemetry-api package and set gen_ai.* attributes on each LLM call.

How do I propagate trace context across multiple AI agents?

Use W3C TraceContext propagation headers (traceparent, tracestate) when one agent calls another over HTTP or a message queue. For in-process orchestrators like LangGraph, the OTel context is propagated automatically via Python's contextvars. For cross-service hops (e.g., an orchestrator agent calling a sub-agent microservice on Kubernetes), inject the W3C headers into your HTTP client and extract them in the receiving service—this creates a single distributed trace spanning all agents, enabling root cause analysis across the full execution chain.

What metrics should I collect for AI agents in production?

The five most critical AI agent production metrics are: (1) Token throughput—input and output tokens per second per model to track cost and capacity; (2) Agent step latency P50/P95/P99—to detect slow tool calls or LLM degradation; (3) Tool call success rate—percentage of tool invocations that succeed without retry; (4) Agent goal completion rate—percentage of tasks completed successfully end-to-end; and (5) LLM error rate—rate of 429 (rate limit), 500, and timeout errors from the model provider. All five can be emitted as OTel metrics using the gen_ai.* namespace and visualised in Grafana with Prometheus as the backend.

Conclusion: OTel Is the Production Readiness Gate for AI Agents

In 2026, deploying AI agents without OpenTelemetry instrumentation is the equivalent of running a Kubernetes cluster without Prometheus—technically possible, practically reckless. The non-deterministic, multi-step nature of agentic AI systems creates failure modes that are invisible to traditional APM: silent hallucinations, compounding tool failures, runaway token costs, and cross-agent trace gaps that make debugging a guesswork exercise.

The good news: the tooling has matured enormously. OpenLLMetry's auto-instrumentation means most teams go from zero to full LLM tracing in a single afternoon. The OTel GenAI semantic conventions give you a standardised vocabulary that works with any backend. And the 5 production metrics covered in this guide give your SRE and finance teams the data they need to operate and optimise AI agent fleets at enterprise scale.

The firms that will win the AI productivity race in 2026 aren't necessarily those with the best models—they're the ones who can operate AI reliably: ship fast, catch failures early, and optimise cost continuously. That operational discipline starts with instrumentation, and instrumentation starts with OpenTelemetry.

Build Production-Ready AI Agents — With Full Observability Included

Our 5-day Agentic AI for Enterprise workshop covers OpenTelemetry instrumentation, LangGraph multi-agent systems, Kubernetes deployment, and production operations—hands-on, from day one. Oracle-verified 4.91/5.0 rating across 200+ enterprise participants.

Explore Enterprise Training →

OpenTelemetry for AI Agents: Production Observability Setup Guide

What Is OpenTelemetry for AI Agents?

The Three Pillars of AI Agent Observability

Why Traditional APM Fails for Agentic AI

Step-by-Step Production Setup Guide

Step 1: Install OpenLLMetry (Auto-Instrumentation)

Step 2: Initialise in Your Agent Entry Point

Step 3: Add Custom Spans for Business Logic

Step 4: Configure W3C Context Propagation for Multi-Agent Systems

Deploying the OTel Collector on Kubernetes

Helm Deployment

Grafana Stack Integration

Essential Metrics, Dashboards & Alerts

1. Token Throughput & Cost Rate

2. Agent Step Latency P95

3. Tool Call Success Rate

4. Agent Goal Completion Rate

5. LLM Provider Error Rate (429s, 500s, Timeouts)

7 Common OTel Mistakes in AI Agent Projects

Real Enterprise Example: Financial Services Agent Fleet

The Challenge

The Implementation (2-Week Sprint)

What They Found

90-Day Results

Frequently Asked Questions

What is OpenTelemetry for AI agents?

How is OTel different from traditional APM for AI agents?

Which OpenTelemetry SDK should I use for LLM and agent instrumentation?

How do I propagate trace context across multiple AI agents?

What metrics should I collect for AI agents in production?

Conclusion: OTel Is the Production Readiness Gate for AI Agents

Build Production-Ready AI Agents — With Full Observability Included

Free Download: Observability Stack Guide

DevOps & AI Weekly

Ready to Instrument Your AI Agents for Production?

What Is OpenTelemetry for AI Agents?

The Three Pillars of AI Agent Observability

Why Traditional APM Fails for Agentic AI

Step-by-Step Production Setup Guide

Step 1: Install OpenLLMetry (Auto-Instrumentation)

Step 2: Initialise in Your Agent Entry Point

Step 3: Add Custom Spans for Business Logic

Step 4: Configure W3C Context Propagation for Multi-Agent Systems

Deploying the OTel Collector on Kubernetes

Helm Deployment

Grafana Stack Integration

Essential Metrics, Dashboards & Alerts

1. Token Throughput & Cost Rate

2. Agent Step Latency P95

3. Tool Call Success Rate

4. Agent Goal Completion Rate

5. LLM Provider Error Rate (429s, 500s, Timeouts)

7 Common OTel Mistakes in AI Agent Projects

Real Enterprise Example: Financial Services Agent Fleet

The Challenge

The Implementation (2-Week Sprint)

What They Found

90-Day Results

Frequently Asked Questions

What is OpenTelemetry for AI agents?

How is OTel different from traditional APM for AI agents?

Which OpenTelemetry SDK should I use for LLM and agent instrumentation?

How do I propagate trace context across multiple AI agents?

What metrics should I collect for AI agents in production?

Conclusion: OTel Is the Production Readiness Gate for AI Agents

Build Production-Ready AI Agents — With Full Observability Included

Free Download: Observability Stack Guide

DevOps & AI Weekly

Related Articles

Langfuse vs LangSmith vs Arize: Which LLM Observability Tool for Enterprise?

How to Build a Production RAG Pipeline on Kubernetes

MLOps CI/CD for AI Models: DevOps Best Practices That Scale in 2026

Ready to Instrument Your AI Agents for Production?