The Observability Gap: What Prometheus Cannot See
I spent years building observability platforms at JPMorgan and Deutsche Bank. The mantra was always the same: instrument everything, alert on thresholds, hire smart SREs. It worked — until microservices and Kubernetes made the failure surface orders of magnitude more complex.
Here is the uncomfortable truth about Prometheus-only observability in 2026:
- You can only alert on failure modes you already know about. PromQL expressions are human-authored. Novel failure patterns — a gradual memory leak in a specific pod, a correlated slowdown across three services during a traffic spike, a subtle CPU throttling cascade — go undetected until they become outages.
- Alert fatigue is algorithmic. Gartner reports the average Kubernetes cluster generates 4,000+ alerts per day. Teams tune alert thresholds progressively looser to reduce noise — which means progressively more genuine incidents are missed.
- Correlation is manual and slow. A P1 incident at 3 AM typically requires a senior SRE to manually correlate metrics from Prometheus, logs from Loki or Elasticsearch, and traces from Jaeger or Tempo. Mean Time To Detect (MTTD) averages 28 minutes. Mean Time To Resolve (MTTR) averages 4.5 hours. Neither number is acceptable.
The gap is not a tooling failure — Prometheus is genuinely excellent at what it does. The gap is a cognitive failure: the volume and complexity of telemetry data has exceeded human capacity to reason about it in real time.
AIOps is the solution to this cognitive gap. It does not replace Prometheus; it reasons over its output at machine speed.
The AIOps Architecture That Closes the Gap
A production AIOps stack on Kubernetes has four layers. Think of it as a pyramid built on top of your existing LGTM stack (Loki, Grafana, Tempo, Mimir):
┌──────────────────────────────────────────────────┐ │ Layer 4: AUTONOMOUS REMEDIATION │ │ keephq/keep · AWS DevOps Agent · custom runbooks │ ├──────────────────────────────────────────────────┤ │ Layer 3: AI ROOT CAUSE ANALYSIS │ │ HolmesGPT · LangGraph SRE agent · LLM router │ ├──────────────────────────────────────────────────┤ │ Layer 2: ANOMALY DETECTION & CORRELATION │ │ Prophet · Isolation Forest · ADTK · Arize │ ├──────────────────────────────────────────────────┤ │ Layer 1: UNIFIED TELEMETRY (existing stack) │ │ OpenTelemetry → Prometheus + Loki + Tempo │ └──────────────────────────────────────────────────┘
Each layer feeds the one above. OpenTelemetry-correlated data makes anomaly detection vastly more accurate (signals are labeled, structured, and cross-referenced). Anomaly alerts with correlation context make AI root cause analysis precise. Precise root cause enables confident autonomous remediation.
The key architectural principle: never throw away your existing stack. Every LGTM investment continues to pay dividends. The AIOps layers are additive, not replacement.
Why This Matters for Enterprise Kubernetes
At a typical enterprise running 50+ microservices across 3 clusters, the math is compelling. Assuming 1 P1 incident per week (conservative), reducing MTTR from 4.5 hours to 45 minutes saves 190+ engineering-hours per quarter — equivalent to a full FTE. That calculation doesn't include the downstream revenue impact of reduced downtime.
Building an AI Anomaly Detection Layer on Kubernetes
The most practical starting point is deploying an Isolation Forest model against your Prometheus metrics. It is unsupervised (no labeled training data needed), computationally cheap, and effective for multivariate metric anomalies.
Here is a production-ready pattern using a Python sidecar that scrapes Prometheus and publishes anomaly alerts back to Alertmanager:
anomaly-detector/detector.py
import time, requests, numpy as np
from sklearn.ensemble import IsolationForest
from collections import deque
PROM_URL = "http://prometheus-kube-prometheus-prometheus:9090"
ALERT_WEBHOOK = "http://alertmanager:9093/api/v2/alerts"
WINDOW = 60 # 60 data points (~60 minutes at 1m scrape)
CONTAMINATION = 0.05 # 5% anomaly rate assumption
# Metrics to monitor per namespace
METRICS = [
'rate(container_cpu_usage_seconds_total[5m])',
'container_memory_working_set_bytes',
'rate(kube_pod_container_status_restarts_total[5m])',
'http_request_duration_p99',
]
history = {m: deque(maxlen=WINDOW) for m in METRICS}
model = IsolationForest(contamination=CONTAMINATION, random_state=42)
def query_prom(metric):
r = requests.get(f"{PROM_URL}/api/v1/query", params={"query": metric})
results = r.json().get("data", {}).get("result", [])
return [float(v[1]) for v in [r["value"] for r in results]] if results else []
def fire_alert(metric, score, value):
alert = [{
"labels": {
"alertname": "AnomalyDetected",
"severity": "warning",
"metric": metric,
},
"annotations": {
"summary": f"AI anomaly in {metric} (score={score:.3f}, value={value:.4f})",
"runbook_url": "https://devops.gheware.com/runbooks/anomaly-response"
}
}]
requests.post(ALERT_WEBHOOK, json=alert)
print(f"[ALERT] {metric} | score={score:.3f}")
while True:
for metric in METRICS:
values = query_prom(metric)
if values:
history[metric].extend(values)
if len(history[metric]) >= 20:
X = np.array(list(history[metric])).reshape(-1, 1)
model.fit(X)
latest = np.array(values).reshape(-1, 1)
scores = model.decision_function(latest)
for i, s in enumerate(scores):
if s < -0.1: # anomaly threshold
fire_alert(metric, s, values[i])
time.sleep(60)
Deploy this as a Kubernetes Deployment with a ConfigMap for metric targets. The model trains on a rolling 60-minute window and fires Alertmanager alerts for anomalous readings — no manual threshold tuning required.
Advanced Option: ADTK for Kubernetes Metrics
ADTK (Anomaly Detection Toolkit) provides higher-level abstractions for common Kubernetes anomaly patterns: seasonal decomposition for traffic (handles daily/weekly patterns), level shift detection for memory leaks, and persist anomaly detection for stuck pods. For production, ADTK is often a faster path than raw scikit-learn:
from adtk.detector import SeasonalAD, LevelShiftAD, PersistAD import pandas as pd # Detect seasonal anomalies in request latency (daily rhythm) seasonal = SeasonalAD(c=3.0, side="positive") # Detect memory leaks (level shift upward) level_shift = LevelShiftAD(c=6.0, side="positive", window=10) # Detect hung pods (persist anomaly: metric frozen for > 5 points) persist = PersistAD(c=3.0, side="both", window=5) # Apply to Prometheus time series (fetched as pandas Series) latency_anomalies = seasonal.fit_detect(latency_series) memory_anomalies = level_shift.fit_detect(memory_series) hung_pod_anomalies = persist.fit_detect(restart_series)
Autonomous Incident Response: HolmesGPT + keephq/keep
Anomaly detection fires the alert. Now what? In a traditional stack, an SRE gets paged. In an AIOps stack, an AI agent investigates first — and resolves if it can.
HolmesGPT: Your AI SRE on Kubernetes
HolmesGPT (CNCF Sandbox) is the most production-ready open-source AI SRE tool available today. When an Alertmanager alert fires, HolmesGPT:
- Fetches relevant pod logs, events, and metrics from the Kubernetes API
- Correlates them across the call graph using OpenTelemetry trace IDs (if available)
- Sends the full context bundle to a configurable LLM (GPT-4o, Claude, or a local Ollama model)
- Returns a root cause summary and recommended next action within 15–30 seconds
holmesgpt-values.yaml (Helm)
holmesgpt:
enabled: true
llm:
provider: "anthropic"
model: "claude-3-5-sonnet-20241022"
apiKeySecret:
name: "holmes-secrets"
key: "ANTHROPIC_API_KEY"
alertmanager:
url: "http://alertmanager:9093"
integrations:
slack:
enabled: true
channel: "#sre-alerts"
webhookSecret:
name: "holmes-secrets"
key: "SLACK_WEBHOOK"
pagerduty:
enabled: true
routingKey: "your-pd-routing-key"
# Auto-remediation: safe actions only (restart, scale)
remediation:
enabled: true
safeActions:
- "kubectl rollout restart"
- "kubectl scale --replicas"
requireApproval: true # Slack approval before execution
keephq/keep: Intelligent Alert Management and Autonomous Runbooks
Keep complements HolmesGPT by handling the alert management layer: deduplication, correlation across multiple alerting sources (Prometheus, Datadog, CloudWatch), and executing structured runbooks when confidence is high.
A Keep runbook for a common Kubernetes scenario (OOMKilled pod):
runbooks/oom-killed.yaml
id: oom-killed-auto-heal
description: "Auto-heal OOMKilled pods: increase limits and restart"
triggers:
- type: alert
filters:
- key: name
value: "KubePodCrashLooping"
- key: annotations.reason
value: "OOMKilled"
steps:
- name: get-pod-context
provider:
type: kubectl
config: "{{ providers.k8s }}"
with:
command: "describe pod {{ alert.labels.pod }} -n {{ alert.labels.namespace }}"
- name: llm-analysis
provider:
type: anthropic
config: "{{ providers.claude }}"
with:
prompt: |
Pod {{ alert.labels.pod }} was OOMKilled.
Context: {{ steps.get-pod-context.results }}
Recommend: new memory limits (in Mi), justification, and risk level (low/medium/high).
condition:
- type: threshold
value: "{{ steps.llm-analysis.risk_level }}"
compare_to: "low" # Only auto-apply for low-risk
- name: patch-memory-limits
provider:
type: kubectl
config: "{{ providers.k8s }}"
with:
command: >
kubectl patch deployment {{ alert.labels.deployment }}
-n {{ alert.labels.namespace }}
--patch '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"{{ steps.llm-analysis.recommended_limit }}"}}}]}}}}'
- name: notify-slack
provider:
type: slack
config: "{{ providers.slack }}"
with:
message: |
✅ Auto-healed OOMKilled pod {{ alert.labels.pod }}
New memory limit: {{ steps.llm-analysis.recommended_limit }}
Justification: {{ steps.llm-analysis.justification }}
This runbook runs end-to-end in under 60 seconds. For a low-risk LLM assessment, no human intervention is needed. For medium/high risk, Keep posts to Slack with a one-click approve/reject button before applying any changes.
Unified Telemetry with OpenTelemetry: The AIOps Data Foundation
All of the above only works well if your telemetry data is correlated. An isolated CPU spike means little. The same spike correlated with a trace showing 99th-percentile latency on a specific service endpoint, and a log line showing database connection pool exhaustion — that's an actionable signal.
OpenTelemetry is the key to correlation. Every span carries a trace_id. When your anomaly detector identifies an anomalous metric at timestamp T, it can look up the trace IDs active at T, retrieve the full distributed trace, and give HolmesGPT a complete picture of the failure.
Configure the OTel Collector to route to all three backends simultaneously:
otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
processors:
# Enrich all signals with K8s metadata (pod, node, namespace, deployment)
k8sattributes:
passthrough: false
extract:
metadata: [k8s.pod.name, k8s.namespace.name, k8s.deployment.name, k8s.node.name]
# Tail-based sampling: capture ALL traces for anomalous requests
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: high-latency-policy
type: latency
latency: {threshold_ms: 500}
- name: always-sample-anomaly
type: string_attribute
string_attribute: {key: "aiops.anomaly", values: ["true"]}
exporters:
# Metrics → Prometheus/Mimir
prometheusremotewrite:
endpoint: "http://mimir:9009/api/v1/push"
# Traces → Tempo
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
# Logs → Loki
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
labels:
attributes:
k8s.pod.name: "pod"
k8s.namespace.name: "namespace"
k8s.deployment.name: "deployment"
service:
pipelines:
traces:
receivers: [otlp]
processors: [k8sattributes, tail_sampling]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [k8sattributes]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [k8sattributes]
exporters: [loki]
With this configuration, every piece of telemetry — metric, trace, log — carries the same Kubernetes metadata labels. The AIOps layer can perform precise cross-signal lookups: "show me all logs and traces associated with the pod that just triggered an anomaly alert."
Learn how to integrate this with your existing Kubernetes setup in our OpenTelemetry for AI Agents guide, or see how it supports broader GenAI architecture patterns on Kubernetes.
Real-World AIOps Implementation Roadmap (90 Days)
Based on enterprise deployments across financial services and tech companies, here is a realistic 90-day AIOps adoption timeline. The key principle: prove value fast at each layer before moving to the next.
| Phase | Days | Deliverable | Success Metric |
|---|---|---|---|
| 1 — Telemetry Unification | 1–21 | OTel Collector deployed, all services instrumented, correlated metrics/traces/logs in Grafana | 100% of P1 services instrumented |
| 2 — Anomaly Detection | 22–45 | Isolation Forest detector running on top 10 metrics; anomaly alerts in Alertmanager | Detect 1+ real incident before human pages |
| 3 — AI Root Cause | 46–65 | HolmesGPT deployed, posting RCA summaries to Slack for every P1/P2 alert | MTTD < 5 minutes for 80% of incidents |
| 4 — Autonomous Runbooks | 66–90 | keephq/keep live with 5 low-risk runbooks; Slack approval flow for medium-risk | 30%+ of incidents resolved without manual SRE action |
Enterprise Pitfalls to Avoid
- Don't start with autonomous remediation. Earn trust with RCA-only mode first. Teams accept AI suggestions long before they accept AI actions.
- Watch for model drift. Isolation Forest trained on normal weekday traffic will flag Black Friday spikes as anomalies. Implement seasonal retraining on a weekly cron.
- Define blast radius limits. Every autonomous runbook must have a maximum impact scope — e.g., "never scale above X replicas" or "never restart more than Y pods per hour." Hard-code these limits, not soft guidelines.
- Log every AI decision. Regulatory requirements (especially in financial services) require that every automated remediation action have a traceable audit log with the AI's reasoning. Store these in your existing SIEM.
For teams building the full AI-native DevOps pipeline, our Zero-Trust Security for AI Agents guide covers the security model for autonomous remediation agents in regulated environments.
Frequently Asked Questions
What is AIOps on Kubernetes?
AIOps on Kubernetes means adding an AI layer on top of your existing observability stack (Prometheus, Grafana, OpenTelemetry) that can detect anomalies, correlate multi-signal incidents, and trigger autonomous remediation — tasks traditional threshold-based alerting cannot handle.
Why is Prometheus not enough for modern Kubernetes observability?
Prometheus is excellent for metrics collection, but it can only alert on conditions you pre-define with PromQL. It cannot detect novel failure patterns, correlate logs with traces automatically, or take autonomous remediation actions. AIOps fills these gaps by applying machine learning to telemetry data in real time.
What is HolmesGPT and how does it help with Kubernetes incidents?
HolmesGPT is a CNCF-sandbox AI SRE tool that receives a PagerDuty or Alertmanager alert, fetches relevant logs, metrics, and events from your cluster, and uses an LLM to diagnose root cause and suggest runbook steps — all within seconds, before a human engineer is even paged.
How does OpenTelemetry enable AIOps?
OpenTelemetry provides a vendor-neutral way to collect traces, metrics, and logs in a unified schema. When this correlated telemetry is fed into a vector store or time-series anomaly model, the AI has a complete multi-signal view of each request's journey — dramatically improving root cause accuracy compared to siloed metric alerting.
Can AIOps tools replace on-call engineers?
Not entirely — but they dramatically reduce MTTR and alert fatigue. Tools like HolmesGPT, keephq/keep, and AWS DevOps Agent can handle 60–80% of routine incidents autonomously (restarts, scale-outs, config rollbacks), freeing engineers to focus on complex, novel failures that genuinely require human judgment.
Conclusion: The AIOps Layer Is No Longer Optional
In 2026, Kubernetes clusters are too complex and failure modes too numerous for purely human-reactive observability. The cognitive gap between the volume of telemetry produced and the human capacity to reason over it in real time has become the primary reliability bottleneck for enterprise engineering teams.
AIOps doesn't replace your Prometheus, Loki, Tempo, and Grafana investment. It multiplies it — by adding a reasoning layer that can detect what you didn't know to look for, correlate what you can't correlate manually at 3 AM, and act on what is safe to automate before your pager even buzzes.
The tools are production-ready today. HolmesGPT, keephq/keep, ADTK, and the OTel Collector give you everything needed to build all four layers of the AIOps stack on open-source foundations. The 90-day roadmap above is proven across enterprise deployments — the only prerequisite is starting.
If your team is ready to build AIOps capability from scratch — not just consume vendor tooling but actually build and operate these systems — our Agentic AI & DevOps training programs cover the full stack in 5 intensive hands-on days.