Your Kubernetes cluster is silently burning money right now. I have seen it at JPMorgan, Deutsche Bank, and in every enterprise I have trained over the last five years: clusters running at 15–25% average CPU utilisation while the cloud bill climbs every month. In 2026, the difference between the teams that solved this and those still chasing manual right-sizing tickets is one thing — AI agents.
This post is the practical playbook. I am going to show you exactly how AI agents automate Kubernetes cost optimisation — from intelligent right-sizing and idle workload elimination to FinOps-aware scheduling and autonomous chargeback reporting. By the end, you will have the architecture, the code patterns, and the KPIs to make your CFO's quarterly cloud review a source of pride rather than panic.
At JPMorgan, we had a team of four platform engineers whose primary job was Kubernetes right-sizing. They ran weekly reports, opened Jira tickets to application teams, waited for change control windows, and manually adjusted resource requests. It took on average three weeks from insight to action — by which time the workload profile had changed again.
This is the fundamental problem with manual FinOps in Kubernetes: the feedback loop is too slow for the rate of change. Deployments happen daily. Traffic patterns shift hourly. Seasonal events compress budgets. A human-driven process simply cannot keep pace.
The three root causes of Kubernetes waste are consistent across every enterprise I encounter:
The hidden multiplier: Over-provisioned requests do not just waste compute — they inflate your cluster autoscaler's minimum footprint. A team requesting 10× what they need forces the autoscaler to maintain nodes that are 90% idle. Fix the requests, and node count drops automatically.
The architecture I recommend in our Agentic AI Workshop follows a four-agent pattern that maps directly to the FinOps observe-optimise-operate cycle:
This agent continuously ingests cost and utilisation data from Kubernetes Metrics Server, Prometheus, and your cloud provider's cost APIs (AWS Cost Explorer, GCP Billing Export, Azure Cost Management). It computes per-pod, per-namespace, and per-team cost attribution in real time.
# Cost Observer Agent — LangGraph state definition
from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated, List
import operator
class FinOpsState(TypedDict):
cluster_metrics: dict # Raw Prometheus/metrics-server data
cost_attribution: dict # Per-namespace cost breakdown
anomalies: Annotated[List, operator.add] # Cost spikes detected
recommendations: Annotated[List, operator.add]
approved_actions: List
executed_actions: List
def observe_costs(state: FinOpsState) -> FinOpsState:
"""Pull Prometheus metrics and compute cost attribution."""
prom = PrometheusClient("http://prometheus:9090")
# Actual vs requested resource efficiency per namespace
efficiency_query = """
sum by (namespace) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
) / sum by (namespace) (
kube_pod_container_resource_requests{resource="cpu"}
)
"""
efficiency = prom.query(efficiency_query)
# Map cloud cost per vCPU-hour (pulled from billing API)
cost_per_vcpu_hour = get_cloud_cost_rate()
attribution = {}
for ns, eff in efficiency.items():
monthly_waste = (1 - eff) * get_requested_vcpus(ns) * cost_per_vcpu_hour * 730
attribution[ns] = {
"efficiency_pct": eff * 100,
"monthly_waste_usd": monthly_waste,
"team_owner": get_namespace_owner(ns)
}
return {**state, "cost_attribution": attribution}
This is the intelligence layer. Using 30 days of utilisation histograms, it generates right-sizing recommendations that go beyond simple Vertical Pod Autoscaler (VPA) suggestions. It understands traffic patterns, deployment schedules, and business context — so it does not recommend scaling down your payments service at 11:55 PM on a Friday.
def analyse_and_recommend(state: FinOpsState) -> FinOpsState:
"""Generate right-sizing and workload lifecycle recommendations."""
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022")
recommendations = []
for namespace, data in state["cost_attribution"].items():
if data["efficiency_pct"] < 30: # Below 30% = strong over-provision signal
# Get 30-day P95 utilisation (not peak — protects against spikes)
p95_cpu = get_p95_utilisation(namespace, metric="cpu", days=30)
p95_mem = get_p95_utilisation(namespace, metric="memory", days=30)
# Ask LLM to generate safe recommendation with context
prompt = f"""
Namespace: {namespace}
Current CPU request: {get_requested_vcpus(namespace)} vCPU
P95 actual CPU: {p95_cpu} vCPU
Current memory request: {get_requested_memory(namespace)} GB
P95 actual memory: {p95_mem} GB
Business criticality: {get_criticality(namespace)}
Last deployment: {get_last_deploy(namespace)}
Generate a right-sizing recommendation with:
1. Safe new CPU/memory requests (P95 + 25% safety buffer for critical, +15% for non-critical)
2. Estimated monthly savings in USD
3. Risk level (LOW/MEDIUM/HIGH)
4. Recommended change window
Return as JSON.
"""
rec = llm.invoke(prompt)
recommendations.append(json.loads(rec.content))
return {**state, "recommendations": recommendations}
This agent implements recommendations — but with a critical safety gate. LOW-risk optimisations (non-production namespaces, idle deployments, >80% waste) execute autonomously. MEDIUM and HIGH risk recommendations go to a Slack/Teams approval queue where a platform engineer can approve or dismiss with one click.
from langgraph.types import interrupt
def route_by_risk(state: FinOpsState) -> str:
high_risk = [r for r in state["recommendations"] if r["risk"] in (
"MEDIUM", "HIGH")]
return "needs_approval" if high_risk else "auto_execute"
def request_approval(state: FinOpsState) -> FinOpsState:
"""Pause graph and send Slack approval request."""
high_risk_recs = [r for r in state["recommendations"]
if r["risk"] in ("MEDIUM", "HIGH")]
send_slack_approval_request(high_risk_recs)
# LangGraph interrupt — graph pauses here until approval received
approved = interrupt("Waiting for FinOps approval on high-risk recommendations")
return {**state, "approved_actions": approved}
def auto_execute(state: FinOpsState) -> FinOpsState:
"""Execute low-risk optimisations autonomously via kubectl patch."""
low_risk = [r for r in state["recommendations"] if r["risk"] == "LOW"]
executed = []
for rec in low_risk:
result = kubectl_patch_resources(
namespace=rec["namespace"],
deployment=rec["deployment"],
cpu_request=rec["recommended_cpu"],
memory_request=rec["recommended_memory"]
)
executed.append({**rec, "executed_at": datetime.now().isoformat()})
return {**state, "executed_actions": executed}
Weekly and monthly, this agent compiles cost savings, trend analysis, and team-level benchmarks into a report delivered to engineering leadership and finance. It answers the CFO's three questions: How much did we spend? How much did we save? What is the plan for next month?
Kubernetes' built-in VPA is reactive and naive — it looks at recent history and sets requests accordingly. An AI agent goes further: it understands cron job patterns, deploy-time spikes, and business-hours traffic cycles, and sets requests that are tight without being dangerous. At Deutsche Bank, we saw 34% CPU cost reduction from right-sizing alone.
Pro tip: Always use P95 (not P99 or max) as your baseline for CPU requests. P99 over-provisions by 25–40% for bursty workloads. Add a static safety buffer on top — 25% for Tier-1 services, 15% for everything else.
The agent queries Prometheus for deployments with zero HTTP requests in the last 7 days that have non-zero replicas. It cross-references the team Slack channel and opens a "clean up or we scale to zero" notification. If no response in 48 hours, it scales to zero automatically. In enterprises with active developer communities, this alone typically recovers 8–12% of total cluster cost.
# Identify idle deployments (zero traffic, >7 days)
idle_query = """
sum by (namespace, pod) (
increase(http_requests_total[7d])
) == 0
AND
sum by (namespace, pod) (
kube_pod_container_status_running
) > 0
"""
idle_workloads = prometheus.query(idle_query)
for workload in idle_workloads:
notify_team(workload["namespace"], deadline_hours=48)
schedule_scale_to_zero(workload, after_hours=48)
AI agents analyse workload interruption tolerance and automatically annotate deployments with the correct node affinity rules to shift batch, ML training, and non-critical workloads to spot instances. On AWS EKS, this typically reduces EC2 cost by 65–80% for eligible workloads. The agent tracks interruption history and avoids spot instance types with high eviction rates.
Each team gets a monthly compute budget. The agent monitors spend in real time and sends alerts at 70%, 85%, and 100% of budget. At 100%, it automatically throttles new deployments in that namespace and escalates to the engineering manager. This has transformed cloud cost culture in every organisation I have worked with — engineers start caring about resource efficiency when it directly affects their team's ability to deploy.
| Budget Threshold | Agent Action | Notification Target |
|---|---|---|
| 70% | Alert only | Team Slack channel |
| 85% | Alert + top 3 waste culprits identified | Team lead + Slack |
| 100% | New deployment throttle + escalation | Eng manager + Finance |
| 110% | Scale non-critical services to 1 replica | CTO dashboard |
Standard Kubernetes scheduling is a one-time placement decision. Over time, fragmentation grows — you end up with 40 nodes at 35% utilisation instead of 25 nodes at 56% utilisation. The descheduler agent runs nightly, identifies fragmented pods, and evicts and reschedules them onto denser nodes. Combined with cluster autoscaler, this typically reduces node count by 20–30%.
The full FinOps agent system deploys as a set of Kubernetes CronJobs and a persistent LangGraph server. Here is the production-grade configuration:
# finops-agent-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: finops-cost-observer
namespace: platform-ops
spec:
schedule: "*/15 * * * *" # Every 15 minutes
jobTemplate:
spec:
template:
spec:
serviceAccountName: finops-agent-sa
containers:
- name: finops-observer
image: ghcr.io/gheware/finops-agent:1.2.0
env:
- name: PROMETHEUS_URL
value: "http://prometheus.monitoring:9090"
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: ai-secrets
key: anthropic-api-key
- name: SLACK_WEBHOOK
valueFrom:
secretKeyRef:
name: finops-secrets
key: slack-webhook
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
restartPolicy: OnFailure
---
# RBAC: FinOps agent needs read access to pods/deployments + patch for right-sizing
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: finops-agent-role
rules:
- apiGroups: [""]
resources: ["pods", "namespaces"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "patch"] # patch only — no delete
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "create", "patch"]
Before you deploy, establish your baseline metrics. These are the KPIs I track in every enterprise deployment — and the targets that tell you the agent system is working:
| KPI | Baseline (Typical) | Target at 90 Days |
|---|---|---|
| Cluster CPU utilisation | 18–25% | 45–60% |
| Memory utilisation | 30–40% | 55–70% |
| % workloads right-sized | 5–15% | 80%+ |
| Idle deployments | 15–30% of total | <5% |
| Node count (same workload) | Baseline | –20 to –35% |
| Cloud bill reduction | 0% | 30–45% |
| Mean time to right-size | 3–6 weeks | <48 hours |
Never right-size in week one. You need at least 14 days of baseline data — 30 days is better. Teams that right-size immediately often clip a deployment's memory requests below its JVM heap size, causing OOMKills that trigger 2 AM pages.
The descheduler must respect PodDisruptionBudgets. Configure your descheduler agent to check PDB availability before evicting pods. I have seen clusters lose quorum on critical stateful services because an over-eager descheduler evicted too many pods simultaneously.
Do not simply move everything possible to spot. Model interruption cost — a spot eviction during a payment processing batch costs real money in retry logic and delayed settlement. The agent needs to understand the business cost of interruption, not just the infrastructure cost of on-demand.
From the field: At a fintech client, we built an "interruption cost model" into the spot scheduling agent. Payment processing workloads were excluded automatically. Batch analytics and model training moved to spot. Net saving: $180K/year. No incidents.
Here is the phased rollout I recommend for enterprise teams:
The teams that see the fastest results are the ones that trust the data before they trust the actions. Build observability first. The automation follows naturally once the organisation believes the numbers.
This is the exact framework we cover in our Agentic AI Workshop — with hands-on lab time building real FinOps agent workflows using LangGraph, Prometheus, and live Kubernetes clusters. The workshop was rated 4.91/5.0 at Oracle. Your team can build this in five days.
Our Agentic AI Workshop covers LangGraph, Kubernetes automation, and real-world AI agent patterns. Rated 4.91/5.0 at Oracle. Zero-risk: 100% refund + $1,000 if your team does not achieve measurable results in 90 days.
View the Agentic AI Workshop →