I was on-site at a Fortune 500 financial services firm last quarter when their AI platform team showed me their GPU utilisation dashboard. Forty-eight A100 nodes — approximately $3.2 million in annual cloud spend. Average utilisation: 21%.

The models were deployed. The data scientists were happy. The business was waiting. But the infrastructure team had hit a wall they couldn't explain: inference latency was spiking unpredictably, GPU OOM kills were crashing production pods overnight, and three AI projects had been quietly shelved because the cluster "just wasn't reliable."

The problem wasn't the hardware. It wasn't the models. It was Kubernetes — specifically, a cluster that had been configured for traditional microservices and then handed to an AI platform team who were told "just deploy your inference servers like any other container."

That advice works until it catastrophically doesn't. AI inference on Kubernetes is a fundamentally different workload class, and treating it otherwise costs enterprises millions in wasted GPU hours, delayed AI projects, and on-call engineer time spent fire-fighting crashes that should never happen.

In 25 years working across JPMorgan, Deutsche Bank, and Morgan Stanley — and now training over 5,000 enterprise engineers — I've seen this exact pattern repeat itself across sectors. This guide gives you the five concrete upgrades that transform a vanilla Kubernetes cluster into a production-ready LLM inference platform.

Why Your Existing Kubernetes Cluster Fails at AI Inference

Before the upgrades, you need to understand why the default Kubernetes configuration creates specific failure modes for LLM inference workloads. These aren't edge cases — they are predictable, structural problems.

The Three Core Failure Modes

1. GPU Resource Contention: Kubernetes' default scheduler has no concept of GPU topology. Two pods can be scheduled on the same node with conflicting GPU memory requirements, triggering OOM kills at 2 AM. Unlike CPU which can be throttled gracefully, GPUs fail hard when memory is exhausted.

2. The Cold-Start Latency Spike: LLM inference servers take 2–8 minutes to load a 70B model from storage into GPU memory. With default HPA (Horizontal Pod Autoscaler) based on CPU/memory metrics, Kubernetes scales too slowly, adding inference pods only after users have already experienced 503 errors. By the time the new pod is warm, the spike is over.

3. Noisy Neighbours in Shared Clusters: A data science team running a fine-tuning job during business hours can consume every GPU on a node, starving the production inference server sharing that node. Without hard GPU isolation, your production SLAs are at the mercy of whoever submits a training job first.

The Infrastructure-Model Gap

A standard Kubernetes cluster is optimised for stateless services with sub-100ms startup, CPU/memory resource dimensions, and predictable traffic patterns. LLM inference workloads are stateful (KV cache), require a third resource dimension (GPU memory), and exhibit bursty, request-time-variable load. The following five upgrades bridge this gap systematically.

Upgrade 1: GPU-Aware Scheduling with Node Affinity and Taints

The first upgrade is also the most foundational: you must explicitly partition your cluster so that LLM inference workloads land on GPU nodes, and GPU nodes are protected from non-GPU workloads consuming CPU and memory headroom that inference servers need.

Step 1: Label Your GPU Nodes

# Label GPU nodes with hardware type and capability tier
kubectl label node gpu-node-01 \
  accelerator=nvidia-a100 \
  gpu-memory=80gb \
  workload-type=inference

# Verify labels
kubectl get nodes --show-labels | grep accelerator

Step 2: Taint GPU Nodes to Prevent Accidental Scheduling

# Apply taint — only pods with matching toleration can schedule here
kubectl taint nodes gpu-node-01 \
  dedicated=inference:NoSchedule

# This prevents CPU-only workloads from consuming CPU/RAM on GPU nodes
# which reduces the available headroom for your inference server processes

Step 3: Configure Your Inference Deployment with Proper Resource Requests

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-server
  namespace: ai-production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "inference"
        effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: In
                values:
                - nvidia-a100
              - key: gpu-memory
                operator: In
                values:
                - 80gb
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - llm-inference
            topologyKey: kubernetes.io/hostname
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest
        resources:
          requests:
            nvidia.com/gpu: "1"
            memory: "48Gi"
            cpu: "8"
          limits:
            nvidia.com/gpu: "1"
            memory: "64Gi"
            cpu: "16"
        args:
        - --model
        - /models/llama-3-70b-instruct
        - --tensor-parallel-size
        - "1"
        - --max-model-len
        - "8192"

Note the podAntiAffinity rule — this ensures that two inference replicas never land on the same GPU node, giving you true high availability rather than two pods that fail together when a single node goes down.

NVIDIA Device Plugin: The Hidden Requirement

The nvidia.com/gpu resource only exists if the NVIDIA device plugin DaemonSet is running on your cluster. Many teams skip this and then wonder why GPU resource requests are rejected.

# Deploy NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

# Verify GPU resources are visible
kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu")'

Upgrade 2: KServe for Production-Grade Model Serving

Raw Deployments work for a proof-of-concept. Production LLM serving requires model versioning, canary deployments, multi-model serving to share GPU memory across smaller models, and standardised inference APIs. This is exactly what KServe (the CNCF-graduated successor to KFServing) provides.

What KServe Gives You That Raw Deployments Don't

  • InferenceService CRD — a Kubernetes-native abstraction for model serving with built-in canary rollout
  • Multi-model serving (MMS) — pack multiple smaller models onto a single GPU node without separate server processes
  • Standardised V1/V2 inference protocols — compatible with OpenAI API format
  • Automatic scale-to-zero via Knative — stop idle inference pods burning GPU hours
  • Pipeline routing — pre/post-processing transformers, explainers, and outlier detectors as first-class primitives

Deploying a Model with KServe

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b
  namespace: ai-production
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 4
    model:
      modelFormat:
        name: pytorch
      runtime: vllm
      storageUri: "pvc://model-store/llama3-70b-instruct"
      resources:
        limits:
          nvidia.com/gpu: "1"
          memory: "64Gi"
        requests:
          nvidia.com/gpu: "1"
          memory: "48Gi"
      args:
      - --tensor-parallel-size=1
      - --max-model-len=8192
      - --enable-chunked-prefill

Canary Deployments: Rolling Out New Models Safely

One of the most underused KServe features is traffic splitting — route 5% of inference traffic to a new model version before promoting it to production:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b
  namespace: ai-production
spec:
  predictor:
    canaryTrafficPercent: 5
    model:
      modelFormat:
        name: pytorch
      runtime: vllm
      storageUri: "pvc://model-store/llama3-70b-instruct-v2"
      # v2 gets 5% of traffic; v1 (existing) gets 95%
      resources:
        limits:
          nvidia.com/gpu: "1"

Once you validate latency and accuracy metrics from the canary, promote the new version by updating canaryTrafficPercent to 100 and removing the field. This is how Netflix-grade confidence transfers to AI model releases.

For teams working through our AI-Powered DevOps training programme, we cover KServe end-to-end including custom runtimes, ModelMesh for multi-model serving, and integration with Argo CD for GitOps-driven model promotion.

Upgrade 3: KEDA for Inference Autoscaling to Zero

Default HPA scales pods based on CPU and memory. LLM inference pods are CPU-idle (the GPU does the work) and memory-stable (GPU VRAM is pre-allocated). HPA sees nothing to trigger scaling. The result: your inference service either runs too many pods burning idle GPU costs, or it can't scale fast enough when traffic spikes.

KEDA (Kubernetes Event-Driven Autoscaling) solves both problems by scaling on custom metrics — queue depth, request rate, pending inference jobs — rather than CPU.

Install KEDA

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace \
  --version 2.13.0

Scale on Prometheus Request Rate

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-inference-scaler
  namespace: ai-production
spec:
  scaleTargetRef:
    name: llm-inference-server
  minReplicaCount: 0      # Scale to zero when idle
  maxReplicaCount: 8
  cooldownPeriod: 300     # 5 min cooldown before scale-down
  pollingInterval: 15     # Check every 15 seconds
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: vllm_request_queue_depth
      threshold: "3"      # Scale up when queue exceeds 3 pending requests
      query: |
        sum(vllm_num_requests_waiting{namespace="ai-production"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: vllm_running_requests
      threshold: "10"
      query: |
        sum(vllm_num_requests_running{namespace="ai-production"})

The Scale-to-Zero Caveat

Scale-to-zero is powerful for batch and off-hours workloads, but introduces cold-start latency (2–8 minutes for large models) for user-facing services. The enterprise solution is a two-tier strategy:

  • Production tier: minReplicaCount: 1 — always warm, immediate response
  • Batch/async tier: minReplicaCount: 0 — scale to zero, triggered by queue depth

In our customer implementations at major financial services firms, this two-tier KEDA strategy consistently reduces idle GPU costs by 40–60% without impacting P99 latency on the production SLA path.

Upgrade 4: Multi-Tenant GPU Isolation and ResourceQuotas

In a shared enterprise cluster, multiple teams — data science, platform engineering, application teams — compete for GPU resources. Without explicit isolation, a single poorly-configured training job can exhaust cluster-wide GPU capacity and kill production inference services.

Namespace-Based Isolation

# Create isolated namespaces for each AI workload class
kubectl create namespace ai-production     # Production inference — highest priority
kubectl create namespace ai-staging        # Pre-production testing
kubectl create namespace ai-training       # Training jobs — lowest priority
kubectl create namespace ai-experiments    # Data science notebooks

ResourceQuotas: Hard GPU Limits Per Team

---
# Production namespace: guaranteed GPU allocation
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota-production
  namespace: ai-production
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"
    requests.memory: "512Gi"
    limits.memory: "640Gi"
    pods: "20"
---
# Training namespace: limited to prevent runaway jobs
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota-training
  namespace: ai-training
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
    requests.memory: "256Gi"
    limits.memory: "320Gi"
    pods: "10"
---
# Experiments namespace: strict guardrails for data scientists
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota-experiments
  namespace: ai-experiments
spec:
  hard:
    requests.nvidia.com/gpu: "2"
    limits.nvidia.com/gpu: "2"
    requests.memory: "64Gi"
    limits.memory: "128Gi"
    pods: "10"

PriorityClasses: Evict Training Jobs Before Production Suffers

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: inference-production
value: 1000
globalDefault: false
description: "Production LLM inference — highest priority, never evicted"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-training
value: 200
globalDefault: false
description: "Training jobs — evicted before production inference under resource pressure"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-experiments
value: 50
globalDefault: false
description: "Interactive experiments — lowest priority, first to be evicted"

Assign these PriorityClasses in your Deployment specs. When cluster resources are constrained, the Kubernetes scheduler will evict lower-priority pods first — meaning your data scientist's notebook gets paused before your production inference server is affected.

LimitRange: Protect Against Resource Request Mistakes

apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limits
  namespace: ai-experiments
spec:
  limits:
  - type: Container
    max:
      nvidia.com/gpu: "1"        # No single container can request more than 1 GPU
      memory: "32Gi"
    default:
      nvidia.com/gpu: "1"
      memory: "8Gi"
    defaultRequest:
      nvidia.com/gpu: "1"
      memory: "4Gi"

This single LimitRange prevents the most common GPU cluster incident: a data scientist requesting all 8 GPUs on a node for an experiment that only needs 1, starving every other workload in the namespace.

Upgrade 5: Observability Stack for LLM Inference Metrics

Traditional Prometheus + Grafana dashboards show CPU, memory, and request rate. For LLM inference, these metrics are nearly useless — the GPU is doing the work, and the most important signals (token throughput, KV cache utilisation, batch queue depth, per-request latency by sequence length) are invisible to standard observability tools.

This is where most enterprise AI platform teams are flying blind, and it directly causes the unexplained latency spikes and unpredictable failures that make stakeholders lose confidence in the AI platform.

vLLM Metrics Endpoint

If you're running vLLM (the most common open-source LLM inference engine), it exposes a /metrics endpoint by default. These are the metrics that actually matter:

# Key vLLM metrics to track
vllm:num_requests_running        # Active concurrent requests
vllm:num_requests_waiting        # Queue depth (your scaling trigger)
vllm:gpu_cache_usage_perc        # KV cache fill % (if >90%, you have a memory leak risk)
vllm:avg_prompt_throughput_toks  # Prompt tokens/second (quality signal)
vllm:avg_generation_throughput   # Output tokens/second (user experience signal)
vllm:time_to_first_token_seconds # TTFT — most important UX latency metric

Grafana Dashboard for LLM Inference

apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-scrape-config
  namespace: monitoring
data:
  vllm-scrape.yaml: |
    scrape_configs:
    - job_name: 'vllm-inference'
      static_configs:
      - targets:
        - 'llm-inference-server.ai-production:8000'
      metrics_path: '/metrics'
      scrape_interval: 10s

With these metrics flowing into Grafana, you can build dashboards that answer the questions that actually matter to your AI platform SLA:

  • What is my time-to-first-token at P50, P95, P99?
  • Is my KV cache approaching saturation (leading indicator of OOM)?
  • Are requests queuing (signal to scale out before users notice)?
  • Which model/route is consuming the most GPU time per request?

Alerts That Actually Matter

groups:
- name: llm-inference-alerts
  rules:
  - alert: LLMHighQueueDepth
    expr: vllm:num_requests_waiting > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "LLM inference queue depth high"
      description: "{{ $value }} requests queued for >2min. KEDA should be scaling."

  - alert: LLMKVCacheSaturation
    expr: vllm:gpu_cache_usage_perc > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "GPU KV cache near saturation"
      description: "KV cache at {{ $value }}%. Risk of OOM kill within minutes."

  - alert: LLMHighTTFT
    expr: histogram_quantile(0.95, vllm:time_to_first_token_seconds_bucket) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "LLM P95 time-to-first-token exceeds 5 seconds"

The KVCacheSaturation alert is the one that will save you from your next 2 AM incident. When KV cache hits 90%, you have minutes before the inference server OOMs and crashes. With this alert, you have time to respond.

Practical Implementation Guide: The 30-Day Enterprise Rollout

The five upgrades above are correct independently, but the sequence matters. Rolling them out in the wrong order creates compounding problems. Here is the enterprise-safe rollout sequence I use with clients:

Week 1: Foundation (Upgrades 1 + 4)

Start with GPU-aware scheduling and namespace isolation — these are non-disruptive to existing workloads and establish the resource governance foundation everything else depends on. Deploy the NVIDIA device plugin if not already running. Label nodes, apply taints, create namespaces and ResourceQuotas.

Gate: Verify GPU resources are visible (kubectl describe node) and that non-GPU pods cannot schedule on GPU nodes (test with a purposely untolerated pod).

Week 2: Model Serving (Upgrade 2)

Deploy KServe. Migrate one non-critical inference workload from a raw Deployment to an InferenceService. Validate the model serves correctly, then practice a canary rollout to build muscle memory before production migration.

Gate: Canary deployment routing 10% traffic to a second model version, validated with curl against both endpoints.

Week 3: Autoscaling (Upgrade 3)

Deploy KEDA. Configure ScaledObjects for the staging environment first. Load test to validate scale-out triggers correctly. Tune thresholds. Only then promote to production ScaledObjects.

Gate: Load test showing KEDA scales from 1 → 4 replicas within 2 minutes of sustained queue depth > 3.

Week 4: Observability (Upgrade 5)

Wire vLLM metrics into Prometheus. Build the Grafana dashboard. Configure alerts. Run a synthetic incident (deliberately saturate KV cache in staging) to validate alert routing end-to-end.

Gate: PagerDuty/OpsGenie alert received within 3 minutes of KV cache crossing 90% in the test.

The ROI Math

Metric Before After (30 days)
GPU Utilisation 21% 68–75%
GPU Idle Cost (monthly) $267K ~$83K
Production OOM Incidents 3–5 / month 0
P95 Time-to-First-Token 12–18s 2–4s
Model Rollout Time 4–6 hours (manual) 15 min (canary)

These numbers are from actual client implementations, not theoretical benchmarks. The infrastructure investment to achieve this — a trained DevOps engineer with 2–3 weeks of focused work — typically has a payback period measured in days, not months, when GPU cost savings are accounted for.

Frequently Asked Questions

Do I need a separate Kubernetes cluster for AI inference, or can I share with existing workloads?

You can share, but you must implement the isolation patterns in Upgrade 4 (namespace isolation, ResourceQuotas, PriorityClasses) before doing so. Without these, production inference will be starved by lower-priority workloads. For enterprises with large GPU fleets (>20 GPU nodes), a dedicated inference cluster often makes economic sense due to simplified scheduling and cleaner blast radius. For teams under 20 GPU nodes, shared cluster with proper isolation is operationally simpler and sufficient.

What is the difference between KServe and simply running vLLM as a Kubernetes Deployment?

A raw Deployment works for a single model in development. KServe adds production necessities: canary traffic splitting for safe model rollouts, multi-model serving to pack multiple smaller models onto shared GPU memory, standardised inference protocols (OpenAI-compatible V2), automatic integration with Knative for scale-to-zero, and a GitOps-friendly CRD model that integrates with Argo CD for ML model lifecycle management. The operational overhead of KServe vs. a raw Deployment becomes net-positive at the second model version rollout.

Which LLM inference engine should I use on Kubernetes: vLLM, TGI, or Triton?

For most enterprise teams in 2026, vLLM is the default choice: OpenAI-compatible API, continuous batching, paged attention for efficient KV cache management, and the strongest community momentum (GitHub #1 trending AI repo repeatedly in 2025–2026). Use Text Generation Inference (TGI) if you're standardised on Hugging Face. Use Triton Inference Server if you need multi-framework serving (PyTorch + TensorFlow + ONNX in the same cluster) or have existing NVIDIA platform commitments. All three integrate with KServe via custom runtimes.

How do I handle model storage in Kubernetes — should I use PersistentVolumes or download at pod startup?

Never download models at pod startup in production. A 70B model is 140GB — downloading at startup adds 10–20 minutes to pod initialisation, defeats autoscaling, and creates egress cost per scale-out event. The correct pattern: store models in a PersistentVolume (backed by NFS or a cloud-native file system like AWS EFS or Azure Files) and mount it into your inference pods. For faster startup, pre-stage models on local NVMe storage on GPU nodes using a DaemonSet and use hostPath mounts — this reduces model load time from 8 minutes (from PV) to under 90 seconds (from NVMe).

How do enterprises handle multi-GPU inference for very large models (70B+) on Kubernetes?

Models larger than approximately 40B parameters at FP16 precision require more VRAM than a single GPU holds (80GB on an A100). The solution is tensor parallelism — splitting model layers across multiple GPUs. In Kubernetes, this requires pods that span multiple GPUs on the same node (using nvidia.com/gpu: "4" or more in resource requests) combined with InfiniBand or NVLink for inter-GPU communication. vLLM's --tensor-parallel-size flag handles the model sharding. The scheduling challenge is ensuring these multi-GPU pods always land on nodes with the required GPU count and interconnect topology — which requires the node affinity patterns from Upgrade 1 to be precisely configured.

Conclusion: Your Kubernetes Cluster Can Become an AI Inference Platform

The gap between a standard Kubernetes cluster and a production-ready AI inference platform isn't a hardware gap — it's an operational knowledge gap. The five upgrades in this guide systematically close that gap:

  1. GPU-aware scheduling — put the right workloads on the right nodes, every time
  2. KServe model serving — production-grade model lifecycle with canary rollouts
  3. KEDA autoscaling — scale on what actually matters (queue depth, request rate) not CPU
  4. Multi-tenant GPU isolation — protect production from the data science team's 3 AM training job
  5. LLM-aware observability — see what's actually happening inside your inference servers

Teams that implement all five upgrades consistently see GPU utilisation jump from the industry-average 21% to 65–75%, while simultaneously reducing production incidents and cutting P95 inference latency by 60–80%. The economics are compelling: at $3.2M in annual GPU spend, moving from 21% to 70% utilisation effectively recovers $1.6M in infrastructure budget — without buying a single additional GPU.

I've spent 25 years building and operating large-scale infrastructure at JPMorgan, Deutsche Bank, and Morgan Stanley. The pattern I see repeatedly is that teams with strong Kubernetes fundamentals adapt to AI workloads faster and with fewer production incidents than teams that skip the foundation. The tools have changed; the discipline of building systems correctly has not.

If your team is ready to build this expertise — not just run through tutorials, but truly understand how to operate AI infrastructure at enterprise scale — we'd love to work with you.