Kubernetes AI Infrastructure

GPU-Optimized Kubernetes for LLM Inference: The Enterprise Playbook for 2026

📅 March 21, 2026 ✍️ Rajesh Gheware ⏱️ 11 min read 🏷️ Kubernetes · GPU · LLM Inference · Enterprise AI

In early 2026, Kubernetes crossed a threshold: it became the de facto runtime for enterprise LLM inference — not just model training. But running a Llama-3 70B or Mistral-large model on Kubernetes is a completely different engineering challenge from running microservices. Without proper GPU-optimized Kubernetes design, teams are wasting 60–70% of their GPU budget and seeing inference latency that would make users abandon the product. I've watched this exact failure pattern happen at scale. Here's how to avoid it.

⚡ Key Takeaways

GPU-optimized Kubernetes requires purpose-built node pools — don't mix CPU and GPU workloads on the same nodes
Model quantization (INT4/INT8, BitNet) can cut GPU memory requirements by 50–75% without meaningful accuracy loss for enterprise use cases
Kubernetes Device Plugin + extended resources are mandatory for GPU scheduling — not optional add-ons
NVIDIA MIG (Multi-Instance GPU) partitioning lets you run 7 independent inference pods on a single A100 — critical for cost control
Right-sized GPU node pools with autoscaling can reduce cloud GPU spend by 40–60% vs. static provisioning

Why Kubernetes Has Become the LLM Inference Runtime of Choice

At JPMorgan, we never ran GPU workloads on Kubernetes in 2015. We provisioned bare-metal GPU servers, dedicated them to specific models, and prayed they stayed up. The operational overhead was enormous — patching, scaling, failover, load balancing — all custom-built. Today, enterprises running LLMs on Kubernetes inherit a decade of container orchestration maturity. That is a genuine competitive advantage, if you configure it correctly.

The enterprise pull toward Kubernetes for LLM inference has accelerated dramatically in Q1 2026. Three convergent forces drove this:

Cost pressure: A100 80GB instances cost $3–8/hour on AWS/GCP/Azure. Static GPU provisioning at enterprise scale burns millions before you write your first inference query. Kubernetes auto-scaling changes the math entirely.
Multi-model deployments: Enterprises are no longer running one LLM. L&D teams use one model, customer service uses another, code review uses a third. Kubernetes multi-tenancy makes this tractable.
Operational consistency: Your CI/CD pipelines, monitoring (Prometheus/Grafana), secrets management, and RBAC already live in Kubernetes. Inference should too.

💡 2026 Data Point: Google Cloud's "Quantizing LLMs on GKE" post — one of the highest-traffic technical articles in Q1 2026 — reported 3.2x throughput improvement and 68% cost reduction by combining INT4 quantization with right-sized GPU node pools on GKE. Enterprise teams not reading this are leaving significant money on the table.

Designing GPU Node Pools: The Architecture That Actually Scales

The most common mistake I see in enterprise Kubernetes setups: one GPU node pool for everything. Training jobs compete with inference pods. Batch embedding tasks starve latency-sensitive chat inference. You end up with neither workload performing well.

The correct architecture uses purpose-built node pools with strict affinity and tainting:

# Node pool taint for LLM inference (GKE example)
gcloud container node-pools create llm-inference-pool \
  --cluster=enterprise-ai-cluster \
  --machine-type=a2-highgpu-1g \          # A100 40GB
  --accelerator type=nvidia-tesla-a100,count=1 \
  --num-nodes=3 \
  --min-nodes=1 \
  --max-nodes=10 \
  --enable-autoscaling \
  --node-taints=workload=llm-inference:NoSchedule \
  --node-labels=gpu-type=a100,workload-class=inference

---
# Separate pool for training/fine-tuning (bursty, long-running)
gcloud container node-pools create llm-training-pool \
  --machine-type=a2-highgpu-8g \          # 8x A100 for distributed training
  --num-nodes=0 \
  --min-nodes=0 \
  --max-nodes=4 \
  --enable-autoscaling \
  --node-taints=workload=llm-training:NoSchedule

The min-nodes=0 on the training pool is critical — training runs are infrequent but expensive. Scale to zero when idle, scale up within minutes for a scheduled fine-tuning run. Your inference pool stays always-on at minimum capacity (1 node) to avoid cold-start latency.

The corresponding Pod spec that respects this architecture:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama3-70b-inference
  namespace: ai-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama3-inference
  template:
    metadata:
      labels:
        app: llama3-inference
    spec:
      # Target the inference-specific node pool
      nodeSelector:
        workload-class: inference
        gpu-type: a100

      tolerations:
      - key: "workload"
        operator: "Equal"
        value: "llm-inference"
        effect: "NoSchedule"

      containers:
      - name: vllm-server
        image: vllm/vllm-openai:v0.6.2
        args:
          - "--model"
          - "meta-llama/Meta-Llama-3-70B-Instruct"
          - "--quantization"
          - "awq"                     # INT4 quantization
          - "--gpu-memory-utilization"
          - "0.90"
          - "--max-model-len"
          - "8192"
          - "--tensor-parallel-size"
          - "1"
        resources:
          limits:
            nvidia.com/gpu: "1"      # Extended resource — requires Device Plugin
            memory: "40Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: "1"
            memory: "32Gi"
            cpu: "4"
        ports:
        - containerPort: 8000
          name: http

Kubernetes GPU Scheduling: The Device Plugin Layer You Cannot Skip

One of the most misunderstood areas in GPU Kubernetes deployments: the NVIDIA Device Plugin is not optional infrastructure. Without it, Kubernetes has no concept of a GPU as a schedulable resource. You cannot request nvidia.com/gpu: 1 in a Pod spec without the Device Plugin DaemonSet running on every GPU node.

Deployment is straightforward but must happen before any inference workloads:

# Deploy NVIDIA Device Plugin (Helm — recommended for production)
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

helm install nvdp nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --version 0.16.2 \
  --set gfd.enabled=true \           # GPU Feature Discovery
  --set migStrategy=mixed \          # Enable MIG support
  --set deviceListStrategy=volume-mounts

With gfd.enabled=true, GPU Feature Discovery automatically labels your nodes with GPU capabilities — model, memory, CUDA version, MIG topology. This enables sophisticated scheduling rules like "only schedule Llama-3 70B on A100 80GB nodes" without manual label management.

NVIDIA MIG: The Cost-Optimization Most Enterprises Miss

Multi-Instance GPU (MIG) is arguably the highest-ROI configuration change available on A100 and H100 GPUs. A single A100 80GB can be partitioned into up to 7 independent GPU instances, each with guaranteed memory and compute isolation. For inference workloads running smaller models (7B–13B parameters after quantization), this transforms one $4/hour GPU node into seven independent inference endpoints.

# Enable MIG on A100 nodes
nvidia-smi -i 0 -mig 1

# Create 7x 1g.10gb instances (smallest partition)
nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C

# Verify — each appears as nvidia.com/mig-1g.10gb resource in K8s
kubectl describe node gpu-node-1 | grep nvidia

At Deutsche Bank, our risk analytics team ran 12 concurrent model scoring services on 2 A100 GPUs using MIG partitioning. The cost per inference endpoint dropped by 71% compared to dedicated GPU allocation.

GPU-Optimized Kubernetes Starts With the Model: Quantization in 2026

Running a full-precision (FP16) Llama-3 70B model requires approximately 140GB of GPU memory. That's two A100 80GBs just to load the model — before you process a single token. For most enterprise use cases, this is unnecessary. The BitNet 1-bit quantization breakthrough (and more practically, AWQ and GPTQ INT4 methods) changes the math entirely.

Model (Llama-3 70B)	Precision	VRAM Required	Quality Impact	Best For
FP16 (full precision)	16-bit	~140 GB	Baseline	Research, fine-tuning source
BF16	16-bit	~140 GB	<0.5% vs FP16	Training on H100
AWQ INT4	4-bit	~38 GB	1–3% MMLU drop	Enterprise inference ✅
GPTQ INT4	4-bit	~36 GB	2–4% MMLU drop	Enterprise inference ✅
BitNet b1.58 (1-bit)	1.58-bit	~12 GB	5–8% quality loss	Edge, cost-extreme scenarios

For enterprise applications — customer service chatbots, internal knowledge assistants, code review tools — AWQ INT4 is the sweet spot. A 1–3% accuracy drop on MMLU benchmarks is imperceptible to end users, while GPU memory requirements drop by 73%. You go from 2 A100s to 1 A100 per model replica.

# Pre-quantize Llama-3 70B to AWQ before pushing to Kubernetes
# Run this as a one-time job, store in private container registry

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Meta-Llama-3-70B-Instruct"
quant_path = "llama3-70b-awq-int4"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Kubernetes Autoscaling for GPU Inference: Keeping Costs Rational

Standard HPA (Horizontal Pod Autoscaler) works poorly for LLM inference — CPU/memory metrics don't correlate well with inference load. The correct signal is GPU utilization + pending request queue depth. In 2026, the two recommended approaches are:

Option 1: KEDA with Custom Metrics (Recommended for Enterprise)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama3-inference-scaler
  namespace: ai-serving
spec:
  scaleTargetRef:
    name: llama3-70b-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 120         # 2 min cooldown (GPU pod startup is slow)
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: vllm_requests_waiting
      threshold: "5"           # Scale up if >5 requests waiting
      query: |
        avg(vllm_num_requests_waiting{namespace="ai-serving"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: gpu_utilization
      threshold: "75"          # Scale up if GPU >75% utilized
      query: |
        avg(DCGM_FI_DEV_GPU_UTIL{namespace="ai-serving"})

Option 2: Cluster Autoscaler + GPU Node Pool Autoscaling

At the node level, Cluster Autoscaler removes idle GPU nodes within the configured cooldown period. Combined with KEDA pod-level scaling, you have a two-tier cost control mechanism: pods scale down first, then nodes scale to zero. For infrequent workloads (like nightly batch inference), this can reduce GPU costs by 60–80% versus static clusters.

One critical Cluster Autoscaler setting for GPU nodes that most teams miss:

# Cluster Autoscaler annotation — prevent scale-down of GPU nodes running inference pods
# Without this, CA may evict running inference pods aggressively

kubectl annotate node gpu-node-1 \
  cluster-autoscaler.kubernetes.io/scale-down-disabled=true

# Better: use a dedicated autoscaling profile for GPU pools
# In cluster-autoscaler config:
--scale-down-unneeded-time=5m      # Wait 5 min before scaling down GPU nodes
--scale-down-utilization-threshold=0.3  # Only scale down if GPU <30% utilized

Observability for GPU Kubernetes: What You Must Monitor

GPU inference in Kubernetes introduces failure modes that standard Prometheus/Grafana setups don't catch. After watching multiple production incidents at enterprise clients, here are the non-negotiable metrics:

# DCGM Exporter — NVIDIA's official GPU metrics for Kubernetes
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true   # Auto-discover with Prometheus Operator

Critical GPU metrics to alert on:

Metric	Alert Threshold	What It Catches
`DCGM_FI_DEV_GPU_TEMP`	> 85°C	Thermal throttling (kills throughput)
`DCGM_FI_DEV_FB_USED`	> 90% of total	GPU OOM approaching
`DCGM_FI_DEV_GPU_UTIL`	< 10% for 10m	Idle GPU node (scale-down candidate)
`vllm_num_requests_waiting`	> 20	Inference queue backlog (latency SLA breach)
`vllm_gpu_cache_usage_perc`	> 95%	KV-cache full (throughput collapse)

Practical Implementation Sequence: Getting to Production in 4 Weeks

Based on training 14 enterprise batches on Kubernetes AI infrastructure, here's the sequence that consistently produces a production-ready GPU inference cluster:

Week 1 — Foundation: Audit existing K8s cluster. Create dedicated GPU node pools with taints/labels. Deploy NVIDIA Device Plugin + GPU Feature Discovery. Verify nvidia.com/gpu extended resources appear.
Week 2 — Model Preparation: Select and quantize target models to AWQ INT4. Package as container images with model weights baked in (or mount from PVC). Deploy vLLM or Triton Inference Server. Run load tests with Locust — establish baseline throughput and latency.
Week 3 — Autoscaling + Cost Control: Install KEDA. Configure ScaledObjects based on vLLM queue depth. Enable Cluster Autoscaler with GPU-appropriate thresholds. Implement MIG partitioning for smaller model replicas.
Week 4 — Observability + Hardening: Deploy DCGM Exporter. Build Grafana dashboards for GPU utilization, temperature, memory, and inference queue. Configure alerting. Run chaos engineering (kill GPU pods, simulate thermal events). Document runbooks.

🏋️ From the Training Room: At our Oracle Agentic AI Workshop (rated 4.91/5.0), participants built a GPU-optimized K8s inference cluster from scratch in 2 days — Day 1 covering architecture and node pools, Day 2 covering quantization, autoscaling, and observability. The hands-on lab approach means engineers leave with working infrastructure, not slide knowledge.

The 5 Most Expensive GPU Kubernetes Mistakes in 2026

Mixed workload node pools — CPU and GPU workloads competing on the same nodes causes inconsistent latency and GPU contention. Always isolate with taints.
Skipping model quantization — Running FP16 models when AWQ INT4 gives near-identical quality costs 2–4x more GPU budget. No enterprise workload should ship without quantization evaluation.
No KEDA — relying on CPU-based HPA — HPA scaling on CPU usage is meaningless for GPU inference. Queue depth and GPU utilization are the right signals.
Ignoring KV-cache configuration — vLLM's KV cache is where most throughput wins or losses happen. Undersized cache = constant cache misses = 5–10x latency spikes under load.
Static GPU node provisioning with no scale-to-zero — If your inference traffic drops 80% on weekends, a static GPU cluster bleeds money 24/7. Configure min-nodes=0 for off-peak GPU pools.

🚀 Build This in 5 Days With Hands-On Labs

Our Kubernetes Mastery and Agentic AI Workshop programs cover GPU-optimized Kubernetes architecture, LLM inference deployment, and production observability — 119 hands-on labs, no death-by-PowerPoint.

Rated 4.91/5.0 at Oracle. Zero-risk guarantee: 40% faster deployments in 90 days or full refund + ₹83,000.

Explore Training Programs →

Rajesh Gheware

Chief Architect & Corporate Trainer at gheWARE uniGPS Solutions LLP. 25+ years building enterprise platforms at JPMorgan Chase, Deutsche Bank, and Morgan Stanley. CKA, CKS, TOGAF Certified. Author of AGENTIC AI: The Practitioner's Guide. Trained 5,000+ engineers at Fortune 500 companies.

About Rajesh →

Related Reading:

Get Weekly DevOps & AI Infrastructure Insights

Practical guides, architecture patterns, and training resources — straight from the enterprise trenches.