GPU-Optimized Kubernetes for LLM Inference: The Enterprise Playbook for 2026
In early 2026, Kubernetes crossed a threshold: it became the de facto runtime for enterprise LLM inference โ not just model training. But running a Llama-3 70B or Mistral-large model on Kubernetes is a completely different engineering challenge from running microservices. Without proper GPU-optimized Kubernetes design, teams are wasting 60โ70% of their GPU budget and seeing inference latency that would make users abandon the product. I've watched this exact failure pattern happen at scale. Here's how to avoid it.
โก Key Takeaways
- GPU-optimized Kubernetes requires purpose-built node pools โ don't mix CPU and GPU workloads on the same nodes
- Model quantization (INT4/INT8, BitNet) can cut GPU memory requirements by 50โ75% without meaningful accuracy loss for enterprise use cases
- Kubernetes Device Plugin + extended resources are mandatory for GPU scheduling โ not optional add-ons
- NVIDIA MIG (Multi-Instance GPU) partitioning lets you run 7 independent inference pods on a single A100 โ critical for cost control
- Right-sized GPU node pools with autoscaling can reduce cloud GPU spend by 40โ60% vs. static provisioning
Why Kubernetes Has Become the LLM Inference Runtime of Choice
At JPMorgan, we never ran GPU workloads on Kubernetes in 2015. We provisioned bare-metal GPU servers, dedicated them to specific models, and prayed they stayed up. The operational overhead was enormous โ patching, scaling, failover, load balancing โ all custom-built. Today, enterprises running LLMs on Kubernetes inherit a decade of container orchestration maturity. That is a genuine competitive advantage, if you configure it correctly.
The enterprise pull toward Kubernetes for LLM inference has accelerated dramatically in Q1 2026. Three convergent forces drove this:
- Cost pressure: A100 80GB instances cost $3โ8/hour on AWS/GCP/Azure. Static GPU provisioning at enterprise scale burns millions before you write your first inference query. Kubernetes auto-scaling changes the math entirely.
- Multi-model deployments: Enterprises are no longer running one LLM. L&D teams use one model, customer service uses another, code review uses a third. Kubernetes multi-tenancy makes this tractable.
- Operational consistency: Your CI/CD pipelines, monitoring (Prometheus/Grafana), secrets management, and RBAC already live in Kubernetes. Inference should too.
Designing GPU Node Pools: The Architecture That Actually Scales
The most common mistake I see in enterprise Kubernetes setups: one GPU node pool for everything. Training jobs compete with inference pods. Batch embedding tasks starve latency-sensitive chat inference. You end up with neither workload performing well.
The correct architecture uses purpose-built node pools with strict affinity and tainting:
# Node pool taint for LLM inference (GKE example) gcloud container node-pools create llm-inference-pool \ --cluster=enterprise-ai-cluster \ --machine-type=a2-highgpu-1g \ # A100 40GB --accelerator type=nvidia-tesla-a100,count=1 \ --num-nodes=3 \ --min-nodes=1 \ --max-nodes=10 \ --enable-autoscaling \ --node-taints=workload=llm-inference:NoSchedule \ --node-labels=gpu-type=a100,workload-class=inference --- # Separate pool for training/fine-tuning (bursty, long-running) gcloud container node-pools create llm-training-pool \ --machine-type=a2-highgpu-8g \ # 8x A100 for distributed training --num-nodes=0 \ --min-nodes=0 \ --max-nodes=4 \ --enable-autoscaling \ --node-taints=workload=llm-training:NoSchedule
The min-nodes=0 on the training pool is critical โ training runs are infrequent but expensive. Scale to zero when idle, scale up within minutes for a scheduled fine-tuning run. Your inference pool stays always-on at minimum capacity (1 node) to avoid cold-start latency.
The corresponding Pod spec that respects this architecture:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama3-70b-inference
namespace: ai-serving
spec:
replicas: 2
selector:
matchLabels:
app: llama3-inference
template:
metadata:
labels:
app: llama3-inference
spec:
# Target the inference-specific node pool
nodeSelector:
workload-class: inference
gpu-type: a100
tolerations:
- key: "workload"
operator: "Equal"
value: "llm-inference"
effect: "NoSchedule"
containers:
- name: vllm-server
image: vllm/vllm-openai:v0.6.2
args:
- "--model"
- "meta-llama/Meta-Llama-3-70B-Instruct"
- "--quantization"
- "awq" # INT4 quantization
- "--gpu-memory-utilization"
- "0.90"
- "--max-model-len"
- "8192"
- "--tensor-parallel-size"
- "1"
resources:
limits:
nvidia.com/gpu: "1" # Extended resource โ requires Device Plugin
memory: "40Gi"
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "4"
ports:
- containerPort: 8000
name: http
Kubernetes GPU Scheduling: The Device Plugin Layer You Cannot Skip
One of the most misunderstood areas in GPU Kubernetes deployments: the NVIDIA Device Plugin is not optional infrastructure. Without it, Kubernetes has no concept of a GPU as a schedulable resource. You cannot request nvidia.com/gpu: 1 in a Pod spec without the Device Plugin DaemonSet running on every GPU node.
Deployment is straightforward but must happen before any inference workloads:
# Deploy NVIDIA Device Plugin (Helm โ recommended for production) helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update helm install nvdp nvdp/nvidia-device-plugin \ --namespace kube-system \ --version 0.16.2 \ --set gfd.enabled=true \ # GPU Feature Discovery --set migStrategy=mixed \ # Enable MIG support --set deviceListStrategy=volume-mounts
With gfd.enabled=true, GPU Feature Discovery automatically labels your nodes with GPU capabilities โ model, memory, CUDA version, MIG topology. This enables sophisticated scheduling rules like "only schedule Llama-3 70B on A100 80GB nodes" without manual label management.
NVIDIA MIG: The Cost-Optimization Most Enterprises Miss
Multi-Instance GPU (MIG) is arguably the highest-ROI configuration change available on A100 and H100 GPUs. A single A100 80GB can be partitioned into up to 7 independent GPU instances, each with guaranteed memory and compute isolation. For inference workloads running smaller models (7Bโ13B parameters after quantization), this transforms one $4/hour GPU node into seven independent inference endpoints.
# Enable MIG on A100 nodes nvidia-smi -i 0 -mig 1 # Create 7x 1g.10gb instances (smallest partition) nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C # Verify โ each appears as nvidia.com/mig-1g.10gb resource in K8s kubectl describe node gpu-node-1 | grep nvidia
At Deutsche Bank, our risk analytics team ran 12 concurrent model scoring services on 2 A100 GPUs using MIG partitioning. The cost per inference endpoint dropped by 71% compared to dedicated GPU allocation.
GPU-Optimized Kubernetes Starts With the Model: Quantization in 2026
Running a full-precision (FP16) Llama-3 70B model requires approximately 140GB of GPU memory. That's two A100 80GBs just to load the model โ before you process a single token. For most enterprise use cases, this is unnecessary. The BitNet 1-bit quantization breakthrough (and more practically, AWQ and GPTQ INT4 methods) changes the math entirely.
| Model (Llama-3 70B) | Precision | VRAM Required | Quality Impact | Best For |
|---|---|---|---|---|
| FP16 (full precision) | 16-bit | ~140 GB | Baseline | Research, fine-tuning source |
| BF16 | 16-bit | ~140 GB | <0.5% vs FP16 | Training on H100 |
| AWQ INT4 | 4-bit | ~38 GB | 1โ3% MMLU drop | Enterprise inference โ |
| GPTQ INT4 | 4-bit | ~36 GB | 2โ4% MMLU drop | Enterprise inference โ |
| BitNet b1.58 (1-bit) | 1.58-bit | ~12 GB | 5โ8% quality loss | Edge, cost-extreme scenarios |
For enterprise applications โ customer service chatbots, internal knowledge assistants, code review tools โ AWQ INT4 is the sweet spot. A 1โ3% accuracy drop on MMLU benchmarks is imperceptible to end users, while GPU memory requirements drop by 73%. You go from 2 A100s to 1 A100 per model replica.
# Pre-quantize Llama-3 70B to AWQ before pushing to Kubernetes
# Run this as a one-time job, store in private container registry
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Meta-Llama-3-70B-Instruct"
quant_path = "llama3-70b-awq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Kubernetes Autoscaling for GPU Inference: Keeping Costs Rational
Standard HPA (Horizontal Pod Autoscaler) works poorly for LLM inference โ CPU/memory metrics don't correlate well with inference load. The correct signal is GPU utilization + pending request queue depth. In 2026, the two recommended approaches are:
Option 1: KEDA with Custom Metrics (Recommended for Enterprise)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama3-inference-scaler
namespace: ai-serving
spec:
scaleTargetRef:
name: llama3-70b-inference
minReplicaCount: 1
maxReplicaCount: 8
cooldownPeriod: 120 # 2 min cooldown (GPU pod startup is slow)
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: vllm_requests_waiting
threshold: "5" # Scale up if >5 requests waiting
query: |
avg(vllm_num_requests_waiting{namespace="ai-serving"})
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: gpu_utilization
threshold: "75" # Scale up if GPU >75% utilized
query: |
avg(DCGM_FI_DEV_GPU_UTIL{namespace="ai-serving"})
Option 2: Cluster Autoscaler + GPU Node Pool Autoscaling
At the node level, Cluster Autoscaler removes idle GPU nodes within the configured cooldown period. Combined with KEDA pod-level scaling, you have a two-tier cost control mechanism: pods scale down first, then nodes scale to zero. For infrequent workloads (like nightly batch inference), this can reduce GPU costs by 60โ80% versus static clusters.
One critical Cluster Autoscaler setting for GPU nodes that most teams miss:
# Cluster Autoscaler annotation โ prevent scale-down of GPU nodes running inference pods # Without this, CA may evict running inference pods aggressively kubectl annotate node gpu-node-1 \ cluster-autoscaler.kubernetes.io/scale-down-disabled=true # Better: use a dedicated autoscaling profile for GPU pools # In cluster-autoscaler config: --scale-down-unneeded-time=5m # Wait 5 min before scaling down GPU nodes --scale-down-utilization-threshold=0.3 # Only scale down if GPU <30% utilized
Observability for GPU Kubernetes: What You Must Monitor
GPU inference in Kubernetes introduces failure modes that standard Prometheus/Grafana setups don't catch. After watching multiple production incidents at enterprise clients, here are the non-negotiable metrics:
# DCGM Exporter โ NVIDIA's official GPU metrics for Kubernetes helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \ --namespace monitoring \ --set serviceMonitor.enabled=true # Auto-discover with Prometheus Operator
Critical GPU metrics to alert on:
| Metric | Alert Threshold | What It Catches |
|---|---|---|
DCGM_FI_DEV_GPU_TEMP | > 85ยฐC | Thermal throttling (kills throughput) |
DCGM_FI_DEV_FB_USED | > 90% of total | GPU OOM approaching |
DCGM_FI_DEV_GPU_UTIL | < 10% for 10m | Idle GPU node (scale-down candidate) |
vllm_num_requests_waiting | > 20 | Inference queue backlog (latency SLA breach) |
vllm_gpu_cache_usage_perc | > 95% | KV-cache full (throughput collapse) |
Practical Implementation Sequence: Getting to Production in 4 Weeks
Based on training 14 enterprise batches on Kubernetes AI infrastructure, here's the sequence that consistently produces a production-ready GPU inference cluster:
- Week 1 โ Foundation: Audit existing K8s cluster. Create dedicated GPU node pools with taints/labels. Deploy NVIDIA Device Plugin + GPU Feature Discovery. Verify
nvidia.com/gpuextended resources appear. - Week 2 โ Model Preparation: Select and quantize target models to AWQ INT4. Package as container images with model weights baked in (or mount from PVC). Deploy vLLM or Triton Inference Server. Run load tests with Locust โ establish baseline throughput and latency.
- Week 3 โ Autoscaling + Cost Control: Install KEDA. Configure ScaledObjects based on vLLM queue depth. Enable Cluster Autoscaler with GPU-appropriate thresholds. Implement MIG partitioning for smaller model replicas.
- Week 4 โ Observability + Hardening: Deploy DCGM Exporter. Build Grafana dashboards for GPU utilization, temperature, memory, and inference queue. Configure alerting. Run chaos engineering (kill GPU pods, simulate thermal events). Document runbooks.
The 5 Most Expensive GPU Kubernetes Mistakes in 2026
- Mixed workload node pools โ CPU and GPU workloads competing on the same nodes causes inconsistent latency and GPU contention. Always isolate with taints.
- Skipping model quantization โ Running FP16 models when AWQ INT4 gives near-identical quality costs 2โ4x more GPU budget. No enterprise workload should ship without quantization evaluation.
- No KEDA โ relying on CPU-based HPA โ HPA scaling on CPU usage is meaningless for GPU inference. Queue depth and GPU utilization are the right signals.
- Ignoring KV-cache configuration โ vLLM's KV cache is where most throughput wins or losses happen. Undersized cache = constant cache misses = 5โ10x latency spikes under load.
- Static GPU node provisioning with no scale-to-zero โ If your inference traffic drops 80% on weekends, a static GPU cluster bleeds money 24/7. Configure min-nodes=0 for off-peak GPU pools.
๐ Build This in 5 Days With Hands-On Labs
Our Kubernetes Mastery and Agentic AI Workshop programs cover GPU-optimized Kubernetes architecture, LLM inference deployment, and production observability โ 119 hands-on labs, no death-by-PowerPoint.
Rated 4.91/5.0 at Oracle. Zero-risk guarantee: 40% faster deployments in 90 days or full refund + โน83,000.
Explore Training Programs โGet Weekly DevOps & AI Infrastructure Insights
Practical guides, architecture patterns, and training resources โ straight from the enterprise trenches.