Why Your K8s Cluster Is Already an AI Supercomputer
Here's the conversation I hear constantly from platform engineering leads at Fortune 500 companies: "We want to run our own LLMs in production, but we can't get budget for another GPU cluster."
I spent 12 years running infrastructure at JPMorgan, Deutsche Bank, and Morgan Stanley. I understand budget cycles. And I also know that most teams asking for new GPU hardware already have what they need — they just haven't configured it correctly.
In 2026, LLM quantization on Kubernetes has matured to the point where the hardware conversation is largely over. BitNet 1.58-bit models trending on HackerFront this week can run on CPU clusters. INT4-quantized Llama 3 fits on a single NVIDIA T4. And vLLM's PagedAttention delivers throughput that would have required a cluster of GPUs just 18 months ago.
The actual challenge is no longer hardware — it's knowing how to configure, deploy, and operate quantized LLMs on the Kubernetes infrastructure you already have. That's what this guide covers.
The State of LLM Inference in 2026
Three converging trends have made this the right moment to solve LLM inference on existing K8s:
- Model efficiency explosion: Microsoft's BitNet b1.58, Google's Gemma 3 (2B and 9B), and Meta's Llama 3 8B are all production-capable at INT4 quantization. The "small is powerful" trend is real.
- Serving framework maturity: vLLM 0.5.x, TGI 2.x, and Ollama have production-grade Kubernetes integrations. Helm charts, HPA/KEDA support, and OpenAI-compatible APIs are now standard.
- Kubernetes GPU primitives: Dynamic Resource Allocation (DRA) in K8s 1.32+, GPU time-slicing on A100/H100, and fractional GPU scheduling via NVIDIA MIG have transformed how clusters handle AI workloads.
The benchmark numbers are stark. On a single NVIDIA A100 80GB:
| Model / Precision | GPU Memory | Throughput (tok/s) | Quality vs FP16 |
|---|---|---|---|
| Llama 3 70B — FP16 | 140 GB (2x A100) | 45 tok/s | Baseline |
| Llama 3 70B — INT8 | 70 GB (1x A100) | 68 tok/s | -0.8% |
| Llama 3 70B — INT4 (GPTQ) | 35 GB (1x A100) | 112 tok/s | -3.2% |
| Mistral 7B — INT4 | 4 GB (T4) | 280 tok/s | -2.1% |
| BitNet b1.58 — 7B | CPU-only (32 cores) | 85 tok/s (CPU) | ~FP16 7B |
The INT4 numbers are not a typo. A quantized Llama 3 70B running on a single A100 delivers more throughput than the FP16 version running on two A100s — because quantization reduces memory bandwidth pressure, which is the primary bottleneck in LLM inference.
LLM Quantization Explained: INT8, INT4, GPTQ, and BitNet
Quantization is the process of reducing the numerical precision of model weights from 32-bit or 16-bit floating point to lower-precision integers. For Kubernetes operators deploying LLMs in production, you need to understand four formats:
INT8 — The Safe Default
INT8 quantization (via bitsandbytes or LLM.int8()) converts weights to 8-bit integers while keeping activations in FP16. Memory halves vs FP16. Quality degradation is under 1% on standard benchmarks. This is the recommended starting point for enterprise deployments where quality cannot be compromised — RAG pipelines, legal document processing, financial analysis.
Enabling INT8 in vLLM is a one-line configuration change:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-instruct \
--quantization bitsandbytes \
--load-format bitsandbytes \
--dtype float16
INT4/GPTQ — The Production Sweet Spot
GPTQ (Generalized Post-Training Quantization) is the dominant INT4 format for enterprise deployments in 2026. Unlike naive rounding, GPTQ uses calibration data to minimize quantization error per layer — the result is 4-bit weights that behave far better than their precision would suggest.
Pre-quantized GPTQ models are available on HuggingFace for every major LLM. The TheBloke repository has GPTQ versions of Llama 3, Mistral, Phi-3, Gemma, and Qwen at multiple quantization levels (4-bit, 5-bit, 6-bit, 8-bit with varying group sizes).
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-3-70B-Instruct-GPTQ \
--quantization gptq \
--dtype float16 \
--max-model-len 8192
GGUF — For CPU and Edge K8s Nodes
GGUF (the successor to GGML) is the format for running quantized models on CPU. Llama.cpp uses GGUF and has a production-grade Kubernetes integration via the ghcr.io/ggerganov/llama.cpp:server container. For workloads with loose latency requirements — batch summarization, overnight document processing — GGUF models on CPU pods can replace GPU workloads entirely.
BitNet 1.58-bit — The 2026 Disruptor
Microsoft's BitNet b1.58 quantizes every weight to a ternary value: -1, 0, or +1. The result is a 1.58-bit model that requires no floating-point multiply operations — only additions. On modern CPUs with AVX-512, BitNet inference is competitive with INT4 GPU inference for models up to 7B parameters.
For Kubernetes platform teams, this is significant: BitNet models can run on standard CPU node pools, completely eliminating GPU resource requests. Microsoft's reference implementation (microsoft/BitNet) ships a production-grade inference server, and the community has packaged it as a Kubernetes Helm chart.
# BitNet on K8s — CPU-only deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: bitnet-inference
namespace: ai-serving
spec:
replicas: 3
selector:
matchLabels:
app: bitnet-inference
template:
spec:
containers:
- name: bitnet
image: ghcr.io/microsoft/bitnet:latest
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "16"
memory: "32Gi"
# Note: NO gpu resource requests required
env:
- name: MODEL_PATH
value: /models/bitnet-b1.58-3B-q1_5.gguf
- name: N_THREADS
value: "16"
vLLM on Kubernetes: The Production Deployment Architecture
vLLM is the production standard for high-throughput LLM serving on Kubernetes in 2026. Its PagedAttention algorithm — which manages the KV cache like virtual memory in an OS — delivers 23x higher throughput than naive Hugging Face pipeline inference at the same GPU utilization. At scale, that's the difference between 1 GPU node and 23.
The Reference Architecture
Here is the production vLLM architecture we deploy for enterprise clients:
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-70b-int4
namespace: ai-serving
labels:
app: vllm
model: llama3-70b-int4
spec:
replicas: 2
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
accelerator: nvidia-a100
containers:
- name: vllm
image: vllm/vllm-openai:v0.5.4
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model=TheBloke/Llama-3-70B-Instruct-GPTQ"
- "--quantization=gptq"
- "--dtype=float16"
- "--max-model-len=8192"
- "--tensor-parallel-size=1"
- "--gpu-memory-utilization=0.92"
- "--served-model-name=llama3-70b"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: "1"
memory: "80Gi"
requests:
nvidia.com/gpu: "1"
memory: "48Gi"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: ai-serving
spec:
selector:
app: vllm
ports:
- port: 80
targetPort: 8000
protocol: TCP
type: ClusterIP
Model Caching — The Critical Performance Layer
In production, model loading time (120–300 seconds for a 70B model) is unacceptable for pod restarts and scaling. The solution is a shared ReadWriteMany PVC backed by EFS, GCS Filestore, or NFS that all inference pods mount. Models are downloaded once and cached; pod startup time drops to 15–30 seconds.
# Model pre-loader Job — runs once, caches model for all pods
apiVersion: batch/v1
kind: Job
metadata:
name: model-preloader-llama3-70b
namespace: ai-serving
spec:
template:
spec:
containers:
- name: preloader
image: python:3.11-slim
command: ["python", "-c"]
args:
- |
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='TheBloke/Llama-3-70B-Instruct-GPTQ',
local_dir='/models/llama3-70b-gptq',
ignore_patterns=['*.bin'] # GPTQ uses .safetensors
)
print('Model cached successfully')
volumeMounts:
- name: model-cache
mountPath: /models
env:
- name: HF_HUB_CACHE
value: /models
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
restartPolicy: OnFailure
OpenAI-Compatible API — Drop-In Replacement
vLLM exposes an OpenAI-compatible REST API. Every LangChain, LlamaIndex, or custom application that uses openai.ChatCompletion can route to vLLM with a single environment variable change:
# Before (OpenAI API)
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-...
# After (vLLM on K8s — same code, zero changes)
OPENAI_API_BASE=http://vllm-service.ai-serving.svc.cluster.local/v1
OPENAI_API_KEY=EMPTY # vLLM doesn't require auth (add a proxy for production)
This is the migration path that makes vLLM adoption frictionless for enterprise teams with existing OpenAI integrations — and it's exactly the architecture we cover in the Agentic AI Workshop (5 days, hands-on labs, 4.91/5.0 at Oracle).
Autoscaling LLM Inference with KEDA + Prometheus
This is where most teams get it wrong. Do not use standard HPA for LLM inference pods. HPA scales on CPU or memory utilization — but GPU-accelerated LLM inference can be CPU-idle while GPU-saturated, and a pod under heavy load may report low CPU% simply because the GPU is doing all the work. CPU/memory are the wrong scaling signals for LLM workloads.
The correct approach is KEDA with Prometheus custom metrics. vLLM exposes a rich set of Prometheus metrics including:
vllm:num_requests_waiting— Requests queued waiting for a GPU slotvllm:gpu_cache_usage_perc— KV cache utilization (proxy for memory pressure)vllm:num_requests_running— Active concurrent requestsvllm:e2e_request_latency_seconds— End-to-end latency histogram
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
namespace: ai-serving
spec:
scaleTargetRef:
name: vllm-llama3-70b-int4
minReplicaCount: 1
maxReplicaCount: 8
cooldownPeriod: 120 # 2 min cooldown (GPU pod startup is slow)
pollingInterval: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: vllm_requests_waiting
threshold: "5" # Scale out when 5+ requests are queued
query: sum(vllm:num_requests_waiting{namespace="ai-serving"})
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: vllm_kv_cache_usage
threshold: "80" # Scale out at 80% KV cache utilization
query: avg(vllm:gpu_cache_usage_perc{namespace="ai-serving"}) * 100
The cooldownPeriod: 120 is important. GPU pods take 30–90 seconds to become ready after scheduling. Aggressive scale-down will trigger constant churn. Set a minimum of 2 minutes cooldown, and for large models (70B+), use 5 minutes.
Node Pool Pre-Warming
For enterprise workloads with predictable traffic patterns (8 AM – 8 PM IST for India-based deployments), use the cluster-autoscaler with a scheduled scale-up to pre-warm GPU node pools before peak traffic. This avoids the 5–10 minute cold-start penalty of bringing up a new GPU VM:
# Scheduled node pre-warm using a sentinel Deployment
# Keeps 1 GPU node always warm; KEDA scales vLLM pods on top
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-node-sentinel
namespace: ai-serving
spec:
replicas: 1
template:
spec:
nodeSelector:
accelerator: nvidia-t4
containers:
- name: sentinel
image: nvidia/cuda:12.3-base-ubuntu22.04
command: ["sleep", "infinity"]
resources:
limits:
nvidia.com/gpu: "1" # Keeps GPU node alive, costs ~$0.35/hr
GitOps Model Deployment: Zero-Downtime Rollouts with Argo CD
Deploying a new model version — quantization level, fine-tuned checkpoint, or entirely different model — is a high-risk operation without the right pipeline. We've seen production outages at financial services firms caused by manual kubectl set image commands during model updates. The fix is GitOps.
The Helm Chart Structure
llm-inference/
├── Chart.yaml
├── values.yaml # Default: production config
├── values-staging.yaml # Staging overrides (smaller model, 1 replica)
├── values-int8.yaml # INT8 variant for quality-critical paths
├── values-int4.yaml # INT4 variant for high-throughput paths
└── templates/
├── deployment.yaml # vLLM Deployment
├── service.yaml
├── serviceaccount.yaml
├── hpa-or-keda.yaml # ScaledObject
├── pvc-model-cache.yaml
└── configmap-env.yaml
The values.yaml key parameters that control the model:
# values.yaml
model:
repo: "TheBloke/Llama-3-70B-Instruct-GPTQ"
quantization: "gptq"
maxModelLen: 8192
gpuMemoryUtilization: 0.92
tensorParallelSize: 1
resources:
gpu:
count: 1
type: "nvidia-a100"
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 8
waitingRequestsThreshold: 5
To roll out a new model, you change model.repo in Git. Argo CD detects the diff and triggers a rolling update with the configured surge/unavailable settings. The old pods stay live until new pods pass readiness probes — zero downtime.
Argo CD Application — Multi-Environment Setup
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: llm-inference-prod
namespace: argocd
spec:
project: ai-platform
source:
repoURL: https://github.com/your-org/ai-platform-gitops
targetRevision: main
path: helm/llm-inference
helm:
valueFiles:
- values.yaml
- values-int4.yaml # INT4 overlay for production
destination:
server: https://kubernetes.default.svc
namespace: ai-serving
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- RespectIgnoreDifferences=true
- CreateNamespace=true
Model Promotion Pipeline
The production workflow for promoting a new model version:
- Evaluate: Run
lm-evaluation-harnessagainst the new checkpoint in a CI job. Gate on: MMLU ≥ baseline − 3%, latency p95 ≤ 300ms at target load. - Stage: Merge to
stagingbranch → Argo CD auto-deploys to staging namespace. Run integration tests against staging endpoint. - Canary: Promote to
mainwith canary weight 10% using Argo Rollouts or a weighted VirtualService (Istio/Linkerd). Monitor error rate and latency for 30 minutes. - Full rollout: Shift 100% traffic to new pods. Old pods terminate after cooldown.
- Rollback: If any gate fails,
git revert→ Argo CD restores previous model in under 3 minutes.
This pipeline — Evaluate → Stage → Canary → Full → Rollback — is what separates teams that run LLMs in production from teams that run LLMs in demos. We build this entire pipeline hands-on in the gheWARE Agentic AI Workshop, including live labs on vLLM deployment, KEDA configuration, and Argo CD GitOps for AI workloads.
Frequently Asked Questions
Can I run a quantized LLM on Kubernetes without a GPU?
Yes — highly quantized models (INT4/GGUF) can run on CPU-only nodes for low-throughput use cases. For production-grade latency (under 200ms), a GPU node is still recommended, but a single NVIDIA T4 (available on most cloud providers for ~$0.35/hr) running an INT4-quantized Mistral 7B can handle 50–80 concurrent requests — a fraction of the cost of running full-precision models. BitNet 1.58-bit models (like BitNet b1.58 3B from Microsoft) can run entirely on CPU clusters and are viable for enterprise classification and summarization workloads.
What is the difference between INT8 and INT4 quantization for LLMs?
INT8 quantization reduces model weights from 32-bit float to 8-bit integer, cutting memory usage roughly in half with less than 1% quality degradation on most benchmarks. INT4 goes further — 4-bit integers — cutting memory by 4x vs FP16 at a 2–5% quality drop. For most enterprise use cases (summarization, extraction, classification), INT4 is acceptable. For RAG-based reasoning or code generation, INT8 is the safer default.
Which is better for production Kubernetes LLM inference — vLLM or TGI?
vLLM is generally the production choice in 2026 for high-throughput enterprise deployments. Its PagedAttention mechanism handles concurrent requests far more efficiently than standard batching, and it integrates cleanly with Helm charts and Kubernetes HPA/KEDA autoscaling. Hugging Face TGI is excellent for fast prototyping and supports more model formats natively. For enterprise at-scale (50+ concurrent users), vLLM wins on throughput; TGI wins on simplicity. KServe is a third option worth evaluating for teams already running KFServing or Kubeflow.
How do I autoscale LLM inference pods on Kubernetes?
Use KEDA (Kubernetes Event-Driven Autoscaling) with a Prometheus scaler targeting GPU utilization or queue depth metrics. Standard HPA can only scale on CPU/memory, which is a poor proxy for LLM inference load. A typical KEDA ScaledObject sets minReplicas: 1, maxReplicas: 8, and triggers on a custom metric like vllm:num_requests_waiting > 5. This gives you zero-cost idle state with rapid scale-out during traffic spikes. Set cooldownPeriod: 120 minimum to avoid GPU pod churn.
What is BitNet and why is it trending for Kubernetes LLM deployment?
BitNet (Microsoft Research, 2025) takes quantization to its extreme — 1.58-bit weights where each parameter is -1, 0, or +1. BitNet b1.58 models can run on CPU clusters without any GPU at all, achieving performance competitive with FP16 models at their scale. For Kubernetes deployments, this means you can run LLM inference on standard compute nodes, eliminating GPU node pools entirely for certain workloads. The BitNet.cpp implementation supports efficient inference on standard x86 and ARM64 architectures — highly relevant for teams running ARM64 Kubernetes nodes (AWS Graviton, GKE Tau T2A, Azure Cobalt).
Conclusion: The Hardware Excuse Is Over
In 2024, "we can't afford the GPU cluster" was a legitimate blocker for enterprise LLM deployments. In 2026, it's an infrastructure knowledge gap.
The combination of INT4/GPTQ quantization (4x memory reduction), vLLM's PagedAttention (23x throughput improvement), KEDA autoscaling (zero cost at idle), and GitOps model management (safe, zero-downtime rollouts) has fundamentally changed the economics of LLM inference on Kubernetes.
A single NVIDIA A100 node running an INT4-quantized Llama 3 70B via vLLM can serve 50+ concurrent enterprise users at under 200ms p95 latency — a workload that would have required 4–6 GPU nodes at full precision 18 months ago.
Your existing Kubernetes cluster is already the platform. Quantization is the key that unlocks it.
Build This Architecture Hands-On
Learn vLLM deployment, KEDA autoscaling, GitOps model pipelines, and quantization strategies in the gheWARE Agentic AI Workshop — 5 days, 119 hands-on labs, rated 4.91/5.0 by Oracle engineers.
View Workshop → 📖 Get the Book