Why Your K8s Cluster Is Already an AI Supercomputer

Here's the conversation I hear constantly from platform engineering leads at Fortune 500 companies: "We want to run our own LLMs in production, but we can't get budget for another GPU cluster."

I spent 12 years running infrastructure at JPMorgan, Deutsche Bank, and Morgan Stanley. I understand budget cycles. And I also know that most teams asking for new GPU hardware already have what they need — they just haven't configured it correctly.

In 2026, LLM quantization on Kubernetes has matured to the point where the hardware conversation is largely over. BitNet 1.58-bit models trending on HackerFront this week can run on CPU clusters. INT4-quantized Llama 3 fits on a single NVIDIA T4. And vLLM's PagedAttention delivers throughput that would have required a cluster of GPUs just 18 months ago.

The actual challenge is no longer hardware — it's knowing how to configure, deploy, and operate quantized LLMs on the Kubernetes infrastructure you already have. That's what this guide covers.

The State of LLM Inference in 2026

Three converging trends have made this the right moment to solve LLM inference on existing K8s:

  • Model efficiency explosion: Microsoft's BitNet b1.58, Google's Gemma 3 (2B and 9B), and Meta's Llama 3 8B are all production-capable at INT4 quantization. The "small is powerful" trend is real.
  • Serving framework maturity: vLLM 0.5.x, TGI 2.x, and Ollama have production-grade Kubernetes integrations. Helm charts, HPA/KEDA support, and OpenAI-compatible APIs are now standard.
  • Kubernetes GPU primitives: Dynamic Resource Allocation (DRA) in K8s 1.32+, GPU time-slicing on A100/H100, and fractional GPU scheduling via NVIDIA MIG have transformed how clusters handle AI workloads.

The benchmark numbers are stark. On a single NVIDIA A100 80GB:

Model / Precision GPU Memory Throughput (tok/s) Quality vs FP16
Llama 3 70B — FP16 140 GB (2x A100) 45 tok/s Baseline
Llama 3 70B — INT8 70 GB (1x A100) 68 tok/s -0.8%
Llama 3 70B — INT4 (GPTQ) 35 GB (1x A100) 112 tok/s -3.2%
Mistral 7B — INT4 4 GB (T4) 280 tok/s -2.1%
BitNet b1.58 — 7B CPU-only (32 cores) 85 tok/s (CPU) ~FP16 7B

The INT4 numbers are not a typo. A quantized Llama 3 70B running on a single A100 delivers more throughput than the FP16 version running on two A100s — because quantization reduces memory bandwidth pressure, which is the primary bottleneck in LLM inference.

LLM Quantization Explained: INT8, INT4, GPTQ, and BitNet

Quantization is the process of reducing the numerical precision of model weights from 32-bit or 16-bit floating point to lower-precision integers. For Kubernetes operators deploying LLMs in production, you need to understand four formats:

INT8 — The Safe Default

INT8 quantization (via bitsandbytes or LLM.int8()) converts weights to 8-bit integers while keeping activations in FP16. Memory halves vs FP16. Quality degradation is under 1% on standard benchmarks. This is the recommended starting point for enterprise deployments where quality cannot be compromised — RAG pipelines, legal document processing, financial analysis.

Enabling INT8 in vLLM is a one-line configuration change:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --dtype float16

INT4/GPTQ — The Production Sweet Spot

GPTQ (Generalized Post-Training Quantization) is the dominant INT4 format for enterprise deployments in 2026. Unlike naive rounding, GPTQ uses calibration data to minimize quantization error per layer — the result is 4-bit weights that behave far better than their precision would suggest.

Pre-quantized GPTQ models are available on HuggingFace for every major LLM. The TheBloke repository has GPTQ versions of Llama 3, Mistral, Phi-3, Gemma, and Qwen at multiple quantization levels (4-bit, 5-bit, 6-bit, 8-bit with varying group sizes).

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-3-70B-Instruct-GPTQ \
  --quantization gptq \
  --dtype float16 \
  --max-model-len 8192

GGUF — For CPU and Edge K8s Nodes

GGUF (the successor to GGML) is the format for running quantized models on CPU. Llama.cpp uses GGUF and has a production-grade Kubernetes integration via the ghcr.io/ggerganov/llama.cpp:server container. For workloads with loose latency requirements — batch summarization, overnight document processing — GGUF models on CPU pods can replace GPU workloads entirely.

BitNet 1.58-bit — The 2026 Disruptor

Microsoft's BitNet b1.58 quantizes every weight to a ternary value: -1, 0, or +1. The result is a 1.58-bit model that requires no floating-point multiply operations — only additions. On modern CPUs with AVX-512, BitNet inference is competitive with INT4 GPU inference for models up to 7B parameters.

For Kubernetes platform teams, this is significant: BitNet models can run on standard CPU node pools, completely eliminating GPU resource requests. Microsoft's reference implementation (microsoft/BitNet) ships a production-grade inference server, and the community has packaged it as a Kubernetes Helm chart.

# BitNet on K8s — CPU-only deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bitnet-inference
  namespace: ai-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: bitnet-inference
  template:
    spec:
      containers:
      - name: bitnet
        image: ghcr.io/microsoft/bitnet:latest
        resources:
          requests:
            cpu: "8"
            memory: "16Gi"
          limits:
            cpu: "16"
            memory: "32Gi"
        # Note: NO gpu resource requests required
        env:
        - name: MODEL_PATH
          value: /models/bitnet-b1.58-3B-q1_5.gguf
        - name: N_THREADS
          value: "16"

vLLM on Kubernetes: The Production Deployment Architecture

vLLM is the production standard for high-throughput LLM serving on Kubernetes in 2026. Its PagedAttention algorithm — which manages the KV cache like virtual memory in an OS — delivers 23x higher throughput than naive Hugging Face pipeline inference at the same GPU utilization. At scale, that's the difference between 1 GPU node and 23.

The Reference Architecture

Here is the production vLLM architecture we deploy for enterprise clients:

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-70b-int4
  namespace: ai-serving
  labels:
    app: vllm
    model: llama3-70b-int4
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      nodeSelector:
        accelerator: nvidia-a100
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.5.4
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - "--model=TheBloke/Llama-3-70B-Instruct-GPTQ"
        - "--quantization=gptq"
        - "--dtype=float16"
        - "--max-model-len=8192"
        - "--tensor-parallel-size=1"
        - "--gpu-memory-utilization=0.92"
        - "--served-model-name=llama3-70b"
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "80Gi"
          requests:
            nvidia.com/gpu: "1"
            memory: "48Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 16Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ai-serving
spec:
  selector:
    app: vllm
  ports:
  - port: 80
    targetPort: 8000
    protocol: TCP
  type: ClusterIP

Model Caching — The Critical Performance Layer

In production, model loading time (120–300 seconds for a 70B model) is unacceptable for pod restarts and scaling. The solution is a shared ReadWriteMany PVC backed by EFS, GCS Filestore, or NFS that all inference pods mount. Models are downloaded once and cached; pod startup time drops to 15–30 seconds.

# Model pre-loader Job — runs once, caches model for all pods
apiVersion: batch/v1
kind: Job
metadata:
  name: model-preloader-llama3-70b
  namespace: ai-serving
spec:
  template:
    spec:
      containers:
      - name: preloader
        image: python:3.11-slim
        command: ["python", "-c"]
        args:
        - |
          from huggingface_hub import snapshot_download
          snapshot_download(
              repo_id='TheBloke/Llama-3-70B-Instruct-GPTQ',
              local_dir='/models/llama3-70b-gptq',
              ignore_patterns=['*.bin']  # GPTQ uses .safetensors
          )
          print('Model cached successfully')
        volumeMounts:
        - name: model-cache
          mountPath: /models
        env:
        - name: HF_HUB_CACHE
          value: /models
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      restartPolicy: OnFailure

OpenAI-Compatible API — Drop-In Replacement

vLLM exposes an OpenAI-compatible REST API. Every LangChain, LlamaIndex, or custom application that uses openai.ChatCompletion can route to vLLM with a single environment variable change:

# Before (OpenAI API)
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-...

# After (vLLM on K8s — same code, zero changes)
OPENAI_API_BASE=http://vllm-service.ai-serving.svc.cluster.local/v1
OPENAI_API_KEY=EMPTY  # vLLM doesn't require auth (add a proxy for production)

This is the migration path that makes vLLM adoption frictionless for enterprise teams with existing OpenAI integrations — and it's exactly the architecture we cover in the Agentic AI Workshop (5 days, hands-on labs, 4.91/5.0 at Oracle).

Autoscaling LLM Inference with KEDA + Prometheus

This is where most teams get it wrong. Do not use standard HPA for LLM inference pods. HPA scales on CPU or memory utilization — but GPU-accelerated LLM inference can be CPU-idle while GPU-saturated, and a pod under heavy load may report low CPU% simply because the GPU is doing all the work. CPU/memory are the wrong scaling signals for LLM workloads.

The correct approach is KEDA with Prometheus custom metrics. vLLM exposes a rich set of Prometheus metrics including:

  • vllm:num_requests_waiting — Requests queued waiting for a GPU slot
  • vllm:gpu_cache_usage_perc — KV cache utilization (proxy for memory pressure)
  • vllm:num_requests_running — Active concurrent requests
  • vllm:e2e_request_latency_seconds — End-to-end latency histogram
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
  namespace: ai-serving
spec:
  scaleTargetRef:
    name: vllm-llama3-70b-int4
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 120      # 2 min cooldown (GPU pod startup is slow)
  pollingInterval: 15
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
      metricName: vllm_requests_waiting
      threshold: "5"       # Scale out when 5+ requests are queued
      query: sum(vllm:num_requests_waiting{namespace="ai-serving"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
      metricName: vllm_kv_cache_usage
      threshold: "80"      # Scale out at 80% KV cache utilization
      query: avg(vllm:gpu_cache_usage_perc{namespace="ai-serving"}) * 100

The cooldownPeriod: 120 is important. GPU pods take 30–90 seconds to become ready after scheduling. Aggressive scale-down will trigger constant churn. Set a minimum of 2 minutes cooldown, and for large models (70B+), use 5 minutes.

Node Pool Pre-Warming

For enterprise workloads with predictable traffic patterns (8 AM – 8 PM IST for India-based deployments), use the cluster-autoscaler with a scheduled scale-up to pre-warm GPU node pools before peak traffic. This avoids the 5–10 minute cold-start penalty of bringing up a new GPU VM:

# Scheduled node pre-warm using a sentinel Deployment
# Keeps 1 GPU node always warm; KEDA scales vLLM pods on top
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-node-sentinel
  namespace: ai-serving
spec:
  replicas: 1
  template:
    spec:
      nodeSelector:
        accelerator: nvidia-t4
      containers:
      - name: sentinel
        image: nvidia/cuda:12.3-base-ubuntu22.04
        command: ["sleep", "infinity"]
        resources:
          limits:
            nvidia.com/gpu: "1"  # Keeps GPU node alive, costs ~$0.35/hr

GitOps Model Deployment: Zero-Downtime Rollouts with Argo CD

Deploying a new model version — quantization level, fine-tuned checkpoint, or entirely different model — is a high-risk operation without the right pipeline. We've seen production outages at financial services firms caused by manual kubectl set image commands during model updates. The fix is GitOps.

The Helm Chart Structure

llm-inference/
├── Chart.yaml
├── values.yaml               # Default: production config
├── values-staging.yaml       # Staging overrides (smaller model, 1 replica)
├── values-int8.yaml          # INT8 variant for quality-critical paths
├── values-int4.yaml          # INT4 variant for high-throughput paths
└── templates/
    ├── deployment.yaml       # vLLM Deployment
    ├── service.yaml
    ├── serviceaccount.yaml
    ├── hpa-or-keda.yaml      # ScaledObject
    ├── pvc-model-cache.yaml
    └── configmap-env.yaml

The values.yaml key parameters that control the model:

# values.yaml
model:
  repo: "TheBloke/Llama-3-70B-Instruct-GPTQ"
  quantization: "gptq"
  maxModelLen: 8192
  gpuMemoryUtilization: 0.92
  tensorParallelSize: 1

resources:
  gpu:
    count: 1
    type: "nvidia-a100"

autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 8
  waitingRequestsThreshold: 5

To roll out a new model, you change model.repo in Git. Argo CD detects the diff and triggers a rolling update with the configured surge/unavailable settings. The old pods stay live until new pods pass readiness probes — zero downtime.

Argo CD Application — Multi-Environment Setup

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: llm-inference-prod
  namespace: argocd
spec:
  project: ai-platform
  source:
    repoURL: https://github.com/your-org/ai-platform-gitops
    targetRevision: main
    path: helm/llm-inference
    helm:
      valueFiles:
      - values.yaml
      - values-int4.yaml     # INT4 overlay for production
  destination:
    server: https://kubernetes.default.svc
    namespace: ai-serving
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - RespectIgnoreDifferences=true
    - CreateNamespace=true

Model Promotion Pipeline

The production workflow for promoting a new model version:

  1. Evaluate: Run lm-evaluation-harness against the new checkpoint in a CI job. Gate on: MMLU ≥ baseline − 3%, latency p95 ≤ 300ms at target load.
  2. Stage: Merge to staging branch → Argo CD auto-deploys to staging namespace. Run integration tests against staging endpoint.
  3. Canary: Promote to main with canary weight 10% using Argo Rollouts or a weighted VirtualService (Istio/Linkerd). Monitor error rate and latency for 30 minutes.
  4. Full rollout: Shift 100% traffic to new pods. Old pods terminate after cooldown.
  5. Rollback: If any gate fails, git revert → Argo CD restores previous model in under 3 minutes.

This pipeline — Evaluate → Stage → Canary → Full → Rollback — is what separates teams that run LLMs in production from teams that run LLMs in demos. We build this entire pipeline hands-on in the gheWARE Agentic AI Workshop, including live labs on vLLM deployment, KEDA configuration, and Argo CD GitOps for AI workloads.

Frequently Asked Questions

Can I run a quantized LLM on Kubernetes without a GPU?

Yes — highly quantized models (INT4/GGUF) can run on CPU-only nodes for low-throughput use cases. For production-grade latency (under 200ms), a GPU node is still recommended, but a single NVIDIA T4 (available on most cloud providers for ~$0.35/hr) running an INT4-quantized Mistral 7B can handle 50–80 concurrent requests — a fraction of the cost of running full-precision models. BitNet 1.58-bit models (like BitNet b1.58 3B from Microsoft) can run entirely on CPU clusters and are viable for enterprise classification and summarization workloads.

What is the difference between INT8 and INT4 quantization for LLMs?

INT8 quantization reduces model weights from 32-bit float to 8-bit integer, cutting memory usage roughly in half with less than 1% quality degradation on most benchmarks. INT4 goes further — 4-bit integers — cutting memory by 4x vs FP16 at a 2–5% quality drop. For most enterprise use cases (summarization, extraction, classification), INT4 is acceptable. For RAG-based reasoning or code generation, INT8 is the safer default.

Which is better for production Kubernetes LLM inference — vLLM or TGI?

vLLM is generally the production choice in 2026 for high-throughput enterprise deployments. Its PagedAttention mechanism handles concurrent requests far more efficiently than standard batching, and it integrates cleanly with Helm charts and Kubernetes HPA/KEDA autoscaling. Hugging Face TGI is excellent for fast prototyping and supports more model formats natively. For enterprise at-scale (50+ concurrent users), vLLM wins on throughput; TGI wins on simplicity. KServe is a third option worth evaluating for teams already running KFServing or Kubeflow.

How do I autoscale LLM inference pods on Kubernetes?

Use KEDA (Kubernetes Event-Driven Autoscaling) with a Prometheus scaler targeting GPU utilization or queue depth metrics. Standard HPA can only scale on CPU/memory, which is a poor proxy for LLM inference load. A typical KEDA ScaledObject sets minReplicas: 1, maxReplicas: 8, and triggers on a custom metric like vllm:num_requests_waiting > 5. This gives you zero-cost idle state with rapid scale-out during traffic spikes. Set cooldownPeriod: 120 minimum to avoid GPU pod churn.

What is BitNet and why is it trending for Kubernetes LLM deployment?

BitNet (Microsoft Research, 2025) takes quantization to its extreme — 1.58-bit weights where each parameter is -1, 0, or +1. BitNet b1.58 models can run on CPU clusters without any GPU at all, achieving performance competitive with FP16 models at their scale. For Kubernetes deployments, this means you can run LLM inference on standard compute nodes, eliminating GPU node pools entirely for certain workloads. The BitNet.cpp implementation supports efficient inference on standard x86 and ARM64 architectures — highly relevant for teams running ARM64 Kubernetes nodes (AWS Graviton, GKE Tau T2A, Azure Cobalt).

Conclusion: The Hardware Excuse Is Over

In 2024, "we can't afford the GPU cluster" was a legitimate blocker for enterprise LLM deployments. In 2026, it's an infrastructure knowledge gap.

The combination of INT4/GPTQ quantization (4x memory reduction), vLLM's PagedAttention (23x throughput improvement), KEDA autoscaling (zero cost at idle), and GitOps model management (safe, zero-downtime rollouts) has fundamentally changed the economics of LLM inference on Kubernetes.

A single NVIDIA A100 node running an INT4-quantized Llama 3 70B via vLLM can serve 50+ concurrent enterprise users at under 200ms p95 latency — a workload that would have required 4–6 GPU nodes at full precision 18 months ago.

Your existing Kubernetes cluster is already the platform. Quantization is the key that unlocks it.

Build This Architecture Hands-On

Learn vLLM deployment, KEDA autoscaling, GitOps model pipelines, and quantization strategies in the gheWARE Agentic AI Workshop — 5 days, 119 hands-on labs, rated 4.91/5.0 by Oracle engineers.

View Workshop → 📖 Get the Book

Related Reading