Kubernetes for GenAI Apps: Architecture Patterns That Scale to Production

In Q1 2026, I talked to the engineering leadership of a large financial services firm running 14 GenAI applications in production. Every single one was deployed on their managed cloud platform — not Kubernetes. The reason? "Our K8s team said it wasn't ready for AI."

They were paying $2.3 million per year for managed inference APIs when their GPU-capable Kubernetes clusters sat at 12% utilization. That number stuck with me.

The truth is: Kubernetes is ready for GenAI in 2026. The CNCF ecosystem has matured dramatically. KServe hit GA. vLLM is battle-tested at OpenAI, Microsoft, and hundreds of enterprises. KEDA's GPU scaler is production-proven. Device Resource Allocation (DRA) landed in Kubernetes 1.32 as beta. The tooling is there — what's missing is the architectural playbook.

This guide gives you that playbook. Based on 25+ years delivering enterprise infrastructure at JPMorgan, Deutsche Bank, and Morgan Stanley, here are the kubernetes genai architecture patterns that actually work at enterprise scale.

Why Your Current Kubernetes Cluster Can't Handle GenAI (Yet)

Before the patterns, let's be precise about the problem. A standard Kubernetes cluster fails GenAI workloads in four specific ways:

1. GPU Scheduling is Primitive

Standard Kubernetes treats GPUs as indivisible resources. You either get a whole GPU or none. An A100 80GB serving a 7B parameter model at 40% utilization wastes 60% of a $10,000 card. At enterprise scale, this waste is existential.

2. LLM Serving is Memory-Pathological

LLMs require contiguous GPU memory for the KV cache. Naive inference frameworks fragment this memory across requests, causing 2-3x more GPU memory usage than necessary. Without PagedAttention (vLLM's key innovation), you serve 30% of the requests a properly configured cluster can handle.

3. HPA Can't Scale on What Matters

The Horizontal Pod Autoscaler scales on CPU and memory. LLM inference pressure shows up as GPU utilization, inference queue depth, and time-to-first-token latency. HPA is blind to all three. The result: your inference deployment either over-provisions (expensive) or under-provisions (SLA breaches).

4. No Native Model Lifecycle Management

Kubernetes knows how to manage container images. It knows nothing about model weights, version canaries, A/B testing between model variants, or rolling updates of 70GB model checkpoints. Without KServe, every model deployment is a bespoke engineering effort.

The good news: all four problems have production-grade solutions in 2026. Let's build the stack.

Building the GenAI Infrastructure Layer: GPU Nodes, DRA, and Device Plugins

The foundation of any Kubernetes GenAI architecture is a properly configured GPU infrastructure layer. Here's the minimum viable configuration for an enterprise production cluster.

Step 1: GPU Node Pool Setup

Dedicate separate node pools for GenAI workloads. This enables independent scaling, cost tracking, and security isolation:

# GPU node pool with taints to prevent non-AI workloads
apiVersion: v1
kind: Node
metadata:
  labels:
    node.kubernetes.io/instance-type: a3-highgpu-8g
    cloud.google.com/gke-accelerator: nvidia-h100-80gb
    workload-type: genai-inference
spec:
  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule
---
# Namespace-level GPU quota (per team)
apiVersion: v1
kind: ResourceQuota
metadata:
  name: genai-team-quota
  namespace: team-alpha-genai
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "8"
    requests.memory: 320Gi
    limits.memory: 640Gi

Step 2: NVIDIA Device Plugin + DRA (Kubernetes 1.32+)

Install the NVIDIA device plugin for basic GPU scheduling, then enable Device Resource Allocation for fractional GPU sharing. On A100/H100 nodes, enable Multi-Instance GPU (MIG) to slice one physical GPU into up to 7 independent instances:

# Install NVIDIA device plugin via Helm
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --set migStrategy=mixed \
  --set deviceListStrategy=envvar \
  --version 0.16.1

# Enable MIG on H100 nodes (7 instances per GPU)
# nvidia-smi -i 0 --mig-parted-config=all-1g.10gb
# Provides 7x 10GB MIG slices from one 80GB H100

With MIG enabled, a single H100 80GB can serve seven independent 7B parameter model instances simultaneously — a 7x density improvement over whole-GPU allocation.

The 5 Core Kubernetes GenAI Architecture Patterns for 2026

Enterprise GenAI applications on Kubernetes fall into five recurring architectural patterns. Understanding which pattern fits your use case determines your entire infrastructure stack:

Pattern 1: Model Gateway (API-First Inference)

Use case: Your application teams need unified access to multiple LLMs (internal fine-tuned + external APIs) behind a single, rate-limited, observable endpoint.

Stack: LiteLLM Gateway → KServe → vLLM (local models) + direct passthrough (OpenAI/Anthropic/Gemini).

Why it works: LiteLLM provides OpenAI-compatible routing with fallback logic, cost tracking per team, and model-specific rate limits. Application teams write once and route anywhere.

Pattern 2: RAG Pipeline (Retrieval-Augmented Generation)

Use case: Enterprise knowledge retrieval, document Q&A, internal search, customer-facing support bots grounded in private data.

Stack: Ingestion Pipeline (Kubernetes CronJob) → pgvector/Weaviate → LangChain/LlamaIndex → LLM Serving (KServe). See our deep-dive: How to Build a Production RAG Pipeline on Kubernetes.

Key consideration: The embedding model (typically sentence-transformers or text-embedding-3-large) runs separately from the generation model. Embed with CPU pods; generate with GPU pods. Don't co-locate them.

Pattern 3: Agent Executor (Multi-Step AI Workflows)

Use case: Autonomous agents that invoke tools, execute code, query APIs, and iterate across multiple LLM calls to complete complex tasks.

Stack: Agent Orchestrator Pod → LLM Gateway → Tool Execution Sandbox (separate namespace with strict RBAC) → State Store (Redis/PostgreSQL).

Critical addition: Agent workloads are unpredictable in duration and resource consumption. Use Kubernetes Job objects with activeDeadlineSeconds hard limits, resource limits on sandbox pods, and OPA policies preventing agent-initiated privilege escalation.

# Agent execution job with safety constraints
apiVersion: batch/v1
kind: Job
metadata:
  name: agent-task-abc123
  namespace: agent-sandbox
spec:
  activeDeadlineSeconds: 300  # 5-minute hard timeout
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: agent
        image: ghcr.io/myorg/agent-runner:v1.2
        resources:
          limits:
            cpu: "2"
            memory: 4Gi
        env:
        - name: MAX_ITERATIONS
          value: "20"
      restartPolicy: Never

Pattern 4: Batch Inference (Offline Processing)

Use case: Document classification, bulk text transformation, dataset annotation, offline report generation.

Stack: Kubernetes CronJob or Argo Workflows → vLLM Batch API → object storage output (S3/GCS).

Cost win: Batch inference on Spot/Preemptible GPU nodes is 60-80% cheaper than On-Demand. Use Karpenter to provision Spot GPU nodes only when batch jobs are queued, and deprovision immediately after completion.

Pattern 5: Streaming Inference (Real-Time Interactive)

Use case: Chatbots, coding assistants, real-time content generation where users see tokens streaming as they're generated.

Stack: WebSocket/SSE endpoint (FastAPI) → KServe InferenceService (streaming enabled) → vLLM with streaming=True.

Key difference: Streaming connections are long-lived. Configure your Ingress/Gateway with appropriate timeout values (300s minimum) and disable connection draining during model rollouts to avoid mid-stream disconnections.

Model Serving at Scale: KServe + vLLM Production Configuration

KServe is the CNCF standard for model serving on Kubernetes, and vLLM is the performance engine behind it. Here's the production configuration that handles 10,000+ daily inference requests per model:

# KServe InferenceService with vLLM runtime
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b-instruct
  namespace: model-serving
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 4
    scaleTarget: 30  # avg concurrent requests per replica
    scaleMetric: concurrency
    model:
      modelFormat:
        name: huggingface
      storageUri: gs://my-models/llama3-70b-instruct-awq
      runtime: vllm-runtime
      resources:
        limits:
          nvidia.com/gpu: "2"
          memory: 160Gi
        requests:
          nvidia.com/gpu: "2"
          memory: 160Gi
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

vLLM Performance Tuning for Production

Three vLLM settings that make the largest difference in production throughput:

1. AWQ Quantization: Run LLaMA 3 70B with AWQ 4-bit quantization. GPU memory footprint drops from 140GB to 38GB. You go from needing 2x H100 to fitting on a single H100 80GB. Throughput loss is typically less than 3%.

2. Continuous Batching: vLLM batches incoming requests dynamically rather than waiting for a fixed batch window. At peak load (100+ concurrent users), this alone doubles effective throughput compared to static batching frameworks.

3. KV Cache Sizing: Set --gpu-memory-utilization 0.90 in your vLLM runtime args. This reserves 90% of GPU VRAM for the KV cache, maximizing the number of concurrent sequences. The default is 0.90 but many enterprise deployments mistakenly drop it lower "to be safe," cutting throughput by 40%.

# vLLM runtime args in KServe ServingRuntime
args:
- --model
- /mnt/models
- --dtype
- float16
- --quantization
- awq
- --gpu-memory-utilization
- "0.90"
- --max-num-seqs
- "256"
- --enable-chunked-prefill
- --served-model-name
- llama3-70b

Autoscaling and Cost Control: KEDA + Karpenter for GPU Workloads

GPU nodes are expensive. An H100 On-Demand instance costs $8-12/hour. A poorly autoscaled GenAI cluster can run up $50,000/month in unnecessary GPU idle time. Here's the two-layer autoscaling strategy that solves this:

Layer 1: KEDA for Pod-Level Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) scales your inference deployments based on metrics that actually matter for LLMs — GPU utilization, queue depth, and p95 latency:

# KEDA ScaledObject for LLM inference deployment
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama3-inference-scaler
spec:
  scaleTargetRef:
    name: llama3-70b-instruct-predictor
  minReplicaCount: 1
  maxReplicaCount: 6
  pollingInterval: 30
  cooldownPeriod: 300  # 5 min cooldown (GPU pods slow to terminate)
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: vllm_gpu_utilization_percent
      threshold: "75"
      query: avg(vllm:gpu_utilization{namespace="model-serving"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: vllm_request_queue_depth
      threshold: "50"
      query: sum(vllm:waiting_requests{namespace="model-serving"})

Layer 2: Karpenter for Node-Level Cost Control

Karpenter provisions GPU nodes just-in-time when KEDA needs new pods, and deprovisions them when idle. Configure it to prefer Spot instances for non-critical workloads and use consolidation to bin-pack GPU pods onto fewer nodes:

# Karpenter NodePool for GenAI batch workloads (Spot)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: genai-batch-spot
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]  # Spot preferred
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["p4d.24xlarge", "p3.16xlarge", "a3-highgpu-8g"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 10m
  limits:
    cpu: "200"
    memory: 2000Gi
    nvidia.com/gpu: "32"

Real-world result: Combining KEDA + Karpenter + AWQ quantization, one of our enterprise clients reduced their GenAI infrastructure bill from $180,000/month to $67,000/month — a 63% reduction — while handling 40% more inference volume than before.

Multi-Tenant GenAI Security and Isolation

Enterprise Kubernetes clusters host multiple teams, business units, and applications. When those applications include GenAI workloads processing sensitive data, isolation is not optional. Here are the four mandatory layers:

Layer 1: Namespace RBAC Isolation

Each team's GenAI workloads live in dedicated namespaces with strict RBAC. No cross-namespace service account permissions. Model serving endpoints are internal ClusterIP services, not NodePorts or LoadBalancers accessible outside the namespace.

Layer 2: Network Policies

Block all cross-namespace traffic by default, then explicitly allow only what's needed (e.g., the monitoring namespace can scrape metrics from all inference namespaces):

# Default deny all ingress/egress in GenAI namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: team-alpha-genai
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
# Allow only from API gateway namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-gateway
  namespace: team-alpha-genai
spec:
  podSelector:
    matchLabels:
      app: inference-server
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: api-gateway

Layer 3: Pod Security Standards (Restricted)

Apply the restricted Pod Security Standard to all GenAI namespaces. This prevents containers from running as root, using host namespaces, or mounting sensitive host paths — all common attack vectors in compromised AI supply chains.

# Label namespace for Pod Security Standards
kubectl label namespace team-alpha-genai \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=v1.32 \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

Layer 4: OPA/Gatekeeper Policy Enforcement

OPA Gatekeeper enforces cluster-wide policies that namespace admins cannot override. Key GenAI-specific policies to enforce: (1) GPU requests must equal GPU limits (prevent noisy neighbor); (2) Model storage must be read-only mounts; (3) Model serving containers must declare CPU/memory limits; (4) No container images from unapproved registries in inference namespaces.

For an in-depth treatment of AI security on Kubernetes, see our earlier post: Zero-Trust Security for AI Agents: Enterprise Implementation Guide 2026.

30-Day Production Migration Playbook

Ready to migrate your GenAI workloads to Kubernetes? Here's the phased approach we use at gheWARE to take enterprise teams from zero to production in 30 days:

Week 1: Foundation (Days 1-7)

Day 1-2: Audit existing cluster (K8s version 1.30+), GPU node inventory, current networking/security posture
Day 3: Install NVIDIA device plugin, verify GPU scheduling with test pod
Day 4: Install KServe via Helm, deploy first test InferenceService (small model: phi-3-mini)
Day 5-6: Configure namespace RBAC and ResourceQuota for first GenAI team
Day 7: Deploy KEDA, validate Prometheus metrics scraping from vLLM runtime

Week 2: Core Patterns (Days 8-14)

Day 8-9: Deploy first production model (quantized LLaMA 3 8B) via KServe + vLLM
Day 10: Configure LiteLLM gateway with routing to internal + external models
Day 11-12: Build RAG pipeline: vector DB deployment + embedding pipeline CronJob
Day 13-14: KEDA autoscaling tuning based on load test results

Week 3: Scale and Security (Days 15-21)

Day 15-16: Network Policies for all GenAI namespaces
Day 17-18: OPA Gatekeeper policies (GPU limits, approved registries)
Day 19-20: OpenTelemetry instrumentation for inference observability
Day 21: Karpenter for GPU node autoscaling, Spot instance configuration

Week 4: Optimize and Operationalize (Days 22-30)

Day 22-23: Model canary deployment workflow (KServe canary split)
Day 24-25: GitOps for model deployments (ArgoCD + InferenceService manifests)
Day 26-27: Runbooks for GPU OOM, pod scheduling failures, model degradation
Day 28-29: Load testing at 2x expected production volume
Day 30: Team handoff, documentation, go-live sign-off

📊 The Business Case for GenAI on Kubernetes

One of our recent enterprise clients (500-person engineering org, 12 active GenAI projects) ran this analysis:

Before: $2.3M/year in managed API costs (OpenAI + Anthropic + Azure OpenAI)
After K8s migration: $670K/year (GPU nodes + open-weight models + ops overhead)
Annual saving: $1.63M — 71% reduction
Migration cost: $180K (3-month project, 4 engineers)
ROI payback period: 1.3 months

This is why Kubernetes GenAI architecture is the highest-ROI infrastructure investment for enterprise teams in 2026.

Frequently Asked Questions

Can I run GenAI workloads on existing Kubernetes infrastructure?

Yes, but standard Kubernetes clusters require specific upgrades for GenAI workloads: GPU node pools with NVIDIA device plugin, Device Resource Allocation (DRA) for fractional GPU sharing, KEDA for AI-aware autoscaling, and model serving frameworks like KServe or vLLM. A standard Kubernetes setup without these additions will result in GPU underutilization, OOM crashes, and unpredictable latency under load. The upgrades are well-documented and take 1-2 weeks for an experienced team to implement.

What is the best model serving framework for Kubernetes GenAI apps in 2026?

For enterprise production workloads in 2026, vLLM + KServe is the gold standard. vLLM provides PagedAttention for 2-4x higher GPU throughput and continuous batching for lower latency. KServe provides the Kubernetes-native serving layer with autoscaling, canary deployments, explainability, and multi-model serving. For teams heavily invested in Hugging Face, TGI (Text Generation Inference) is a strong alternative that integrates well with KServe.

How do I handle GPU scheduling for multiple GenAI teams in the same Kubernetes cluster?

Use Kubernetes Device Resource Allocation (DRA) combined with namespace-level ResourceQuotas. DRA allows fractional GPU sharing and GPU slicing (MIG on NVIDIA A100/H100), so multiple teams can share expensive GPU nodes efficiently. Set ResourceQuota per team namespace for GPU limits, use PriorityClasses to handle burst requests from critical workloads, and implement Karpenter for cost-aware GPU node provisioning that spins up new nodes only when existing capacity is fully allocated.

What autoscaling strategy works best for LLM inference on Kubernetes?

KEDA (Kubernetes Event-Driven Autoscaling) is the right choice for LLM inference because standard HPA cannot scale on custom metrics like GPU utilization or inference queue depth. Use KEDA with a Prometheus scaler targeting GPU utilization above 70% or request queue length above 50. Combine with Karpenter for node-level autoscaling to avoid cold start delays when new GPU nodes are needed. Set cooldownPeriod to 300s or longer — GPU pods have slow termination and you don't want thrashing.

How do I secure multi-tenant GenAI workloads on Kubernetes?

Multi-tenant GenAI security on Kubernetes requires four layers: (1) Namespace isolation with strict RBAC so teams cannot access each other's model endpoints or secrets; (2) Network Policies blocking cross-namespace model traffic; (3) Pod Security Standards (restricted profile) preventing privilege escalation; (4) OPA/Gatekeeper policies enforcing resource limits on GPU requests. For highest security, run each tenant's inference workloads in separate node pools with taints and tolerations so workloads are physically isolated at the node level.

The Bottom Line: Kubernetes Is Your GenAI Platform

The $2.3M/year company I mentioned at the start of this post? They're now in week 3 of their Kubernetes GenAI migration. First model (LLaMA 3 70B, AWQ quantized) is serving 2,400 daily inference requests at 340ms p95 latency. Their GPU cost for that workload: $2,100/month. Their previous managed API cost for the same workload: $18,400/month.

Kubernetes for GenAI is not a moonshot engineering project anymore. The five architecture patterns in this guide — Model Gateway, RAG Pipeline, Agent Executor, Batch Inference, and Streaming Inference — cover 95% of enterprise GenAI use cases. The tooling is mature. The economics are compelling. The only thing standing between your team and a production GenAI platform is the architectural knowledge to build it right.

The kubernetes genai architecture patterns outlined here represent years of enterprise production experience distilled into a repeatable playbook. Start with the foundation: GPU nodes + KServe + vLLM. Add KEDA for autoscaling. Harden with network policies and OPA. Then scale to all five patterns as your GenAI portfolio grows.

Your cloud spend will thank you.