Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Kubernetes

Kubernetes from First Principles

Why It Works the Way It Does


Most Kubernetes resources teach you how to write YAML. This book teaches you why the YAML looks the way it does. 45 chapters — each traced from the original design problem through the ecosystem’s evolution to today’s best practice.

About

This is an eight-part book that takes you from “why does Kubernetes exist?” to “I’m running GPU-accelerated ML workloads in production across multiple clusters.” It is written for engineers who understand Linux, networking, and how systems work — and want to understand Kubernetes deeply, not just follow tutorials.


Part 1: First Principles

Why Kubernetes was designed the way it was.

  1. The Road to Kubernetes — From bare metal to Borg to Kubernetes
  2. The Problems Kubernetes Solves — Bin packing, service discovery, self-healing, and the desired state model
  3. Architecture from First Principles — etcd, API server, controllers, scheduler, kubelet, kube-proxy
  4. The API Model — Resources, specs, status, reconciliation loops, labels, and CRDs
  5. The Networking Model — Flat networking, CNI, Services, Ingress, and Network Policies
  6. The Ecosystem — Operators, Helm, service meshes, and Kubernetes as a platform for platforms
  7. Key Design Principles — Declarative over imperative, control loops, level-triggered vs edge-triggered
  8. Why Kubernetes Won — The competitive landscape and the deeper architectural lesson
  9. References and Further Reading — Foundational papers, design documents, talks, and books

Part 2: The Tooling Ecosystem — History and Evolution

How the tools around Kubernetes evolved, and why they look the way they do today.

  1. The Container Runtime Wars — Docker to containerd to CRI-O: why Docker was deprecated
  2. Bootstrapping a Cluster — From kube-up.sh to kubeadm: how cluster setup evolved
  3. Package Management and GitOps — Helm v2/v3, Kustomize, ArgoCD, Flux
  4. The Networking Stack Evolution — Flannel to Calico to Cilium: how eBPF changed everything
  5. Kubernetes Version History — A guided tour of key releases and what they introduced

Part 3: From Theory to Practice

Connecting the principles from Part 1 to real-world usage.

  1. Setting Up a Cluster from Scratch — What kubeadm actually does: TLS bootstrapping, static pods
  2. Managed Kubernetes: EKS, GKE, and AKS — Cloud provider comparison and how to choose
  3. Cloud Networking and Storage — VPC CNI, CSI drivers, and how K8s maps to cloud infrastructure
  4. Your First Workloads — Hands-on: Deployments, Services, ConfigMaps, rolling updates
  5. Debugging Kubernetes — The kubectl toolkit and diagnosing common failures
  6. Production Readiness — Monitoring, logging, security basics, and backup

Part 4: Stateful Workloads

Running real applications with persistent state.

  1. StatefulSets Deep Dive — Stable identities, ordered operations, and headless Services
  2. Databases on Kubernetes — When to run databases on K8s, operators, and the trade-offs
  3. Persistent Storage Patterns — volumeClaimTemplates, reclaim policies, backup, and resize
  4. Jobs and CronJobs — Batch processing, indexed completions, and scheduling patterns

Part 5: Security Deep Dive

Understanding and implementing Kubernetes security from the ground up.

  1. RBAC from First Principles — Roles, bindings, ServiceAccounts, and multi-tenant design
  2. Network Policies — Default deny, namespace isolation, and egress control
  3. Supply Chain Security — Image signing, admission policies, scanning, and SLSA
  4. Secrets Management — Encryption at rest, Vault, External Secrets Operator, and best practices
  5. Pod Security Standards — Privileged, Baseline, Restricted profiles and enforcement

Part 6: Scaling and Performance

Making Kubernetes handle real-world load.

  1. Horizontal Pod Autoscaler — The scaling algorithm, custom metrics, KEDA, and tuning
  2. Vertical Pod Autoscaler and Right-Sizing — Recommendation mode, in-place resize, and resource tuning
  3. Node Scaling: Cluster Autoscaler and Karpenter — How nodes scale, Karpenter’s architecture, and consolidation
  4. Resource Tuning Deep Dive — CPU throttling, memory cgroups, NUMA, and overcommitment

Part 7: Multi-Cluster and Platform Engineering

Operating Kubernetes at organizational scale.

  1. Multi-Cluster Strategies — Federation, GitOps-driven, service mesh, and Cluster API
  2. Building Internal Developer Platforms — Backstage, the platform stack, and reducing cognitive load
  3. Crossplane: Infrastructure as CRDs — Managing cloud resources through Kubernetes
  4. Multi-Tenancy — Namespace isolation, virtual clusters, and tenant boundaries

Part 8: Advanced Topics

Deep dives for infrastructure engineers.

  1. Writing Controllers and Operators — controller-runtime, Kubebuilder, and the Reconcile pattern
  2. The Kubernetes API Internals — Aggregation, admission webhooks, API priority and fairness
  3. etcd Operations — Backup, restore, compaction, monitoring, and disaster recovery
  4. GPU Workloads and AI/ML on Kubernetes — Device plugins, DRA, GPU sharing, distributed training
  5. Running LLMs on Kubernetes — vLLM, TGI, KServe, multi-node inference, and model serving
  6. Disaster Recovery — Cluster backup, etcd snapshots, multi-region strategies
  7. Cost Optimization — Right-sizing, spot instances, Kubecost, and chargeback
  8. Observability with OpenTelemetry — Metrics, logs, traces, and the OTel Collector

How to Read This

Part 1 is the intellectual foundation. Read it first.

Part 2 fills in the historical context of the tooling. Read it after Part 1.

Part 3 is hands-on. Reference it as you work through your own cluster.

Parts 4-5 cover stateful workloads and security — essential for running real production systems.

Part 6 covers scaling — read it when your workloads need to handle real load.

Part 7 is for when you’re operating multiple clusters or building a platform team.

Part 8 is deep reference material. Read chapters as needed. The GPU/ML chapters (41-42) are especially relevant for AI infrastructure teams.

If you only have time for one chapter from each part:

Appendices

Companion Material

  • install.sh — The bootstrap script we built to provision Kubernetes nodes on EC2
  • Colophon — How this book was made, the prompts used, and accuracy notes

Chapter 1: The Road to Kubernetes

---
config:
    flowchart:
        nodeSpacing: 15
        rankSpacing: 30
---
flowchart LR
    subgraph industry ["Industry Timeline"]
        bm["Bare Metal<br>1990s"] --> vm["Virtualization<br>VMware, Xen, KVM<br>2000s"]
        vm --> cloud["AWS EC2<br>Cloud era<br>2006"]
        cloud --> cm["Chef / Puppet<br>2009"]
        cm --> docker["Docker<br>2013"]
        docker --> k8s["Kubernetes<br>2014"]
        k8s --> cncf["CNCF + Managed K8s<br>2015+"]
    end

    subgraph google ["Inside Google"]
        borg["Borg<br>2003"] --> cg["cgroups +<br>namespaces<br>2006–08"]
        cg --> omega["Omega<br>2011"]
        omega --> k8sg["Kubernetes<br>2014"]
    end

    k8sg -.->|"open-sourced<br>as"| k8s

The Bare Metal Era: One Application, One Server

In the earliest days of server computing, one application ran on one physical server — simple, isolated, but catastrophically wasteful. Most servers ran at 5-15% average CPU utilization. Organizations maintained vast fleets of underutilized machines, each dedicated to a single workload, each requiring its own power, cooling, network connectivity, and physical maintenance.

The fundamental problem was resource fragmentation. You could not easily share a physical machine between two applications because there was no reliable mechanism to prevent one application from consuming all available CPU, memory, or disk I/O and starving the other. Operating system process isolation was insufficient: processes could interfere with each other through shared filesystems, port conflicts, library version conflicts, and resource exhaustion. The result was an era of enormous waste, where the primary cost driver was not compute but rather the operational overhead of managing vast numbers of barely-utilized machines.

The Virtualization Revolution: Abstracting Hardware

Virtualization, pioneered commercially by VMware in the late 1990s and later commoditized by Xen, KVM, and cloud providers like Amazon Web Services, represented the first fundamental shift. By inserting a hypervisor between the hardware and the operating system, virtualization allowed multiple isolated virtual machines to share a single physical host. Each VM got its own kernel, its own filesystem, its own network stack — complete isolation without dedicated hardware.

This solved the resource fragmentation problem at a macro level. You could now pack multiple workloads onto a single physical machine with strong isolation guarantees. Cloud computing emerged from this capability: Amazon Web Services launched EC2 in 2006, offering on-demand virtual machines that could be provisioned in minutes rather than the weeks required to procure and rack physical servers.

But virtualization introduced its own problems. Virtual machines were heavy: each carried a full operating system kernel, consuming hundreds of megabytes of RAM just for the OS overhead. Boot times were measured in minutes. VM images were large and slow to transfer. The hypervisor itself consumed resources. And while VMs solved the isolation problem, they did not solve the management problem. With hundreds or thousands of VMs, organizations still needed to answer fundamental questions: which workload runs where? How do you update an application across fifty VMs without downtime? How do you recover when a VM’s host machine fails? How do you ensure that a critical application always has enough resources?

The Configuration Management Interlude: Puppet, Chef, Ansible

The late 2000s and early 2010s saw the rise of configuration management tools — Puppet (2005), Chef (2009), Ansible (2012), SaltStack (2011). These tools addressed the management problem by allowing operators to describe the desired state of a server (which packages should be installed, which services should be running, which configuration files should be present) and then converge the actual state toward that desired state.

This was a crucial intellectual contribution that directly influenced Kubernetes: the desired state model. Instead of writing imperative scripts that said “install package X, then start service Y, then modify file Z,” configuration management tools let you declare “package X should be present, service Y should be running, file Z should contain these contents” and let the tool figure out how to get there. This declarative approach was more robust because it was idempotent — you could run the tool multiple times and get the same result, regardless of the starting state.

But configuration management operated at the wrong level of abstraction for the emerging world of containerized microservices. They could ensure that a particular server had the right software installed, but they could not easily reason about a distributed application that spanned dozens of servers, needed to be updated without downtime, and had to automatically recover from server failures. The unit of management was the machine, not the application.

The Container Revolution: Docker and the Shipping Container Metaphor

Containers were not new when Docker launched in 2013. The underlying Linux kernel features — cgroups (for resource limits) and namespaces (for isolation) — had been in the Linux kernel since 2008 (cgroups merged in v2.6.24). Google had been using containers internally since at least 2004, running everything from web search to Gmail inside Linux containers managed by their Borg system. FreeBSD had jails since 2000. Solaris had zones since 2004.

What Docker did was make containers accessible. It provided a simple command-line interface, a standardized image format (the Dockerfile and layered filesystem), and a distribution mechanism (Docker Hub). For the first time, a developer could package an application and all its dependencies into a single artifact, push it to a registry, and run it identically on any Linux machine. The shipping container metaphor was apt: just as standardized shipping containers revolutionized global trade by providing a uniform interface between ships, trains, and trucks, Docker containers provided a uniform interface between development, testing, and production.

Containers had profound advantages over VMs for application deployment:

  • Lightweight: Containers shared the host kernel, eliminating the OS overhead of VMs. A container image might be tens of megabytes instead of gigabytes.
  • Fast startup: Containers started in milliseconds to seconds, not minutes.
  • Density: You could run dozens or hundreds of containers on a single host, compared to perhaps a dozen VMs.
  • Reproducibility: The container image was immutable. The same image ran identically everywhere.
  • Composability: Complex applications could be decomposed into multiple containers, each with a single responsibility.

But Docker, by itself, solved only the packaging and isolation problem. It told you nothing about how to run containers at scale across a fleet of machines. If you had one hundred machines and one thousand containers, Docker could not tell you which container should run on which machine, what to do when a machine failed, how to route network traffic to the right container, or how to update a running application without downtime. This was the orchestration problem.

Google’s Borg: The Secret Precursor

To understand why Kubernetes looks the way it does, you must understand Google’s Borg system. Published in a landmark 2015 EuroSys paper (“Large-scale cluster management at Google with Borg” by Verma et al.), Borg had been running inside Google since at least 2003-2004. It managed virtually everything Google ran: web search, Gmail, YouTube, Maps, BigTable, MapReduce — hundreds of thousands of jobs across tens of thousands of machines in each of dozens of clusters.

Borg introduced several concepts that directly shaped Kubernetes:

1. The declarative job specification. In Borg, users did not tell the system to “start a process on machine X.” They declared a job specification: “I need 100 instances of this binary, each with 2 GB of RAM and 0.5 CPU cores, and they should be spread across failure domains.” Borg figured out where to place them, and if instances died, Borg automatically restarted them. This declarative model — describe what you want, not how to get it — became the philosophical foundation of Kubernetes.

2. Bin packing and resource management. Borg treated a cluster of machines as a single pool of resources. Its scheduler solved a variant of the bin packing problem: given a set of tasks with resource requirements and a set of machines with resource capacities, place tasks on machines to maximize utilization while respecting constraints (failure domain isolation, hardware requirements, etc.). Borg achieved remarkably high utilization — published figures suggest 60-70% average CPU utilization across Google’s fleet, compared to the 5-15% typical of enterprise data centers.

3. Service discovery via naming. Borg provided a built-in naming service (BNS) that allowed tasks to find each other by name rather than by IP address and port. This was essential in an environment where tasks were constantly being started, stopped, and moved between machines.

4. Allocs and resource reservations. Borg introduced the concept of “allocs” — reserved resources on a machine that could be filled with tasks. This concept directly inspired the Kubernetes Pod: a group of containers that share resources and are co-scheduled on the same machine.

Google’s Omega: The Research System

Borg was a production system, evolved over a decade, carrying enormous technical debt. In 2011-2013, Google built Omega as a research project to explore alternative cluster management architectures. Omega’s key contribution was its approach to scheduling: instead of Borg’s monolithic scheduler, Omega used optimistic concurrency control with a shared state model. Multiple schedulers could operate in parallel, each reading the full cluster state, making scheduling decisions, and then atomically committing those decisions. If two schedulers made conflicting decisions, one would detect the conflict and retry.

This shared-state, optimistic-concurrency approach influenced Kubernetes’ design in a critical way: it demonstrated that you could have multiple independent controllers operating on shared state, each making progress independently, with conflicts resolved through mechanisms like resource versioning. This is exactly the model that Kubernetes uses for its controllers.

The Birth of Kubernetes: 2014

Kubernetes was born in mid-2014 at Google, created by Joe Beda, Brendan Burns, and Craig McLuckie, with significant contributions from Brian Grant, Tim Hockin, and many others. It was explicitly designed to be an open-source, vendor-neutral system that embodied the lessons of Borg and Omega without carrying their technical debt.

The founders made a crucial strategic decision: Rather than open-sourcing Borg, they built a clean-room redesign designed to run anywhere. This meant:

  • Using standard open-source components (etcd for storage, instead of Google’s proprietary Chubby/Colossus)
  • Supporting multiple container runtimes (not just Google’s internal runtime)
  • Designing for extensibility from the start (a pluggable API that later evolved into ThirdPartyResources and then CRDs, custom controllers, pluggable networking)
  • Making the system portable across cloud providers and on-premises environments

Kubernetes was donated to the newly formed Cloud Native Computing Foundation (CNCF) in 2015, ensuring its governance was independent of any single company. This was a masterstroke of ecosystem building: by making Kubernetes vendor-neutral, Google ensured that every major cloud provider (AWS, Azure, GCP) would offer managed Kubernetes services, creating a de facto standard that benefited everyone — including Google, whose cloud platform was smaller than AWS but whose expertise in running Kubernetes was unmatched.

The Borg Lineage: Kubernetes (Greek: steersman or pilot) was reportedly codenamed “Project Seven” — a reference to Seven of Nine from Star Trek, a Borg who became an individual. The name is a deliberate allusion to Kubernetes’ origins in Google’s Borg system, while signaling that it had been liberated from Google’s proprietary infrastructure to become something independent.

Common Mistakes and Misconceptions

  • “Kubernetes is just Docker orchestration.” Kubernetes is a general-purpose container orchestrator. Docker is one of many supported container runtimes (and modern K8s clusters typically use containerd, not Docker). Kubernetes predates most Docker-native orchestration features and was designed to be runtime-agnostic from the start.

  • “Google open-sourced Borg.” Borg is still an internal Google system and has never been released. Kubernetes is a clean-room redesign inspired by Borg’s lessons and design principles, not a port or fork of Borg. The two systems share no code.

  • “Kubernetes was the first container orchestrator.” Apache Mesos with Marathon, Fleet, and Docker Swarm all predated or emerged alongside Kubernetes. K8s won the orchestration wars not by being first but through superior API design, extensibility, and community governance.

For a visual overview of how Part 1’s concepts connect, see Appendix B: Mental Models.

Further Reading


Next: The Problems Kubernetes Solves

Chapter 2: The Problems Kubernetes Solves

Kubernetes exists because running containerized applications at scale presents a set of interrelated problems that no individual tool solves.

The Bin Packing Problem

At its most fundamental, Kubernetes solves a resource allocation problem. You have N machines, each with some amount of CPU, memory, and other resources. You have M workloads, each requiring some amount of those resources. How do you assign workloads to machines to maximize utilization while respecting constraints?

This is a variant of the NP-hard bin packing problem. In the general case, finding the optimal solution is computationally intractable. But good heuristics exist, and Kubernetes’ scheduler implements several of them. The key insight is that centralized, automated scheduling dramatically outperforms human scheduling. When humans decide where to place workloads, they tend to be:

  • Conservative — over-provisioning resources to avoid contention
  • Forgetful — leaving old workloads running on machines long after they should have been decommissioned
  • Inconsistent — different operators making different decisions for similar workloads

Borg’s experience demonstrated that automated bin packing could improve cluster utilization from the 5-15% typical of manually managed environments to 60-70%. Even modest improvements in utilization translate to enormous cost savings at scale: Google’s fleet comprises millions of machines, so a 1% improvement in utilization saves tens of thousands of servers.

The Service Discovery Problem

In a static world, you can configure your web frontend to talk to your database at a known IP address and port. But in a dynamic, containerized world, nothing has a stable address. Containers are created and destroyed constantly. They are moved between machines when hosts fail or when the scheduler finds a more efficient placement. The set of containers backing a particular service changes every time a deployment rolls out.

This creates the service discovery problem: how does one service find and communicate with another in an environment where addresses are constantly changing? There are several classic approaches:

  • DNS-based discovery: Register service instances in DNS, and clients look up the DNS name. Simple, but DNS has caching and TTL issues that make it slow to reflect changes.
  • Client-side registries: Services register themselves with a central registry (like ZooKeeper or Consul), and clients query the registry. Flexible, but requires every service to include registry client code.
  • Load balancer-based discovery: A load balancer sits in front of service instances, and clients talk to the load balancer’s stable address. Simple for clients, but adds latency and a single point of failure.

Kubernetes provides service discovery as a first-class primitive through its Service abstraction. A Service has a stable IP address (the ClusterIP) and DNS name. The Kubernetes control plane automatically updates the set of endpoints (pod IP addresses) behind a Service as pods come and go. This is implemented transparently by kube-proxy (or the CNI plugin), which programs iptables/IPVS rules on every node to redirect traffic addressed to a Service’s ClusterIP to one of its backing pods.

The Rolling Deployment Problem

Updating a running application without downtime is one of the hardest problems in operations. The naive approach — stop all old instances, start all new instances — causes downtime proportional to the startup time of the new instances. In a microservices architecture with hundreds of services, even brief downtime cascades into widespread failures.

The rolling deployment strategy addresses this by incrementally replacing old instances with new ones: start one new instance, wait for it to become healthy, then stop one old instance, and repeat. This maintains capacity throughout the update. But implementing rolling deployments correctly requires solving several sub-problems:

  • Health checking: How do you know when a new instance is ready to serve traffic? Kubernetes provides readiness probes and liveness probes.
  • Traffic draining: How do you gracefully stop sending traffic to an instance before terminating it? Kubernetes provides graceful shutdown periods and endpoint management.
  • Rollback: If the new version is broken, how do you quickly revert? Kubernetes maintains revision history and supports automatic rollback on failure.
  • Surge and unavailability budgets: How many extra instances can you run during the update (surge), and how many instances can be unavailable at once? Kubernetes’ Deployment controller supports configurable maxSurge and maxUnavailable parameters.

The Self-Healing Problem

In any sufficiently large system, failures are not exceptions — they are the normal operating condition. Machines crash, networks partition, disks fill up, processes crash, memory leaks accumulate. Google’s published data suggests that in a cluster of 10,000 machines, several will fail every day.

The self-healing problem is: how do you build a system that automatically detects and recovers from failures without human intervention? Kubernetes addresses this at multiple levels:

  • Container restart: If a container process crashes, the kubelet automatically restarts it, with exponential backoff to avoid restart storms.
  • Pod health monitoring: Liveness probes detect when a container is running but unhealthy (e.g., deadlocked). Kubernetes kills and restarts unhealthy containers.
  • Node failure detection: The control plane detects when a node stops reporting (via the node controller watching heartbeats) and automatically reschedules its pods onto healthy nodes.
  • Replica maintenance: If a Deployment specifies 3 replicas and one pod dies, the Deployment controller automatically creates a replacement.

The key insight is that self-healing requires a control loop: continuously compare the actual state of the system to the desired state, and take action to reconcile any differences. This is the reconciliation loop — the central architectural pattern of Kubernetes, used by every controller from the scheduler to the kubelet.

The Desired State Model vs. Imperative Commands

Perhaps the most important conceptual contribution of Kubernetes is its commitment to the desired state (declarative) model over the imperative model.

In an imperative model, you issue commands: “start 3 instances of nginx,” “stop instance X,” “scale up to 5 instances.” The system executes each command as a one-shot action. If the command fails, or if the system state drifts after the command succeeds, the system does not automatically correct itself. The operator must detect the drift and issue corrective commands.

In a declarative model, you declare the desired state: “there should be 3 instances of nginx running.” The system continuously works to make reality match this declaration. If an instance crashes, the system automatically creates a replacement. If an extra instance somehow appears, the system terminates it. If the declaration changes to “5 instances,” the system creates 2 more.

The declarative model is fundamentally more robust because:

  1. It is self-correcting. The system continuously reconciles actual state toward desired state, handling failures automatically.
  2. It is idempotent. Applying the same desired state declaration multiple times has the same effect as applying it once.
  3. It separates intent from execution. The user says what they want; the system decides how to achieve it.
  4. It enables auditability. The desired state is a document (YAML or JSON) that can be version-controlled, reviewed, and diffed.
  5. It enables composition. Multiple controllers can independently reconcile different aspects of the desired state, each responsible for a single concern.

This is not merely a philosophical preference. The imperative model breaks down catastrophically in distributed systems where commands can be lost, duplicated, or reordered. The declarative model, by contrast, is eventually consistent by design: no matter what transient failures occur, the system will eventually converge to the desired state.

Imperative vs. Declarative: A Comparison

DimensionImperative ModelDeclarative (Kubernetes) Model
User actionIssue commands: “start X”, “stop Y”Declare desired state: “there should be 3 of X”
Failure recoveryManual: operator detects drift and issues correctionsAutomatic: reconciliation loop continuously corrects drift
IdempotencyCommands may not be safe to replayApplying same state is always safe to repeat
State visibilityState is the cumulative effect of past commandsState is a document that can be inspected, diffed, versioned
ScalabilityRequires operator attention proportional to scaleController workload scales, but operator intent stays constant
CompositionCommands must be carefully orderedControllers reconcile independently and concurrently

Common Mistakes and Misconceptions

  • “Kubernetes is only for microservices.” Kubernetes runs monoliths, batch jobs, stateful workloads, ML training pipelines, and more. Its primitives (Deployments, Jobs, StatefulSets, DaemonSets) are designed for a wide variety of workload patterns, not just microservices.

  • “I need Kubernetes for my small application.” For a single service with low traffic, a VM or platform-as-a-service (Heroku, Cloud Run, App Engine) is simpler, cheaper, and faster to operate. Kubernetes’ value emerges when you have the scaling, scheduling, and self-healing problems described in this chapter.

  • “Kubernetes replaces your CI/CD pipeline.” Kubernetes is a runtime platform, not a build or deploy tool. You still need a CI/CD system (GitHub Actions, Jenkins, ArgoCD, etc.) to build images, run tests, and push manifests. Kubernetes runs what your pipeline delivers.

Further Reading


Next: Architecture from First Principles

Chapter 3: Architecture from First Principles

The Big Picture

flowchart TD
    subgraph CP["Control Plane"]
        etcd["etcd<br>(state)"]
        API["API Server<br>(gateway)"]
        CM["kube-controller-manager<br>(reconciliation)"]
        SCHED["kube-scheduler<br>(placement)"]
        etcd <--> API
        API <--> CM
        API <--> SCHED
    end

    subgraph N1["Node 1"]
        kubelet1["kubelet"] --> containerd1["containerd"] --> pods1["Pod Pod"]
        kubeproxy1["kube-proxy"]
    end

    subgraph N2["Node 2"]
        kubelet2["kubelet"] --> containerd2["containerd"] --> pods2["Pod Pod"]
        kubeproxy2["kube-proxy"]
    end

    subgraph N3["Node 3"]
        kubelet3["kubelet"] --> containerd3["containerd"] --> pods3["Pod Pod"]
        kubeproxy3["kube-proxy"]
    end

    API --> kubelet1
    API --> kubelet2
    API --> kubelet3

Every arrow is through the API Server. There are no direct connections between components. This is the single most important architectural constraint.

If you encounter unfamiliar terms in this chapter, see Appendix A: Glossary for quick definitions.

Why etcd? The Case for a Consistent, Distributed Key-Value Store

Kubernetes needs to store the desired state of the entire cluster: every pod specification, every service definition, every configuration map. This state must be consistent (all readers see the same data), durable (data survives machine failures), and available (the store can be read and written to even when some machines fail).

The CAP theorem forces a choice between consistency and availability (since network partitions are inevitable), and Kubernetes chose consistency. This is the right choice for a cluster management system: it is better to temporarily refuse writes than to allow conflicting writes that could result in two different controllers making contradictory scheduling decisions.

etcd implements the Raft consensus algorithm, which provides strong consistency (linearizability) across a cluster of typically 3 or 5 nodes. Every write must be acknowledged by a majority of nodes before it is committed. This means etcd requires a majority of nodes to commit writes: a 3-node cluster needs 2 (tolerates 1 failure), a 5-node cluster needs 3 (tolerates 2 failures). This is why etcd clusters always use an odd number of nodes — a 4-node cluster still requires 3 for quorum, giving no additional fault tolerance over 3.

Why etcd specifically, rather than ZooKeeper, Consul, or a relational database?

  • ZooKeeper was the incumbent choice (used by Hadoop, Kafka, and many other systems). But ZooKeeper has a complex session-based model, a limited data model (tree of znodes with size limits), and a Java-based implementation that was harder to embed. etcd offered a simpler HTTP/gRPC API, a more flexible key-value model, and was written in Go (matching Kubernetes’ language).
  • Consul was not yet mature when Kubernetes was designed.
  • Relational databases provide strong consistency but are harder to operate in a distributed, fault-tolerant configuration. etcd’s Raft-based replication is simpler to reason about and deploy than MySQL/PostgreSQL with synchronous replication.

Critically, etcd provides a watch mechanism: clients can subscribe to changes on a key or key prefix and receive notifications when the data changes. This is the mechanism that powers Kubernetes’ reconciliation loops. Controllers do not poll etcd; they watch for changes and react to them. This makes the system event-driven and efficient.

Why a Single API Server? The Chokepoint That Enables Everything

All access to the Kubernetes cluster state — every read, every write, from every component — goes through the kube-apiserver. This seems like a bottleneck, and indeed it is a deliberate chokepoint. Why?

1. Authentication and authorization. The API server is the single enforcement point for access control. Every request is authenticated (who is making this request?) and authorized (is this identity allowed to perform this action on this resource?). Having a single enforcement point is a fundamental security principle: it eliminates the risk of inconsistent access control across multiple entry points.

2. Validation. The API server validates every object before storing it in etcd. This ensures that invalid state never enters the system. Validation includes schema validation (does this object have the right fields?), semantic validation (does this pod specification reference an existing service account?), and admission control (do custom policies allow this object?).

3. Admission control. The API server supports admission webhooks — external services that can examine, modify, or reject API requests. This enables powerful policy enforcement: injecting sidecar containers, enforcing naming conventions, requiring resource limits, preventing privilege escalation. The single-API-server model makes this possible because all mutations flow through one point.

4. Watch multiplexing. The API server multiplexes watch connections. Hundreds of controllers and kubelets watch for changes to different resources, and the API server efficiently fans out notifications from etcd changes. Without the API server as intermediary, every client would need a direct connection to etcd, which would not scale.

5. API versioning and conversion. The API server handles conversion between different API versions. An object stored as apps/v1 can be read as apps/v1beta1 (with appropriate conversion). This enables gradual API evolution without breaking clients.

The API server is designed to be horizontally scalable. You can run multiple instances behind a load balancer. Each instance is stateless — all state is in etcd. This means the single logical API server does not become a single point of failure in practice.

The Controller Pattern: Reconciliation as Architecture

The controller pattern is the heart of Kubernetes’ architecture. A controller is a loop that:

  1. Observes the current state of the world (by watching the API server)
  2. Compares the current state to the desired state (as expressed in API objects)
  3. Takes action to move the current state toward the desired state
  4. Repeats indefinitely

This is sometimes called the reconciliation loop or the observe-diff-act pattern. It is borrowed from control theory, where it is known as a closed-loop controller. The thermostat in your house is a simple example: it observes the current temperature, compares it to the desired temperature, and turns the heater on or off.

Kubernetes runs dozens of controllers, each responsible for a specific aspect of the system:

  • The Deployment controller watches Deployment objects and ensures the right number of ReplicaSets exist with the right template.
  • The ReplicaSet controller watches ReplicaSet objects and ensures the right number of Pods exist.
  • The Node controller watches nodes and detects failures.
  • The Service controller watches Service objects and updates endpoints.
  • The Job controller watches Job objects and creates Pods to run tasks to completion.
  • The Endpoint controller watches Services and Pods to maintain the mapping between them.
flowchart TD
    OBSERVE["OBSERVE<br>current state"] -->|Compare actual<br>vs desired| DIFF["DIFF<br>actual vs spec"]
    DIFF -->|Create/delete/update<br>objects via API Server| ACT["ACT<br>to fix drift"]
    ACT -->|repeat| OBSERVE

    OBSERVE -.-|Watch API Server| API(("API<br>Server"))
    ACT -.-|Write to API Server| API

The genius of this pattern is decomposition. Each controller handles exactly one concern. The Deployment controller knows nothing about nodes or networking; the ReplicaSet controller knows nothing about rolling updates. Each controller reads from and writes to the API server, and the API server provides the shared state that coordinates them.

This decomposition also provides fault tolerance. If the Deployment controller crashes and restarts, it simply reads the current state from the API server and resumes reconciling. No state is lost because the controller is stateless — all state is in the API server (and ultimately in etcd). This is why Kubernetes components can be restarted at any time without corruption.

The Watch Mechanism: Event-Driven Efficiency

Controllers need to know when things change. Polling — periodically reading all objects — is wasteful and slow, introducing latency proportional to the polling interval.

Kubernetes uses a watch mechanism instead. A controller opens a long-lived HTTP connection to the API server and says, “notify me of any changes to Deployment objects.” The API server, in turn, watches etcd for changes and fans out notifications to all watching clients. This is event-driven: controllers react to changes immediately rather than discovering them on a polling interval.

The watch mechanism is implemented using HTTP chunked transfer encoding (or gRPC streaming). Each change event includes the type of change (ADDED, MODIFIED, DELETED), the object’s new state, and a resource version — a logical clock that enables clients to resume a watch from where they left off after a disconnection.

To handle the case where a watch connection breaks and events are missed, Kubernetes controllers use a pattern called list and watch: on startup, the controller lists all objects of interest (establishing a baseline), notes the resource version, and then watches for changes from that version forward. The client-go library provides an Informer abstraction that implements this pattern, including a local cache of objects and an event handler interface.

The Scheduler: Separation of Concerns

The Kubernetes scheduler is a separate component from the API server and the controllers. Its job is simple but computationally intensive: when a new Pod is created without a node assignment, the scheduler selects the best node for it.

Why is the scheduler separate? Because scheduling is a fundamentally different concern from state management. The API server manages state; controllers reconcile state; the scheduler makes placement decisions. Separating these concerns allows each to evolve independently. You can replace the default scheduler with a custom one, or run multiple schedulers for different workload types, without modifying any other component.

The scheduler operates in two phases:

  1. Filtering: Eliminate nodes that cannot run the pod (insufficient resources, incompatible taints, missing node selectors, etc.).
  2. Scoring: Rank the remaining nodes by desirability (resource balance, affinity/anti-affinity, topology spread, etc.).

The scheduler’s decision is recorded by binding the pod to a node — setting the spec.nodeName field in the Pod object via the API server. The kubelet on the target node watches for pods bound to it and starts the containers.

This design means the scheduler is advisory, not authoritative. It makes suggestions (by binding pods to nodes), but it does not directly start containers. If the kubelet cannot run the pod (perhaps a resource changed between scheduling and execution), the pod enters a failed state and the reconciliation loop handles it.

Kubelet: The Node Agent

The kubelet runs on every node in the cluster. It is the bridge between the Kubernetes control plane and the container runtime on the node. The kubelet:

  1. Watches the API server for pods assigned to its node
  2. Translates pod specifications into container runtime calls (via the Container Runtime Interface, CRI)
  3. Monitors container health (via liveness and readiness probes)
  4. Reports node status and pod status back to the API server

The kubelet is deliberately simple. It does not make scheduling decisions, manage networking, or handle service discovery. It is a single-responsibility agent that converts API state into running containers and reports back.

The kubelet’s design reflects a key Kubernetes principle: the control plane tells nodes what to do, not how to do it. The kubelet receives a PodSpec and is free to implement it however it wants, as long as the containers end up running with the specified resources and configuration. This abstraction is what allows Kubernetes to support multiple container runtimes (containerd, CRI-O, etc.) through the CRI interface.

Kube-Proxy: Transparent Service Networking

Kube-proxy runs on every node (or is replaced by equivalent CNI functionality) and implements the networking rules that make Services work. When a Service is created with a ClusterIP, kube-proxy programs iptables or IPVS rules on every node that intercept traffic destined for the ClusterIP and redirect it to one of the Service’s backing pods.

Why does kube-proxy run on every node instead of as a centralized load balancer? Because a centralized load balancer would be a bottleneck and a single point of failure. By distributing the load-balancing rules to every node, Kubernetes ensures that Service traffic takes the most direct path: a pod on Node A talking to a Service endpoint on Node B sends traffic directly from A to B, with no intermediary.

Kube-proxy watches the API server for Service and Endpoint changes and updates the local rules accordingly. It is another example of the controller pattern: observe desired state (Service definitions), observe actual state (current iptables rules), and reconcile.

The Controller Pattern is Kubernetes. If you understand only one thing about Kubernetes’ architecture, understand the controller pattern: observe, diff, act, repeat. Every component — from the scheduler to the kubelet to kube-proxy — is a controller that watches for state changes and reconciles actual state toward desired state. This single pattern, applied recursively across the entire system, is what makes Kubernetes self-healing, scalable, and extensible.

Common Mistakes and Misconceptions

  • “The master node runs my workloads.” Control plane nodes are dedicated to cluster management (API server, scheduler, controllers, etcd). Workloads run on worker nodes by default. Running application pods on control plane nodes is possible but strongly discouraged in production because it risks destabilizing the control plane.

  • “etcd is a general-purpose database.” etcd is optimized for small key-value metadata with a hard limit of approximately 1.5 MB per value. It is designed for storing cluster state, not application data. Treating it as an application database will lead to performance degradation and cluster instability.

  • “If the API server goes down, my pods stop.” Running pods continue to execute on their nodes even if the API server is unavailable. The kubelet keeps containers running based on its last known state. What stops is your ability to make changes, schedule new pods, or observe cluster state through the API.

  • “The scheduler continuously moves pods for better balance.” Pods are scheduled once and stay on their assigned node unless they are evicted, deleted, or the node fails. The scheduler does not rebalance running pods. If you need rebalancing, you must use tools like the Descheduler, which evicts pods so the scheduler can place them on better nodes.

Further Reading


Next: The API Model

Chapter 4: The API Model — Declarative State and Reconciliation

Resources, Objects, Specs, and Status

The Kubernetes API is a resource-oriented API. Everything in Kubernetes — pods, services, deployments, config maps, custom resources — is a resource with a standard structure:

  • apiVersion: The API group and version (e.g., apps/v1)
  • kind: The type of resource (e.g., Deployment)
  • metadata: Name, namespace, labels, annotations, resource version, creation timestamp, finalizers, owner references
  • spec: The desired state — what the user wants
  • status: The actual state — what the system has observed

This spec/status split is fundamental. The user writes spec; controllers write status. This separation of concerns means that:

  1. The user owns intent. Only the user (or their tooling) should modify spec. Controllers never modify spec.
  2. Controllers own reality. Controllers update status to reflect what they have observed and what actions they have taken.
  3. Reconciliation bridges the gap. The controller’s job is to make the real world match spec, and to report the real world in status.

The resource version field in metadata is a critical coordination mechanism. It is an opaque string (typically derived from etcd’s revision number) that changes every time the object is modified. When a client wants to update an object, it must include the current resource version. If another client has modified the object in the meantime, the resource version will have changed, and the update will fail with a 409 Conflict error. This is optimistic concurrency control: clients assume they can make updates without locking, and the system detects and rejects conflicting updates.

Declarative YAML: Configuration as Data

Kubernetes objects are typically expressed as YAML (or JSON) documents. This is not incidental — it is a deliberate design choice with deep implications.

By representing desired state as data (YAML files) rather than code (imperative scripts), Kubernetes enables:

  • Version control: YAML files can be committed to Git, creating a complete history of every change to the cluster’s desired state. This is the foundation of GitOps.
  • Code review: Changes to infrastructure can be reviewed with the same tools and processes used for application code.
  • Diffing: You can diff two versions of a deployment spec to see exactly what changed.
  • Templating: Tools like Helm can generate YAML from templates, enabling parameterized deployments.
  • Validation: YAML can be validated against schemas before being applied, catching errors before they reach the cluster.
  • Dry runs: kubectl apply --dry-run=server sends the YAML to the API server for validation without actually creating resources.

The choice of YAML specifically (rather than JSON, TOML, or a custom DSL) was pragmatic: YAML is human-readable, supports comments (unlike JSON), and was already widely used in the DevOps community (Ansible, Docker Compose). Its verbosity has been criticized, but its universality is a significant advantage.

Reconciliation Loops: The Engine of Self-Healing

The reconciliation loop is the mechanism by which Kubernetes achieves its declarative guarantees.

Consider what happens when you apply a Deployment object:

flowchart TD
    kubectl["kubectl apply"] --> API["API Server"]
    API -->|store| etcd["etcd"]
    API -->|watch| DC["Deployment Controller"]
    DC -->|create| RS["ReplicaSet"]
    API -->|watch| RSC["ReplicaSet Controller"]
    RSC -->|create| Pods["Pod<br>Pod<br>Pod"]
    API -->|watch| Scheduler["Scheduler"]
    Scheduler -->|"bind pod to node"| Pods
    API -->|watch| Kubelet["Kubelet (on node)<br>containerd<br>start container"]
    Pods --> Kubelet

The following sequence diagram shows the temporal flow — notice that every component communicates only through the API server:

sequenceDiagram
    participant User as User (kubectl)
    participant API as API Server
    participant etcd as etcd
    participant DC as Deployment Controller
    participant RSC as ReplicaSet Controller
    participant Sched as Scheduler
    participant KL as Kubelet (per node)
    participant EC as Endpoint Controller

    User->>API: POST /apis/apps/v1/deployments
    API->>etcd: store Deployment
    API-->>User: 201 Created

    API-->>DC: watch: new Deployment
    Note over DC: compare: 0 RS exist, need 1
    DC->>API: create ReplicaSet
    API->>etcd: store RS

    API-->>RSC: watch: new RS
    Note over RSC: compare: 0 pods, need 3
    RSC->>API: create 3 Pods
    API->>etcd: store Pods

    API-->>Sched: watch: 3 unbound Pods
    Note over Sched: assign nodeName per Pod
    Sched->>API: update Pod .spec.nodeName
    API->>etcd: store updated Pods

    API-->>KL: watch: Pod bound to my node
    Note over KL: start containers via CRI

    KL->>API: update Pod .status (Running, IP)
    API->>etcd: store Pod status

    API-->>EC: watch: Pod Ready
    Note over EC: add Pod IP to Service
    EC->>API: update Endpoints
    API->>etcd: store Endpoints

Here’s the same flow in words:

  1. The API server validates the Deployment and stores it in etcd.
  2. The Deployment controller observes the new Deployment. It compares the desired state (e.g., 3 replicas of nginx:1.21) to the actual state (no ReplicaSets exist yet). It creates a new ReplicaSet object.
  3. The ReplicaSet controller observes the new ReplicaSet. It compares the desired state (3 pods) to the actual state (0 pods). It creates 3 Pod objects.
  4. The Scheduler observes the 3 unscheduled Pods. For each, it selects a node and updates the Pod’s spec.nodeName.
  5. The Kubelet on each selected node observes the Pod assigned to it. It calls the container runtime to start the containers.
  6. The Kubelet reports the pod’s status (running, IP address, etc.) back to the API server.
  7. The Endpoint controller observes the running Pods and updates the Endpoints object for any matching Services.

Notice how many controllers are involved, each doing a small, independent job, communicating only through the API server. If any controller crashes, it simply restarts and resumes from the current state. If a pod crashes, the ReplicaSet controller detects that the actual count (2) differs from the desired count (3) and creates a replacement. This is self-healing through reconciliation.

Labels and Selectors: The Soft Linking Mechanism

Kubernetes objects are connected not by hard references (like foreign keys in a relational database) but by labels and selectors. A label is a key-value pair attached to an object’s metadata (e.g., app: nginx, env: production). A selector is a query that matches objects by their labels (e.g., app=nginx,env=production).

This soft linking is a deliberate design choice:

  • Loose coupling: A Service does not reference specific Pods by name. It references a label selector, and any Pod matching that selector is included. This means Pods can be created, destroyed, and replaced without updating the Service.
  • Flexibility: Labels can represent any dimension: application name, version, environment, team, cost center. Selectors can combine dimensions.
  • Composition: Multiple resources can select the same Pods. A Service, a NetworkPolicy, and a PodDisruptionBudget can all independently select the same set of Pods using the same or different labels.

The label/selector model is inspired by the way tagging works in cloud infrastructure and the way CSS selectors work in web development: you define properties on objects and use queries to match them, rather than building explicit relationship graphs.

Custom Resource Definitions: Extending the API

One of Kubernetes’ most powerful features is the ability to extend the API with custom resources. A Custom Resource Definition (CRD) tells the API server about a new type of object — say, PostgresCluster — including its schema, its API group, and its versions. Once the CRD is installed, users can create, read, update, and delete PostgresCluster objects just like built-in resources.

But a CRD alone is just data storage. The magic happens when you pair a CRD with a custom controller that watches for PostgresCluster objects and reconciles them — creating the underlying StatefulSets, Services, ConfigMaps, PersistentVolumeClaims, and other resources needed to run an actual PostgreSQL cluster. This combination of CRD + controller is the Operator pattern.

CRDs are Kubernetes’ answer to the extensibility problem: how do you allow the platform to manage new types of resources without modifying Kubernetes itself? By making the API server a generic, extensible state store with a standard interface, Kubernetes enables an ecosystem of operators that teach the system how to manage everything from databases to message queues to machine learning pipelines.

This extensibility was a lesson from Borg, whose fixed API required modifying the system itself to support new workload types. Kubernetes’ CRD mechanism democratizes this: anyone can extend the API without forking the project.

Common Mistakes and Misconceptions

  • “kubectl apply and kubectl create are the same.” kubectl create is imperative and fails if the resource already exists. kubectl apply is declarative and merges your manifest with the existing resource, making it safe to run repeatedly. In production, always use apply for reproducible, idempotent deployments.

  • “I should use kubectl edit in production.” Imperative edits bypass GitOps workflows, code review, and audit trails. Changes made with kubectl edit are not tracked in version control and cannot be reproduced. Always use declarative YAML stored in Git and applied through a pipeline.

  • “All Kubernetes resources are namespaced.” Many important resources are cluster-scoped: Nodes, PersistentVolumes, ClusterRoles, ClusterRoleBindings, and Namespaces themselves. Understanding which resources are namespaced and which are cluster-scoped is essential for RBAC and multi-tenancy.

  • “Deleting a resource is instant.” Finalizers can block deletion indefinitely until a controller completes cleanup logic. Pods have a graceful termination period (default 30 seconds) during which they receive SIGTERM before being killed. A resource in “Terminating” state may remain for an extended time.

Further Reading


Next: The Networking Model

Chapter 5: The Networking Model — Why Every Pod Gets an IP

The Fundamental Networking Problem

Kubernetes’ networking model is most easily understood by contrast with the Docker port-mapping model it rejected.

Docker Port-Mapping vs. Kubernetes Flat Network

 DOCKER PORT-MAPPING MODEL              KUBERNETES FLAT NETWORK MODEL
 ─────────────────────────              ────────────────────────────

 Host IP: 192.168.1.10                  Node 1                Node 2
 ┌──────────────────────┐               ┌──────────────┐      ┌──────────────┐
 │ Container A (:80)    │               │ Pod A        │      │ Pod C        │
 │  → mapped to :32768  │               │ 10.244.1.5   │      │ 10.244.2.8   │
 │                      │               │ :80 is :80   │      │ :80 is :80   │
 │ Container B (:80)    │               │              │      │              │
 │  → mapped to :32769  │               │ Pod B        │      │ Pod D        │
 │                      │               │ 10.244.1.6   │      │ 10.244.2.9   │
 │ Container C (:3000)  │               │ :3000 is     │      │ :3000 is     │
 │  → mapped to :32770  │               │  :3000       │      │  :3000       │
 └──────────────────────┘               └──────┬───────┘      └──────┬───────┘
                                               │    Flat network     │
 Client must know:                             └─────────────────────┘
  192.168.1.10:32768                     Any pod can reach any pod by IP.
  192.168.1.10:32769                     No port translation. No NAT.
  192.168.1.10:32770                     Apps bind to the port they expect.

Docker’s Port-Mapping Model (and Why Kubernetes Rejected It)

In Docker’s default bridge networking mode, each container gets its own network namespace connected to the docker0 bridge on the host. Since containers are isolated behind this bridge, reaching them from outside the host requires port mapping (-p), which maps a host port into the container’s namespace (e.g., host port 32768 to container port 80). This means:

  • The container’s address from outside is <host-ip>:<random-port>, not a predictable address.
  • Every service that needs to communicate with the container must know the host IP and the mapped port.
  • Port allocation must be coordinated across all containers on a host to avoid conflicts.
  • Applications must be aware of port mapping, or an intermediary must translate.

This model breaks a fundamental assumption of network programming: that you know your own address. A container that binds to port 80 thinks it is listening on port 80, but external clients reach it on port 32768. Google’s experience with Borg — which used a naming service (BNS) to map logical names to host:port pairs — confirmed that port-mapping models create cascading operational friction.

Kubernetes’ Flat Networking Model

Its networking model has three fundamental rules:

  1. Every Pod gets its own IP address. No port mapping. No NAT between pods. A pod that binds to port 80 is reachable on port 80 at its pod IP.
  2. All pods can communicate with all other pods without NAT. Any pod can reach any other pod using the other pod’s IP address, regardless of which node either pod is on.
  3. Agents on a node (kubelet, kube-proxy) can communicate with all pods on that node.

This is sometimes called the flat networking model because from the perspective of pods, the network is flat: every pod is directly reachable from every other pod. There are no layers of NAT or port mapping to navigate.

Why is this model superior? Because it preserves the assumptions of traditional network programming. Applications do not need to know about port mapping. They bind to the port they expect. They connect to other services at their expected ports. DNS, load balancers, and monitoring tools work as expected. The mental model is: “pods are like VMs on a flat network.”

How the Flat Network Is Implemented: CNI

Kubernetes does not implement networking itself. Instead, it defines the Container Network Interface (CNI) specification: a standard API that networking plugins must implement. The CNI plugin is responsible for:

  • Allocating an IP address for each pod
  • Configuring the pod’s network namespace (virtual ethernet pair, routes, etc.)
  • Ensuring pod-to-pod connectivity across nodes

Different CNI plugins implement this in different ways:

  • Flannel uses a simple overlay network (VXLAN or host-gateway) to encapsulate pod traffic in UDP or IP-in-IP packets.
  • Calico uses BGP to distribute pod routes, avoiding encapsulation overhead and enabling network policies.
  • Cilium uses eBPF (extended Berkeley Packet Filter) programs in the Linux kernel for high-performance, programmable networking.
  • AWS VPC CNI assigns pod IPs from the AWS VPC address space, making pods first-class citizens in the VPC network.

The CNI abstraction is another example of Kubernetes’ design philosophy: define the interface, not the implementation. By specifying what networking must provide (unique pod IPs, flat connectivity) without specifying how, Kubernetes allows the networking layer to be optimized for different environments.

Services: Stable Endpoints for Ephemeral Pods

Pod IP addresses are ephemeral. When a pod is destroyed and recreated, it gets a new IP. This means you cannot rely on pod IPs for service discovery. This is where the Service abstraction comes in.

A Service provides a stable virtual IP address (the ClusterIP) and a stable DNS name that routes traffic to the set of pods matching the Service’s label selector. The mapping from Service to pods is maintained by the Endpoints (or EndpointSlice) controller, which watches for pod changes and updates the endpoint list.

flowchart TD
    SVC["Service: web-svc<br>ClusterIP: 10.96.0.42<br>DNS: web-svc.default.svc<br>Selector: app=web"]
    SVC -->|"kube-proxy / iptables<br>load-balances across"| Pod1["Pod app=web<br>10.244.1.5<br>Node 1"]
    SVC -->|"kube-proxy / iptables<br>load-balances across"| Pod2["Pod app=web<br>10.244.2.8<br>Node 2"]
    SVC -->|"kube-proxy / iptables<br>load-balances across"| Pod3["Pod app=web<br>10.244.1.9<br>Node 1"]

Client code: http://web-svc:80 is transparently routed to a pod.

Kube-proxy (or the CNI plugin) programs rules on every node that intercept traffic to the Service’s ClusterIP and redirect it to one of the backing pod IPs, using round-robin or other load-balancing algorithms. From the client’s perspective, the Service has a single, stable address; the fact that traffic is being distributed to ephemeral pods is transparent.

Services come in several types:

  • ClusterIP (default): Accessible only within the cluster.
  • NodePort: Exposes the Service on a static port on every node’s IP, making it accessible from outside the cluster.
  • LoadBalancer: Provisions an external load balancer (on cloud providers) that routes external traffic to the Service.
  • ExternalName: Maps the Service to a DNS CNAME record, providing a Kubernetes-native alias for an external service.

Ingress and Gateway API: L7 Routing

Services operate at Layer 4 (TCP/UDP). For HTTP-level routing — path-based routing, host-based virtual hosting, TLS termination — Kubernetes provides the Ingress resource (and its successor, the Gateway API).

An Ingress is a declaration of routing rules: “route traffic for host foo.example.com to Service foo, and traffic for host bar.example.com to Service bar.” An Ingress Controller (a separate component, typically nginx, HAProxy, Traefik, or a cloud load balancer) watches for Ingress resources and configures the actual routing.

The Gateway API, introduced as a successor to Ingress, provides a more expressive and extensible model for routing, with better support for multi-tenancy, traffic splitting, and protocol-specific routing.

Network Policies: The Missing Firewall

By default, Kubernetes’ flat networking model allows all pods to communicate with all other pods. This is convenient but not secure. Network Policies provide pod-level firewall rules: you can specify which pods can communicate with which other pods, based on labels, namespaces, and IP blocks.

Network Policies are implemented by the CNI plugin (not all plugins support them). They are another example of Kubernetes’ declarative model: you declare the desired network access rules, and the CNI plugin configures the underlying network to enforce them.

The Four Networking Problems

ProblemSolutionKey Mechanism
Container-to-container on same podShared network namespace (localhost)Pods share a single IP; containers communicate via localhost
Pod-to-pod across nodesFlat network via CNI pluginEvery pod gets a unique IP; CNI ensures cross-node connectivity
Pod-to-Service (service discovery)Service abstraction with ClusterIPkube-proxy/CNI programs iptables/IPVS rules for load balancing
External-to-ServiceNodePort, LoadBalancer, IngressExpose services externally via port mapping, cloud LB, or L7 routing

Common Mistakes and Misconceptions

  • “Pods need NAT to talk to each other.” Kubernetes requires a flat network where every pod can reach every other pod directly by IP without NAT. This is a fundamental requirement of the networking model, enforced by the CNI plugin. If you find yourself configuring NAT between pods, something is misconfigured.

  • “Services are load balancers.” A Service is a stable virtual IP (ClusterIP) with endpoint tracking and basic load distribution via kube-proxy rules. Only type: LoadBalancer provisions an actual external load balancer. ClusterIP and NodePort Services are internal routing constructs, not load balancer appliances.

  • “Pod IPs are stable.” Pod IPs are ephemeral and change every time a pod is restarted or rescheduled. Never hard-code pod IPs in configuration. Use Services for stable endpoints and DNS-based service discovery.

  • “NodePort is fine for production.” NodePort exposes a high-numbered port on every node in the cluster, making it difficult to manage, secure, and integrate with external DNS or TLS. For production external traffic, use Ingress controllers or type: LoadBalancer Services instead.

Further Reading


Next: The Ecosystem

Chapter 6: The Ecosystem — Why Operators, Helm, and Service Meshes Exist

block-beta
    columns 3

    APP["YOUR APPLICATION"]:3

    GITOPS["GitOps\nArgoCD / Flux"]
    PKG["Packaging\nHelm / Kustomize"]
    OBS["Observability\nPrometheus / Grafana"]

    MESH["SERVICE MESH (optional) — Istio / Linkerd / Cilium"]:3

    OPH["OPERATORS — CRD + Controller = domain knowledge as code\nPostgreSQL / Kafka / Any Domain Operator"]:3

    CORE["KUBERNETES CORE\nDeployments, Services, ConfigMaps, Secrets, RBAC, CRDs\nAPI Server, etcd, Scheduler, Controllers, Kubelet"]:3

    CNI["NETWORK (CNI)\nFlannel / Calico / \nCilium"]
    CRI["RUNTIME (CRI)\ncontainerd / CRI-O"]
    CSI["STORAGE (CSI)\nEBS / NFS / Ceph"]

    LINUX["LINUX + HARDWARE — cgroups, namespaces, iptables/eBPF, kernel"]:3

    style CORE fill:#1a4eb8,color:#fff,stroke:#0d2d6e,stroke-width:3px

Kubernetes provides the middle layers. Everything above and below is pluggable. This is “platform for platforms” by design.

Operators: Teaching Kubernetes Domain Knowledge

The Operator pattern, introduced by CoreOS in 2016, is the most significant architectural pattern to emerge from the Kubernetes ecosystem. An Operator encodes the operational knowledge of a human domain expert into a custom controller.

Consider the problem of running a PostgreSQL database on Kubernetes. Kubernetes knows how to run containers, but it does not know how to:

  • Initialize a PostgreSQL cluster with a primary and replicas
  • Configure streaming replication between primary and replicas
  • Perform a failover when the primary node fails
  • Take point-in-time backups using WAL archiving
  • Resize a cluster by adding or removing replicas
  • Upgrade PostgreSQL versions with minimal downtime

A human DBA knows all of these things. An Operator encodes this knowledge into code. The Operator defines a CRD (e.g., PostgresCluster) and a controller that watches for PostgresCluster objects and reconciles them by creating and managing the underlying Kubernetes resources (StatefulSets, Services, ConfigMaps, PersistentVolumeClaims, CronJobs for backups, etc.).

The Operator pattern is powerful because it composes with Kubernetes’ existing primitives — the same API, controllers, and reconciliation loops — adding only the domain-specific logic on top.

Operators exist for virtually every stateful application: databases (PostgreSQL, MySQL, MongoDB, CockroachDB, Cassandra), message queues (Kafka, RabbitMQ), monitoring systems (Prometheus), and many more. The OperatorHub.io registry catalogs hundreds of them.

Helm: Package Management for Kubernetes

Helm addresses a different problem: parameterized deployment. A typical Kubernetes application consists of dozens of YAML files: Deployments, Services, ConfigMaps, Secrets, Ingresses, ServiceAccounts, RBAC rules. These files need to be customized for different environments (dev, staging, production) and different configurations (replicas, resource limits, feature flags).

Helm introduces the concept of a chart: a package of templated YAML files, a values.yaml file that provides default parameters, and metadata. Users install a chart with custom values, and Helm renders the templates and applies the resulting YAML to the cluster.

Helm also provides release management: it tracks which charts are installed, their versions, and their configuration, enabling upgrades and rollbacks. It fills the role of a package manager (like apt or npm) for Kubernetes.

Helm has been criticized for its complexity and for its templating approach (Go templates embedded in YAML is ergonomically challenging). Alternatives like Kustomize (which uses overlay-based patching rather than templating) have emerged, but Helm remains the most widely used packaging tool in the Kubernetes ecosystem, largely because of its enormous library of community-maintained charts.

Service Meshes: The Networking Layer That Kubernetes Lacks

Kubernetes provides basic service discovery and load balancing through Services, but it does not provide:

  • Mutual TLS (mTLS) between services: encrypting and authenticating all inter-service communication
  • Fine-grained traffic management: canary deployments, traffic splitting, fault injection, retries, timeouts, circuit breaking
  • Observability: distributed tracing, per-service metrics, access logging

A service mesh adds these capabilities by inserting a sidecar proxy (typically Envoy) alongside every pod. All inbound and outbound traffic flows through the sidecar, which can encrypt it, route it, observe it, and enforce policies on it. A control plane (Istio, Linkerd, Consul Connect) configures the sidecars.

Service meshes exist because Kubernetes deliberately does not implement application-level networking. Kubernetes provides the infrastructure-level network (pod IPs, Service ClusterIPs) but leaves application-level concerns (encryption, traffic management, observability) to the application or to a mesh. This is consistent with Kubernetes’ design philosophy of providing building blocks rather than a complete platform.

However, service meshes add significant complexity: they increase resource consumption (each sidecar consumes CPU and memory), add latency (each hop through a sidecar adds processing time), and introduce a large new control plane to operate. Many organizations find that they can achieve sufficient security and observability with simpler approaches (network policies + application-level TLS + centralized logging) and do not need a full mesh.

Why the Ecosystem Exists: Kubernetes as a Platform for Platforms

The common thread across Operators, Helm, and service meshes is that Kubernetes is deliberately incomplete. It provides primitives (pods, services, deployments, CRDs) and extension mechanisms (controllers, admission webhooks, CNI, CRI, CSI) but does not attempt to solve every problem itself — a lesson from Borg, which tried to be everything and became too tightly coupled to evolve. Kubernetes instead adopted the Unix philosophy: do one thing well, and compose with other tools.

The result is that Kubernetes is not a platform; it is a platform for building platforms. Organizations build their own internal developer platforms on top of Kubernetes, combining:

  • Operators for stateful services
  • Helm or Kustomize for packaging
  • A service mesh or CNI-level features for security
  • ArgoCD or Flux for GitOps
  • Prometheus and Grafana for monitoring
  • Custom CRDs and controllers for domain-specific needs

This composability is both Kubernetes’ greatest strength and its greatest source of complexity. The bare Kubernetes API is relatively simple; the ecosystem built on top of it is vast and sometimes overwhelming. Understanding that this is by design — that Kubernetes provides the kernel, not the full operating system — is essential to understanding the Kubernetes landscape.

Common Mistakes and Misconceptions

  • “CNCF graduated means production-ready for my use case.” Graduated status indicates mature governance, broad adoption, and a proven security audit process. It does not guarantee the project is the right fit for your specific workload, scale, or operational constraints. Always evaluate projects against your own requirements.

  • “I need to install every CNCF tool.” Most production clusters need only 5-10 ecosystem tools (a CNI plugin, an ingress controller, monitoring, logging, and perhaps a GitOps tool). The CNCF landscape contains 1000+ projects; installing everything would create an unmanageable operational burden.

  • “The CNCF landscape is the complete ecosystem.” Many important Kubernetes tools live outside the CNCF, including commercial products, independent open-source projects, and cloud-provider-specific integrations. The CNCF landscape is a significant subset, not the totality of the ecosystem.

Further Reading

  • CNCF Landscape – Interactive map of the entire cloud-native ecosystem, categorized by function (orchestration, observability, service mesh, etc.), with funding and maturity data.
  • CNCF Project Maturity Levels – Explanation of the Sandbox, Incubating, and Graduated tiers, along with a full list of CNCF projects and their current status.
  • Introducing Operators (CoreOS, 2016) – The original blog post by Brandon Philips that introduced the Operator pattern, explaining why encoding operational knowledge in code is a natural extension of Kubernetes controllers.
  • CNCF Annual Survey Results – Yearly survey data on Kubernetes adoption rates, ecosystem tool usage, and deployment patterns across organizations worldwide.
  • KubeCon + CloudNativeCon Talk Recordings – Full archives of KubeCon presentations covering operators, Helm, service meshes, and every other corner of the ecosystem.
  • Helm Documentation – Official docs for the most widely used Kubernetes package manager, covering chart authoring, templating, release management, and repository hosting.
  • Kubernetes Service Mesh: A Comparison of Istio, Linkerd and Consul (Platform9) – Detailed comparison of the major service mesh implementations across 16 factors, covering architectures, performance characteristics, and ideal use cases.

Next: Key Design Principles

Chapter 7: Key Design Principles

Declarative Over Imperative

Kubernetes favors declaring desired state over issuing commands. This principle pervades every level of the system, from the API (objects have spec and status, not a command queue) to the controllers (which reconcile rather than execute) to the tooling (kubectl apply rather than kubectl run).

Control Loops Over Orchestration

As the official documentation states: “Kubernetes is not a mere orchestration system. Orchestration means executing a defined workflow: first do A, then B, then C. Kubernetes comprises a set of independent, composable control processes that continuously drive the current state towards the desired state.”

This distinction is subtle but important. An orchestration system is fragile: if step B fails, the entire workflow may need to be restarted or manually intervened. A control-loop system is robust: each controller independently makes progress, and failures in one controller do not block others.

API-Centric Design

Everything in Kubernetes is an API object. Every component communicates through the API server. There are no hidden side channels, no direct component-to-component communication. This means:

  • The API is the complete description of the system’s state.
  • Any behavior can be observed by watching the API.
  • Any component can be replaced by one that speaks the same API.
  • The system can be extended by adding new API types (CRDs) and controllers.

Portability and Vendor Neutrality

Kubernetes was designed from the start to run anywhere: on any cloud provider, on bare metal, on a laptop. This is achieved through abstraction layers (CRI for container runtimes, CNI for networking, CSI for storage) that isolate Kubernetes from the underlying infrastructure. The goal is to prevent vendor lock-in and enable workload portability.

Extensibility as a First-Class Concern

Kubernetes does not try to solve every problem itself. Instead, it provides extension points at every level: CRDs for custom API types, admission webhooks for custom validation and mutation, custom schedulers, custom controllers, CNI/CRI/CSI plugins. This extensibility is what enables the vast Kubernetes ecosystem.

The Level-Triggered vs. Edge-Triggered Distinction

Kubernetes controllers are designed to be level-triggered, not edge-triggered. An edge-triggered system reacts to changes (events): “a pod was deleted.” A level-triggered system reacts to state: “the desired count is 3, but the actual count is 2.”

The level-triggered approach is more robust because it handles missed events gracefully. If a controller misses the “pod deleted” event (because it was restarting or the watch was disconnected), it will still notice that the actual count is wrong on its next reconciliation and take corrective action. Edge-triggered systems require reliable event delivery; level-triggered systems only require eventual state observation.

This is why Kubernetes controllers are built around Informers that maintain a cached copy of the current state, rather than simple event handlers. The Informer’s cache represents the current level, and the controller reconciles against it.

Level-Triggered Design: Kubernetes controllers react to the current state of the world (“there are 2 pods but 3 desired”), not to individual events (“a pod was deleted”). This makes them robust against missed events, disconnections, and restarts. If a controller misses an event, it will still observe the state discrepancy on its next reconciliation cycle and take corrective action.

Common Mistakes and Misconceptions

  • “Declarative means one-shot.” Declarative does not mean “apply once and walk away.” It means continuous reconciliation: Kubernetes constantly compares the actual state of the cluster to the desired state and drives toward convergence. The system is always working, not just at the moment you run kubectl apply.

  • “Controllers run once when you apply a change.” Controllers run in continuous loops, not as one-shot handlers. They watch for any drift from desired state, whether caused by your changes, hardware failures, resource pressure, or other controllers. A controller that only ran once would miss all subsequent drift.

  • Writing event-driven controllers instead of level-triggered ones. Controllers that react to individual events rather than reconciling against current state break when events are missed. A level-triggered controller simply observes the current state on the next reconciliation and converges regardless of what events it missed.

Further Reading


Next: Why Kubernetes Won

Chapter 8: Why Kubernetes Won

The Competitive Landscape

Kubernetes was not the only container orchestration system:

  • Docker Swarm (2015) offered a simpler, Docker-native orchestration experience.
  • Apache Mesos (2009) with Marathon provided a battle-tested, two-level scheduling architecture used at Twitter, Airbnb, and Apple.
  • Nomad (2015) from HashiCorp offered a simpler, more flexible orchestrator that could manage containers, VMs, and standalone binaries.

So why did Kubernetes win? Several factors:

1. The right abstraction level. Docker Swarm was too simple: it lacked the extensibility and abstraction depth needed for complex production workloads. Mesos was too low-level: it provided resource scheduling but left application management to frameworks like Marathon, creating a fragmented experience. Kubernetes hit a sweet spot: it provided a comprehensive API for managing applications (Deployments, Services, ConfigMaps, Secrets) while remaining extensible for new use cases (CRDs, Operators).

2. The declarative model. Kubernetes’ commitment to declarative, reconciliation-based state management was more robust than Swarm’s imperative commands or Mesos’ framework-specific APIs. The declarative model enabled GitOps, automated testing of infrastructure changes, and reliable self-healing.

3. The extensibility model. CRDs and custom controllers allowed the community to extend Kubernetes without forking it. This created a virtuous cycle that Docker Swarm and Mesos, lacking this extensibility, could not match.

4. Vendor neutrality. By donating Kubernetes to the CNCF and designing it to run on any infrastructure, Google ensured that no single vendor controlled the project. This convinced AWS, Azure, and every other cloud provider to offer managed Kubernetes services, creating a universal standard. Docker Swarm was controlled by Docker, Inc., and Mesos was associated with Mesosphere (later D2iQ).

5. Google’s credibility. Kubernetes was backed by Google’s decade of experience running Borg at unprecedented scale. This gave the project instant credibility in a way that a startup’s orchestrator could not match.

6. Community and ecosystem. Kubernetes built the largest open-source community in history (by contributor count). The CNCF ecosystem of complementary projects (Prometheus, Envoy, Helm, ArgoCD, Cilium, etc.) created a complete platform story that no competitor could match.

The Deeper Lesson

But the deeper reason Kubernetes won is architectural. Its design — declarative state, reconciliation loops, extensible API, composable controllers — is not just a set of implementation choices. It is a theory of how to manage distributed systems.

The theory says: define the desired state of the world as data. Build independent controllers that each reconcile one aspect of the world toward the desired state. Communicate only through a shared, versioned state store. Make everything observable and extensible.

This theory is general enough to manage not just containers but anything: virtual machines, databases, DNS records, cloud resources, machine learning models. And that generality is what makes Kubernetes not just an orchestrator but a universal control plane — a platform for managing any infrastructure through declarative, reconciliation-based APIs.

Whether this generality justifies Kubernetes’ complexity is a fair debate. For simple applications, Kubernetes is overkill. But for organizations managing diverse, dynamic, distributed infrastructure at scale, Kubernetes’ architectural principles provide a coherent framework that no other system has matched.

Kubernetes’ ultimate contribution is not the code (which will be replaced someday) but the ideas: declarative state, reconciliation loops, level-triggered controllers, extensible APIs, operator patterns. These ideas will outlive Kubernetes itself and will influence the design of distributed systems for decades to come.

Complexity Is Not Free. Kubernetes’ generality comes at a cost. The system has hundreds of moving parts, a vast ecosystem of add-ons, and a steep learning curve. For many applications — a single service with modest scale, a batch processing pipeline, a static website — Kubernetes is dramatically overengineered. The right question is not “should I use Kubernetes?” but “do I have the problems that Kubernetes was designed to solve?” If you do not have bin-packing, service discovery, rolling deployment, or self-healing problems at meaningful scale, simpler alternatives (Docker Compose, a cloud provider’s native container service, even a well-managed VM fleet) may be more appropriate.

Key Contributors to Kubernetes’ Design

NameRole
Joe BedaCo-founder of Kubernetes at Google. Led early architecture decisions.
Brendan BurnsCo-founder of Kubernetes. Author of key design patterns papers. Corporate VP at Microsoft Azure.
Craig McLuckieCo-founder of Kubernetes. Founded Heptio (later acquired by VMware). Key advocate for CNCF donation.
Brian GrantPrincipal engineer at Google. Led Kubernetes API design and declarative configuration model.
Tim HockinPrincipal engineer at Google. Key architect of Kubernetes networking and node components.
John WilkesGoogle Fellow. Architect of Borg and Omega. His research directly informed Kubernetes’ design.
Eric TuneGoogle engineer. Co-author of the Borg paper and early Kubernetes contributor.
Clayton ColemanRed Hat architect. Major contributor to Kubernetes API machinery, CRDs, and extensibility.

Common Mistakes and Misconceptions

  • “Kubernetes won because it’s the simplest.” Kubernetes won despite its complexity, not because of simplicity. The decisive factors were API extensibility (CRDs and custom controllers), vendor-neutral governance through the CNCF, and the ecosystem flywheel these created. Simpler alternatives like Docker Swarm lost because they lacked these properties.

  • “Docker Swarm failed because Docker was bad.” Swarm’s user experience was widely praised as simpler and more intuitive than Kubernetes. Swarm lost on ecosystem breadth, not on technical quality or user experience.

  • “There are no alternatives to Kubernetes.” HashiCorp Nomad, AWS ECS, and various platform-as-a-service offerings (Cloud Run, Fly.io, Railway) are valid alternatives for many workloads. Kubernetes is the right choice for complex, multi-service, multi-team environments at scale, but not every application needs what Kubernetes provides.

Further Reading


Next: References and Further Reading

Chapter 9: References and Further Reading

Foundational Papers

Large-scale cluster management at Google with Borg (Verma et al., EuroSys 2015) — The landmark paper describing Google’s Borg system, which directly inspired Kubernetes. Covers the declarative job specification, bin packing scheduler, naming service, and lessons learned from a decade of production use.

Borg, Omega, and Kubernetes (Burns, Grant, Oppenheimer, Tune, Wilkes, ACM Queue 2016) — A retrospective by the architects of all three systems, explicitly discussing the lessons learned from Borg and Omega that were applied to Kubernetes.

Omega: flexible, scalable schedulers for large compute clusters (Schwarzkopf et al., EuroSys 2013) — Describes Google’s Omega scheduling system and its shared-state, optimistic-concurrency approach, which influenced Kubernetes’ multi-controller architecture.

Design Patterns for Container-Based Distributed Systems (Burns and Oppenheimer, USENIX HotCloud 2016) — By Brendan Burns, co-founder of Kubernetes. Identifies common patterns in containerized systems: sidecar, ambassador, adapter. These patterns became the foundation for service meshes and the Operator pattern.

Official Design Documents

Kubernetes Design Proposals Archive — The archive of Kubernetes Enhancement Proposals (KEPs) and design documents. Reading these documents reveals the reasoning behind specific design decisions.

Kubernetes Architecture Documentation — The official documentation of Kubernetes’ architecture, including descriptions of every control plane and node component.

Kubernetes API Concepts — Official documentation of the Kubernetes API model, versioning, and extension mechanisms.

Kubernetes Networking Model — Official documentation of the Kubernetes networking model and its requirements.

Key Talks

“Kubernetes: Changing the Way That We Think and Talk About Computing” (Brendan Burns, various conferences) — Burns’ talks consistently focus on the conceptual model rather than the mechanics, making them excellent introductions to the design philosophy.

“The History of Kubernetes and Where It’s Going” (Joe Beda, KubeCon keynotes) — Beda, co-founder of Kubernetes, discusses the project’s origins and design decisions.

“Borg, Omega, and Kubernetes: Lessons Learned” (John Wilkes, various) — Wilkes was a key architect of Borg and Omega, and his talks provide unparalleled insight into the design evolution.

Books

Kubernetes Up & Running (Burns, Beda, Hightower; O’Reilly) — Co-authored by two Kubernetes co-founders. Covers both how and why.

Kubernetes in Action (Luksa; Manning) — Deep technical coverage of Kubernetes internals, with excellent explanations of the control plane.

Programming Kubernetes (Hausenblas, Schimanski; O’Reilly) — Focuses on extending Kubernetes: writing controllers, operators, and CRDs.

Kubernetes Patterns (Ibryam, Huss; O’Reilly) — Catalogs recurring design patterns for Kubernetes applications.


This concludes Part 1: First Principles. You now have the conceptual foundation — the architecture, the API model, the networking model, and the design principles that explain why Kubernetes works the way it does. Part 2 shifts from “why was it designed this way?” to “how did the tooling around it evolve?” — starting with the container runtime wars that shaped the foundation Kubernetes runs on.

Next: The Container Runtime Wars

Chapter 10: The Container Runtime Wars

The Evolution of Container Runtimes in Kubernetes

Era 1 (2014-2016):   kubelet ---> Docker Engine (which used libcontainer/runc internally)
                     "The only option. Docker was a monolith."

Era 2 (2016-2020):   kubelet ---> dockershim ---> Docker Engine ---> containerd ---> runc
                     "CRI exists, but Docker doesn't speak it. Add another layer."

Era 3 (2018+):       kubelet ---> CRI ---> containerd ---> runc
                      kubelet ---> CRI ---> CRI-O ------> runc
                     "Direct communication. Docker removed from the chain."

Docker’s Original Role: From Monolith to Layers

To understand the container runtime wars, you must first understand how Docker evolved. In the early days (2014-2016), Docker Engine was a monolithic daemon. It used an internal library called libcontainer (later extracted and renamed to runc) to interact with the Linux kernel, setting up namespaces, cgroups, and filesystem mounts. There was no separate “containerd” layer yet — Docker Engine handled everything from the user-facing API down to container creation in a single process.

When Kubernetes launched in 2014-2015, it talked to Docker Engine directly. The kubelet called the Docker API, and Docker Engine internally used libcontainer/runc to create containers. Kubernetes was only using a fraction of what Docker Engine provided. It did not need Docker Compose, Docker Swarm, or Docker’s build system. It needed exactly one capability: run containers.

In December 2016, Docker began decomposing its monolith. It extracted the core container lifecycle management into a separate daemon called containerd, and the low-level container creation into runc (the graduated form of libcontainer). This produced the layered architecture that later versions used:

  • runc — a low-level tool that did exactly one thing: create and run a container according to the OCI (Open Container Initiative) runtime specification.
  • containerd — a daemon that managed the lifecycle of containers: image pulling, storage, container execution (by calling runc), and networking setup.
  • Docker Engine (dockerd) — the daemon that provided the Docker API, Docker CLI integration, Docker Compose support, Docker Swarm orchestration, build functionality, and all the user-facing features that made Docker popular. dockerd talked to containerd, which talked to runc.

With this decomposition, every container operation now went through three layers: kubelet called the Docker API, Docker Engine called containerd, containerd called runc. Each layer added latency, complexity, and potential failure modes.

This was like hiring a general contractor, a subcontractor, and a specialist every time you needed to hammer a single nail.

The CRI: Defining a Standard Interface

Before Kubernetes 1.5 (December 2016), the kubelet had direct knowledge of how to talk to Docker compiled into its source code. If you wanted to use a different container runtime — say, rkt from CoreOS — the code for that runtime also had to be compiled into the kubelet binary. This meant the kubelet was tightly coupled to every runtime it supported. Adding a new runtime required modifying kubelet source code, getting the changes reviewed and merged, and waiting for a Kubernetes release. This did not scale.

The Container Runtime Interface (CRI) was introduced in Kubernetes 1.5 to solve this problem. CRI defined a gRPC-based interface with two services:

  • RuntimeService: operations on containers and pods (create, start, stop, remove, list, status, exec, attach, port-forward)
  • ImageService: operations on container images (pull, list, remove, image status)

Any container runtime that implemented this gRPC interface could be plugged into the kubelet without modifying kubelet source code. The kubelet would communicate with the runtime over a Unix socket, and the runtime would handle everything from there.

This was a critical architectural decision — the same design philosophy that Kubernetes applied to networking (CNI), storage (CSI), and cloud providers. Define a clean interface, let implementations compete, and avoid coupling the core system to any particular vendor.

flowchart TD
    kubelet["kubelet"] -->|"gRPC over<br>Unix socket"| CRI["CRI gRPC Interface"]

    CRI --- RS["RuntimeService<br>RunPodSandbox, CreateContainer,<br>StartContainer, StopContainer,<br>ListContainers, ExecSync"]
    CRI --- IS["ImageService<br>PullImage, ListImages,<br>RemoveImage"]

    CRI -->|"Implemented by"| containerd["containerd<br>(via CRI plugin)"]
    CRI -->|"Implemented by"| CRIO["CRI-O"]
    CRI -->|"Formerly"| dockershim["dockershim<br>(removed in 1.24)"]

    style dockershim fill:#666,stroke:#999,color:#ccc

The Dockershim: A Bridge to Nowhere

There was a problem. Docker Engine predated CRI by several years and did not implement it. Docker had its own API, its own assumptions, its own way of doing things. But Docker was the dominant runtime — virtually every Kubernetes cluster in production used Docker. Kubernetes could not simply drop Docker support overnight.

The solution was the dockershim — a CRI-compatible shim layer that translated CRI calls into Docker Engine API calls. The kubelet would speak CRI to the dockershim, and the dockershim would translate those calls into the Docker API. The dockershim was maintained inside the kubelet codebase itself, making the kubelet responsible for keeping up with every Docker API change.

The call chain became even longer:

kubelet ---> dockershim ---> Docker Engine (dockerd) ---> containerd ---> runc

Four layers of indirection to start a container. And the dockershim added a particular kind of burden: because it lived in the kubelet codebase, every Kubernetes release had to ensure compatibility with Docker. Docker bugs became kubelet bugs. Docker’s release cycle constrained Kubernetes’ release cycle. The dockershim was approximately 2,000 lines of complex translation code that had to be maintained by the Kubernetes community despite being, conceptually, Docker’s problem.

This situation was unsustainable. The Kubernetes community was maintaining a compatibility shim for one specific vendor’s product inside its core codebase.

containerd Goes Standalone

Docker itself recognized the architectural problem. In 2017, Docker donated containerd to the Cloud Native Computing Foundation (CNCF) as an independent project. This was a significant move: containerd was no longer “Docker’s internal component” but an independent, community-governed container runtime.

containerd 1.1, released in 2018, added native CRI support through a built-in CRI plugin. This meant containerd could speak CRI directly — the kubelet could talk to containerd without Docker Engine in the middle. The call chain collapsed:

Before:   kubelet ---> dockershim ---> dockerd ---> containerd ---> runc
After:    kubelet ---> CRI ---> containerd ---> runc

Two layers were eliminated. The result was faster container operations, fewer potential failure points, and lower resource overhead (no Docker daemon consuming memory and CPU for features Kubernetes did not use).

CRI-O: The Minimal Alternative

Red Hat took a different approach. Rather than adapting an existing runtime, they built CRI-O from scratch as a minimal CRI implementation — just enough runtime to support Kubernetes, nothing more. CRI-O’s motto was effectively “Kubernetes’ container runtime.”

CRI-O did not support Docker Compose. It did not support Docker Swarm. It did not have a CLI for building images. It implemented the CRI gRPC interface and managed containers using runc (or any OCI-compliant low-level runtime). It was purpose-built for Kubernetes and nothing else.

This minimalism had real advantages:

  • Smaller attack surface: less code means fewer vulnerabilities
  • Version alignment: CRI-O versions are aligned 1:1 with Kubernetes versions (CRI-O 1.24 works with Kubernetes 1.24)
  • Predictable behavior: no features outside the CRI specification that could cause unexpected interactions
  • Lighter weight: lower memory and CPU overhead than Docker Engine or even containerd (which supports non-CRI use cases)

Red Hat adopted CRI-O as the default runtime for OpenShift, their enterprise Kubernetes distribution. This gave CRI-O a significant production footprint and a well-funded development team.

The Deprecation That Shook the Community

In December 2020, Kubernetes 1.20 included a deprecation warning: dockershim would be removed in a future release. In May 2022, Kubernetes 1.24 completed the removal.

The announcement caused widespread panic. Blog posts declared “Kubernetes is dropping Docker support!” Twitter erupted with confusion. Many users believed their Docker images would stop working, that their Dockerfiles were obsolete, that they needed to rebuild everything.

None of this was true. The confusion stemmed from conflating “Docker” the image format with “Docker” the runtime engine.

Here is what actually happened:

  • Docker images are OCI images. The Open Container Initiative defined a standard image format, and Docker images conform to it. containerd, CRI-O, and every other OCI-compliant runtime can pull and run Docker images. Nothing changed about images.
  • Dockerfiles continued to work. They produce OCI-compliant images. It does not matter what builds the image; what matters is that the output conforms to the OCI specification.
  • What was removed was the dockershim — the translation layer inside the kubelet that allowed Kubernetes to talk to Docker Engine. If you were running Docker Engine as your container runtime, you needed to switch to containerd or CRI-O. Since Docker Engine itself used containerd internally, this switch was straightforward: containerd was already on the machine, it just needed to be configured as the CRI endpoint.

The deprecation was, in a sense, a non-event for users. Their workflows did not change. Their images did not change. Their Dockerfiles did not change. What changed was which daemon the kubelet talked to on each node — an infrastructure detail that most application developers never interacted with directly.

But the deprecation was a significant event for the Kubernetes project. It removed approximately 2,000 lines of shim code from the kubelet, eliminated an entire class of compatibility bugs, and completed the transition to a clean, pluggable runtime architecture that CRI had promised since 2016.

The Current Landscape

Today, the container runtime landscape has settled into a clear pattern:

containerd is the dominant runtime. Amazon EKS, Google GKE, Microsoft AKS, and most managed Kubernetes services use containerd as their default runtime. containerd is mature, well-tested, and supports a broad range of use cases beyond just Kubernetes (it is also used by Docker Desktop, nerdctl, and other tools).

CRI-O is the standard runtime for Red Hat OpenShift and is used in environments where minimalism and strict Kubernetes alignment are priorities.

Docker Desktop remains the most popular tool for local container development. Developers build images with Docker, push them to registries, and those images run on containerd or CRI-O in production. Docker’s role shifted from “the runtime” to “the developer tool.”

Specialized runtimes exist for specific use cases: gVisor (Google) provides kernel-level sandboxing for stronger isolation, Kata Containers runs each container in a lightweight VM for hardware-level isolation, and Firecracker (AWS) powers Lambda and Fargate with microVMs. All of these can be plugged into Kubernetes through CRI, demonstrating the power of the pluggable interface.

flowchart TD
    subgraph Dev["Developer Workstation"]
        Docker["Docker Desktop<br>(build images)"] --> OCI["OCI Image"]
    end

    OCI -->|"push to registry"| kubelet

    subgraph Prod["Production Cluster"]
        kubelet["kubelet"]
        kubelet --> containerd
        kubelet --> CRIO["CRI-O"]
        containerd --> runc["runc<br>(or gVisor / Kata)"]
        CRIO --> runc
    end

The lesson they teach is a design lesson: define clean interfaces early, and the ecosystem will sort itself out. CRI turned the container runtime from a hardwired dependency into a pluggable component, and the result was a healthier ecosystem where runtimes could compete on merit without requiring changes to Kubernetes core. The removal of dockershim, despite the community anxiety it caused, was the natural conclusion of a process that began six years earlier with the introduction of CRI.

Common Mistakes and Misconceptions

  • “I need Docker installed to run containers on Kubernetes.” Since K8s 1.24, dockershim was removed. Kubernetes uses containerd or CRI-O directly. Docker is a development tool, not a K8s runtime dependency.
  • “containerd and Docker are completely different.” containerd was extracted from Docker. Docker uses containerd internally. The difference is that K8s talks to containerd directly via CRI, skipping the Docker daemon.
  • “OCI images built with Docker won’t work with containerd.” OCI images are runtime-agnostic. An image built with docker build runs identically on containerd, CRI-O, or any OCI-compliant runtime.

For a visual timeline of how container runtimes evolved alongside the broader ecosystem, see Appendix E: Architecture Evolution Timeline.

Further Reading

  • Container Runtime Interface (CRI) specification – The formal API definition that decoupled Kubernetes from any single container runtime, enabling the pluggable ecosystem described in this chapter.
  • containerd documentation – Official docs for the dominant container runtime, covering architecture, configuration, and the plugin system that makes containerd extensible beyond Kubernetes.
  • CRI-O documentation – The lightweight, Kubernetes-dedicated runtime used by OpenShift. Useful for understanding the “do one thing well” design philosophy contrasted with containerd’s broader scope.
  • KEP-2221: Dockershim Removal – The Kubernetes Enhancement Proposal that formalized the dockershim removal, including the rationale, migration plan, and community discussion.
  • “Don’t Panic: Kubernetes and Docker” (Kubernetes blog) – The official blog post that clarified the Docker deprecation, explaining why Docker images still work and what actually changed. Essential reading for understanding the community communication around this transition.
  • OCI Runtime Specification – The standard that defines how a container runtime starts and manages containers at the lowest level. Understanding this spec clarifies the relationship between high-level runtimes (containerd, CRI-O) and low-level runtimes (runc).
  • runc GitHub repository – The reference implementation of the OCI runtime spec and the low-level runtime that actually creates containers for both containerd and CRI-O. Reading the README provides a clear picture of what happens at the bottom of the runtime stack.

Next: Chapter 11: Bootstrapping a Cluster — From kube-up.sh to kubeadm

Chapter 11: Bootstrapping a Cluster — From kube-up.sh to kubeadm

gantt
    title Cluster Bootstrap Tools — Increasing Abstraction
    dateFormat YYYY
    axisFormat %Y
    section Provision Everything
        kube-up.sh (GCE only)       :done, 2014, 2016
    section Bootstrap on Machines
        kubeadm (alpha → GA 2018)   :done, 2016, 2024
        kops (alpha → GA 2018)      :done, 2016, 2024
        minikube                    :done, 2016, 2024
        kubespray                   :done, 2018, 2024
    section Single Binary
        k3s                         :done, 2018, 2024
        kind                        :done, 2018, 2024
        k0s                         :done, 2020, 2024
    section Managed K8s Dominates
        GKE / EKS / AKS            :active, 2018, 2026

The Problem: What Does It Actually Take to Run Kubernetes?

Before examining the tools, consider what bootstrapping a Kubernetes cluster requires. This is not a trivial task. At minimum, you must:

  1. Generate a Public Key Infrastructure (PKI). Kubernetes components communicate over mutually authenticated TLS. The API server needs a certificate. Each kubelet needs a certificate. etcd needs certificates. The front proxy needs certificates. A typical cluster has 10+ certificates, each with specific Subject Alternative Names, key usages, and expiration policies. Getting any of these wrong results in cryptic TLS handshake failures.

  2. Deploy and configure etcd. etcd must form a quorum, which means each member must know about the others. In a multi-node etcd cluster, the initial bootstrap is a chicken-and-egg problem: members need to discover each other before they can form a cluster.

  3. Configure the API server. The API server needs to know where etcd is, which certificates to use, which admission controllers to enable, which authentication methods to support, and how to reach the kubelet on each node.

  4. Configure the controller manager and scheduler. Both need kubeconfig files with credentials to authenticate to the API server.

  5. Join worker nodes. Each worker node needs a kubelet configured with the correct API server address and authentication credentials. The node needs a container runtime installed and configured. It needs kube-proxy or a replacement for service networking.

  6. Install cluster networking. Kubernetes mandates that every pod can communicate with every other pod without NAT. This requires a CNI (Container Network Interface) plugin, which must be installed after the control plane is running but before workloads can function.

  7. Install DNS. Kubernetes assumes that a cluster DNS service (CoreDNS) is available. Services are discovered by DNS name, and without DNS, almost nothing works correctly.

Each of these steps has dependencies on the others. Certificates must be generated before any component can start. etcd must be running before the API server can start. The API server must be running before the controller manager, scheduler, or any kubelets can connect. Networking must be installed before pods can communicate. The ordering is strict, and mistakes are difficult to diagnose.

The Early Days: kube-up.sh

In 2014-2015, the primary way to create a Kubernetes cluster was a shell script called kube-up.sh. This script attempted to do everything: provision cloud resources (VMs, networks, firewalls, load balancers), install Kubernetes binaries, generate certificates, configure all components, and join nodes into a cluster.

The script was massive — thousands of lines of bash — and was built primarily for Google Compute Engine (GCE). It had branches for AWS and other providers, but these were maintained with varying degrees of quality. The fundamental problem was that kube-up.sh conflated two very different concerns:

  • Infrastructure provisioning: creating VMs, networks, and storage. This is cloud-provider-specific and depends on each provider’s API, authentication model, and resource semantics.
  • Cluster bootstrapping: installing and configuring Kubernetes on machines that already exist. This is (or should be) cloud-agnostic.

By combining both concerns in a single script, kube-up.sh was fragile, difficult to debug, and nearly impossible to extend. If you wanted to customize the VM size, the network topology, or the operating system, you had to modify the script. If the script failed halfway through, there was no reliable way to resume. If you wanted to bootstrap Kubernetes on bare metal or on a cloud provider that kube-up.sh did not support, you were on your own.

The script was also undocumented in any meaningful way. Understanding what it did required reading thousands of lines of bash, following variable expansions across multiple files, and understanding the implicit assumptions about the environment. This was the era when “setting up a Kubernetes cluster” was a multi-day project that required deep expertise.

kubeadm: Separating Bootstrap from Provisioning

The Kubernetes community recognized that the solution was to separate concerns. Infrastructure provisioning should be handled by tools designed for that purpose — Terraform, CloudFormation, Ansible, or cloud-provider CLIs. Cluster bootstrapping should be handled by a dedicated tool that assumed machines already existed and focused exclusively on turning those machines into a Kubernetes cluster.

kubeadm emerged from SIG Cluster Lifecycle in 2016, reached beta in Kubernetes 1.11, and became GA in Kubernetes 1.13 (December 2018). Its design principles were explicit:

  • Scope limitation: kubeadm bootstraps a cluster on existing machines. It does not provision infrastructure.
  • Composability: kubeadm is designed to be a building block for higher-level tools. kops, kubespray, and managed Kubernetes services all use kubeadm internally.
  • Phases: the bootstrap process is broken into discrete, independently executable phases. If something fails, you can re-run a specific phase without starting over.

What kubeadm Actually Does

When you run kubeadm init on a machine destined to be a control plane node, it executes the following phases:

Preflight checks. kubeadm verifies that the machine meets requirements: the container runtime is installed and running, required kernel modules are loaded, required ports are available, swap is disabled (Kubernetes historically required this because the scheduler’s resource accounting assumed no swap), and the machine has sufficient resources.

PKI generation. kubeadm generates the entire certificate authority hierarchy: a root CA, an API server certificate, kubelet client certificates, front proxy certificates, etcd CA and certificates, and service account signing keys. Each certificate has appropriate SANs and key usages. This single phase eliminates what was previously one of the most error-prone manual steps.

Static pod manifests. Rather than running control plane components as system services, kubeadm writes static pod manifests to /etc/kubernetes/manifests/. The kubelet watches this directory and automatically creates pods for any manifests it finds. This means the API server, controller manager, scheduler, and etcd all run as pods on the control plane node — Kubernetes managing itself. This approach is elegant: it means the same mechanisms that manage user workloads also manage the control plane.

flowchart TD
    Start["Machine with kubelet +<br>container runtime installed"]

    Start --> P1
    P1["Phase 1: Preflight checks<br>Verify container runtime, ports,<br>kernel modules, resources"]

    P1 --> P2
    P2["Phase 2: Generate PKI<br>CA, API server cert, kubelet certs,<br>etcd certs, SA keys<br>Writes to /etc/kubernetes/pki/"]

    P2 --> P3
    P3["Phase 3: Generate kubeconfig files<br>admin.conf, kubelet.conf,<br>controller-manager.conf, scheduler.conf"]

    P3 --> P4
    P4["Phase 4: Write static pod manifests<br>kube-apiserver.yaml<br>kube-controller-manager.yaml<br>kube-scheduler.yaml<br>etcd.yaml"]

    P4 --> P5
    P5["Phase 5: Wait for control plane<br>kubelet reads manifests, starts pods,<br>API server becomes healthy"]

    P5 --> P6
    P6["Phase 6: Upload configuration<br>Store cluster config in ConfigMap<br>for future joins"]

    P6 --> P7
    P7["Phase 7: Generate bootstrap token<br>Short-lived token for<br>worker nodes to join"]

    P7 --> P8
    P8["Phase 8: Install addons<br>CoreDNS (cluster DNS)<br>kube-proxy (service networking)"]

Bootstrap tokens. kubeadm generates a short-lived token that worker nodes use to authenticate with the API server during the join process. This solves the chicken-and-egg problem of node authentication: the node needs credentials to talk to the API server, but the API server needs to verify the node’s identity. The bootstrap token provides initial trust, and the node uses it to request a proper kubelet certificate through the TLS bootstrap protocol.

Addon installation. kubeadm installs CoreDNS (for cluster DNS) and kube-proxy (for service networking) as cluster addons. These are deployed as regular Kubernetes workloads, managed by the same control plane they support.

The Alternatives: Different Problems, Different Tools

kubeadm solved the bootstrap problem but deliberately left the provisioning problem to others. This created space for tools that combined provisioning and bootstrapping, each optimized for different use cases. See Appendix C: Decision Trees for a flowchart to help choose the right bootstrap tool.

kops (Kubernetes Operations)

kops took the opposite approach from kubeadm: it handled both provisioning and bootstrapping. Originally built for AWS, kops could create VPCs, subnets, auto-scaling groups, security groups, IAM roles, Route53 DNS entries, and S3 state storage, then install and configure Kubernetes across the provisioned infrastructure.

kops was opinionated and comprehensive. It stored cluster state in a cloud storage bucket (S3 on AWS) and could perform rolling updates, upgrade Kubernetes versions, and resize clusters. For AWS users who wanted a production-grade, self-managed Kubernetes cluster without a managed service, kops was often the best choice.

The tradeoff: kops’ breadth makes it powerful on AWS but tightly coupled to cloud-provider APIs and harder to debug than kubeadm.

kubespray

kubespray used Ansible playbooks to install Kubernetes on existing machines. It supported a wide range of operating systems, container runtimes, and network plugins. kubespray was the tool of choice for organizations that already used Ansible for configuration management, had bare-metal infrastructure, or needed to customize every aspect of the installation.

kubespray occupied the middle ground between kubeadm’s minimalism and kops’ full-stack approach. It assumed you had provisioned machines (like kubeadm) but handled more of the pre-requisite setup than kubeadm did (installing container runtimes, configuring kernel parameters, setting up load balancers for HA control planes).

k3s

k3s, created by Rancher Labs (later acquired by SUSE), took a radically different approach. Instead of a collection of separate binaries with complex interdependencies, k3s packaged the entire Kubernetes distribution into a single binary under 100MB.

k3s achieved this by making several substitutions:

  • SQLite instead of etcd for the default datastore (etcd and other datastores available as options)
  • Flannel built-in for networking
  • Traefik built-in as the ingress controller
  • Local storage provider built-in
  • Removed legacy and alpha features, cloud provider integrations, and storage drivers that were not needed in edge/IoT scenarios

The result was a Kubernetes distribution that could run on a Raspberry Pi, start in 30 seconds, and be installed with a single curl command. k3s was certified conformant — it passed the CNCF conformance tests — meaning it was “real Kubernetes,” just packaged differently.

k3s demonstrated that the complexity of Kubernetes installation was largely accidental, not essential. The core of Kubernetes is not that large; it was the matrix of configuration options, pluggable interfaces, and backward compatibility that made installation complex.

kind (Kubernetes IN Docker)

kind solved a different problem entirely: running Kubernetes in CI/CD pipelines and for local testing. kind created a multi-node Kubernetes cluster by running each “node” as a Docker container. Inside each container, it ran the kubelet and a container runtime (containerd), creating a nested container architecture.

kind was fast (cluster creation in under a minute), lightweight (no VMs required), and disposable (clusters could be created and destroyed as part of a test pipeline). It became the standard tool for testing Kubernetes itself — the Kubernetes CI infrastructure uses kind to run conformance tests.

minikube

minikube was the original local Kubernetes development tool, created alongside kubeadm in 2016. It ran a single-node Kubernetes cluster inside a VM (or later, a container). minikube was the tool most developers encountered first when learning Kubernetes. It prioritized ease of use and supported add-ons for common development needs: dashboards, metrics, registries, and ingress controllers.

k0s

k0s (zero friction Kubernetes) followed k3s’ single-binary approach but aimed to be closer to upstream Kubernetes with fewer opinionated substitutions. k0s packaged all control plane components into a single binary and supported running the control plane and worker components separately, making it suitable for both single-node and multi-node deployments.

The Managed Service Explosion

The most significant development in cluster bootstrapping was the emergence of managed Kubernetes services that made bootstrapping irrelevant for a large portion of users.

Google Kubernetes Engine (GKE), launched in 2015, was the first. Google managed the control plane as a service. Users only managed worker nodes (and later, with Autopilot mode, not even that). GKE’s early availability gave it a lasting advantage: it had years of operational experience that competitors could not quickly replicate.

Azure Kubernetes Service (AKS) entered preview in 2017 (GA June 2018), and Amazon Elastic Kubernetes Service (EKS) launched in 2018. AWS was notably late to the Kubernetes party, having bet heavily on its own orchestration system (ECS) before market demand forced its hand. EKS’s eventual success validated Kubernetes as the industry standard: when the largest cloud provider builds a managed service for your project, you have won.

By the mid-2020s, managed Kubernetes services account for the majority of production Kubernetes usage. The bootstrapping tools — kubeadm, kops, kubespray — remain essential for on-premises deployments, specialized environments, and educational purposes, but the center of gravity has shifted decisively toward managed services.

Who Uses What (2024+)

  Use Case                          Tool
  ─────────────────────────────     ──────────────────────
  Production (cloud)                Managed: GKE, EKS, AKS
  Production (on-premises)          kubeadm + kubespray, or k0s/k3s
  Production (AWS, self-managed)    kops
  Edge / IoT / Raspberry Pi         k3s
  CI/CD testing                     kind
  Local development                 minikube, kind, Docker Desktop
  Learning                          minikube, kind, k3s

The evolution of bootstrapping tools mirrors a broader pattern in infrastructure software: complexity moves from the user to the platform. In 2014, bootstrapping a cluster required deep expertise in Linux administration, PKI, and distributed systems. By 2024, it requires a credit card and a cloud provider account. The knowledge is still valuable — someone has to build and operate those managed services — but the barrier to entry for Kubernetes users has dropped by orders of magnitude.

Common Mistakes and Misconceptions

  • “kubeadm is only for learning.” kubeadm is used in production by many organizations. It handles TLS bootstrapping, certificate rotation, and upgrade orchestration. Managed services are easier, but kubeadm is production-grade.
  • “k3s is not real Kubernetes.” k3s is a certified, conformant Kubernetes distribution. It passes the same conformance tests as full K8s. It just has a smaller binary and uses SQLite instead of etcd by default.
  • “I should use minikube/kind for production.” These tools are for local development and CI. They run single-node clusters without HA, proper networking, or persistent storage guarantees.

Further Reading

  • kubeadm documentation – Official reference for the standard cluster bootstrapping tool. Covers kubeadm init, kubeadm join, certificate management, and upgrade procedures.
  • kops GitHub repository – The Kubernetes Operations project for deploying production clusters on AWS, GCE, and other clouds. The docs include architecture decisions and comparison with other tools.
  • kubespray documentation – Ansible-based cluster provisioning that supports bare metal, AWS, GCE, Azure, and more. Useful for understanding the infrastructure-as-code approach to cluster bootstrapping.
  • k3s documentation – Rancher’s lightweight Kubernetes distribution designed for edge, IoT, and resource-constrained environments. Explains the trade-offs made to shrink Kubernetes into a single binary.
  • Rancher documentation – Multi-cluster management platform that abstracts over different bootstrap methods. Covers fleet management, RBAC, and the operational layer above individual clusters.
  • kind (Kubernetes in Docker) – A tool for running local Kubernetes clusters using Docker containers as nodes. Designed for testing Kubernetes itself, and widely used in CI/CD pipelines.
  • minikube documentation – The original local Kubernetes tool, supporting multiple drivers (Docker, VirtualBox, HyperKit, etc.). Remains the most approachable path for developers learning Kubernetes.

Next: Chapter 12: Package Management and GitOps

Chapter 12: Package Management and GitOps

The YAML Explosion

Every Kubernetes resource is defined by a YAML manifest. A simple web application requires, at minimum: a Deployment (to run the pods), a Service (to expose them), a ConfigMap (for configuration), a Secret (for credentials), an Ingress (for external access), a ServiceAccount (for identity), and resource quotas. That is seven YAML files for a single application. A real production application typically requires 15-30 manifests when you include HorizontalPodAutoscalers, PodDisruptionBudgets, NetworkPolicies, PersistentVolumeClaims, and RBAC rules.

The YAML Explosion: One Application's Manifests

  Minimal App (7 files)              Production App (15-30 files)
  ┌─────────────────────┐            ┌─────────────────────────────────┐
  │ deployment.yaml     │ Pods       │ deployment.yaml                 │
  │ service.yaml        │ Network    │ service.yaml                    │
  │ configmap.yaml      │ Config     │ configmap.yaml                  │
  │ secret.yaml         │ Creds      │ secret.yaml                     │
  │ ingress.yaml        │ External   │ ingress.yaml                    │
  │ serviceaccount.yaml │ Identity   │ serviceaccount.yaml             │
  │ resourcequota.yaml  │ Limits     │ resourcequota.yaml              │
  └─────────────────────┘            │─────────────────────────────────│
          7 files                    │ hpa.yaml              Scaling   │
             │                       │ pdb.yaml              Uptime    │
             │  "Just add            │ networkpolicy.yaml    Security  │
             │   production          │ pvc.yaml              Storage   │
             │   concerns..."        │ role.yaml             RBAC      │
             │                       │ rolebinding.yaml      RBAC      │
             ▼                       │ limitrange.yaml       Limits    │
     ┌───────────────┐               │ poddisruptionbudget.yaml Uptime │
     │ × 3 envs      │               │ prometheus-rules.yaml  Observe  │
     │ (dev/stg/prd) │               │ grafana-dashboard.json Observe  │
     └───────────────┘               └─────────────────────────────────┘
             │                                   15-20 files
             ▼                                       │
     ┌───────────────┐                               ▼
     │ 7 × 3 = 21    │                       ┌───────────────┐
     │ files minimum │                       │ × 3 envs      │
     └───────────────┘                       │ (dev/stg/prd) │
             │                               └───────────────┘
             │  "But each env differs:                │
             │   replicas, limits,                    ▼
             │   image tags, configs..."     ┌───────────────────┐
             ▼                               │ 20 × 3 = 60       │
     ┌────────────────────┐                  │ files to maintain │
     │  21-90 YAML files  │                  └───────────────────┘
     │  for ONE service   │
     └────────────────────┘

Now multiply by environments. Most organizations maintain at least three — development, staging, and production — with small differences between them: different replica counts, different resource limits, different image tags, different configuration values. If you manage this with raw YAML files, you either maintain three copies of every manifest (tripling the maintenance burden and guaranteeing drift) or you build some ad-hoc templating system with sed and environment variables (fragile and error-prone).

This is the YAML explosion problem, and it is the root cause behind every tool discussed in this chapter.

Helm v2: The Package Manager with a Fatal Flaw

Helm was introduced in 2016 as “the package manager for Kubernetes,” explicitly modeled on apt, yum, and Homebrew. The core abstraction was the Chart — a collection of templated YAML files, a values.yaml file containing default parameters, and metadata describing the package.

Helm Charts solved two problems simultaneously:

Distribution. A complex application like Prometheus (which requires 10+ Kubernetes resources) could be packaged as a single Chart and installed with one command. Charts could be stored in repositories and versioned. The ecosystem effect was powerful: instead of every user figuring out how to deploy Prometheus on Kubernetes, one person wrote a Chart and everyone benefited.

Parameterization. Charts used Go templates to inject values into YAML manifests. A Deployment’s replica count might be templated as {{ .Values.replicaCount }}, allowing users to override it without modifying the Chart. This addressed the multi-environment problem: you could install the same Chart with different values files for dev, staging, and production.

But Helm v2 had a critical architectural flaw: Tiller.

Tiller: The Security Nightmare

Tiller was a server-side component that ran inside the Kubernetes cluster. When you ran helm install, your local Helm client sent the rendered manifests to Tiller, which then applied them to the cluster. Tiller stored release state (which Charts were installed, at which versions, with which values) as ConfigMaps in the cluster.

flowchart TD
    helm["helm CLI<br>Chart + values.yaml"] -->|"gRPC"| Tiller

    subgraph Cluster["Kubernetes Cluster"]
        Tiller["Tiller (Deployment)<br>cluster-admin access"]
        Tiller --> API["API Server"]
        Tiller --> CM["ConfigMaps<br>(release state)"]
    end

    Warning["Problem: Tiller has<br>GOD MODE access to<br>the entire cluster"]

    style Tiller fill:#f44,color:#fff,stroke:#d00
    style Warning fill:#fee,stroke:#f44,color:#d00

The problem was that Tiller required cluster-admin privileges by default. It needed broad access because it had to create any type of resource in any namespace on behalf of any user. This meant:

  • Any user who could talk to Tiller had effective cluster-admin access. Tiller did not enforce per-user RBAC. If developer A had permission to deploy only to namespace “team-a,” they could use Tiller to deploy anything anywhere, because Tiller itself had cluster-admin access.
  • Tiller was a single point of attack. Compromise Tiller, and you had full control of the cluster. Tiller’s gRPC port was often accessible from any pod in the cluster without authentication.
  • Multi-tenant clusters were unsafe. Helm v2 was fundamentally incompatible with the principle of least privilege. You could not safely use Helm v2 in a cluster shared by multiple teams with different access levels.

The security community raised alarms repeatedly. Workarounds existed (running one Tiller per namespace, using TLS for the gRPC connection), but they were complex and undermined Helm’s ease-of-use promise.

Helm v3: The Tiller Excision

Helm v3, released in November 2019, removed Tiller entirely. The new architecture was client-only: the Helm CLI connected directly to the Kubernetes API server, using the user’s own kubeconfig credentials. The user’s RBAC permissions determined what Helm could do. If a user only had access to namespace “team-a,” Helm would only be able to deploy to namespace “team-a.”

Release state moved from ConfigMaps to Kubernetes Secrets, stored in the namespace of the release. This was both more secure (Secrets can be encrypted at rest) and more natural (the release metadata lived alongside the release resources).

Helm v3 also added:

  • JSON Schema validation: Chart authors could define schemas for their values.yaml, catching configuration errors before rendering
  • OCI registry support: Charts could be stored in container registries alongside images, unifying artifact management
  • Library charts: reusable chart fragments that could be imported by other charts, reducing duplication
  • Three-way merge for upgrades: comparing the old manifest, the live cluster state, and the new manifest, enabling safer upgrades when resources had been manually modified

The removal of Tiller was driven by a principle that applies broadly in systems design: do not bypass the access control layer. Tiller existed because it was architecturally convenient to have a server-side component that could apply resources. But convenience created a massive security hole. Helm v3 demonstrated that you could achieve the same functionality without a privileged intermediary, simply by having the client talk directly to the API server.

Kustomize: Template-Free Customization

Kustomize, developed by Google and released in 2018, took a fundamentally different approach to the YAML problem. Where Helm used Go templates to parameterize YAML, Kustomize used overlay-based patching. No templating language. No {{ }} syntax. No Tiller. No client-side rendering. Just plain YAML, composed and patched using a declarative overlay system.

The core idea was simple. You start with a base — a set of plain, valid Kubernetes YAML files that represent your application. Then you create overlays — directories that contain patches describing how to modify the base for a specific environment. An overlay might change the replica count for production, add resource limits for staging, or change the image tag for development.

Kustomize Directory Structure

  base/
  ├── kustomization.yaml       # Lists resources
  ├── deployment.yaml           # Plain, valid Kubernetes YAML
  ├── service.yaml              # No templates, no {{ }}
  └── configmap.yaml

  overlays/
  ├── dev/
  │   ├── kustomization.yaml   # References base + patches
  │   └── replica-patch.yaml   # "Change replicas to 1"
  ├── staging/
  │   ├── kustomization.yaml
  │   └── resource-patch.yaml  # "Add resource limits"
  └── prod/
      ├── kustomization.yaml
      ├── replica-patch.yaml   # "Change replicas to 5"
      └── hpa.yaml             # "Add HorizontalPodAutoscaler"

Kustomize’s key advantage was diffability. Because the base files were plain YAML and the patches were plain YAML, you could diff any two environments and see exactly what was different. With Helm templates, understanding the difference between two rendered outputs required rendering both and diffing the result — a lossy process that made code review difficult.

Kustomize was integrated into kubectl itself (kubectl apply -k ./overlay/prod/), meaning it required no additional tooling. This made it attractive for organizations that wanted to minimize their dependency footprint.

Helm vs. Kustomize: Complementary, Not Competing

The community often framed Helm and Kustomize as competitors, but they solve different problems.

Helm excels at third-party package distribution. If you want to install Prometheus, PostgreSQL, or NGINX Ingress Controller on your cluster, Helm Charts are the standard distribution mechanism. The Chart author encapsulates the complexity of deploying the application, and you customize it through values. You would not want to maintain your own YAML files for every third-party application you use.

Kustomize excels at managing your own manifests across environments. If you are deploying your own application and need to manage small differences between dev, staging, and production, Kustomize’s overlay model is simpler and more transparent than Helm templates.

Many organizations use both: Helm for third-party software, Kustomize for their own applications. Some even use Kustomize to patch Helm-rendered output, combining both tools in a pipeline.

The GitOps Revolution

Helm and Kustomize solved the problem of parameterizing and organizing YAML. But they left a deeper problem unaddressed: how does the YAML get applied to the cluster?

The traditional workflow was: a developer modifies manifests, runs kubectl apply, and the cluster state changes. This approach has several serious deficiencies:

  • No audit trail. Who applied what, when? kubectl does not maintain a log. You can check the Kubernetes audit log if it is enabled, but correlating API server events to human actions is difficult.
  • No rollback mechanism. If a kubectl apply causes a problem, reverting requires knowing what the previous state was and manually applying it.
  • No access control beyond RBAC. Anyone with kubectl access and appropriate RBAC permissions can modify the cluster. There is no approval workflow, no review process, no gate.
  • Drift. If someone manually modifies a resource in the cluster (a “hot fix”), the cluster state diverges from the YAML files in the repository. Over time, the repository becomes a lie — it no longer represents what is actually running.

GitOps addresses all of these problems by applying a single principle: Git is the single source of truth for the desired state of the cluster.

flowchart TD
    Dev["Developer"] -->|"git push"| Git["Git Repo<br>(source of truth)"]
    Git -->|"watch"| Controller["GitOps Controller<br>(ArgoCD / Flux)<br><br>1. Watch Git for changes<br>2. Compare Git state to cluster state<br>3. Reconcile: apply diff to<br>make cluster match Git"]
    Controller -->|"apply"| K8s["Kubernetes Cluster"]
    K8s -->|"drift detection"| Controller

    Rollback["Rollback = git revert<br>Audit = git log<br>Review = pull request<br>Access = Git permissions"]

    style Rollback fill:#eff,stroke:#099,color:#066

The idea is that a controller running inside the cluster watches a Git repository. When the repository changes (new commit, merged pull request), the controller compares the desired state in Git to the actual state in the cluster and reconciles any differences. This is the Kubernetes reconciliation pattern applied to deployment itself — the same pattern that the Deployment controller uses to reconcile desired and actual pod counts, now applied at the level of the entire cluster configuration.

ArgoCD

ArgoCD, created by Intuit in 2018 and donated to the CNCF, is the most widely adopted GitOps tool. ArgoCD runs as a set of controllers in the cluster and provides a web UI, CLI, and API for managing applications. An ArgoCD “Application” resource defines the mapping: this Git repository, this path, this branch should be deployed to this cluster, this namespace.

ArgoCD supports Helm Charts, Kustomize overlays, plain YAML directories, and Jsonnet as input formats. It provides real-time sync status visualization, showing which resources are in sync with Git and which have drifted. It supports multi-cluster management, RBAC, SSO integration, and automated sync policies.

Flux

Flux, created by Weaveworks in 2017 (and rebuilt as Flux v2 using the GitOps Toolkit), takes a more Kubernetes-native approach. Flux is a set of Custom Resource Definitions (CRDs) and controllers: a GitRepository resource tells Flux where to watch, a Kustomization resource tells Flux how to render and apply the manifests, and a HelmRelease resource tells Flux how to manage Helm releases.

Flux v2 was designed to be composable: each controller does one thing, and they communicate through Kubernetes resources. This makes Flux extensible (you can add image automation controllers, notification controllers, etc.) but also means there are more pieces to understand and configure.

If you manage multiple clusters (dev, staging, production, or multiple production regions), GitOps ensures they are configured from the same source. Promoting a change from staging to production is a Git merge, not a series of manual kubectl commands against different clusters.

Common Mistakes and Misconceptions

  • “Helm charts are always safe to install.” Helm charts can contain arbitrary Kubernetes resources including ClusterRoles and webhooks. Always review chart templates before installing, especially from unknown sources.
  • “Kustomize replaces Helm.” They solve different problems. Helm templates generate YAML; Kustomize patches existing YAML. Many teams use both: Helm for third-party charts, Kustomize for environment overlays.
  • “Putting all configuration in values.yaml is good practice.” Over-parameterizing Helm charts makes them harder to maintain than raw YAML. Only expose values that actually change between environments.

Further Reading

  • Helm documentation – Official reference for Helm, covering chart structure, templating, release management, and the Helm SDK. Start with the “Chart Developer Guide” for understanding how charts are built.
  • Kustomize documentation – The template-free customization tool built into kubectl. The “Examples” section demonstrates the overlay pattern for managing environment-specific configurations.
  • Helm Chart Best Practices Guide – Official guidelines for writing production-quality Helm charts, covering values design, template conventions, labels, and dependency management.
  • Artifact Hub – The CNCF’s central repository for discovering Helm charts, OPA policies, and other Kubernetes packages. Browse to understand the breadth of the ecosystem and how charts are published and versioned.
  • “Helm vs Kustomize” (Harness) – A practical comparison of the two dominant approaches, covering strengths, weaknesses, when to use each, and how to combine them.
  • cdk8s documentation – AWS’s framework for defining Kubernetes manifests using general-purpose programming languages (TypeScript, Python, Go, Java). Represents the “code over YAML” approach to configuration.
  • “Stop Using Helm” and the counterarguments (archived) – A provocative critique of Helm’s templating approach, useful for understanding the trade-offs that led to alternatives like Kustomize and cdk8s.

Next: Chapter 13: The Networking Stack Evolution

Chapter 13: The Networking Stack Evolution

The Fundamental Requirement

Kubernetes imposes a single, non-negotiable networking requirement: every pod gets its own IP address, and every pod can communicate with every other pod without NAT. This is called the flat network model. A pod on Node A can reach a pod on Node B by sending a packet to that pod’s IP address directly. No port mapping. No address translation. No special routing configuration by the application developer.

This requirement is deceptively simple to state and remarkably difficult to implement. On a single machine, giving each container its own IP is straightforward using Linux network namespaces and virtual ethernet pairs. But across machines, you must somehow route packets from one node’s pod network to another node’s pod network, typically over a physical network that knows nothing about Kubernetes pods. This is the problem that CNI (Container Network Interface) plugins solve, and the evolution of these plugins reflects the broader maturation of Kubernetes networking from “good enough” to “production-grade at massive scale.”

Flannel: The First Answer (2014)

Flannel, created by CoreOS in 2014, was the first widely-adopted CNI plugin for Kubernetes. Flannel’s approach was simple: create a VXLAN overlay network. Each node was assigned a subnet (e.g., node 1 gets 10.244.1.0/24, node 2 gets 10.244.2.0/24), and VXLAN encapsulation handled cross-node communication. When a pod on node 1 sent a packet to a pod on node 2, Flannel encapsulated the pod-to-pod packet inside a UDP packet with the node-to-node addresses, sent it across the physical network, and de-encapsulated it on the other side.

flowchart LR
    subgraph Node1["Node 1 (10.0.0.1)"]
        PodA["Pod A<br>10.244.1.5"] --> flannel1["flannel.1<br>(VXLAN dev)"]
        flannel1 --> encap["Encapsulate:<br>src=10.0.0.1, dst=10.0.0.2"]
    end

    encap -->|"Physical network<br>(VXLAN encapsulated)"| decap

    subgraph Node2["Node 2 (10.0.0.2)"]
        decap["Decapsulate:<br>unwrap original packet"] --> flannel2["flannel.1<br>(VXLAN dev)"]
        flannel2 --> PodB["Pod B<br>10.244.2.8"]
    end

Flannel worked. It was simple to deploy, easy to understand, and provided basic pod-to-pod connectivity. But it had significant limitations:

  • No network policy support. Flannel could not restrict which pods could talk to which other pods. In a multi-tenant cluster, any pod could reach any other pod. This was a non-starter for security-conscious organizations.
  • VXLAN overhead. Encapsulation added 50 bytes of header overhead to every packet, reducing the effective MTU. It also added CPU overhead for encapsulation and decapsulation.
  • Limited performance. The overlay approach was inherently slower than native routing because of the encapsulation overhead and the need to traverse the kernel’s network stack twice (once for the inner packet, once for the outer packet).

Flannel was “good enough” for getting started, and it remains useful in simple environments and edge deployments (k3s includes Flannel by default). But production clusters needed more.

Calico: Production-Grade Networking (2016)

Calico, created by Project Calico (later commercialized by Tigera), took a fundamentally different approach. Instead of overlay networking, Calico used BGP (Border Gateway Protocol) to distribute pod routes across the physical network. Each node announced its pod subnet to its neighbors using BGP, and the physical network infrastructure routed packets natively. No encapsulation. No overlay. Packets traveled from pod to pod using the same routing mechanisms that power the internet.

This approach had significant advantages:

  • Native performance. Without encapsulation overhead, Calico achieved near-line-rate performance. Packets were not wrapped and unwrapped; they were simply routed.
  • Rich network policies. Calico implemented Kubernetes NetworkPolicy and extended it with its own CRD-based policies that supported L3/L4 rules, namespace selectors, global policies, and CIDR-based rules for external traffic.
  • Visibility. Because packets were not encapsulated, standard network debugging tools (tcpdump, traceroute) worked as expected. With overlays, debugging required understanding the encapsulation layer.

The tradeoff was that BGP-based routing required cooperation from the physical network infrastructure. In cloud environments where you could not run BGP (because the cloud provider controlled the network), Calico could fall back to VXLAN or IP-in-IP encapsulation. This hybrid approach made Calico viable everywhere while providing optimal performance on bare metal.

Calico became the de facto standard for production Kubernetes networking. Its combination of performance, network policy support, and operational maturity made it the default choice for most serious deployments.

The eBPF Revolution

To understand why eBPF changed Kubernetes networking, you must first understand how kube-proxy and iptables work — and why they fail at scale.

The iptables Problem

kube-proxy is the Kubernetes component responsible for implementing Services. When you create a Service with three backend pods, kube-proxy programs the node’s packet filtering rules so that traffic to the Service’s ClusterIP is load-balanced across the three pods. Historically, kube-proxy used iptables to implement this.

iptables is a Linux kernel packet filtering framework that evaluates rules sequentially. For each Service, kube-proxy creates a chain of iptables rules that use probability-based matching to distribute traffic across endpoints. If a Service has three endpoints, the first rule matches with probability 1/3, the second matches with probability 1/2 (of the remaining traffic), and the third catches everything else.

The problem is scale. iptables rules are evaluated linearly for each packet. In a cluster with 10,000 Services, each with multiple endpoints, the iptables rule set can grow to hundreds of thousands of rules. Every packet entering the node traverses this list. The result is measurable latency increases at scale, slow rule updates (programming 100,000 iptables rules takes seconds to minutes), and high CPU overhead.

iptables vs eBPF: The Performance Cliff

  Packet processing time vs. number of Services

  Latency
  (us)
  |
  |                                           * iptables
  |                                        *
  |                                     *
  |                                  *
  |                              *
  |                          *
  |                      *
  |                  *
  |             *
  |         *
  |     *
  |  ──────────────────────────────────────── eBPF (constant)
  |
  └─────────────────────────────────────────── Number of Services
          1K       5K       10K      20K

  iptables: O(n) linear scan for each packet
  eBPF:     O(1) hash table lookup for each packet

IPVS (IP Virtual Server) was introduced as an alternative kube-proxy mode to address some of these issues. IPVS uses hash tables rather than linear rule chains, providing better performance at scale. But IPVS still runs in the kernel’s Netfilter framework and has limitations around custom packet manipulation and observability.

eBPF: Programs in the Kernel

eBPF (extended Berkeley Packet Filter) is a technology that allows running sandboxed programs directly in the Linux kernel without modifying kernel source code or loading kernel modules. Originally designed for packet filtering (hence the name), eBPF has evolved into a general-purpose in-kernel execution environment.

An eBPF program is compiled to a bytecode that the kernel verifies for safety (no infinite loops, no invalid memory access, bounded execution time) and then JIT-compiles to native machine code. eBPF programs can be attached to various kernel hooks: network device ingress/egress, socket operations, system calls, tracepoints, and more.

For Kubernetes networking, eBPF is transformative because it allows implementing Service load-balancing, network policies, and observability at the earliest possible point in the kernel’s network stack, using hash table lookups instead of linear rule chains.

When a packet arrives at a node destined for a Service ClusterIP, an eBPF program attached to the network device performs a single hash lookup in a BPF map to find the backend pod, rewrites the packet’s destination address, and forwards it. O(1) regardless of how many Services exist. No iptables traversal. No Netfilter overhead.

Cilium: eBPF-Native Networking (2017)

Cilium, created by Isovalent in 2017 (Isovalent was acquired by Cisco in 2024), was built from the ground up on eBPF. Where Calico added eBPF support as an alternative to its iptables-based datapath, Cilium was eBPF-native from day one.

Cilium’s capabilities extend well beyond basic networking:

kube-proxy replacement. Cilium can fully replace kube-proxy, implementing Service load-balancing with eBPF programs. This eliminates the iptables bottleneck entirely and provides features like Maglev consistent hashing (for better load distribution), DSR (Direct Server Return) for reduced latency on reply packets, and graceful connection handling during backend changes.

L7-aware network policies. Traditional network policies operate at L3/L4 — IP addresses and TCP/UDP ports. Cilium’s eBPF programs can inspect L7 protocol headers, enabling policies like “allow HTTP GET to /api/v1/users but deny HTTP DELETE” or “allow gRPC calls to the ProductService.GetProduct method but deny ProductService.DeleteProduct.” This level of granularity was previously only available through service meshes.

Hubble observability. Cilium includes Hubble, an observability platform that provides real-time visibility into network flows, DNS queries, HTTP requests, and connection state — all captured by eBPF programs with minimal overhead. This is networking observability without sampling, without agents, without instrumentation.

Transparent encryption. Cilium can encrypt all pod-to-pod traffic using WireGuard or IPsec, transparently and without application changes. The encryption and decryption happen in eBPF programs attached to the network interfaces, so applications are unaware of the encryption layer.

Bandwidth management. eBPF programs can implement EDT (Earliest Departure Time) based rate limiting, providing better bandwidth management than traditional tc (traffic control) approaches.

Cilium became the default CNI on Google Kubernetes Engine in 2023, a significant endorsement. Its adoption reflects a broader trend: the kernel’s programmability through eBPF is displacing decades of networking infrastructure built on iptables, ipvs, and userspace proxies.

The kube-proxy Replacement Story

The move to replace kube-proxy deserves special attention because it illustrates how architectural assumptions age.

When Kubernetes was designed, iptables was the standard way to implement packet manipulation in the Linux kernel. It was well-understood, widely deployed, and sufficient for the cluster sizes of the time (dozens to hundreds of nodes, hundreds to thousands of Services). kube-proxy’s iptables mode was a reasonable engineering choice.

But Kubernetes clusters grew. Cloud providers ran clusters with tens of thousands of nodes and tens of thousands of Services. The linear scaling characteristics of iptables became untenable. Rule update latency meant Service changes took minutes to propagate. Connection tracking table overflow caused packet drops.

The progression was:

  1. iptables mode (original): simple, O(n) per packet, slow updates at scale
  2. IPVS mode (Kubernetes 1.11 GA): hash-based, better at scale, but still Netfilter-based
  3. eBPF mode (Cilium, Calico): O(1) per packet, fast updates, additional features
  4. nftables mode (Kubernetes 1.31): successor to iptables within the Netfilter framework, better performance and maintainability than iptables but still not eBPF-level

Today, organizations running at scale increasingly use Cilium or Calico’s eBPF datapath in place of kube-proxy. The kube-proxy component remains the default for backward compatibility and for environments where eBPF is not available (older kernels, certain cloud VMs), but the trajectory is clear.

Service Mesh Evolution

Istio, jointly developed by Google and IBM, using Lyft’s Envoy proxy as its data plane, and announced in 2017, was the first major service mesh for Kubernetes. Istio’s architecture injected an Envoy sidecar proxy into every pod. All traffic to and from the pod passed through this proxy, which could enforce mTLS (mutual TLS), collect metrics, perform traffic routing, implement circuit breakers, and enforce access policies.

sequenceDiagram
    box Traditional (Istio with sidecars)
        participant A1 as Pod A: App
        participant E1 as Pod A: Envoy
        participant E2 as Pod B: Envoy
        participant B1 as Pod B: App
    end

    A1->>E1: request
    E1->>E2: mTLS
    E2->>B1: request

    Note over A1,B1: Per-pod proxy: 50-100 MB memory each

    box Sidecar-less (Cilium / Istio Ambient)
        participant A2 as Pod A: App
        participant NP as Per-Node eBPF Proxy
        participant B2 as Pod B: App
    end

    A2->>NP: request
    NP->>B2: mTLS + L7 policy

    Note over A2,B2: One shared proxy per node: 10x less memory

The sidecar approach was powerful but expensive. Each sidecar consumed memory (50-100 MB per pod was common for Envoy), added latency (traffic traversed two proxies for each hop), and increased the complexity of the pod lifecycle (the sidecar had to start before the application, and shutting down required careful ordering). In a cluster with 10,000 pods, the sidecar overhead was 500 GB to 1 TB of memory just for the mesh infrastructure.

Linkerd, created by Buoyant in 2017, was the lighter-weight alternative. Linkerd’s Rust-based proxy (linkerd2-proxy) consumed significantly less memory than Envoy and focused on a smaller, well-defined feature set: mTLS, observability, and reliability features. The most significant recent trend is the sidecar-less mesh. Cilium Service Mesh uses eBPF programs in the kernel to provide mTLS, L7 policy, and observability without any sidecar proxies. Istio’s Ambient Mesh mode uses per-node ztunnel proxies for L4 features (mTLS, L4 policy) and optional waypoint proxies for L7 features, eliminating the per-pod sidecar overhead.

The sidecar-less approach reflects a broader realization: much of what sidecars do can be done more efficiently at the node level or in the kernel. The sidecar was an architectural choice driven by the constraints of 2017 (limited eBPF support, no per-node proxy infrastructure). As the infrastructure has evolved, the architecture is evolving with it.

The Current Landscape

The Kubernetes networking stack in 2024+ looks nothing like it did in 2015:

  • CNI plugin: Cilium (dominant, especially on cloud), Calico (strong on-premises), Flannel (edge/simple deployments)
  • Service implementation: eBPF (Cilium, Calico) replacing iptables/IPVS (kube-proxy)
  • Network policy: Cilium or Calico, both supporting L3/L4 and increasingly L7
  • Service mesh: consolidating around sidecar-less approaches; Istio Ambient and Cilium Service Mesh
  • Encryption: WireGuard-based transparent encryption (Cilium, Calico)

The evolution from Flannel’s simple VXLAN overlay to Cilium’s eBPF-native stack represents one of the most dramatic technical shifts in the Kubernetes ecosystem. It was driven by scale: the solutions that worked for hundreds of nodes failed at thousands. And it was enabled by a foundational technology shift (eBPF) that changed what was possible inside the Linux kernel. For a quick flowchart on choosing a CNI, see Appendix C: Decision Trees.

Common Mistakes and Misconceptions

  • “eBPF replaces all of iptables immediately.” Cilium’s eBPF datapath replaces kube-proxy’s iptables rules for service routing, but iptables still exists on the host for other purposes. Migration is incremental.
  • “I need a service mesh from day one.” Service meshes add complexity (sidecars, mTLS certificate management, control plane). Start without one; add it when you have a concrete need for mTLS, traffic splitting, or observability between services.
  • “Flannel is obsolete.” Flannel is simpler and lighter than Calico or Cilium. For small clusters that don’t need NetworkPolicy, Flannel is a perfectly valid choice.

Further Reading

  • Cilium documentation – Comprehensive reference for the eBPF-based CNI plugin that has become the dominant networking solution. The “Concepts” section explains how eBPF replaces iptables for service routing, network policy, and observability.
  • Calico documentation – Covers Calico’s BGP-based networking, network policy engine, and eBPF dataplane. Particularly strong on network policy design patterns for enterprise environments.
  • eBPF.io – The definitive resource for understanding eBPF, the kernel technology underpinning modern Kubernetes networking. Includes tutorials, reference material, and a curated list of eBPF-based projects.
  • Isovalent blog: “eBPF-based Networking, Observability, Security” – Technical deep-dives from the creators of Cilium on how eBPF is applied to networking, including kube-proxy replacement, transparent encryption, and service mesh without sidecars.
  • Thomas Graf – “Accelerating Envoy with the Linux Kernel” (KubeCon EU 2018) — Cilium creator on how eBPF fundamentally changes Kubernetes networking performance and architecture.
  • Flannel GitHub repository – The simple overlay network that was the default CNI for early Kubernetes. Reading the design docs helps understand the baseline that more advanced CNI plugins improved upon.
  • Cilium Service Mesh – Documentation on Cilium’s sidecar-less service mesh implementation, showing how eBPF enables mTLS, L7 policy, and traffic management without per-pod proxy overhead.
  • Kubernetes Network Policy documentation – The official reference for the NetworkPolicy API, essential for understanding the baseline that CNI plugins like Cilium and Calico extend with their own CRDs.

Next: Chapter 14: Kubernetes Version History — A Guided Tour

Chapter 14: Kubernetes Version History — A Guided Tour

For a visual timeline showing how the entire ecosystem evolved in parallel, see Appendix E: Architecture Evolution Timeline.

timeline
    title Kubernetes Release Timeline — The Inflection Points
    section Can it run?
        v1.0 (2015) : CNCF launch
                    : Pods, Services, Secrets
    section Can it handle state?
        v1.5 (2016) : CRI introduced
                    : StatefulSets (beta)
                    : PodDisruptionBudgets
    section Can we extend it?
        v1.7 (2017) : CRDs replace TPR
                    : RBAC GA
    section Is it production ready?
        v1.9 (2018) : Apps/v1 GA
                    : Workloads API stable
    section Can anyone set it up?
        v1.13 (2019) : kubeadm GA
                     : CSI 1.0
    section Cleaning house
        v1.20 (2020) : Docker deprecation announced
    section Removing the debt
        v1.22 (2021) : API removals forced migration
                     : beta APIs off by default
    section Runtime independence
        v1.24 (2022) : dockershim removed
                     : Kubernetes runs on containerd/CRI-O only
    section Mature platform
        v1.29 (2023) : Sidecar containers (beta)
                     : KMS v2 GA
        v1.31 (2024) : nftables kube-proxy (beta)
                     : AppArmor GA

v1.0 (July 2015): The Starting Line

Kubernetes 1.0 was released at OSCON 2015 alongside the announcement that Google was donating the project to the newly formed Cloud Native Computing Foundation (CNCF). The CNCF donation (covered in Chapter 8) gave competitors reason to contribute rather than fork.

The 1.0 release was sparse by modern standards. It had Pods, ReplicationControllers (the precursor to ReplicaSets and Deployments), Services, and Secrets. There were no Deployments, no StatefulSets, no RBAC, no CRDs. The scheduler was basic. Networking was primitive. But the core architectural decisions were already in place: the declarative API model, the reconciliation loop pattern, etcd as the state store, and the API server as the single point of access.

The significance of 1.0 was not its feature set but its commitment to stability. By calling it 1.0, the project promised backward compatibility. API resources marked as stable would not be removed or changed in breaking ways. This promise — which Kubernetes has largely kept — gave enterprises the confidence to invest in the platform.

v1.2 (March 2016): First Usable for Production

Kubernetes 1.2 introduced three features that transformed it from a promising experiment into something you could actually run in production.

ConfigMaps provided a way to inject configuration data into pods without baking it into the container image. Before ConfigMaps, you had two options: environment variables (limited and inflexible) or mounting Secrets (semantically wrong for non-secret configuration).

DaemonSets ensured that a specific pod ran on every node (or a selected subset of nodes). This was essential for infrastructure agents: log collectors, monitoring agents, network plugins, storage drivers. Without DaemonSets, operators had to manually ensure these agents were running on every node and handle new nodes joining the cluster.

Deployments (in beta) introduced declarative rolling updates. Before Deployments, updating an application required manually managing ReplicationControllers — creating a new one, scaling it up, scaling the old one down, and handling failures during the transition. Deployments automated this entire process and added rollback capability. The Deployment controller became the workhorse of Kubernetes, managing the vast majority of stateless workloads.

v1.3 (July 2016): The State Problem

PetSets (later StatefulSets) provided stable network identities, per-pod persistent storage, and ordered scaling — the features stateful workloads need that Deployments do not provide.

The name “PetSets” reflected the “pets vs. cattle” metaphor that dominated DevOps thinking: stateless containers were “cattle” (identical, replaceable) while stateful services were “pets” (unique, requiring individual care). The rename to StatefulSets in 1.5 was driven by the community’s desire for a more descriptive, less metaphorical name.

Cluster federation also appeared in alpha, reflecting an early attempt to manage multiple clusters as a single entity. Federation proved premature — the problem was real but the approach was wrong — and it was eventually replaced by tools like Loft’s vCluster, Admiralty, and the multi-cluster capabilities of service meshes and GitOps tools.

v1.5 (December 2016): The Plugin Architecture Emerges

Kubernetes 1.5 was architecturally pivotal. The Container Runtime Interface (CRI) was introduced, beginning the process that would eventually lead to Docker’s removal (covered in Chapter 10). StatefulSets reached beta, making stateful workloads viable for early adopters. PodDisruptionBudgets appeared, giving operators a way to express how much disruption a workload could tolerate during maintenance operations.

PodDisruptionBudgets solved a subtle but critical problem. When a node needed to be drained for maintenance (kernel upgrade, hardware repair), the system needed to know whether it was safe to evict a pod. For a Deployment with 10 replicas, losing one pod during a drain is fine. For a three-node etcd cluster, losing one node when another is already down would break quorum. PodDisruptionBudgets let operators express constraints like “at least 2 of 3 replicas must always be available,” giving the drain process the information it needed to proceed safely.

v1.6 (March 2017): Security Gets Serious

RBAC (Role-Based Access Control) reached beta and was enabled by default. Before RBAC, Kubernetes had ABAC (Attribute-Based Access Control), which required restarting the API server to change policies. However, the bigger problem was that many clusters simply ran with the AlwaysAllow authorizer — the permissive default — which allowed any authenticated user to do anything. ABAC was available as a more restrictive alternative, but its static file-based configuration made it cumbersome to adopt. RBAC changed this fundamentally.

RBAC introduced Roles (permissions scoped to a namespace), ClusterRoles (permissions scoped to the cluster), RoleBindings, and ClusterRoleBindings. It allowed fine-grained access control: developer A can create Deployments in namespace “team-a” but not in namespace “team-b.” Service accounts can read ConfigMaps but not Secrets. CI/CD pipelines can deploy but not modify RBAC rules.

This release also migrated the default storage backend from etcd v2 to etcd v3, a significant change. etcd v3 introduced a flat key-value model (replacing v2’s directory tree), a more efficient storage format, and support for watchers at scale. The migration was transparent to most users but was essential for supporting larger clusters.

Dynamic storage provisioning reached GA, allowing PersistentVolumeClaims to automatically trigger the creation of underlying storage (EBS volumes, GCE persistent disks, NFS shares) without manual administrator intervention. This completed the self-service model: developers could request storage in their manifests and the cluster would provision it automatically.

v1.7 (June 2017): Extensibility Unlocked

Custom Resource Definitions (CRDs) replaced the earlier ThirdPartyResources, fundamentally changing what Kubernetes could do. CRDs allowed anyone to define new resource types in the Kubernetes API. Combined with custom controllers, CRDs turned Kubernetes from a container orchestration platform into a general-purpose platform for managing any kind of resource.

The significance of CRDs cannot be overstated. They enabled the “operator pattern” — custom controllers that encode domain-specific operational knowledge. A PostgreSQL operator could define a PostgresCluster CRD, and a controller could watch for these resources and automatically provision databases, configure replication, manage backups, and handle failover. The operator pattern turned Kubernetes into a platform for automating the operation of complex software systems, not just running containers.

Network Policies reached GA, providing a mechanism to restrict pod-to-pod communication. Before Network Policies, the flat network model meant any pod could talk to any other pod — a security model that was unacceptable for multi-tenant clusters or environments handling sensitive data.

v1.8 (September 2017): RBAC Stabilizes

RBAC reached GA, completing its journey from alpha to stable. This was the release where Kubernetes’ security model was considered production-ready. From this point forward, the expectation was that all clusters would use RBAC, and tools and documentation assumed its presence.

CronJobs reached beta, providing scheduled job execution (the Kubernetes equivalent of cron). While conceptually simple, CronJobs were important because they addressed a common pattern — batch processing, report generation, database maintenance — that previously required external scheduling systems.

v1.9 (December 2017): The Apps API Stabilizes

This release marked the moment Kubernetes’ core workload APIs became stable. Deployments, ReplicaSets, StatefulSets, and DaemonSets all reached GA under the apps/v1 API group.

The Container Storage Interface (CSI) appeared in alpha. CSI would do for storage what CRI did for container runtimes and CNI did for networking: define a standard interface so storage providers could be plugged in without modifying Kubernetes core code. Before CSI, storage drivers were compiled into Kubernetes, meaning a new storage provider required a change to the Kubernetes codebase. CSI decoupled storage from the Kubernetes release cycle.

v1.11 (June 2018): Infrastructure Refresh

CoreDNS replaced kube-dns as the default cluster DNS provider. kube-dns was a composite of three containers (kube-dns, dnsmasq, sidecar) that was complex to debug and had known performance issues. CoreDNS was a single binary, written in Go, with a plugin-based architecture that made it flexible and easy to extend. The switch reflected a maturation of the ecosystem: better tools replaced adequate ones.

IPVS-based kube-proxy reached GA, providing an alternative to iptables mode for Service load-balancing. IPVS used hash tables instead of linear iptables chains, offering better performance at scale (thousands of Services). This was a stopgap improvement; the eventual answer would be eBPF, but IPVS provided meaningful improvements for clusters that could not yet adopt eBPF-based solutions.

v1.13 (December 2018): The Bootstrap Milestone

kubeadm reached GA, meaning the Kubernetes project now had a stable, supported way to bootstrap clusters. This was the culmination of two years of development by SIG Cluster Lifecycle and was essential for the ecosystem of higher-level tools (kops, kubespray) that built on kubeadm.

CSI 1.0 was released, completing the storage plugin interface specification. Storage vendors could now build drivers that worked with any Kubernetes version without compiling code into Kubernetes. This accelerated the storage ecosystem enormously: vendors shipped CSI drivers for their proprietary storage systems, and the community built CSI drivers for NFS, Ceph, and other open-source storage systems.

v1.16 (September 2019): CRDs Grow Up

CRDs reached GA with structural schemas, meaning CRD authors could define OpenAPI v3 schemas for their custom resources. The API server would validate custom resources against these schemas, rejecting invalid objects. Before structural schemas, CRDs accepted any JSON object, which meant validation errors were only caught at the controller level. Structural schemas moved validation to the API server, matching the behavior of built-in resources.

This release also deprecated extensions/v1beta1 for Deployments, DaemonSets, and ReplicaSets, forcing users to migrate to apps/v1. This was the beginning of a pattern: Kubernetes would aggressively deprecate beta APIs to prevent permanent dependence on unstable interfaces.

v1.20 (December 2020): The Docker Announcement

The dockershim deprecation announcement dominated this release’s narrative (covered in detail in Chapter 10). Beyond the Docker story, v1.20 introduced graceful node shutdown, allowing the kubelet to detect that the node’s operating system was shutting down and gracefully terminate pods in priority order. Before this, a node shutdown simply killed all pods, potentially interrupting critical workloads mid-operation.

v1.22 (August 2021): The Great API Migration

This release removed many deprecated beta APIs that had been deprecated since v1.16. Ingress moved from extensions/v1beta1 to networking.k8s.io/v1. CRD moved from apiextensions.k8s.io/v1beta1 to v1. ValidatingWebhookConfiguration and MutatingWebhookConfiguration moved to admissionregistration.k8s.io/v1.

The removals caused significant disruption. Many Helm charts, operators, and deployment scripts still referenced the old API versions. Tools that generated Kubernetes manifests had to be updated. The community learned a painful lesson about the cost of depending on beta APIs and the importance of migration planning.

Server-side apply reached GA, moving manifest merging logic from kubectl to the API server. This enabled conflict detection (two controllers modifying the same field), field ownership tracking, and consistent behavior across all API clients. Server-side apply was foundational for the emerging GitOps ecosystem, where multiple tools might manage different fields of the same resource.

v1.24 (May 2022): Runtime Independence

The dockershim was removed, completing the deprecation announced in v1.20. Clusters using Docker as their container runtime needed to switch to containerd or CRI-O. In practice, most managed Kubernetes services had already made this switch, and the impact on self-managed clusters was modest because containerd — the actual runtime inside Docker — was already present on most nodes.

v1.25 (August 2022): Security Model Modernization

PodSecurityPolicy (PSP) was removed, ending a contentious chapter in Kubernetes security. PSP had been the mechanism for restricting what pods could do (run as root, use host networking, mount host paths), but it was widely regarded as confusing, difficult to use correctly, and prone to misconfiguration. Its replacement was Pod Security Standards enforced through the Pod Security Admission controller, which defined three profiles — Privileged, Baseline, and Restricted — that were simpler to understand and apply.

Ephemeral containers reached GA, allowing operators to inject temporary debugging containers into running pods. Before ephemeral containers, debugging a distroless or minimal container (which lacked shells, debugging tools, or even a writable filesystem) required rebuilding the image with debugging tools, redeploying, and reproducing the problem. Ephemeral containers solved this by allowing you to attach a container with debugging tools to a running pod without restarting it.

v1.27 (April 2023): Resource Flexibility

In-place pod resource resize appeared in alpha, addressing a long-standing limitation. Before this feature, changing a pod’s CPU or memory limits required deleting and recreating the pod. For stateful workloads, this meant downtime. In-place resize allowed changing resource limits on a running pod, with the kubelet adjusting cgroup limits without restarting the container.

SeccompDefault reached GA, enabling Seccomp security profiles by default for all pods. Seccomp restricts which system calls a container can make, reducing the kernel attack surface. Making it default-on was a security hardening step that moved the ecosystem toward defense-in-depth.

v1.29 (December 2023): Sidecar Containers and Secrets at Scale

Sidecar containers reached beta (enabled by default) (formally: native sidecar support via init containers with restartPolicy: Always, with GA expected in v1.33). This addressed a long-standing problem with the sidecar pattern: Kubernetes had no native concept of a container that started before and stopped after the main container. Log collectors, service mesh proxies, and monitoring agents were deployed as sidecars, but Kubernetes treated them as ordinary containers. This led to startup ordering issues (the sidecar proxy might not be ready when the application started) and shutdown ordering issues (the sidecar might be killed before the application finished draining connections).

KMS v2 reached GA for secrets encryption at rest. Kubernetes Secrets are stored in etcd, and without encryption at rest, anyone with access to etcd’s data directory can read all Secrets in plaintext. KMS v2 provided a standard interface for integrating with external key management services (AWS KMS, Google Cloud KMS, Azure Key Vault, HashiCorp Vault), ensuring Secrets were encrypted in etcd using keys managed by a dedicated, auditable, access-controlled key management system.

v1.30 (April 2024): Authentication and Resource Management

Structured authentication configuration allowed administrators to configure authentication using a file-based configuration rather than a proliferation of API server flags. This made authentication setup more manageable, auditable, and version-controllable.

Dynamic Resource Allocation (DRA) continued its progression, providing a framework for managing non-traditional resources (GPUs, FPGAs, network devices) through a structured API rather than the opaque extended resources mechanism. DRA was driven by the explosive growth of AI/ML workloads that required fine-grained GPU allocation and sharing.

v1.31 (August 2024): Kernel-Level Security and Modern Networking

AppArmor support reached GA, providing mandatory access control profiles that restrict container capabilities at the kernel level. AppArmor profiles could limit filesystem access, network operations, and capability usage, providing a defense-in-depth layer beyond Seccomp and Linux capabilities.

The nftables kube-proxy backend was promoted to beta (it first appeared as alpha in v1.29), replacing iptables with its successor in the Linux kernel. nftables provides better performance, a cleaner rule syntax, and improved maintainability. While eBPF-based solutions (Cilium, Calico) offer superior performance, nftables modernized the default kube-proxy for environments that prefer to use the standard kernel networking stack.

v1.32+ (2025-2026): Continued Maturation

Recent and upcoming releases continue the trend of maturation rather than revolution. Dynamic Resource Allocation improvements address the growing demand for GPU and accelerator scheduling in AI/ML workloads. In-place pod resource resize progresses toward GA. The overall pattern is one of stabilization: making alpha features beta, making beta features GA, and improving the operational experience of features that are already stable.

The Pattern Behind the Versions

Reading the version history as a narrative rather than a changelog reveals a clear pattern of maturation:

2015-2016: Can it run at all? The early releases focused on basic functionality — scheduling pods, running stateless workloads, providing services. Kubernetes was proving that the architecture worked.

2016-2017: Can it handle real workloads? StatefulSets, RBAC, CRDs, Network Policies. These features addressed the requirements of production systems: state, security, extensibility, and network isolation.

2018-2019: Can anyone set it up? kubeadm, CSI, Helm v3. The focus shifted from what Kubernetes could do to how people could deploy and manage it. The tooling ecosystem matured alongside the platform.

2020-2022: Cleaning up the debt. Docker deprecation, API removals, PSP removal. Kubernetes spent these years removing technical debt and forcing the ecosystem to migrate away from deprecated interfaces. This was painful but necessary for long-term health.

2023-2026: Mature platform. Sidecar containers, in-place resize, DRA, security hardening. The features being added are refinements, not revolutions. Kubernetes is no longer proving itself; it is optimizing for the workloads and operational patterns that have emerged over a decade of production use.

The version history also reveals the disciplined API lifecycle that makes Kubernetes trustworthy as a platform. Features progress through alpha (disabled by default, may change or be removed), beta (enabled by default, API may change), and GA (stable, backward compatible, will not be removed). This lifecycle gives users clear signals about what is safe to depend on and gives the community space to iterate on APIs before committing to them permanently.

Common Mistakes and Misconceptions

  • “I should always run the latest Kubernetes version.” New versions may have bugs, and your tools/operators may not support them yet. Use release channels (Stable or Regular) and wait 1-2 months after a minor release before upgrading production.
  • “Skipping minor versions during upgrades is fine.” Kubernetes supports upgrading one minor version at a time (e.g., 1.28 → 1.29 → 1.30). Skipping versions can break API compatibility and is unsupported.
  • “Deprecated APIs will keep working forever.” Deprecated APIs are removed after a defined period (typically 2-3 releases). Plan migrations early using kubectl convert or tools like Pluto to detect deprecated APIs.

Further Reading

  • Kubernetes Release Notes (official) – The canonical list of all Kubernetes releases with links to changelogs, release notes, and upgrade guides. Start here to understand what changed in any specific version.
  • Kubernetes Deprecation Policy – The formal rules governing how APIs and features are deprecated and removed, including the minimum version guarantees for GA, beta, and alpha APIs.
  • Kubernetes Enhancement Proposals (KEP) process – How new features go from idea to implementation. Understanding KEPs explains why features take multiple releases to mature and how the community coordinates large changes.
  • SIG Release – The Special Interest Group responsible for the release process, cadence, and tooling. The README and meeting notes provide insight into how the three-releases-per-year cadence is managed.
  • Kubernetes CHANGELOG on GitHub – The raw changelogs for every release, useful for detailed investigation of specific changes, bug fixes, and API modifications.
  • “Kubernetes Release Cadence Change: Here’s What You Need To Know” (Kubernetes blog) – Explains the move from four to three releases per year and how the new cadence balances stability with velocity.
  • API Version Lifecycle documentation – Official reference for understanding alpha, beta, and GA API stages, which directly maps to the feature maturation pattern described in this chapter.

This concludes Part 2: The Tooling Ecosystem. You now understand how the tools around Kubernetes evolved and why they look the way they do today. Part 3 takes all of this context and puts it into practice — setting up real clusters, deploying real workloads, and learning to debug when things go wrong.

Next: Setting Up a Cluster from Scratch

Chapter 15: Setting Up a Cluster from Scratch

Every Kubernetes cluster begins as a collection of Linux machines that know nothing about each other. Something must generate the certificates, write the configuration files, start the control plane processes, and establish the trust relationships that let workers join.

What kubeadm Actually Does

kubeadm is the official bootstrapping tool. When you run kubeadm init, it executes 12 phases in sequence. Each phase solves a specific problem in the bootstrap chain.

kubeadm init
│
├── 1. preflight          Validate the node can become a control plane
├── 2. certs              Generate the entire PKI hierarchy
├── 3. kubeconfig         Generate kubeconfig files for each component
├── 4. kubelet-start       Configure and start the kubelet
├── 5. control-plane      Write static pod manifests for control plane
├── 6. etcd               Write static pod manifest for local etcd
├── 7. upload-config      Store kubeadm and kubelet config in ConfigMaps
├── 8. upload-certs       (optional) Upload certs for HA join
├── 9. mark-control-plane Taint and label the node
├── 10. bootstrap-token    Create token for worker node joining
├── 11. kubelet-finalize   Update kubelet config for TLS bootstrap
├── 12. addon              Install CoreDNS and kube-proxy
│
▼
Control plane is running. Workers can join.

Let us walk through each phase in detail.

Phase 1: Preflight Checks

Before touching anything, kubeadm validates that the system meets the minimum requirements. This includes:

  • Swap is disabled. The kubelet refuses to start if swap is enabled (by default) because the scheduler’s resource accounting assumes no swap; swap breaks memory limit enforcement.
  • Required ports are available. The API server needs port 6443, etcd needs 2379-2380, the scheduler needs 10259, the controller manager needs 10257. If another process occupies these ports, the control plane cannot start.
  • Container runtime is reachable. kubeadm checks for a CRI-compatible runtime (containerd or CRI-O) at the expected socket path.
  • cgroup driver matches. The kubelet and the container runtime must agree on whether to use cgroupfs or systemd as the cgroup driver. A mismatch causes containers to start in the wrong cgroup hierarchy, breaking resource accounting. Since Kubernetes 1.22, systemd is the recommended default.
  • Required kernel modules and sysctl settings are present (br_netfilter, ip_forward).

Phase 2: Certificate Generation

This is the most important phase. Kubernetes is a distributed system where every component authenticates to every other component using mutual TLS. kubeadm generates the entire PKI hierarchy and writes it to /etc/kubernetes/pki/.

PKI HIERARCHY
─────────────

/etc/kubernetes/pki/
│
├── ca.crt / ca.key                    ◄── Cluster Root CA
│   │
│   ├── apiserver.crt / apiserver.key         API server serving cert
│   ├── apiserver-kubelet-client.crt/key      API server → kubelet client cert
│   ├── front-proxy-ca.crt / .key      ◄── Front Proxy CA (aggregation layer)
│   │   └── front-proxy-client.crt/key        Aggregation layer client cert
│   │
│   └── (kubeconfig embedded certs)
│       ├── admin client cert                 kubectl access
│       ├── controller-manager client cert    controller-manager → API server
│       └── scheduler client cert             scheduler → API server
│
├── etcd/
│   ├── ca.crt / ca.key                ◄── etcd Root CA (separate trust domain)
│   ├── server.crt / server.key               etcd server serving cert
│   ├── peer.crt / peer.key                   etcd peer-to-peer communication
│   └── healthcheck-client.crt/key            Health check client cert
│
├── apiserver-etcd-client.crt/key             API server → etcd client cert
│
└── sa.key / sa.pub                           Service account signing keypair

Two separate CAs exist by design. The cluster CA signs all Kubernetes component certificates. The etcd CA signs all etcd certificates. This separation means a compromise of the cluster CA does not automatically grant access to etcd, and vice versa. The API server holds a client certificate signed by the etcd CA, which is how it authenticates to etcd.

The service account keypair (sa.key / sa.pub) is used by the controller manager to sign service account tokens and by the API server to verify them. This is not a CA — it is a signing key for JWTs.

Phase 3: kubeconfig Generation

kubeadm generates four kubeconfig files in /etc/kubernetes/:

FileUsed ByPurpose
admin.confkubectl (cluster admin)Full cluster access
controller-manager.confkube-controller-managerAuthenticate to API server
scheduler.confkube-schedulerAuthenticate to API server
kubelet.confkubelet on the control plane nodeAuthenticate to API server

Each kubeconfig file embeds a client certificate (signed by the cluster CA) and the CA certificate for verifying the API server. This is mutual TLS: the component authenticates to the API server, and the API server authenticates back to the component.

Phase 4-6: Static Pod Manifests and the Bootstrap Problem

Here is the fundamental bootstrap problem: the API server, controller manager, scheduler, and etcd must run as containers, but the kubelet cannot pull their pod specs from an API server that does not yet exist. This is a circular dependency.

Kubernetes solves this with static pods. The kubelet can read pod manifests directly from a local directory (/etc/kubernetes/manifests/) and run them without any API server involvement. kubeadm writes four manifest files:

/etc/kubernetes/manifests/
├── kube-apiserver.yaml
├── kube-controller-manager.yaml
├── kube-scheduler.yaml
└── etcd.yaml

The kubelet detects these files, creates the pods, and monitors them. If a static pod crashes, the kubelet restarts it. Once the API server is running, the kubelet creates mirror pods in the API — read-only representations that make static pods visible through kubectl get pods -n kube-system, even though the API server does not manage them.

This is one of the most elegant solutions in Kubernetes’ design. The kubelet operates in two modes simultaneously: it manages static pods from local files (for bootstrapping) and regular pods from the API server (for everything else).

Phase 7-9: Configuration and Node Marking

kubeadm stores its own configuration and the kubelet’s configuration as ConfigMaps in the kube-system namespace. This serves two purposes: it documents how the cluster was initialized, and it provides configuration for worker nodes joining later.

The control plane node is tainted with node-role.kubernetes.io/control-plane:NoSchedule so that regular workloads are not scheduled onto it. This is a convention, not a hard rule — you can remove this taint on single-node clusters.

Phase 10: Bootstrap Tokens and the TLS Bootstrap Handshake

When a worker node joins the cluster, it needs to authenticate to the API server. But it has no certificate yet — that is what it is trying to obtain. This is solved by the TLS bootstrap protocol.

sequenceDiagram
    participant K as Kubelet (Worker)
    participant A as API Server
    participant C as CSR Approving Controller
    participant CA as CA (Signer)

    Note over K: kubeadm join --token abc123

    K->>A: TLS connect (verify CA cert hash)
    K->>A: Authenticate with bootstrap token
    Note right of A: Token valid — grants CSR create permission

    K->>A: POST CertificateSigningRequest
    Note left of K: "I am node X, give me a cert"

    A->>C: CSR created
    Note right of C: Auto-approve (first cert from bootstrap token)

    C->>CA: Sign request
    CA-->>C: Signed certificate

    C-->>A: CSR approved + signed cert
    A-->>K: Signed certificate returned

    rect rgba(50, 108, 229, 0.1)
        Note over K,CA: Kubelet now uses real certificate for all API calls.<br/>Bootstrap token can be safely revoked.
        K->>A: Authenticated API calls (using real cert)
    end

The bootstrap token is a short-lived, low-privilege credential. It grants exactly one permission: the ability to create a CertificateSigningRequest. The csrapproving controller in the controller manager automatically approves CSRs from bootstrap tokens (for the first certificate). The worker receives a signed certificate and uses it for all subsequent communication. The bootstrap token can now be revoked.

The --discovery-token-ca-cert-hash flag prevents man-in-the-middle attacks during the initial connection. The worker verifies the API server’s certificate against this hash before sending the bootstrap token.

Phase 11-12: Addons

kubeadm installs two mandatory addons as Deployments:

  • CoreDNS: Provides cluster DNS. Pods resolve service names (e.g., my-svc.my-namespace.svc.cluster.local) through CoreDNS.
  • kube-proxy: Runs as a DaemonSet on every node. Manages iptables or IPVS rules that implement Service routing.

Note that kubeadm does not install a CNI plugin. This is deliberate: the choice of CNI plugin is a critical networking decision that kubeadm leaves to the operator. Until a CNI plugin is installed, pods on the control plane node will be stuck in Pending and nodes will show as NotReady.

Using a kubeadm Configuration File

While kubeadm init accepts dozens of CLI flags, production usage should always use a YAML configuration file. This makes the cluster setup reproducible and auditable.

apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: v1.32.0
controlPlaneEndpoint: "k8s-api.example.com:6443"
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"
  dnsDomain: "cluster.local"
apiServer:
  certSANs:
    - "k8s-api.example.com"
    - "10.0.0.100"
  extraArgs:
    - name: "audit-log-path"
      value: "/var/log/kubernetes/audit.log"
etcd:
  local:
    dataDir: "/var/lib/etcd"
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
nodeRegistration:
  criSocket: "unix:///var/run/containerd/containerd.sock"
  kubeletExtraArgs:
    - name: "cgroup-driver"
      value: "systemd"

Run with: kubeadm init --config=kubeadm-config.yaml

The controlPlaneEndpoint is critical for HA clusters. It should point to a load balancer in front of multiple API server instances. Setting it during initial setup avoids painful reconfiguration later.

Kubernetes the Hard Way

Kelsey Hightower’s Kubernetes the Hard Way is a 13-lab exercise that provisions a cluster by hand, without kubeadm. The labs (updated for v1.32.x) walk you through:

  1. Generating every certificate by hand (you will appreciate kubeadm’s Phase 2 after this)
  2. Writing every kubeconfig file manually
  3. Configuring etcd from scratch
  4. Writing systemd unit files for every component
  5. Configuring kubelet and kube-proxy on each worker
  6. Setting up pod networking manually

What The Hard Way teaches that kubeadm hides:

  • The CA is just files. There is no magic PKI server. You generate a CA key, use it to sign certificates, and distribute them. Understanding this demystifies all of Kubernetes’ authentication.
  • The API server is just a binary with flags. Every feature — authentication methods, authorization modes, admission controllers — is controlled by command-line flags to kube-apiserver.
  • Networking is not built-in. You must configure routing tables or install a CNI plugin yourself. This makes you understand why CNI exists.
  • etcd is independent. It runs as its own cluster and can be inspected with etcdctl independently of Kubernetes.

Do The Hard Way once, then use kubeadm for everything after. The exercise takes 4-8 hours and permanently changes how you think about clusters.

Common Pitfalls

ProblemSymptomFix
Swap enabledkubelet refuses to startswapoff -a and remove swap from /etc/fstab
cgroup driver mismatchPods fail with cgroup errorsEnsure kubelet and containerd both use systemd
Port 6443 in useAPI server fails to bindCheck for existing processes: ss -tlnp | grep 6443
Firewall blockingWorkers cannot joinOpen 6443, 2379-2380, 10250, 10259, 10257
CNI not installedAll pods stuck in Pending, nodes NotReadyInstall a CNI plugin (Calico, Cilium, Flannel)
Wrong podSubnetCNI and kubeadm disagree on pod CIDRMatch podSubnet in kubeadm config with CNI config
Expired bootstrap tokenWorkers cannot join after 24hGenerate new token: kubeadm token create --print-join-command

For a comprehensive error-to-fix mapping, see Appendix D: Troubleshooting Quick Reference.

Common Mistakes and Misconceptions

  • “One control plane node is enough for production.” A single control plane is a single point of failure. Production clusters need 3 or 5 control plane nodes for etcd quorum and API server HA.
  • “Worker nodes should be as large as possible.” Fewer large nodes means each node failure impacts more pods. Balance node size against blast radius — many medium nodes are often better than few huge ones.
  • “I can skip configuring kubelet flags.” Defaults work for learning, but production kubelets need tuning: eviction thresholds, max-pods, image garbage collection, and system reserved resources.

Further Reading


Next: Managed Kubernetes: EKS, GKE, and AKS

Chapter 16: Managed Kubernetes: EKS, GKE, and AKS

Running your own control plane is an excellent way to learn Kubernetes. For most teams, managed services reduce operational overhead significantly. The control plane — etcd, the API server, the controller manager, the scheduler — requires careful backup, monitoring, upgrade orchestration, and high-availability configuration. Managed Kubernetes services take this burden off your team so you can focus on what runs on the cluster rather than what runs the cluster.

But “managed” does not mean “fully operated.” Every cloud provider draws the line differently between what they manage and what remains your responsibility. Understanding exactly where that line falls is essential for making an informed choice.

The Shared Responsibility Model

MANAGED KUBERNETES: WHO MANAGES WHAT?
──────────────────────────────────────

  Cloud Provider Manages               │  You Manage
  ─────────────────────                │  ──────────
                                       │
  ┌────────────────────────────────┐   │  ┌────────────────────────────────┐
  │ Control Plane                  │   │  │ Worker Nodes                   │
  │ ┌───────────┐  ┌────────────┐  │   │  │ ┌───────────┐  ┌───────────┐   │
  │ │ API Server│  │ Controller │  │   │  │ │  kubelet  │  │ Your Pods │   │
  │ │ (HA, TLS) │  │ Manager    │  │   │  │ │           │  │           │   │
  │ └───────────┘  └────────────┘  │   │  │ └───────────┘  └───────────┘   │
  │ ┌───────────┐  ┌────────────┐  │   │  │ ┌───────────┐  ┌───────────┐   │
  │ │ Scheduler │  │    etcd    │  │   │  │ │ kube-proxy│  │ CNI agent │   │
  │ │           │  │ (backups)  │  │   │  │ │           │  │           │   │
  │ └───────────┘  └────────────┘  │   │  │ └───────────┘  └───────────┘   │
  │                                │   │  │                                │
  │ Upgrades, patches, HA,         │   │  │ OS patching, scaling,          │
  │ etcd backups, API cert         │   │  │ node upgrades, app deploys,    │
  │ rotation                       │   │  │ networking config, storage,    │
  └────────────────────────────────┘   │  │ security policies, RBAC        │
                                       │  └────────────────────────────────┘
                                       │
  * GKE Autopilot: Google also manages │  * With node auto-upgrade enabled,
    the worker nodes and their sizing  │    the provider patches node OS

What remains your responsibility in all cases: your application workloads, your RBAC policies, your network policies, your storage configuration, your monitoring, your cost management.

GKE: Google Kubernetes Engine

GKE is the most mature managed Kubernetes service. GKE is typically the first to adopt new Kubernetes features and the most opinionated about best practices.

Networking. GKE uses a VPC-native networking model with Alias IPs. Each node is allocated a secondary IP range from the VPC. Pods receive IPs from this secondary range. These are real VPC IPs — they are routable within the VPC without overlay networks or encapsulation. This means VPC firewall rules, routes, and VPC peering work natively with pod IPs.

Autopilot mode. GKE offers two modes: Standard (you manage node pools) and Autopilot (Google manages everything, including node provisioning and sizing). In Autopilot mode, you submit workloads and Google provisions the right amount of compute. You pay per pod resource request, not per node. Autopilot enforces security best practices by default: workloads run as non-root, privilege escalation is blocked, and host path mounts are disallowed.

Upgrades. GKE is typically the fastest to support new Kubernetes versions. It offers release channels (Rapid, Regular, Stable) that automatically upgrade the control plane and node pools on a schedule. Surge upgrades create extra nodes to maintain capacity during rolling node upgrades.

Pricing. $0.10/hr for the cluster management fee (Standard mode). Autopilot charges per pod resource request instead.

GKE Strengths

  • Fastest Kubernetes version adoption
  • Autopilot removes node management entirely
  • VPC-native networking eliminates overlay complexity
  • Tight integration with Google Cloud networking (Cloud NAT, Cloud Armor, Internal Load Balancers)
  • Binary Authorization for supply chain security

GKE Weaknesses

  • Smaller ecosystem of third-party integrations compared to AWS
  • Autopilot restrictions may be too opinionated for some workloads
  • Vendor lock-in to GCP networking model

EKS: Amazon Elastic Kubernetes Service

EKS is the most widely used managed Kubernetes service, reflecting AWS’s dominant market position. It is also the most “assembly required” of the three — AWS provides the control plane and expects you to configure everything else.

Networking. EKS uses the AWS VPC CNI plugin, which assigns pods real VPC IP addresses from Elastic Network Interfaces (ENIs). Each EC2 instance has a limit on the number of ENIs it can attach and the number of secondary IPs per ENI. This means pod density is limited by instance type:

Instance TypeMax ENIsIPs per ENIMax Pods
t3.nano22~4
t3.medium36~17
m5.large310~29
m5.xlarge415~58
m5.24xlarge1550~737

This is a critical capacity planning consideration. If you run many small pods, you may hit the pod limit before you exhaust CPU or memory. AWS offers prefix delegation to increase pod density by assigning /28 prefixes instead of individual IPs.

Node management. EKS offers three options: self-managed nodes (EC2 instances you configure), managed node groups (AWS manages the EC2 lifecycle), and Fargate (serverless pods, similar to GKE Autopilot but per-pod). Karpenter is AWS’s open-source node autoscaler, which provisions right-sized nodes based on pending pod requirements — it is faster and more flexible than the Cluster Autoscaler.

Upgrades. EKS upgrades are the most manual of the three providers. You upgrade the control plane first (one API call or console click), then upgrade each node group separately. There is no automatic release channel for control plane upgrades in the standard configuration — you must actively track Kubernetes versions and initiate upgrades. However, EKS Auto Mode (launched December 2024) manages node and upgrade operations automatically for clusters that opt in.

Pricing. $0.10/hr for the cluster ($72/month). EKS on Fargate adds a per-pod charge.

EKS Strengths

  • Largest ecosystem — most third-party tools are tested on EKS first
  • Deep AWS integration (IAM roles for service accounts, ALB Ingress Controller, EBS CSI driver)
  • Karpenter for intelligent, fast node autoscaling
  • Most flexibility in configuration
  • AWS marketplace of EKS add-ons

EKS Weaknesses

  • Most manual upgrade process
  • VPC CNI pod density limits require careful instance type selection
  • More “assembly required” than GKE or AKS

AKS: Azure Kubernetes Service

AKS differentiates primarily on pricing: the control plane is free in the Free tier. You pay only for the worker node VMs.

Networking. AKS offers two networking models. kubenet is a basic overlay network where pods get IPs from a virtual network that is not routable in the VPC (Azure calls it VNet). Azure CNI assigns pods real VNet IPs, similar to AWS VPC CNI and GKE Alias IPs. Azure CNI Overlay is a newer option that provides Azure CNI features without consuming VNet IPs for every pod.

Upgrades. AKS has a rapid security patching cadence. It supports automatic upgrades through channels (none, patch, stable, rapid, node-image). Node image upgrades can be applied independently from Kubernetes version upgrades.

Pricing. Free tier: $0 for the control plane. Standard tier: $0.10/hr (adds SLA and more features). Premium tier: $0.60/hr (adds long-term support versions).

AKS Strengths

  • Free control plane in Free tier
  • Rapid security patching
  • Strong integration with Azure Active Directory for RBAC
  • Azure Arc extends AKS management to on-premises and other clouds
  • AKS Automatic mode (similar to GKE Autopilot)

AKS Weaknesses

  • Azure networking can be complex (VNet peering, NSG interactions)
  • Historically slower Kubernetes version adoption than GKE
  • Smaller Kubernetes-specific community than AWS

Comparison Table

FeatureGKEEKSAKS
Control plane cost$0.10/hr$0.10/hrFree (Free tier)
Serverless podsAutopilotFargateVirtual Nodes
Pod networkingAlias IPs (VPC-native)VPC CNI (ENI-based)Azure CNI or kubenet
Pod IP routable in VPC?YesYesYes (Azure CNI)
Default node autoscalerCluster AutoscalerKarpenter / CACluster Autoscaler / KEDA
Upgrade automationRelease channelsManual initiation; EKS Auto Mode (Dec 2024) manages upgrades automaticallyUpgrade channels
Version adoption speedFastestModerateModerate
Identity integrationGoogle IAM + Workload IdentityIAM Roles for Service AccountsAzure AD + Workload Identity
Service meshAnthos Service MeshIstio / Linkerd (App Mesh deprecated Sept 2024)Istio add-on (Open Service Mesh archived Sept 2023)
GPU supportYes (multi-GPU, TPU)Yes (GPU, Inferentia, Trainium)Yes (GPU)
Max nodes per cluster15,0005,000 (soft limit)5,000

When to Choose Each

See also Appendix C: Decision Trees for a quick decision flowchart.

Choose GKE when:

  • You want the most automated, opinionated experience
  • You are already on Google Cloud or are starting fresh
  • You want Autopilot to eliminate node management
  • You need fast access to the latest Kubernetes features
  • You are running ML/AI workloads with TPU requirements

Choose EKS when:

  • You are already on AWS (most organizations are)
  • You need maximum flexibility and control
  • Your team has AWS expertise
  • You need deep integration with the AWS ecosystem (Lambda, SQS, DynamoDB)
  • You want Karpenter for intelligent autoscaling

Choose AKS when:

  • You are already on Azure or have an Enterprise Agreement
  • You want a free control plane for dev/test
  • You use Azure Active Directory for identity management
  • You need hybrid cloud with Azure Arc
  • You want the cheapest entry point for learning

Choose self-managed (kubeadm) when:

  • You are on-premises with no cloud option
  • You have strict regulatory requirements about where the control plane runs
  • You are learning Kubernetes internals
  • You need control over every component’s configuration

The Hidden Costs

The control plane fee is the smallest part of the bill. The real costs are:

  • Worker node compute: The VMs or instances running your pods (typically 80-90% of the bill)
  • Load balancers: Each Service of type LoadBalancer creates a cloud load balancer ($15-25/month each)
  • NAT gateways: Required for private clusters to reach the internet ($30-45/month + data processing fees)
  • Persistent storage: EBS volumes, Persistent Disks, Managed Disks ($0.08-0.10/GB/month for SSD)
  • Data transfer: Cross-AZ traffic is charged on all three clouds ($0.01-0.02/GB)
  • Monitoring and logging: CloudWatch, Cloud Monitoring, Azure Monitor charges for ingestion and storage

A “free” AKS control plane cluster running three m5.large worker nodes with a load balancer, NAT gateway, and 100 GB of persistent storage will cost approximately $300-400/month before data transfer.

Common Mistakes and Misconceptions

  • “Managed Kubernetes means fully managed.” You still manage worker nodes (unless using Autopilot/Fargate), networking, storage, RBAC, monitoring, and your applications. “Managed” refers primarily to the control plane.
  • “EKS/GKE/AKS clusters are identical to vanilla Kubernetes.” Each provider adds proprietary networking (VPC CNI, Alias IPs), identity (IRSA, Workload Identity), and storage integrations that don’t exist in upstream K8s.
  • “The control plane fee is my main Kubernetes cost.” The $72-74/month control plane fee is typically under 5% of the total bill. Worker node compute, load balancers, NAT gateways, and data transfer dominate costs.

Further Reading


Next: Cloud Networking and Storage

Chapter 17: Cloud Networking and Storage

Kubernetes defines abstractions — Services, PersistentVolumes, Ingress — but it does not implement them. The implementation is provided by cloud-specific controllers and plugins. Understanding how these abstractions map to real cloud infrastructure is the difference between writing YAML that works and writing YAML that works well.

How Pod Networking Maps to Cloud Networking

Recall the flat network model from Chapter 5: every pod gets a unique IP, reachable without NAT.

AWS VPC CNI: Pods as First-Class VPC Citizens

The AWS VPC CNI plugin gives each pod a real VPC IP address. It does this by leveraging Elastic Network Interfaces (ENIs), which are virtual network cards that can be attached to EC2 instances.

AWS VPC CNI: HOW PODS GET IPS
──────────────────────────────

EC2 Instance (m5.large)
┌──────────────────────────────────────────────────┐
│                                                  │
│  Primary ENI (eth0)                              │
│  ┌────────────────────────────────────┐          │
│  │ Primary IP: 10.0.1.100 (node IP)   │          │
│  │ Secondary IP: 10.0.1.101 → Pod A   │          │
│  │ Secondary IP: 10.0.1.102 → Pod B   │          │
│  │ Secondary IP: 10.0.1.103 → Pod C   │          │
│  │ ...up to 10 IPs per ENI            │          │
│  └────────────────────────────────────┘          │
│                                                  │
│  Secondary ENI (eth1)                            │
│  ┌────────────────────────────────────┐          │
│  │ Primary IP: 10.0.1.200             │          │
│  │ Secondary IP: 10.0.1.201 → Pod D   │          │
│  │ Secondary IP: 10.0.1.202 → Pod E   │          │
│  │ ...up to 10 IPs per ENI            │          │
│  └────────────────────────────────────┘          │
│                                                  │
│  Secondary ENI (eth2)                            │
│  ┌────────────────────────────────────┐          │
│  │ Primary IP: 10.0.1.210             │          │
│  │ Secondary IP: 10.0.1.211 → Pod F   │          │
│  │ ...                                │          │
│  └────────────────────────────────────┘          │
│                                                  │
│  m5.large: 3 ENIs x 10 IPs = ~29 max pods        │
│                                                  │
└──────────────────────────────────────────────────┘

Pod A (10.0.1.101) can reach Pod X (10.0.2.55) on another
node directly through VPC routing. No encapsulation.
No overlay. Just VPC route tables.

The IPAMD (IP Address Management Daemon) runs on each node as part of the VPC CNI. It pre-allocates ENIs and warms secondary IPs so that new pods get IPs quickly. When a pod is scheduled, the CNI assigns a pre-warmed IP from the pool.

Advantages: No overlay network. No encapsulation overhead. Pod IPs are routable in the VPC, so VPC security groups, NACLs, VPC Flow Logs, and VPC peering work natively with pod traffic.

Trade-off: Pod density is constrained by the instance type’s ENI and IP limits. A t3.nano can run approximately 4 pods. An m5.large can run approximately 29. This matters: if you run many small pods (sidecars, agents), you may exhaust the IP limit before CPU or memory. Enable prefix delegation to assign /28 prefixes (16 IPs each) instead of individual IPs, dramatically increasing pod density.

GKE Alias IPs: VPC-Native Pods

GKE’s Alias IP model (described in Chapter 16) gives each node a secondary VPC range; pods draw IPs from it without overlay.

On-Premises: Why Overlay Networks Are Necessary

On-premises clusters lack the cloud’s SDN (Software-Defined Networking) layer. The physical network routers do not know about pod CIDRs. An overlay network — VXLAN (used by Flannel, Calico), Geneve (used by Cilium), or IP-in-IP (used by Calico) — encapsulates pod traffic inside packets addressed to node IPs, which the physical network can route.

OVERLAY vs. CLOUD-NATIVE NETWORKING
────────────────────────────────────

Cloud-Native (AWS VPC CNI, GKE Alias IPs):
  Pod A ──► [Packet: src=10.0.1.101, dst=10.0.2.55] ──► VPC Router ──► Pod X
  No encapsulation. Direct routing.

On-Prem Overlay (VXLAN):
  Pod A ──► [Outer: src=192.168.1.10, dst=192.168.1.20]
            [VXLAN header]
            [Inner: src=10.244.1.5, dst=10.244.2.8]     ──► Physical Switch ──► Pod X
  Pod packet wrapped inside a node-to-node packet.
  ~50 bytes overhead per packet. Physical network only sees node IPs.

The overlay approach works everywhere but adds latency (encapsulation/decapsulation), reduces MTU (the inner packet must be smaller than the outer packet), and makes network debugging harder (tcpdump on the physical network shows encapsulated traffic). Cloud-native CNI plugins avoid all of this by integrating with the cloud’s routing layer.

How Storage Maps to Cloud Infrastructure

Kubernetes storage abstractions — PersistentVolumes (PV), PersistentVolumeClaims (PVC), and StorageClasses — map to specific cloud storage services through the Container Storage Interface (CSI).

Storage Access Modes

Access ModeAbbreviationMeaningCloud Examples
ReadWriteOnceRWOOne node can mount read-writeEBS, GCE PD, Azure Managed Disk
ReadOnlyManyROXMany nodes can mount read-onlyEBS (snapshot-based), GCE PD
ReadWriteManyRWXMany nodes can mount read-writeEFS, Filestore, Azure Files
ReadWriteOncePodRWOPOne pod can mount read-writeEBS (since CSI spec 1.5)

The most common mistake is requesting RWX for a workload that only needs RWO. Block storage (EBS, GCE PD, Azure Managed Disks) is RWO — a single volume can only be attached to one node at a time. If you need shared storage across multiple pods on different nodes, you must use a file storage service (EFS, Filestore, Azure Files) or a distributed storage system (Ceph, GlusterFS).

Cloud Storage Mapping

Kubernetes ConceptAWSGCPAzure
RWO PersistentVolumeEBS (gp3, io2)GCE Persistent Disk (pd-balanced, pd-ssd)Azure Managed Disk (Premium SSD, Standard SSD)
RWX PersistentVolumeEFSCloud FilestoreAzure Files
StorageClass provisionerebs.csi.aws.compd.csi.storage.gke.iodisk.csi.azure.com
Volume snapshotsEBS snapshotsPD snapshotsAzure Disk snapshots

The CSI Architecture

CSI (Container Storage Interface) is the standard that allows storage vendors to write plugins for Kubernetes without modifying Kubernetes itself. The architecture has two components deployed differently:

CSI ARCHITECTURE
────────────────

┌─────────────────────────────────────────────────────────┐
│                    CONTROL PLANE                        │
│                                                         │
│  CSI Controller Plugin (Deployment, 1-3 replicas)       │
│  ┌───────────────────────────────────────────────────┐  │
│  │                                                   │  │
│  │  ┌──────────────────┐  ┌──────────────────────┐   │  │
│  │  │ external-        │  │ external-            │   │  │
│  │  │ provisioner      │  │ attacher             │   │  │
│  │  │                  │  │                      │   │  │
│  │  │ Watches PVCs,    │  │ Watches VolumeAttach │   │  │
│  │  │ calls CSI        │  │ objects, calls CSI   │   │  │
│  │  │ CreateVolume()   │  │ ControllerPublish()  │   │  │
│  │  └────────┬─────────┘  └────────┬─────────────┘   │  │
│  │           │                     │                 │  │
│  │  ┌────────▼─────────────────────▼──────────────┐  │  │
│  │  │         CSI Driver (controller mode)        │  │  │
│  │  │                                             │  │  │
│  │  │  Translates CSI calls to cloud API calls:   │  │  │
│  │  │  CreateVolume() → aws ec2 create-volume     │  │  │
│  │  │  ControllerPublish() → aws ec2 attach-vol   │  │  │
│  │  └─────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                    EVERY NODE                           │
│                                                         │
│  CSI Node Plugin (DaemonSet, one per node)              │
│  ┌───────────────────────────────────────────────────┐  │
│  │                                                   │  │
│  │  ┌──────────────────┐                             │  │
│  │  │ node-driver-     │                             │  │
│  │  │ registrar        │  Registers the CSI driver   │  │
│  │  │                  │  with the kubelet           │  │
│  │  └────────┬─────────┘                             │  │
│  │           │                                       │  │
│  │  ┌────────▼───────────────────────────────────┐   │  │
│  │  │         CSI Driver (node mode)             │   │  │
│  │  │                                            │   │  │
│  │  │  NodeStageVolume() → format + mount to     │   │  │
│  │  │                      staging path          │   │  │
│  │  │  NodePublishVolume() → bind mount into     │   │  │
│  │  │                       pod's filesystem     │   │  │
│  │  └────────────────────────────────────────────┘   │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

The Controller Plugin runs as a Deployment (typically 1-3 replicas). It handles volume lifecycle operations that do not require node-level access: creating volumes, deleting volumes, creating snapshots, and attaching volumes to nodes (at the cloud API level).

The Node Plugin runs as a DaemonSet (one per node). It handles operations that require access to the node’s filesystem: formatting the volume, mounting it, and bind-mounting it into the pod’s filesystem.

Sidecar containers bridge between Kubernetes and CSI. They watch Kubernetes API objects and translate them into CSI calls:

  • external-provisioner: Watches PVCs, calls CreateVolume()
  • external-attacher: Watches VolumeAttachment objects, calls ControllerPublishVolume()
  • external-snapshotter: Watches VolumeSnapshot objects, calls CreateSnapshot()
  • external-resizer: Watches PVC size changes, calls ControllerExpandVolume()
  • node-driver-registrar: Registers the CSI driver with kubelet

Dynamic Provisioning Flow

When you create a PVC with a StorageClass, the following sequence occurs. The following sequence diagram shows the full lifecycle — six different components coordinate to turn a PVC into a mounted filesystem inside a running pod:

CSI DYNAMIC PROVISIONING: PVC TO MOUNTED VOLUME
─────────────────────────────────────────────────

  User         API Server    external-       CSI Controller   Cloud API     external-      kubelet /       Pod
  (kubectl)                  provisioner     Plugin           (EBS/GCE)     attacher       CSI Node Plugin
    │              │             │               │               │              │              │              │
    │ create PVC   │             │               │               │              │              │              │
    ├─────────────▶│             │               │               │              │              │              │
    │              │             │               │               │              │              │              │
    │              │  watch:     │               │               │              │              │              │
    │              │  new PVC    │               │               │              │              │              │
    │              ├────────────▶│               │               │              │              │              │
    │              │             │               │               │              │              │              │
    │              │             │ CreateVolume  │               │              │              │              │
    │              │             ├──────────────▶│               │              │              │              │
    │              │             │               │  create disk  │              │              │              │
    │              │             │               ├──────────────▶│              │              │              │
    │              │             │               │  disk ready   │              │              │              │
    │              │             │               │◀──────────────┤              │              │              │
    │              │             │  volume ID    │               │              │              │              │
    │              │             │◀──────────────┤               │              │              │              │
    │              │             │               │               │              │              │              │
    │              │  create PV, │               │               │              │              │              │
    │              │  bind PVC   │               │               │              │              │              │
    │              │◀────────────┤               │               │              │              │              │
    │              │             │               │               │              │              │              │
    │              │  (Pod scheduled to node)    │               │              │              │              │
    │              │             │               │               │              │              │              │
    │              │  watch:     │               │               │              │              │              │
    │              │  VolumeAttachment           │               │              │              │              │
    │              ├────────────────────────────────────────────────────────────▶│              │              │
    │              │             │               │               │              │              │              │
    │              │             │               │ ControllerPublish            │              │              │
    │              │             │               │ (attach disk  │              │              │              │
    │              │             │               │  to node)     │              │              │              │
    │              │             │  ┌────────────┤◀─────────────────────────────┤              │              │
    │              │             │  │            │  attach to    │              │              │              │
    │              │             │  │            ├──────────────▶│              │              │              │
    │              │             │  │            │  attached     │              │              │              │
    │              │             │  │            │◀──────────────┤              │              │              │
    │              │             │  └───────────▶│               │              │              │              │
    │              │             │               │               │              │              │              │
    │              │  kubelet: mount volume      │               │              │              │              │
    │              ├──────────────────────────────────────────────────────────────────────────▶│              │
    │              │             │               │               │              │              │              │
    │              │             │               │ NodeStage +   │              │              │              │
    │              │             │               │ NodePublish   │              │              │              │
    │              │             │               │ (format, mount)              │              │              │
    │              │             │               │◀─────────────────────────────────────────────┤              │
    │              │             │               │  mounted      │              │              │              │
    │              │             │               ├──────────────────────────────────────────────▶│              │
    │              │             │               │               │              │              │              │
    │              │             │               │               │              │  start       │              │
    │              │             │               │               │              │  container   │              │
    │              │             │               │               │              │  with mount  │              │
    │              │             │               │               │              ├─────────────▶│              │
    │              │             │               │               │              │              │   Running    │

The same steps in a compact numbered form:

DYNAMIC PROVISIONING FLOW
──────────────────────────

1. User creates PVC
   ┌──────────────────────────────┐
   │ kind: PersistentVolumeClaim  │
   │ spec:                        │
   │   storageClassName: gp3      │
   │   resources:                 │
   │     requests:                │
   │       storage: 50Gi          │
   └──────────┬───────────────────┘
              │
              ▼
2. external-provisioner sees unbound PVC
   with storageClassName matching its driver
              │
              ▼
3. Calls CSI CreateVolume() → cloud creates EBS volume
              │
              ▼
4. Creates PV object bound to the PVC
              │
              ▼
5. Pod is scheduled to a node
              │
              ▼
6. external-attacher sees VolumeAttachment →
   calls CSI ControllerPublishVolume() →
   cloud attaches EBS to EC2 instance
              │
              ▼
7. Node plugin: NodeStageVolume() formats + mounts
              │
              ▼
8. Node plugin: NodePublishVolume() bind-mounts into pod
              │
              ▼
9. Pod sees /data with 50Gi filesystem

WaitForFirstConsumer: Why It Matters

StorageClasses have a volumeBindingMode field with two options:

  • Immediate: The volume is created as soon as the PVC is created.
  • WaitForFirstConsumer: The volume is not created until a pod using the PVC is scheduled.

WaitForFirstConsumer is critical for availability-zone-aware storage. EBS volumes, GCE PDs, and Azure Managed Disks are zonal — they exist in a specific availability zone. If a PVC creates an EBS volume in us-east-1a immediately, but the scheduler places the pod in us-east-1b, the volume cannot be attached. WaitForFirstConsumer delays volume creation until the pod is scheduled, so the volume is created in the same AZ as the node.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-waitforfirstconsumer
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  fsType: ext4
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
allowVolumeExpansion: true

Always use WaitForFirstConsumer for zonal block storage. The only exception is if you are running a single-AZ cluster.

Volume Snapshots

CSI volume snapshots allow point-in-time copies of PersistentVolumes. The workflow uses three objects:

# 1. VolumeSnapshotClass (like StorageClass but for snapshots)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Delete

---
# 2. VolumeSnapshot (request a snapshot of an existing PVC)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: my-app-snapshot
spec:
  volumeSnapshotClassName: ebs-snapshot-class
  source:
    persistentVolumeClaimName: my-app-data

---
# 3. Restore from snapshot (create a PVC from the snapshot)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-data-restored
spec:
  storageClassName: gp3
  dataSource:
    name: my-app-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi

This is the foundation for backup workflows. Tools like Velero use CSI snapshots internally.

Common Mistakes and Misconceptions

  • “All storage classes perform the same.” gp3 vs io2 vs local NVMe have vastly different IOPS, throughput, and cost profiles. Match storage class to workload requirements, especially for databases.
  • “Cross-AZ traffic is free.” All three major clouds charge $0.01-0.02/GB for cross-AZ data transfer. High-traffic services with pods spread across AZs can accumulate significant costs.
  • “I should use one big VPC for everything.” Separate VPCs (or at least subnets) for dev/staging/production provide network-level isolation. VPC peering connects them when needed.

Further Reading


Next: Your First Workloads

Chapter 18: Your First Workloads

This chapter is hands-on. Every YAML example is complete — you can apply it to a running cluster and observe the result.

Exercise 1: Deployment + Service

A Deployment manages a set of identical pods. A Service provides a stable network endpoint to reach them. Together, they are the fundamental building block of every Kubernetes application.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: default
  labels:
    app: web-app
spec:
  replicas: 3                    # Run 3 identical pods
  selector:
    matchLabels:
      app: web-app               # The Deployment manages pods with this label
  template:                      # Pod template --- every pod created from this
    metadata:
      labels:
        app: web-app             # Must match selector.matchLabels
    spec:
      containers:
        - name: web
          image: nginx:1.27.3    # Always pin a specific version. Never use :latest.
          ports:
            - containerPort: 80
              protocol: TCP
          resources:
            requests:            # Scheduler uses these for placement decisions
              cpu: 100m          # 100 millicores = 0.1 CPU core
              memory: 128Mi      # 128 mebibytes
            limits:              # Hard ceiling the container cannot exceed
              cpu: 250m
              memory: 256Mi
          readinessProbe:        # Pod is added to Service endpoints only when ready
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:         # Pod is restarted if this fails
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 15
            periodSeconds: 20

Key fields explained:

  • spec.replicas: The desired number of pod instances. The Deployment controller continuously reconciles the actual count to match this number.
  • spec.selector.matchLabels: How the Deployment identifies which pods it owns. This must match the pod template labels. If it does not, the API server rejects the Deployment.
  • spec.template: The blueprint for each pod. Every pod created by this Deployment is identical (same image, same resources, same probes).
  • resources.requests: The minimum resources the scheduler guarantees. A pod with 100m CPU requests is guaranteed 0.1 cores. The scheduler will not place the pod on a node that cannot satisfy this request.
  • resources.limits: The maximum resources the container can use. Exceeding CPU limits causes throttling (the container is slowed down). Exceeding memory limits causes OOMKill (the container is terminated).

Now the Service:

apiVersion: v1
kind: Service
metadata:
  name: web-app
  namespace: default
spec:
  type: ClusterIP              # Internal-only. Reachable within the cluster.
  selector:
    app: web-app               # Route traffic to pods with this label
  ports:
    - port: 80                 # The port the Service listens on
      targetPort: 80           # The port on the pod to forward to
      protocol: TCP

The Service creates a stable virtual IP (ClusterIP) that load-balances across all pods matching the selector. When pods are created, destroyed, or become unready, the Service automatically updates its endpoints. This decouples clients from the pod lifecycle.

flowchart TD
    Client["Client Pod"] -- "GET http://web-app/" --> Service["Service<br>ClusterIP: 10.96.45.12"]
    Service -- "load balance" --> Pod1["Pod 1<br>10.244.1.5:80"]
    Service -- "load balance" --> Pod2["Pod 2<br>10.244.2.8:80"]
    Service -- "load balance" --> Pod3["Pod 3<br>10.244.1.6:80"]

kube-proxy maintains iptables/IPVS rules that distribute traffic across healthy pods.

Apply both:

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl get pods -l app=web-app
kubectl get endpoints web-app

The endpoints command shows which pod IPs are currently backing the Service. Pods that fail their readiness probe are removed from endpoints.

Exercise 2: Scaling

Scaling a Deployment is a single field change:

kubectl scale deployment web-app --replicas=5

Or declaratively, change spec.replicas: 5 and kubectl apply. The Deployment controller creates 2 new pods. The scheduler places them on nodes with available resources. The Service automatically includes them in its endpoint list once they pass their readiness probe.

Scale down to 2:

kubectl scale deployment web-app --replicas=2

The Deployment controller selects 3 pods for termination. Kubernetes sends SIGTERM, waits for terminationGracePeriodSeconds (default 30 seconds), then sends SIGKILL. During this window, the pod is removed from Service endpoints so it stops receiving new traffic.

Exercise 3: Rolling Updates

Change the image version to trigger a rolling update:

kubectl set image deployment/web-app web=nginx:1.27.4

Or change the image in the YAML and kubectl apply. The Deployment controller performs a rolling update controlled by two parameters:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1          # At most 1 extra pod above desired count
      maxUnavailable: 0    # Zero pods can be unavailable during update
ROLLING UPDATE (replicas=3, maxSurge=1, maxUnavailable=0)
──────────────────────────────────────────────────────────

Time    Old Pods (v1)    New Pods (v2)    Total Running
────    ─────────────    ─────────────    ─────────────
 t0     [A] [B] [C]                       3 (all v1)
 t1     [A] [B] [C]     [D]creating       3 + 1 surge
 t2     [A] [B] [C]     [D]ready          4 (surge = 1)
 t3     [A] [B]  X      [D]               3 (C terminated)
 t4     [A] [B]         [D] [E]creating   3 + 1 surge
 t5     [A] [B]         [D] [E]ready      4
 t6     [A]  X          [D] [E]           3 (B terminated)
 t7     [A]             [D] [E] [F]creat  3 + 1 surge
 t8     [A]             [D] [E] [F]ready  4
 t9      X              [D] [E] [F]       3 (all v2)

maxSurge: 1 means at most 1 extra pod can exist above the desired replica count. This provides capacity during the transition.

maxUnavailable: 0 means every old pod must be replaced by a ready new pod before it is terminated. This ensures zero downtime. The trade-off is that the update requires temporarily running 4 pods (3 desired + 1 surge), which needs extra cluster capacity.

Alternative strategies:

  • maxSurge: 0, maxUnavailable: 1: No extra pods, but one pod is unavailable during each step. Saves resources, risks reduced capacity.
  • maxSurge: 25%, maxUnavailable: 25%: The default. Balances speed and availability.

Roll back if something goes wrong:

kubectl rollout undo deployment/web-app
kubectl rollout status deployment/web-app
kubectl rollout history deployment/web-app

Exercise 4: ConfigMaps and Secrets

Configuration should be separated from container images. ConfigMaps hold non-sensitive configuration. Secrets hold sensitive data (passwords, tokens, certificates).

apiVersion: v1
kind: ConfigMap
metadata:
  name: web-config
data:
  APP_ENV: "production"
  LOG_LEVEL: "info"
  config.json: |
    {
      "database_pool_size": 10,
      "cache_ttl_seconds": 300,
      "feature_flags": {
        "new_dashboard": true
      }
    }
---
apiVersion: v1
kind: Secret
metadata:
  name: web-secrets
type: Opaque
stringData:                    # stringData accepts plain text (base64 encoded on save)
  DATABASE_URL: "postgres://user:pass@db:5432/myapp"
  API_KEY: "sk-abc123secret"

Mount as volumes, not environment variables. This is a best practice for two reasons:

  1. Volume mounts can be updated without restarting the pod (if subPath is not used).
  2. Environment variables are exposed in kubectl describe pod, process listings, and crash dumps. Volume-mounted files are more contained.
# In the Deployment spec.template.spec:
containers:
  - name: web
    image: my-app:v1.2.0
    volumeMounts:
      - name: config-volume
        mountPath: /etc/app/config
        readOnly: true
      - name: secret-volume
        mountPath: /etc/app/secrets
        readOnly: true
volumes:
  - name: config-volume
    configMap:
      name: web-config
  - name: secret-volume
    secret:
      secretName: web-secrets
      defaultMode: 0400         # Read-only by owner

The application reads /etc/app/config/config.json and /etc/app/secrets/DATABASE_URL as files. When the ConfigMap is updated, the kubelet updates the mounted files within 1-2 minutes (the sync period). The application must watch for file changes or be signaled to reload.

Note: Kubernetes Secrets are base64-encoded, not encrypted at rest by default. For actual security, enable encryption at rest (EncryptionConfiguration) or use an external secret store (AWS Secrets Manager, HashiCorp Vault) with the External Secrets Operator.

Exercise 5: Resource Requests, Limits, and QoS

Understanding the difference between CPU and memory limits is fundamental to running stable workloads.

CPU is compressible. When a container exceeds its CPU limit, it is throttled — the kernel’s CFS (Completely Fair Scheduler) restricts the container’s CPU time. The container runs slower but continues to run. It is never killed for using too much CPU.

Memory is non-compressible. When a container exceeds its memory limit, it is OOMKilled — the kernel’s OOM killer terminates the process. There is no way to “slow down” memory usage. The container either fits in its limit or it dies.

CPU vs MEMORY: WHAT HAPPENS WHEN YOU EXCEED LIMITS
───────────────────────────────────────────────────

CPU (compressible):
  ┌─────────┐     ┌─────────┐     ┌─────────┐
  │ Request │     │ Using   │     │  Limit  │
  │  100m   │ ... │  300m   │ ... │  250m   │
  └─────────┘     └────┬────┘     └─────────┘
                       │
                  Container is THROTTLED.
                  Runs slower. Not killed.
                  CFS quota enforced.

Memory (non-compressible):
  ┌─────────┐     ┌─────────┐     ┌─────────┐
  │ Request │     │ Using   │     │  Limit  │
  │  128Mi  │ ... │  300Mi  │ ... │  256Mi  │
  └─────────┘     └────┬────┘     └─────────┘
                       │
                  Container is OOMKilled.
                  Exit code 137 (128 + SIGKILL=9).
                  Pod restarts (CrashLoopBackOff if repeated).

QoS classes are assigned automatically based on resource configuration:

QoS ClassConditionEviction Priority
GuaranteedEvery container has requests == limits for both CPU and memoryLast to be evicted
BurstableAt least one container has a request or limit set, but they are not all equalMiddle priority
BestEffortNo requests or limits set on any containerFirst to be evicted

When a node runs out of memory, the kubelet evicts pods in order: BestEffort first, then Burstable (sorted by how much they exceed their requests), then Guaranteed (only under extreme pressure). Always set both requests and limits. Setting them equal gives you Guaranteed QoS — the strongest protection against eviction.

Exercise 6: Ingress

A Service of type ClusterIP is only reachable inside the cluster. Ingress exposes HTTP/HTTPS routes from outside the cluster to Services inside the cluster.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: nginx       # Which Ingress controller handles this
  tls:
    - hosts:
        - app.example.com
      secretName: app-tls-cert  # Secret containing TLS cert and key
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-app
                port:
                  number: 80
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 8080

Ingress requires an Ingress controller — a pod that watches Ingress resources and configures a reverse proxy (typically NGINX, Traefik, or HAProxy). The Ingress resource itself is just configuration; the controller is the data plane that routes traffic.

INGRESS TRAFFIC FLOW
────────────────────

Internet
    │
    ▼
Load Balancer (cloud LB or NodePort)
    │
    ▼
Ingress Controller Pod (NGINX)
    │
    ├── Host: app.example.com, Path: /     → Service: web-app:80
    │                                         → Pod 10.244.1.5:80
    │                                         → Pod 10.244.2.8:80
    │
    └── Host: app.example.com, Path: /api  → Service: api-service:8080
                                              → Pod 10.244.1.9:8080

Install an Ingress controller (it is not included by default):

# NGINX Ingress Controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.12.0/deploy/static/provider/cloud/deploy.yaml

Putting It All Together

A complete application typically combines all of the above:

COMPLETE APPLICATION STACK
──────────────────────────

Ingress (app.example.com)
    │
    ▼
Service (ClusterIP)
    │
    ├──► Pod 1 ──► ConfigMap (config files)
    │              Secret (credentials)
    │              PVC (persistent data)
    │
    ├──► Pod 2 ──► (same mounts)
    │
    └──► Pod 3 ──► (same mounts)

Apply resources in dependency order: Namespace, ConfigMap, Secret, PVC, Deployment, Service, Ingress. Or put them all in one file separated by --- and let kubectl apply handle the ordering.

Common Mistakes and Misconceptions

  • “Using the latest tag is convenient and fine.” latest is mutable — it can point to different images over time. This breaks reproducibility and rollbacks. Always use specific version tags or digests.
  • “Pods don’t need resource requests and limits.” Without requests, the scheduler can’t make good placement decisions. Without limits, a single pod can consume all node resources and crash neighbors.
  • “Restarting a Deployment means deleting and recreating it.” Use kubectl rollout restart deployment/<name> to trigger a rolling restart without downtime or losing the Deployment’s history.
  • “I should use kubectl run for everything.” kubectl run creates bare pods without a controller. Use Deployments for services (self-healing, rolling updates) and Jobs for batch work.

When things go wrong, see Appendix D: Troubleshooting Quick Reference for a mapping of error messages to root causes.

Further Reading


Next: Debugging Kubernetes

Chapter 19: Debugging Kubernetes

Kubernetes failures are often opaque. A pod does not start, a service does not route traffic, a node disappears — and the system gives you a status word and expects you to figure out the rest. This chapter builds a systematic debugging methodology and a reference for the most common failure modes. For a quick-reference cheat sheet of errors and fixes, see Appendix D: Troubleshooting Quick Reference.

The Debugging Workflow

Every Kubernetes debugging session follows the same escalation path:

THE DEBUGGING ESCALATION PATH
──────────────────────────────

  kubectl get        What exists? What state is it in?
       │
       ▼
  kubectl describe   Why is it in that state? What events occurred?
       │
       ▼
  kubectl logs       What did the application say?
       │
       ▼
  kubectl exec       Get inside the container and investigate
       │
       ▼
  kubectl debug      Cannot exec? Use an ephemeral debug container
       │
       ▼
  Node-level debug   SSH to the node, check kubelet logs, check runtime

Step 1: kubectl get — What Exists?

Start broad and narrow down.

# Overview of all resources in a namespace
kubectl get all -n my-namespace

# Pods with extra detail
kubectl get pods -n my-namespace -o wide

# Watch pods in real time
kubectl get pods -n my-namespace -w

# Filter by label
kubectl get pods -l app=web-app -o wide

The -o wide flag shows node placement and pod IPs. The -w flag watches for changes in real time — invaluable for observing rolling updates, scaling events, or crash loops.

Step 2: kubectl describe — Why?

kubectl describe shows the full history of an object: its current spec, its status, and the events that affected it. Events are the single most important debugging data source in Kubernetes.

kubectl describe pod web-app-7d4f8b6c9-x2z4p

The output includes:

  • Status: The pod’s current phase (Pending, Running, Succeeded, Failed, Unknown)
  • Conditions: Ready, Initialized, ContainersReady, PodScheduled — each with a reason if false
  • Container state: Waiting (with reason), Running, or Terminated (with exit code)
  • Events: Time-ordered log of what happened to this pod

Events decay after 1 hour by default. If you are debugging something that happened hours ago, events may be gone. Use a monitoring system to persist events (more on this in Chapter 20).

Step 3: kubectl logs — What Did the Application Say?

# Current logs
kubectl logs web-app-7d4f8b6c9-x2z4p

# Previous container's logs (after a restart)
kubectl logs web-app-7d4f8b6c9-x2z4p --previous

# Follow logs in real time
kubectl logs web-app-7d4f8b6c9-x2z4p -f

# Logs from a specific container in a multi-container pod
kubectl logs web-app-7d4f8b6c9-x2z4p -c sidecar

# Logs from all pods matching a label
kubectl logs -l app=web-app --all-containers

The --previous flag is critical for CrashLoopBackOff debugging. The current container has just started (and may have nothing useful in its logs yet), but the previous container’s logs show why it crashed.

Step 4: kubectl exec — Get Inside

# Interactive shell
kubectl exec -it web-app-7d4f8b6c9-x2z4p -- /bin/sh

# Run a single command
kubectl exec web-app-7d4f8b6c9-x2z4p -- cat /etc/app/config/config.json

# Check DNS resolution from inside the pod
kubectl exec web-app-7d4f8b6c9-x2z4p -- nslookup my-service

# Check network connectivity
kubectl exec web-app-7d4f8b6c9-x2z4p -- wget -qO- http://my-service:8080/health

Step 5: kubectl debug — When exec Is Not Enough

Many production images are distroless — they contain only the application binary, with no shell, no curl, no debugging tools. You cannot exec into something that has no shell.

Ephemeral debug containers solve this. They inject a temporary container into a running pod that shares the pod’s network namespace (and optionally its process namespace).

# Attach a debug container with networking tools
kubectl debug -it web-app-7d4f8b6c9-x2z4p \
  --image=nicolaka/netshoot \
  --target=web

# The --target flag shares the process namespace with the specified container
# You can now see the target container's processes with ps aux

The debug container runs alongside the existing containers in the same pod. It shares the network namespace (same IP, same ports) but has its own filesystem with the debugging tools you need. When you exit, the ephemeral container is cleaned up.

You can also debug nodes:

# Create a debugging pod on a specific node
kubectl debug node/worker-1 -it --image=ubuntu

# This creates a pod with hostPID, hostNetwork, and the node's
# filesystem mounted at /host. You can inspect the node as if
# you had SSH access.

Understanding Pod Status

Pod status words are the first signal in any debugging session. Here is what each one means and how to investigate it.

POD LIFECYCLE
─────────────

  Pending ──► Running ──► Succeeded
     │            │
     │            └──► Failed
     │
     └──► (stuck here: scheduling or volume issues)


  Container States:
  Waiting ──► Running ──► Terminated
     │                        │
     │                        └──► (exit code 0 = success)
     │                        └──► (exit code non-zero = error)
     │
     └──► CrashLoopBackOff (repeated Terminated → Waiting cycle)

Status Reference Table

StatusLikely CauseDiagnostic Command
Pending (no events)No node has enough resourceskubectl describe pod — look for “Insufficient cpu/memory” in events
Pending (FailedScheduling)Node selector, affinity, or taint preventing schedulingkubectl describe pod — check node affinity/selector rules and taints
Pending (volume)PVC unbound, StorageClass missing, or AZ mismatchkubectl get pvc and kubectl describe pvc
ContainerCreating (stuck)Image pull in progress, or volume mount failingkubectl describe pod — check events for pull progress or mount errors
ImagePullBackOffWrong image name, tag does not exist, or registry auth failurekubectl describe pod — read the exact error. Check image name and imagePullSecrets
CrashLoopBackOffContainer starts and immediately exitskubectl logs --previous — read the application’s error output
CrashLoopBackOff (exit 1)Application error (bad config, missing dependency)kubectl logs --previous and check ConfigMap/Secret mounts
CrashLoopBackOff (exit 137)OOMKilled — container exceeded memory limitkubectl describe pod — look for “OOMKilled”. Increase memory limit or fix memory leak
CrashLoopBackOff (exit 139)Segfault in the applicationkubectl logs --previous — check for native crash logs
Running but not readyReadiness probe failingkubectl describe pod — check readiness probe events
OOMKilledMemory limit exceededkubectl describe pod — confirm OOMKilled reason. Check resources.limits.memory
EvictedNode under memory or disk pressurekubectl describe pod — check eviction reason. kubectl describe node — check conditions
Terminating (stuck)Finalizers blocking deletion, or process ignoring SIGTERMkubectl get pod -o json | jq '.metadata.finalizers'
UnknownKubelet on the node is not reportingkubectl get nodes — check if the node is NotReady. Investigate the node.
Error (on Job/CronJob)Container exited with non-zero exit codekubectl logs <pod>

Common Failure Patterns

Pattern 1: DNS Resolution Failure

Symptom: Application logs show connection timeouts or “name not found” errors for service names.

Diagnosis:

# Check if CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test DNS from inside a pod
kubectl exec -it debug-pod -- nslookup kubernetes.default
kubectl exec -it debug-pod -- nslookup my-service.my-namespace.svc.cluster.local

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

Common causes:

  • CoreDNS pods are not running
  • The pod’s DNS policy is misconfigured
  • A NetworkPolicy is blocking DNS traffic (port 53 UDP/TCP to the kube-dns Service)

Pattern 2: Service Not Routing Traffic

Symptom: Requests to a Service ClusterIP time out or return connection refused.

Diagnosis:

# Check if the Service has endpoints
kubectl get endpoints my-service

# If endpoints list is empty:
# 1. Check that pods exist with the right labels
kubectl get pods -l app=web-app
# 2. Check that pods are Ready
kubectl get pods -l app=web-app -o jsonpath='{.items[*].status.conditions}'
# 3. Check that the Service selector matches the pod labels
kubectl get svc my-service -o yaml | grep -A5 selector

The most common cause is a selector mismatch — the Service’s spec.selector labels do not match the pod’s metadata.labels. This is a silent failure: no error, just no traffic.

Pattern 3: Node NotReady

Symptom: kubectl get nodes shows a node in NotReady status.

Diagnosis:

# Check node conditions
kubectl describe node worker-1

# Look for conditions:
#   MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable

# If you can SSH to the node:
# Check kubelet status
systemctl status kubelet
journalctl -u kubelet -n 100

# Check container runtime
systemctl status containerd
crictl ps

Common causes:

  • kubelet crashed
  • Container runtime is down
  • The node ran out of disk space
  • Network connectivity to the API server was lost

Pattern 4: Persistent Volume Claim Stuck in Pending

Symptom: PVC stays in Pending state indefinitely.

Diagnosis:

kubectl describe pvc my-claim

# Look for events like:
# - "no persistent volumes available for this claim"
# - "storageclass not found"
# - "waiting for first consumer to be created before binding"

Common causes:

  • StorageClass does not exist
  • The CSI driver is not installed
  • WaitForFirstConsumer volume binding mode is waiting for a pod to be scheduled
  • The requested storage exceeds available capacity

Pattern 5: Intermittent OOMKills

Symptom: Pods restart periodically with exit code 137.

Diagnosis:

# Confirm OOMKill
kubectl describe pod my-pod | grep -A5 "Last State"

# Check current memory usage
kubectl top pod my-pod

# Check the memory limit
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.limits.memory}'

The fix is either to increase the memory limit or to fix the memory leak in the application. If kubectl top shows memory growing over time without plateauing, suspect a leak. If it grows to a stable level that exceeds the limit, the limit is too low.

Advanced: Reading Events Cluster-Wide

Events are namespaced objects. To see all events across the cluster:

# All events in a namespace, sorted by time
kubectl get events -n my-namespace --sort-by='.lastTimestamp'

# All events cluster-wide
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# Watch for new events in real time
kubectl get events -n my-namespace -w

# Filter events by type (Warning events are usually the interesting ones)
kubectl get events -n my-namespace --field-selector type=Warning

Events are the system’s audit trail. When something goes wrong, the event stream usually tells you what happened, when, and why — if you look quickly enough before the events expire.

Common Mistakes and Misconceptions

  • “kubectl logs shows everything.” Logs only show stdout/stderr from the current (or previous with -p) container. For multi-container pods, specify -c container-name. For crashed init containers, use -c init-container-name.
  • “If the pod is Running, it’s healthy.” Running means the container process is alive, not that it’s serving traffic correctly. Readiness probes determine if a pod receives traffic; a Running pod can be unready.
  • “kubectl exec is safe in production.” exec gives shell access to running containers, bypassing RBAC audit trails for in-container actions. Use it for debugging but not as a routine operational tool. Audit exec usage.
  • “Events tell the full story.” Events are garbage-collected after 1 hour by default. For historical debugging, you need persistent logging (Loki, CloudWatch, etc.).

Further Reading


Next: Production Readiness

Chapter 20: Production Readiness

A cluster that runs workloads is not the same as a cluster that is ready for production. Production readiness is a checklist of capabilities that, taken together, ensure your cluster is observable, secure, recoverable, and cost-efficient.

The Production Readiness Checklist

PRODUCTION READINESS
────────────────────

  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
  │  Monitoring  │  │   Logging    │  │   Security   │
  │  Prometheus  │  │  Loki / EFK  │  │  RBAC, PSS,  │
  │  Grafana     │  │              │  │  NetworkPol  │
  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘
         │                 │                 │
  ┌──────▼───────┐  ┌──────▼───────┐  ┌──────▼───────┐
  │   Backup     │  │   Health     │  │  Resource    │
  │   Velero     │  │   Probes     │  │  Management  │
  │   etcd snap  │  │   PDBs       │  │  QoS, Quotas │
  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘
         │                 │                 │
         └─────────────────┼─────────────────┘
                           │
                    ┌──────▼───────┐
                    │     Cost     │
                    │  Management  │
                    │  Labels,     │
                    │  Kubecost    │
                    └──────────────┘

Monitoring: Prometheus + Grafana

Monitoring answers one question: “Is my system healthy right now, and if not, where is the problem?”

kube-prometheus-stack

The kube-prometheus-stack Helm chart deploys a complete monitoring pipeline:

  • Prometheus: Scrapes metrics from all Kubernetes components, node exporters, and application pods
  • Grafana: Dashboards for visualization and alerting
  • Alertmanager: Routes alerts to Slack, PagerDuty, email, or other channels
  • node-exporter: DaemonSet that exports node-level metrics (CPU, memory, disk, network)
  • kube-state-metrics: Exports Kubernetes object state as metrics (pod status, deployment replicas, PVC capacity)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=changeme

This single command deploys 5+ components with pre-configured dashboards and alert rules. The default dashboards cover node health, pod resource usage, API server latency, etcd performance, and CoreDNS metrics.

What to Monitor

LayerKey MetricsWhy
NodesCPU utilization, memory utilization, disk I/O, network I/ODetect resource exhaustion before it causes evictions
PodsCPU usage vs request, memory usage vs limit, restart countDetect misconfigured resource limits and crash loops
API ServerRequest latency (p99), error rate, request countThe API server is the heart of the cluster
etcdDisk fsync duration, leader elections, DB sizeetcd performance directly affects cluster responsiveness
ApplicationRequest latency, error rate, throughput (RED metrics)Your users care about application health, not node health

Golden Signals

Monitor the four golden signals for every service:

  1. Latency: How long requests take (distinguish successful vs failed requests)
  2. Traffic: How many requests per second
  3. Errors: How many requests fail
  4. Saturation: How full is the system (CPU, memory, queue depth)

Logging: Loki or EFK

Metrics tell you that something is wrong. Logs tell you why.

Loki is a log aggregation system designed for Kubernetes. Unlike Elasticsearch, Loki indexes only labels, not full text. This makes it an order of magnitude cheaper to operate while remaining fast for label-based queries (which is how you search logs in Kubernetes: by pod, namespace, container, node).

helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true \
  --set grafana.enabled=false    # Use the Grafana from kube-prometheus-stack

Promtail runs as a DaemonSet, reads container logs from /var/log/pods/, and ships them to Loki with Kubernetes labels attached.

Option 2: EFK Stack (Elasticsearch + Fluentd + Kibana)

The traditional choice. Elasticsearch provides full-text search, which is more powerful than Loki’s label-based queries. The trade-off is operational complexity: Elasticsearch clusters require significant memory, careful index management, and regular maintenance.

Choose Loki if you want simplicity and cost efficiency. Choose EFK if you need full-text search across log content.

Security

Kubernetes security is defense in depth: multiple layers, each reducing the attack surface.

RBAC: Principle of Least Privilege

Every human user, service account, and CI/CD pipeline should have the minimum permissions required for their function. Never use cluster-admin for applications.

# A Role that allows reading pods and logs in a specific namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: my-app
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
---
# Bind the role to a service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-reader-binding
  namespace: my-app
subjects:
  - kind: ServiceAccount
    name: my-app-sa
    namespace: my-app
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Key RBAC principles:

  • Use Roles (namespaced) over ClusterRoles (cluster-wide) whenever possible
  • Never grant * (all) verbs unless absolutely necessary
  • Audit RBAC regularly: kubectl auth can-i --list --as=system:serviceaccount:my-app:my-app-sa
  • Use kubectl auth can-i create deployments --as=jane to test permissions

NetworkPolicies: Default Deny

By default, every pod can communicate with every other pod — a compromised pod can reach the entire cluster network.

Start with a default-deny policy in every namespace, then explicitly allow the traffic you need:

# Default deny all ingress and egress in a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: my-app
spec:
  podSelector: {}              # Applies to ALL pods in the namespace
  policyTypes:
    - Ingress
    - Egress

---
# Allow the web pods to receive traffic on port 80
# and make DNS queries (port 53) and reach the database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-web-traffic
  namespace: my-app
spec:
  podSelector:
    matchLabels:
      app: web-app
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
      ports:
        - port: 80
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: database
      ports:
        - port: 5432
    - to:                       # Allow DNS
        - namespaceSelector: {}
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP

Note: NetworkPolicies require a CNI plugin that supports them (Calico, Cilium, Weave). Flannel does not enforce NetworkPolicies.

Pod Security Standards

Pod Security Standards (PSS) replace the deprecated PodSecurityPolicy. They define three levels:

LevelDescriptionKey Restrictions
PrivilegedUnrestrictedNone
BaselineMinimally restrictiveNo hostNetwork, no hostPID, no privileged containers
RestrictedHeavily restrictedMust run as non-root, drop ALL capabilities (only NET_BIND_SERVICE may be added back), allowPrivilegeEscalation: false, seccomp RuntimeDefault or Localhost

Apply them at the namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: my-app
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted

Every namespace running application workloads should enforce at least baseline. Use restricted for workloads that do not need elevated privileges.

Image Scanning

Scan container images for known vulnerabilities before deploying them. Trivy is the most widely used open-source scanner:

# Scan an image locally
trivy image nginx:1.27.3

# Integrate into CI/CD to fail builds with critical vulnerabilities
trivy image --exit-code 1 --severity CRITICAL my-app:v1.2.0

For continuous in-cluster scanning, deploy Trivy Operator, which scans running workloads and reports vulnerabilities as Kubernetes custom resources.

Backup: Velero + etcd Snapshots

etcd Snapshots

etcd contains the entire cluster state. Regular snapshots are non-negotiable:

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Managed Kubernetes services handle etcd backups automatically. For self-managed clusters, automate this with a CronJob or systemd timer.

Velero

Velero backs up Kubernetes resources (YAML manifests) and persistent volume data (via CSI snapshots). It can restore entire namespaces or specific resources to the same or a different cluster.

# Install Velero
velero install --provider aws --bucket my-backup-bucket \
  --secret-file ./credentials-velero \
  --use-volume-snapshots=true \
  --plugins velero/velero-plugin-for-aws:v1.10.0

# Create a backup of a namespace
velero backup create my-app-backup --include-namespaces my-app

# Schedule daily backups with 30-day retention
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --include-namespaces my-app \
  --ttl 720h

Test your restores regularly. A backup that has never been tested is not a backup — it is a hope.

Health Probes: Readiness vs. Liveness vs. Startup

These three probes serve different purposes. Conflating them is one of the most common production mistakes.

ProbePurposeWhat Happens on FailureWhen to Use
ReadinessIs the pod ready to serve traffic?Removed from Service endpoints (stops receiving traffic)Always. Check that the app can serve requests.
LivenessIs the pod stuck in an unrecoverable state?Pod is restartedOnly when the app can deadlock or hang. Check a lightweight endpoint.
StartupHas the pod finished starting up?Liveness/readiness probes are not run until startup succeedsSlow-starting apps (JVM, large model loading).

Critical rule: keep readiness and liveness probes different. The readiness probe should check that the application can serve requests (e.g., can it reach its database?). The liveness probe should check that the application process is not deadlocked (e.g., can it respond to a simple /healthz ping?). If you make them the same, a downstream dependency failure (database down) will cause liveness failures, which restarts the pod, which cannot fix a database outage, which creates a restart storm.

startupProbe:               # Allow up to 5 minutes for slow startup
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

readinessProbe:              # Check full readiness (dependencies included)
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

livenessProbe:               # Check basic aliveness (no dependency checks)
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 20
  failureThreshold: 3

PodDisruptionBudgets

When a node is drained (for upgrades, scaling down, or maintenance), Kubernetes evicts all pods on that node. Without a PodDisruptionBudget (PDB), all replicas of a Deployment on that node could be evicted simultaneously, causing downtime.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 2              # At least 2 pods must remain running
  selector:
    matchLabels:
      app: web-app

Alternatively, use maxUnavailable: 1 to allow at most 1 pod to be disrupted at a time. PDBs are respected by kubectl drain, cluster autoscaler, and node upgrade processes.

Resource Management

LimitRanges

Set default requests and limits for a namespace, so that developers who forget to set them still get reasonable defaults:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: my-app
spec:
  limits:
    - default:
        cpu: 500m
        memory: 512Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      type: Container

ResourceQuotas

Prevent a single namespace from consuming the entire cluster:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: namespace-quota
  namespace: my-app
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    persistentvolumeclaims: "10"

QoS Classes Revisited

Guaranteed QoS (requests = limits) ensures critical pods are evicted last; Burstable QoS (requests < limits) allows efficient sharing for batch workloads. Avoid BestEffort — see Chapter 18 for detail.

Cost Management

Kubernetes makes it easy to provision resources and hard to track who is paying for them.

Labels for Cost Attribution

Apply consistent labels to every resource:

metadata:
  labels:
    team: platform
    environment: production
    cost-center: engineering
    app: web-app

Cloud providers can filter billing data by Kubernetes labels (if label propagation is enabled in the cloud integration).

Tools

  • Kubecost: Open-source cost monitoring. Shows cost per namespace, deployment, pod, and label. Identifies idle resources and right-sizing recommendations.
  • OpenCost: CNCF project for Kubernetes cost monitoring. Vendor-neutral alternative to Kubecost.

Spot Instances

Run non-critical, fault-tolerant workloads on spot/preemptible instances to reduce compute costs by 60-90%. Use node affinity and tolerations to separate spot-friendly workloads from those that need stable compute:

# Toleration for spot instance taint
tolerations:
  - key: "kubernetes.io/spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Combine with PDBs to ensure that spot instance reclamation does not take down all replicas simultaneously.

Chaos Engineering

Once your cluster is observable, secured, and backed up, test that it actually survives failure.

  • Chaos Mesh: CNCF project. Injects pod failures, network latency, disk I/O stress, and time skew.
  • Litmus: Another CNCF chaos engineering project with a library of pre-built experiments.
  • Manual chaos: kubectl delete pod <random-pod>, kubectl drain node <random-node>, kill a container runtime on a node. Start simple before adopting frameworks.

The goal is not to break things for fun. The goal is to verify that your monitoring catches the failure, your alerts fire, your PDBs prevent cascading outages, and your team knows how to respond.

The Complete Checklist

Before declaring a cluster production-ready, verify:

  • Monitoring: kube-prometheus-stack or equivalent deployed and dashboards reviewed
  • Alerting: Critical alerts configured (node down, pod CrashLoopBackOff, disk pressure, API server errors)
  • Logging: Loki or EFK collecting logs from all namespaces
  • RBAC: No unnecessary cluster-admin bindings; service accounts have minimal permissions
  • NetworkPolicies: Default-deny in application namespaces with explicit allow rules
  • Pod Security Standards: At least baseline enforced on all application namespaces
  • Image scanning: Trivy or equivalent in CI/CD pipeline
  • Backup: Velero or equivalent with scheduled backups and tested restores
  • Health probes: Readiness, liveness, and startup probes on all Deployments
  • PDBs: PodDisruptionBudgets on all production Deployments
  • Resource limits: Requests and limits set on all containers
  • LimitRanges: Default limits in every namespace
  • ResourceQuotas: Quotas on every namespace
  • Labels: Consistent labeling for cost attribution and filtering
  • etcd backups: Automated (managed K8s) or scripted (self-managed)
  • Upgrade plan: Documented process for upgrading Kubernetes and node OS

Common Mistakes and Misconceptions

  • “My app works in dev, so it’s production-ready.” Production requires health probes, resource requests/limits, PodDisruptionBudgets, anti-affinity rules, graceful shutdown handling, and monitoring. Dev-working is the starting line, not the finish.
  • “Setting replicas to 1 with a PDB is fine.” A PDB with minAvailable: 1 on a single-replica Deployment blocks all voluntary disruptions (node drains, upgrades). Use at least 2 replicas for anything that needs PDB protection.
  • “Liveness probes should check dependencies.” If your liveness probe checks the database and the database goes down, Kubernetes kills all your pods — making recovery impossible. Liveness checks should only verify the process itself is alive.

Further Reading


This concludes Part 3: From Theory to Practice. You have a running cluster, deployed workloads, and the debugging skills to keep them healthy. Part 4 tackles the next challenge: running applications that cannot simply be restarted and replaced — databases, queues, and other stateful systems that need stable identity and persistent storage.

Next: StatefulSets Deep Dive

Chapter 21: StatefulSets Deep Dive

Deployments treat pods as interchangeable. If pod web-abc123 dies, the replacement web-def456 is identical in every way that matters — same image, same configuration, same role. This works beautifully for stateless applications where any instance can handle any request. But some workloads are not interchangeable. - A database replica cannot simply replace the primary without coordination.

  • A distributed system that uses consistent hashing needs members with stable identities.
  • A clustered cache needs each node to own a predictable shard of data. These workloads need something Deployments cannot provide: stable identity.

StatefulSets exist because some pods are not fungible. (For a visual map of how stateful workload concepts relate, see Appendix B: Mental Models.) Like every Kubernetes workload controller, a StatefulSet follows the controller pattern we covered in Chapter 3 — observe, diff, act — but with additional ordering and identity guarantees that the Deployment controller does not provide.

The Identity Problem

Consider what happens when a Deployment manages three pods:

DEPLOYMENT IDENTITY MODEL
──────────────────────────

Deployment: web (replicas=3)
    │
    ├── web-7b9f5d4c8-abc12   ← random suffix
    ├── web-7b9f5d4c8-def34   ← random suffix
    └── web-7b9f5d4c8-ghi56   ← random suffix

Pod dies → Replacement: web-7b9f5d4c8-xyz99   ← new random name
                                                  new IP address
                                                  new node (maybe)
                                                  no memory of its past life

Now compare with a StatefulSet:

STATEFULSET IDENTITY MODEL
───────────────────────────

StatefulSet: db (replicas=3)
    │
    ├── db-0   ← ordinal index 0 (always the first)
    ├── db-1   ← ordinal index 1 (always the second)
    └── db-2   ← ordinal index 2 (always the third)

Pod db-1 dies → Replacement: db-1   ← same name
                                       same PVC (data-db-1)
                                       same DNS record
                                       same identity, different incarnation

The difference is fundamental. A Deployment pod is a disposable worker. A StatefulSet pod is a named member of a group. When db-1 is replaced, the new pod inherits the identity of the old one — its name, its storage, its network address. This is what makes stateful workloads possible on Kubernetes.

StatefulSets vs Deployments

PropertyDeploymentStatefulSet
Pod namesRandom hash suffix (web-7b9f5-abc12)Ordinal index (web-0, web-1, web-2)
Pod creation orderAll at once (parallel)Sequential by default (web-0web-1web-2)
Pod deletion orderAny orderReverse ordinal by default (web-2web-1web-0)
StorageShared or nonePer-pod PVC via volumeClaimTemplates
Network identityClusterIP Service (virtual IP)Headless Service (individual DNS per pod)
ScalingInstant (add/remove any pod)Sequential (add highest, remove highest ordinal)
Use caseStateless apps, web servers, APIsDatabases, message queues, distributed systems

The cost of these guarantees is operational complexity. StatefulSets are harder to scale, harder to update, and require more careful planning. Use them only when your workload genuinely needs stable identity or per-pod storage.

Headless Services and Stable DNS

A normal ClusterIP Service creates a virtual IP that load-balances requests across all matching pods. A headless Service (one with clusterIP: None) does not create a virtual IP. Instead, it creates individual DNS records for each pod in the StatefulSet.

apiVersion: v1
kind: Service
metadata:
  name: db
  namespace: default
spec:
  clusterIP: None          # This makes it headless
  selector:
    app: db
  ports:
    - port: 5432
      targetPort: 5432

This headless Service produces the following DNS records:

flowchart TB
    subgraph sts["StatefulSet: db (replicas: 3)"]
        db0["db-0<br>10.244.1.5"]
        db1["db-1<br>10.244.2.8"]
        db2["db-2<br>10.244.1.9"]
    end

    subgraph dns["DNS Records (Headless Service: db, clusterIP: None)"]
        dns0["db-0.db.default.svc.cluster.local"]
        dns1["db-1.db.default.svc.cluster.local"]
        dns2["db-2.db.default.svc.cluster.local"]
        dnsAll["db.default.svc.cluster.local<br>(A record → all three IPs)"]
    end

    dns0 --> db0
    dns1 --> db1
    dns2 --> db2
    dnsAll -.-> db0
    dnsAll -.-> db1
    dnsAll -.-> db2

    style dns fill:#f0f0ff,stroke:#333
    style sts fill:#e0ffe0,stroke:#333

Application connects to:

  • db-0.db.default.svc.cluster.local — always reaches db-0
  • db-1.db.default.svc.cluster.local — always reaches db-1
  • db.default.svc.cluster.local — reaches any (round-robin DNS)

The DNS naming convention is: <pod-name>.<service-name>.<namespace>.svc.cluster.local

The combination of a stable pod name (db-0) and a stable DNS entry (db-0.db.default.svc.cluster.local) gives each pod a persistent network identity that survives restarts, rescheduling, and node failures.

A PostgreSQL replica can be configured to always connect to db-0.db.default.svc.cluster.local as its primary, regardless of which node db-0 happens to be running on or what IP address it currently has.

The StatefulSet Spec

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db
spec:
  serviceName: db               # Must match the headless Service name
  replicas: 3
  selector:
    matchLabels:
      app: db
  template:
    metadata:
      labels:
        app: db
    spec:
      containers:
        - name: postgres
          image: postgres:16.2
          ports:
            - containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:          # Per-pod persistent storage
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp3
        resources:
          requests:
            storage: 50Gi

The volumeClaimTemplates field is unique to StatefulSets. For each pod, Kubernetes creates a PersistentVolumeClaim named <template-name>-<statefulset-name>-<ordinal>. In this example: data-db-0, data-db-1, data-db-2. Each PVC is bound to its own PersistentVolume, giving each pod dedicated storage.

Ordered Operations

By default, StatefulSets use the OrderedReady pod management policy. This means:

  1. Creation: Pods are created in order: db-0 first, then db-1 only after db-0 is Running and Ready, then db-2 only after db-1 is Running and Ready.
  2. Scaling up: Same as creation — new pods are added one at a time in ordinal order.
  3. Scaling down: Pods are removed in reverse order: db-2 first, then db-1, then db-0.
  4. Deletion: If you delete a StatefulSet, pods are terminated in reverse ordinal order.

This ordering exists because stateful systems often need it. A database primary (db-0) must be running before replicas (db-1, db-2) can initialize and connect. Replicas should be drained before the primary is stopped.

For workloads that do not need ordered operations (for example, a distributed cache where all nodes are peers), you can set podManagementPolicy: Parallel:

spec:
  podManagementPolicy: Parallel    # All pods start/stop simultaneously

This removes the ordering constraint but retains stable names and per-pod storage.

Update Strategies

RollingUpdate (Default)

Pods are updated in reverse ordinal order: db-2 first, then db-1, then db-0. Each pod must become Ready before the next one is updated.

The partition parameter enables canary deployments:

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 2             # Only pods with ordinal >= 2 are updated

With partition: 2 and 3 replicas, only db-2 receives the new pod template. db-0 and db-1 remain on the old version. After verifying db-2 is healthy, you lower the partition to 1, then 0, to roll out the update progressively. This is the safest way to update a stateful workload.

OnDelete

Pods are updated only when you manually delete them. This gives you complete control over the update order:

spec:
  updateStrategy:
    type: OnDelete

This is useful when the update order matters and the default reverse-ordinal approach is not appropriate — for example, when you need to update replicas before the primary.

PVC Retention Policies

By default, PVCs created by volumeClaimTemplates are never deleted by Kubernetes. This is the safest behavior — you never accidentally lose data — but it means orphaned PVCs accumulate when you scale down or delete a StatefulSet.

Starting in v1.27, you can configure PVC retention:

spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain        # When the StatefulSet is deleted
    whenScaledDown: Retain     # When replicas are scaled down
PolicywhenDeletedwhenScaledDownBehavior
SafestRetainRetainPVCs always preserved (default)
BalancedDeleteRetainCleanup on StatefulSet deletion, preserve on scale-down
AggressiveDeleteDeletePVCs deleted on both operations

For production databases, always use Retain for both. Data recovery from a lost PVC is far more expensive than cleaning up unused PVCs.

Scale-Down and PVC Persistence

This behavior surprises many operators and deserves special emphasis:

PVC PERSISTENCE ON SCALE-DOWN
───────────────────────────────

BEFORE: replicas=5
  db-0  db-1  db-2  db-3  db-4        PVCs: data-db-0 through data-db-4
   │     │     │     │     │
   ▼     ▼     ▼     ▼     ▼
  PV0   PV1   PV2   PV3   PV4

AFTER: replicas=3 (scaled down)
  db-0  db-1  db-2                     Pods db-3, db-4 terminated
   │     │     │
   ▼     ▼     ▼
  PV0   PV1   PV2   PV3   PV4         PVCs data-db-3, data-db-4 STILL EXIST
                     ▲     ▲
                     │     │
                  Orphaned PVCs       ← data preserved but no pod using it

LATER: replicas=5 (scaled back up)
  db-0  db-1  db-2  db-3  db-4        db-3, db-4 reattach to existing PVCs
   │     │     │     │     │
   ▼     ▼     ▼     ▼     ▼
  PV0   PV1   PV2   PV3   PV4         Data from previous incarnation intact!

This is deliberate. When you scale back up, db-3 and db-4 get their old data back. But it also means you are paying for unused storage until you manually delete the orphaned PVCs (or configure whenScaledDown: Delete).

When to Use StatefulSets vs Deployments

Use a StatefulSet when:

  • Each pod needs a stable, unique network identity (databases, consensus systems)
  • Each pod needs its own persistent storage volume (data nodes)
  • Pod initialization or termination must happen in a defined order
  • Peers need to address each other by name (cluster membership protocols)

Use a Deployment when:

  • All pods are identical and interchangeable
  • Shared storage (or no storage) is sufficient
  • Order of creation and deletion does not matter
  • You need fast scaling (no sequential constraints)

A common anti-pattern is using StatefulSets for applications that just need persistent storage but do not need stable identity. If your application uses a single shared PVC (ReadWriteMany), a Deployment with a PVC is simpler and more appropriate.

Common Mistakes and Misconceptions

  • “StatefulSets are just Deployments with persistent storage.” StatefulSets provide ordered startup/shutdown, stable network identities (pod-0, pod-1), and per-replica PVCs. These guarantees come with trade-offs: slower scaling and more complex operations.
  • “Deleting a StatefulSet deletes its PVCs.” PVCs are deliberately retained to prevent data loss. You must delete PVCs manually. This is a safety feature, not a bug.
  • “I need a StatefulSet for any app that uses a database.” If your app is stateless but connects to an external database, use a Deployment. StatefulSets are for when the pod itself holds state (e.g., the pod IS the database).

Further Reading


Next: Databases on Kubernetes — When to run databases on K8s, operators, and the trade-offs.

Chapter 22: Databases on Kubernetes

“Should we run our database on Kubernetes?” is one of the most debated questions in the Kubernetes community, and the debate persists because the answer is genuinely nuanced. It depends on what database, what workload, what team, and what alternatives exist.

The Great Debate

The argument against databases on Kubernetes is simple: databases are the most important component in most architectures, and Kubernetes was designed for stateless, ephemeral workloads. Pods get rescheduled. Nodes fail. Network partitions happen. Storage has latency. Every one of these events is routine for a web server and potentially catastrophic for a database.

The argument for databases on Kubernetes is equally simple: managed database services are expensive, lock you into a cloud provider, and do not exist on-premises. Kubernetes operators can automate the same operational tasks that managed services handle — failover, backup, replication — and they work everywhere Kubernetes runs.

Both arguments are correct. The question is which trade-offs matter more for your specific situation.

The Honest Assessment

For revenue-critical systems, managed database services remain superior. AWS RDS, Google Cloud SQL, and Azure Database for PostgreSQL have teams of database engineers whose full-time job is handling failover, patching, backup, and recovery. They have years of operational experience encoded into their automation. The cost premium you pay for a managed service is insurance against the operational complexity you would otherwise absorb.

For development, testing, and non-critical workloads, Kubernetes databases are excellent. They provide consistent environments across dev/staging/production, they are easy to spin up and tear down, and they integrate naturally with the rest of your Kubernetes tooling.

For on-premises deployments, Kubernetes operators are often the best option available. When managed services do not exist, the choice is between hand-managing databases on VMs and using an operator that automates the hardest parts. The operator wins in most cases.

Decision Framework

TierDescriptionRecommendation
Development / TestNon-production, disposable dataKubernetes — fast to create, fast to destroy
Tier 2-3 ServicesInternal tools, analytics, non-revenue workloadsKubernetes — acceptable with good operators and monitoring
Revenue-CriticalCustomer-facing, SLA-bound, data-loss-intolerantManaged service — unless you have strong database operations expertise
On-PremisesNo managed services availableKubernetes operators — best available option
Regulatory / ComplianceData residency, air-gapped environmentsKubernetes operators — often the only option that satisfies constraints

This is not a permanent ranking. The Kubernetes database ecosystem matures every year. Five years from now, running production databases on Kubernetes may be as routine as running web servers. But today, the operational gap between managed services and operators is real.

The Pets vs Cattle Nuance

The “pets vs cattle” metaphor says that modern infrastructure should treat servers like cattle (interchangeable, disposable) rather than pets (unique, irreplaceable). Kubernetes embodies this philosophy for stateless workloads. But databases are pets. A PostgreSQL primary node has unique state that cannot be recreated from a container image. Its data represents months or years of accumulated state.

StatefulSets are Kubernetes’s acknowledgment that some workloads are pets. The stable identity, ordered operations, and persistent storage guarantees exist specifically because not everything can be cattle. The question is not whether to treat databases as pets — they are pets — but whether Kubernetes provides the right tools for pet care.

Operators are the answer. An operator is a custom controller that encodes domain-specific operational knowledge into software. A PostgreSQL operator knows how to initialize a replica from a base backup, how to promote a replica to primary during failover, how to manage connection pooling, and how to schedule backups. It turns the pet care into automated, repeatable processes.

The Operator Landscape

PostgreSQL

PostgreSQL has the most mature operator ecosystem on Kubernetes.

CloudNativePG — The strongest momentum in 2025. A CNCF Sandbox project with a clean architecture: each PostgreSQL pod runs a lightweight instance manager alongside the database process in the same container (no sidecar), while the operator itself runs as a separate Deployment. Supports automated failover, continuous backup to object storage (S3, GCS, Azure Blob), point-in-time recovery, connection pooling via PgBouncer, and declarative configuration. The project’s velocity and community engagement make it the default choice for new deployments.

Crunchy Data PGO (postgres-operator) — The most battle-tested option. Crunchy Data has been running PostgreSQL on Kubernetes since before it was fashionable. PGO supports pgBackRest for backup (the gold standard for PostgreSQL backup), high availability via Patroni, connection pooling, monitoring integration, and multi-cluster replication. Choose this if you want the operator with the longest production track record.

Zalando postgres-operator — A simpler operator that grew out of Zalando’s internal Kubernetes usage. Good for straightforward PostgreSQL deployments but development velocity has slowed compared to CloudNativePG and PGO. Still a reasonable choice for teams that value simplicity.

MySQL

Percona Operator for MySQL — Supports both Percona XtraDB Cluster (Galera-based synchronous replication) and MySQL group replication. Backup to S3, automated failover, proxy via HAProxy or ProxySQL.

Vitess — Not strictly a MySQL operator but a database clustering system that runs on Kubernetes. Used by Slack, GitHub, and originally developed at YouTube. Vitess is the right choice when you need horizontal sharding of MySQL at massive scale. It is not the right choice for a single PostgreSQL-equivalent deployment.

Other Databases

MongoDB Community Operator — Manages MongoDB replica sets on Kubernetes. The enterprise version from MongoDB Inc. adds Ops Manager integration.

Redis (via Spotahome operator or Redis Enterprise) — Redis Sentinel and Redis Cluster topologies. Redis is simpler to operate than relational databases because it is primarily in-memory, but persistence and replication still require operational care.

Apache Kafka (Strimzi) — The dominant Kafka operator. Strimzi manages Kafka brokers, ZooKeeper (or KRaft), topics, users, and MirrorMaker. Kafka on Kubernetes is now mainstream, partly because Kafka’s distributed architecture maps well to StatefulSet semantics.

What Makes Database Operators Hard

A production database operator must handle:

OPERATOR RECONCILIATION LOOP
──────────────────────────────

  ┌─────────────────────────────────────────────────────┐
  │                  Desired State (CR)                 │
  │   PostgresCluster: replicas=3, backup=daily,        │
  │   version=16, storage=100Gi, pooler=pgbouncer       │
  └──────────────────────┬──────────────────────────────┘
                         │
                         ▼
  ┌──────────────────────────────────────── ─────────────┐
  │              Operator Controller                     │
  │                                                      │
  │   for each reconciliation loop:                      │
  │                                                      │
  │   1. Check cluster health                            │
  │      ├── Is primary alive?                           │
  │      ├── Are replicas streaming?                     │
  │      └── Is replication lag acceptable?              │
  │                                                      │
  │   2. Handle topology changes                         │
  │      ├── Scale up: init new replica from backup      │
  │      ├── Scale down: drain connections, remove       │
  │      └── Node failure: promote replica to primary    │
  │                                                      │
  │   3. Manage supporting services                      │
  │      ├── Connection pooler (PgBouncer/ProxySQL)      │
  │      ├── Backup schedule (base backup + WAL)         │
  │      ├── Monitoring endpoints (Prometheus)           │
  │      └── TLS certificates                            │
  │                                                      │
  │   4. Handle version upgrades                         │
  │      ├── Minor: rolling restart                      │
  │      └── Major: pg_upgrade or logical replication    │
  │                                                      │
  └──────────────────────┬────────────────────── ────────┘
                         │
                         ▼
  ┌────────────────────────────────────────────── ───────┐
  │              Managed Resources                       │
  │                                                      │
  │   StatefulSet(primary)  StatefulSet(replicas)        │
  │   Service(read-write)   Service(read-only)           │
  │   PVCs(data)            ConfigMaps(postgresql.conf)  │
  │   Secrets(passwords)    CronJob(backup)              │
  │   Deployment(pooler)    ServiceMonitor(metrics)      │
  └─────────────────────────────────────────────── ──────┘

Each of these responsibilities is a failure mode:

Leader Election and Failover — When the primary fails, the operator must detect the failure, select the most up-to-date replica, promote it, reconfigure all other replicas to follow the new primary, and update the read-write Service endpoint. This must happen in seconds, without data loss, and without split-brain (two nodes both believing they are primary). Getting this wrong is the single most dangerous failure mode for a database.

Replication — The operator must configure streaming replication (for PostgreSQL) or group replication (for MySQL), monitor replication lag, and handle replicas that fall behind. A replica that loses its replication slot must be rebuilt from a base backup, which can take hours for large databases.

Backup and Recovery — Continuous backup involves both periodic base backups (full snapshots of the data directory) and continuous WAL (write-ahead log) archival. The operator must verify backup integrity, manage backup retention, and support point-in-time recovery to any moment in the past.

Version Upgrades — Minor version upgrades (16.1 to 16.2) are typically rolling restarts. Major version upgrades (15 to 16) require data migration via pg_upgrade or logical replication. Both must be done without extended downtime.

Connection Pooling — Database connections are expensive (each consumes memory and a process/thread). A connection pooler like PgBouncer sits between the application and the database, multiplexing many application connections onto a smaller number of database connections. The operator manages the pooler’s lifecycle and configuration.

Dual Monitoring — You need both Kubernetes-level monitoring (pod health, resource usage, PVC capacity) and database-level monitoring (query latency, lock contention, replication lag, cache hit ratio). These are complementary and both are essential.

The Real Cost of Self-Managing

The real comparison is:

Cost FactorManaged ServiceKubernetes Operator
ComputeHigher (managed premium)Lower (your nodes)
Engineering timeLow (vendor handles operations)Significant (you handle operations the operator cannot)
Failure recoveryVendor SLAYour team’s expertise
Backup verificationVendor responsibilityYour responsibility to test restores
Major version upgradesOften push-buttonOften manual coordination
Compliance auditingVendor provides documentationYou provide documentation

If your team has strong database operations expertise and the time to invest in it, Kubernetes operators are a powerful tool. If your team’s expertise is in application development and they view the database as infrastructure that should just work, a managed service is the better choice.

A Pragmatic Path

Many organizations adopt a layered approach:

  1. Start with managed services for production databases. Do not optimize costs before you have a working system.
  2. Use Kubernetes databases for dev/test. This gives your team experience with the operator and ensures dev/test environments match production topology.
  3. Evaluate migration to Kubernetes for Tier 2-3 workloads after your team has built confidence with the operator in non-production environments.
  4. Keep revenue-critical databases on managed services unless you have a compelling reason to move them (cost, compliance, on-premises requirement).

This path minimizes risk while building the operational muscle needed to run databases on Kubernetes if and when it makes sense.

Common Mistakes and Misconceptions

  • “Never run databases on Kubernetes.” This was good advice in 2018. Modern operators (CloudNativePG, Percona, Vitess) handle replication, failover, backup, and restore. For many teams, K8s-native databases are simpler than managing separate DB infrastructure.
  • “Kubernetes storage is too slow for databases.” Cloud SSDs (gp3, pd-ssd) provide consistent IOPS. Local NVMe on dedicated node pools rivals bare-metal performance. The storage layer is rarely the bottleneck.
  • “A database operator means zero operational effort.” Operators automate routine tasks but still require monitoring, capacity planning, backup verification, and version upgrade planning. They reduce effort, not eliminate it.

Further Reading


Next: Persistent Storage Patterns — volumeClaimTemplates, reclaim policies, backup, and resize.

Chapter 23: Persistent Storage Patterns

Storage on Kubernetes is where the abstraction meets physical reality. A pod can be rescheduled to any node in seconds, but a 500GB disk cannot teleport. Persistent storage forces you to think about topology, data lifecycle, and failure modes that stateless workloads let you ignore. For a quick storage decision flowchart, see Appendix C: Decision Trees.

volumeClaimTemplates: The Naming Convention

As covered in Chapter 21, StatefulSets use volumeClaimTemplates to create per-pod PVCs. The naming convention is deterministic:

<template-name>-<statefulset-name>-<ordinal>

For a StatefulSet named db with a template named data:

data-db-0
data-db-1
data-db-2

This naming convention is not arbitrary. It is the mechanism by which Kubernetes reconnects pods to their storage after rescheduling. When db-1 is deleted and recreated (during an update, a node failure, or a manual restart), the new db-1 pod finds the PVC data-db-1 by name and reattaches to the same underlying volume. No operator intervention required.

If your StatefulSet has multiple volume templates (for example, separate volumes for data and write-ahead logs), each template produces its own set of PVCs:

volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: wal
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 20Gi

This creates: data-db-0, data-db-1, data-db-2, wal-db-0, wal-db-1, wal-db-2. Six PVCs for three pods, each with its own underlying PersistentVolume.

Reclaim Policies: Retain vs Delete

When a PersistentVolumeClaim is deleted, the underlying PersistentVolume’s reclaimPolicy determines what happens to the actual storage:

PolicyWhat HappensWhen to Use
RetainPV is preserved. Data remains. PV enters Released state and must be manually reclaimed.Production databases. Any workload where accidental data loss is unacceptable.
DeletePV and underlying storage (EBS volume, GCE PD, etc.) are deleted.Development environments. Workloads where data can be recreated.
RecycleDeprecated. Was rm -rf /thevolume/* followed by making PV available again.Never. Use Delete instead.

The default reclaim policy for dynamically provisioned PVs is Delete in most StorageClasses. This is dangerous for production workloads. If someone accidentally deletes a PVC, the underlying data is gone.

For production, always set the reclaim policy to Retain. You can do this in the StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-retain
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain              # PVs are preserved when PVCs are deleted
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"

The consequence of Retain is that you must manually clean up PVs when you are done with them. This is a feature, not a bug. Explicit deletion of persistent data should require a human decision.

WaitForFirstConsumer: Topology Awareness

In a multi-zone cluster, where a PV is provisioned matters. An EBS volume in us-east-1a cannot be attached to a node in us-east-1b. If the volume is provisioned before the pod is scheduled, and the pod lands on a node in a different zone, the pod will be stuck in Pending forever.

WaitForFirstConsumer solves this by deferring volume provisioning until a pod actually needs it:

VOLUME BINDING MODES
─────────────────────

Immediate (default for some StorageClasses):
  1. PVC created              → PV provisioned in zone-a
  2. Pod scheduled to zone-b  → STUCK: PV is in zone-a, pod is in zone-b

WaitForFirstConsumer:
  1. PVC created              → PV provisioning deferred
  2. Pod scheduled to zone-b  → PV provisioned in zone-b (same zone as pod)
  3. Pod binds to PV          → SUCCESS: everything in the same zone

Always use WaitForFirstConsumer for cloud storage in multi-zone clusters. It is the only safe choice.

There is a subtle interaction with StatefulSets: once a PVC is bound to a PV in a specific zone, any future incarnation of that pod is constrained to that zone. If data-db-0 is provisioned in us-east-1a, then db-0 will always be scheduled to us-east-1a (assuming the PVC still exists). This is usually desirable for databases but means that zone failures affect specific StatefulSet members predictably.

Storage Resize

Most CSI drivers support volume expansion. The StorageClass must allow it:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-expandable
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true           # Enables resize

To resize, edit the PVC’s spec.resources.requests.storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-db-0
spec:
  resources:
    requests:
      storage: 200Gi    # Was 100Gi, now requesting 200Gi

The resize process has two phases:

  1. Controller expansion: The CSI driver expands the underlying volume (e.g., modifies the EBS volume size). This happens automatically.
  2. Node expansion: The filesystem on the volume is expanded to use the new space. This happens when the pod using the volume restarts (for offline expansion) or live (for online expansion, supported by most modern CSI drivers).

Important constraints:

  • Volumes can only grow, never shrink. There is no way to reduce a PVC’s size.
  • EBS volumes have a cooldown period. After modifying an EBS volume, you must wait 6 hours before modifying it again.
  • Some filesystems require a pod restart for the resize to take effect.

The PVC Lifecycle on Scale-Down

When you scale down a StatefulSet, the pods are deleted but the PVCs are not:

PVC LIFECYCLE ON SCALE-DOWN
─────────────────────────────

Step 1: Running at replicas=5
┌──────────────────────────────────────────────────────┐
│  Pod    │  PVC          │  PV     │  Status          │
├─────────┼───────────────┼─────────┼──────────────────┤
│  db-0   │  data-db-0    │  pv-a   │  Bound           │
│  db-1   │  data-db-1    │  pv-b   │  Bound           │
│  db-2   │  data-db-2    │  pv-c   │  Bound           │
│  db-3   │  data-db-3    │  pv-d   │  Bound           │
│  db-4   │  data-db-4    │  pv-e   │  Bound           │
└──────────────────────────────────────────────────────┘

Step 2: Scale to replicas=3 (kubectl scale sts db --replicas=3)
┌──────────────────────────────────────────────────────┐
│  Pod    │  PVC          │  PV     │  Status          │
├─────────┼───────────────┼─────────┼──────────────────┤
│  db-0   │  data-db-0    │  pv-a   │  Bound           │
│  db-1   │  data-db-1    │  pv-b   │  Bound           │
│  db-2   │  data-db-2    │  pv-c   │  Bound           │
│  ---    │  data-db-3    │  pv-d   │  Bound (no pod!) │
│  ---    │  data-db-4    │  pv-e   │  Bound (no pod!) │
└──────────────────────────────────────────────────────┘

  data-db-3 and data-db-4 still exist.
  You are still paying for pv-d and pv-e.
  The data in pv-d and pv-e is preserved.

Step 3: Scale back to replicas=5
┌──────────────────────────────────────────────────────┐
│  Pod    │  PVC          │  PV     │  Status          │
├─────────┼───────────────┼─────────┼──────────────────┤
│  db-0   │  data-db-0    │  pv-a   │  Bound           │
│  db-1   │  data-db-1    │  pv-b   │  Bound           │
│  db-2   │  data-db-2    │  pv-c   │  Bound           │
│  db-3   │  data-db-3    │  pv-d   │  Bound (data!)   │
│  db-4   │  data-db-4    │  pv-e   │  Bound (data!)   │
└──────────────────────────────────────────────────────┘

  db-3 and db-4 reattach to their old PVCs.
  All previous data is intact.

Operational implications:

  1. Cost: Orphaned PVCs consume storage and incur charges. Monitor with kubectl get pvc and cloud billing tools.
  2. Stale data: If you scale down, modify the application, and scale back up, the reattached pods may have stale data that does not match the current application state.
  3. Cleanup: If you genuinely want to discard the data, you must manually delete the orphaned PVCs: kubectl delete pvc data-db-3 data-db-4.

Backup Strategies: A Layered Approach

No single backup mechanism is sufficient for production data. Each approach has blind spots, and a robust strategy layers multiple approaches to cover each other’s weaknesses:

BACKUP STRATEGY LAYERS
────────────────────────

Layer 3: Application-Native Backup              ← Highest fidelity
  │  pg_basebackup + WAL archival
  │  mongodump / mysqldump
  │  Application-consistent snapshots
  │  Understands transactions, replication state
  │
Layer 2: Velero (Kubernetes-aware backup)        ← Kubernetes context
  │  Backs up K8s resources (StatefulSets, Services, ConfigMaps)
  │  Can trigger pre/post-backup hooks (e.g., pg_start_backup)
  │  Backs up PV data via snapshots or Restic/Kopia
  │  Restores entire namespaces with all resources
  │
Layer 1: Volume Snapshots (CSI)                  ← Fastest recovery
  │  Point-in-time snapshot of the block device
  │  Fast: typically copy-on-write, completes in seconds
  │  Can clone volumes from snapshots
  │  WARNING: crash-consistent, NOT application-consistent
  │
  ▼
Storage Layer (EBS, GCE PD, Ceph, etc.)

Why Each Layer Matters

Volume snapshots are fast but dangerous in isolation. A snapshot captures the block device at a point in time, like pulling the power cord on a running database. The filesystem will be consistent (journaling handles that), but the database may have in-flight transactions that are partially written. The snapshot is crash-consistent but not application-consistent. Restoring from a snapshot alone may require crash recovery, and some data may be lost.

Velero adds Kubernetes context. It backs up not just the data but the Kubernetes resources that define how the data is used — the StatefulSet, the Service, the ConfigMaps, the Secrets. Velero can also run pre-backup hooks (like pg_start_backup or FLUSH TABLES WITH READ LOCK) that put the database into a consistent state before snapshotting.

Application-native backup is the gold standard. PostgreSQL’s continuous archival (base backup + WAL shipping) provides point-in-time recovery to any second in the past. This is the only backup method that guarantees zero data loss for committed transactions.

Volume Snapshots in Practice

# Create a VolumeSnapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: db-snapshot-20260403
spec:
  volumeSnapshotClassName: csi-aws-ebs
  source:
    persistentVolumeClaimName: data-db-0

---
# Create a new PVC from a snapshot (cloning)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-db-0-restored
spec:
  storageClassName: gp3-retain
  dataSource:
    name: db-snapshot-20260403
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

Volume cloning from snapshots is invaluable for creating test environments from production data. Snapshot a production PVC, create a new PVC from the snapshot, and attach it to a test StatefulSet. The clone is independent of the original — modifications to one do not affect the other.

Velero Configuration

# Install Velero with AWS provider
velero install \
  --provider aws \
  --bucket my-velero-bucket \
  --secret-file ./credentials-velero \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --plugins velero/velero-plugin-for-aws:v1.9.0

# Schedule daily backups with 30-day retention
velero schedule create daily-db-backup \
  --schedule="0 2 * * *" \
  --include-namespaces database \
  --ttl 720h

# Restore a namespace from backup
velero restore create --from-backup daily-db-backup-20260403020000

Velero’s pre-backup hooks let you ensure application consistency:

metadata:
  annotations:
    pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "psql -c \"SELECT pg_backup_start(''velero'')\""]'
    pre.hook.backup.velero.io/container: postgres
    post.hook.backup.velero.io/command: '["/bin/bash", "-c", "psql -c \"SELECT pg_backup_stop()\""]'
    post.hook.backup.velero.io/container: postgres

The Backup Rule

Test your restores. A backup that has never been restored is a hypothesis, not a guarantee. Schedule regular restore tests to a separate namespace and verify data integrity. The time to discover that your backup process is broken is not during an incident.

Putting It All Together

A production storage configuration for a StatefulSet database combines everything in this chapter:

  1. StorageClass: reclaimPolicy: Retain, volumeBindingMode: WaitForFirstConsumer, allowVolumeExpansion: true
  2. volumeClaimTemplates: Separate templates for data and WAL if the database benefits from it
  3. PVC retention policy: Retain for both whenDeleted and whenScaledDown
  4. Backup: Application-native continuous backup (WAL archival) + Velero scheduled backups + periodic volume snapshots
  5. Monitoring: Alert on PVC usage approaching capacity, orphaned PVCs after scale-down, backup job failures

Storage is the foundation that stateful workloads rest on. Get it right and your databases can survive node failures, zone outages, and operational mistakes. Get it wrong and you will learn why the operations community repeats: “backups are worthless; restores are priceless.”

Common Mistakes and Misconceptions

  • “All PersistentVolumes are the same.” RWO (ReadWriteOnce) can only be mounted by one node. RWX (ReadWriteMany) works across nodes but requires NFS or cloud file systems (EFS, Filestore). Choosing wrong access mode causes mount failures. Note: RWO allows multiple pods on the same node to mount the volume simultaneously. For databases that require exclusive single-pod access, use ReadWriteOncePod (RWOP), which restricts the volume to exactly one pod. RWOP is GA since Kubernetes 1.29.
  • “Storage classes are just about disk type.” Storage classes also control reclaim policy (Delete vs Retain), volume binding mode (Immediate vs WaitForFirstConsumer), and provisioner. WaitForFirstConsumer is critical for zone-aware scheduling.
  • “I can resize PVCs freely.” Volume expansion must be enabled on the storage class (allowVolumeExpansion: true). Not all provisioners support it. Shrinking is never supported — plan initial sizes carefully.

Further Reading


Next: Jobs and CronJobs — Batch processing, indexed completions, and scheduling patterns.

Chapter 24: Jobs and CronJobs

Not every workload is a long-running service. Some workloads run to completion: a database migration, a batch data transformation, an ML training run, a nightly report. Deployments and StatefulSets are the wrong abstraction for these workloads because they try to keep pods running forever. Jobs and CronJobs are Kubernetes’s answer to batch and scheduled workloads. A Job creates one or more pods, runs them to completion, and then stops. A CronJob creates Jobs on a schedule. The concepts are simple, but the details — completion modes, parallelism, failure handling, and concurrency policies — matter enormously for production reliability.

Jobs: Run to Completion

A Job ensures that a specified number of pods successfully terminate. The simplest Job runs a single pod:

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: my-app:v2.1.0
          command: ["./migrate", "--target", "v2.1"]
      restartPolicy: Never        # Jobs require Never or OnFailure
  backoffLimit: 3                 # Retry up to 3 times on failure
  activeDeadlineSeconds: 600      # Kill the Job if it runs longer than 10 minutes
  ttlSecondsAfterFinished: 3600   # Clean up completed Job after 1 hour

Key fields:

  • restartPolicy: Must be Never or OnFailure. Jobs cannot use the default Always because that would restart the pod after successful completion.
  • backoffLimit: How many times to retry before marking the Job as failed. Each retry uses exponential backoff (10s, 20s, 40s, …).
  • activeDeadlineSeconds: A hard timeout for the entire Job. If the Job has not completed within this duration, all running pods are terminated and the Job is marked as failed.
  • ttlSecondsAfterFinished: How long to keep the completed (or failed) Job object before garbage collection. Without this, completed Jobs accumulate forever.

Completion Modes

Jobs support two completion modes that determine how “done” is defined:

NonIndexed (Default)

The Job is complete when .spec.completions pods have succeeded. Each pod is interchangeable — they all run the same work.

spec:
  completions: 5          # 5 pods must succeed
  parallelism: 3          # Run up to 3 pods at a time
  completionMode: NonIndexed

This creates 5 pods (3 at a time), each running the same task. If any pod fails, a replacement is created (up to backoffLimit). When 5 pods have exited with status 0, the Job is complete.

Indexed

Each pod gets a unique index (0 through completions-1) via the JOB_COMPLETION_INDEX environment variable. This enables work partitioning: each pod processes a different shard of data.

spec:
  completions: 10         # 10 indexed pods (0-9)
  parallelism: 5          # Run up to 5 pods at a time
  completionMode: Indexed

Each pod knows its identity: pod with index 3 reads JOB_COMPLETION_INDEX=3 from its environment and processes the corresponding data partition. The Job is complete when each index (0 through 9) has one successful pod.

Indexed Jobs are the Kubernetes-native way to implement MapReduce-style parallelism. Instead of a single pod processing a 1TB dataset, ten pods each process 100GB.

Parallelism Patterns

The interaction between completions and parallelism produces different execution patterns:

JOB PARALLELISM PATTERNS
──────────────────────────

Pattern 1: Single Pod (default)
completions=1, parallelism=1

  Time ──────────────────────►
  ┌──────────────────┐
  │     Pod 0        │ ✓ Done
  └──────────────────┘


Pattern 2: Fixed Completion Count
completions=5, parallelism=2

  Time ──────────────────────────────────────►
  ┌──────────────┐
  │   Pod 0      │ ✓
  └──────────────┘
  ┌──────────────┐
  │   Pod 1      │ ✓
  └──────────────┘
                   ┌──────────────┐
                   │   Pod 2      │ ✓
                   └──────────────┘
                   ┌──────────────┐
                   │   Pod 3      │ ✓
                   └──────────────┘
                                    ┌──────────────┐
                                    │   Pod 4      │ ✓
                                    └──────────────┘

  2 pods run at a time. 5 must succeed total.


Pattern 3: Work Queue (external coordination)
completions=unset, parallelism=5

  Time ──────────────────────────────────────►
  ┌──────────────────────────────────────┐
  │   Pod 0 (processes items from queue) │ ✓
  └──────────────────────────────────────┘
  ┌────────────────────────────────┐
  │   Pod 1                        │ ✓
  └────────────────────────────────┘
  ┌────────────────────────────────────────────┐
  │   Pod 2                                    │ ✓
  └────────────────────────────────────────────┘
  ┌──────────────────────────────────┐
  │   Pod 3                          │ ✓
  └──────────────────────────────────┘
  ┌──────────────────────┐
  │   Pod 4              │ ✓
  └──────────────────────┘

  All 5 pods run simultaneously.
  Each pulls work from an external queue (SQS, Redis, RabbitMQ).
  When a pod exits successfully, it is not restarted.
  Job completes when at least one pod terminates successfully
  and all other pods have also terminated.


Pattern 4: Indexed Parallel
completions=4, parallelism=4, completionMode=Indexed

  Time ──────────────────────────────────────►
  ┌──────────────────────┐
  │  Pod 0 (index=0)     │ ✓ processes partition 0
  └──────────────────────┘
  ┌────────────────────────────┐
  │  Pod 1 (index=1)          │ ✓ processes partition 1
  └────────────────────────────┘
  ┌──────────────────┐
  │  Pod 2 (index=2) │ ✓ processes partition 2
  └──────────────────┘
  ┌──────────────────────────────────┐
  │  Pod 3 (index=3)                │ ✓ processes partition 3
  └──────────────────────────────────┘

  Each pod gets JOB_COMPLETION_INDEX env var.
  Each processes its assigned data shard.
PatterncompletionsparallelismUse Case
Single pod1 (default)1 (default)Database migration, one-off script
Fixed countNM (M <= N)Batch processing with known work items
Work queueunsetNQueue-driven processing (SQS, RabbitMQ)
IndexedNMData partitioning, parallel map operations

Failure Handling

backoffLimit

When a pod fails (exits with non-zero status or is evicted), Kubernetes retries with exponential backoff. The delay between retries starts at 10 seconds and doubles each time (10s, 20s, 40s, …), capped at 6 minutes.

spec:
  backoffLimit: 6           # Allow up to 6 failures before giving up

After backoffLimit failures, the Job is marked as Failed. The default is 6.

activeDeadlineSeconds

A safety net for Jobs that might hang. If the Job has not completed after this many seconds, all pods are killed and the Job fails:

spec:
  activeDeadlineSeconds: 3600    # Hard timeout: 1 hour

This is essential for production Jobs. Without it, a hung Job consumes resources indefinitely. Always set this to a value comfortably above the expected runtime.

Pod Failure Policy

Introduced in v1.26 (stable in v1.31), Pod Failure Policy gives fine-grained control over how specific failure types are handled. Instead of treating all failures the same, you can define rules:

spec:
  podFailurePolicy:
    rules:
      - action: FailJob                      # Immediately fail the entire Job
        onExitCodes:
          containerName: migrate
          operator: In
          values: [42]                        # Exit code 42 = unrecoverable error

      - action: Ignore                        # Do not count toward backoffLimit
        onPodConditions:
          - type: DisruptionTarget            # Node drain, preemption, etc.

      - action: Count                         # Default: count toward backoffLimit
        onExitCodes:
          containerName: migrate
          operator: In
          values: [1]                         # Exit code 1 = transient, worth retrying

This is powerful for distinguishing between transient failures (network timeout, node eviction) and permanent failures (invalid input, schema mismatch). Without Pod Failure Policy, a pod that fails due to node preemption counts toward backoffLimit the same as a pod that fails due to a bug — which is wasteful because the preempted pod should just be retried without penalty.

ttlSecondsAfterFinished

Completed Jobs (both successful and failed) remain in the cluster until garbage collected. Without ttlSecondsAfterFinished, they stay forever, cluttering kubectl get jobs output and consuming API server resources:

spec:
  ttlSecondsAfterFinished: 86400    # Remove 24 hours after completion

Set this on every Job. The appropriate TTL depends on how long you need the Job (and its pod logs) for debugging.

CronJobs: Scheduled Execution

A CronJob creates Jobs on a schedule. The scheduling uses standard cron syntax:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
spec:
  schedule: "0 2 * * *"               # 2:00 AM every day
  timeZone: "America/New_York"         # Stable since v1.27
  concurrencyPolicy: Forbid            # Do not start a new Job if the previous is still running
  startingDeadlineSeconds: 300         # If missed by more than 5 minutes, skip this run
  successfulJobsHistoryLimit: 3        # Keep last 3 successful Jobs
  failedJobsHistoryLimit: 5            # Keep last 5 failed Jobs
  jobTemplate:
    spec:
      activeDeadlineSeconds: 7200      # Each Job has a 2-hour timeout
      backoffLimit: 2
      template:
        spec:
          containers:
            - name: backup
              image: backup-tool:v1.3
              command: ["./backup.sh"]
          restartPolicy: OnFailure

Cron Syntax

┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of month (1-31)
│ │ │ ┌───────────── month (1-12)
│ │ │ │ ┌───────────── day of week (0-6, Sunday=0)
│ │ │ │ │
* * * * *

Examples:

  • 0 2 * * * — Every day at 2:00 AM
  • */15 * * * * — Every 15 minutes
  • 0 0 1 * * — First day of every month at midnight
  • 0 9 * * 1-5 — Weekdays at 9:00 AM

timeZone

Before v1.25, CronJobs used the kube-controller-manager’s local timezone, which was usually UTC but not always. The timeZone field (stable since v1.27) lets you specify the timezone explicitly. “Every day at 2 AM” is meaningless without a timezone.

concurrencyPolicy

What happens when it is time to start a new Job but the previous one is still running?

PolicyBehaviorWhen to Use
AllowStart the new Job alongside the running oneIndependent jobs where overlap is safe
ForbidSkip the new Job if the previous is still runningBackups, database maintenance, anything that should not overlap
ReplaceKill the running Job and start a new oneLong-running jobs where the latest run supersedes previous runs

Forbid is the safest default for most production CronJobs. Two concurrent backup jobs competing for the same database locks is a recipe for failures.

startingDeadlineSeconds

If the CronJob controller misses a scheduled run (for example, because the controller was down or the cluster was overloaded), startingDeadlineSeconds controls how long after the scheduled time Kubernetes will still attempt to start the Job:

spec:
  startingDeadlineSeconds: 300    # If missed by more than 5 minutes, skip

Without this, Kubernetes counts missed schedules and may try to start all of them at once when the controller recovers. If more than 100 schedules were missed, the CronJob is marked as unable to be scheduled. Setting startingDeadlineSeconds provides a clean cutoff.

History Limits

spec:
  successfulJobsHistoryLimit: 3    # Keep 3 successful completed Jobs
  failedJobsHistoryLimit: 5        # Keep 5 failed Jobs (more for debugging)

These control how many completed Job objects are retained. Keep enough for debugging (especially for failed Jobs) but not so many that they clutter the cluster.

Real-World Use Cases

Data Pipeline Stage

apiVersion: batch/v1
kind: Job
metadata:
  name: etl-daily-20260403
spec:
  completions: 10
  parallelism: 5
  completionMode: Indexed
  backoffLimit: 3
  activeDeadlineSeconds: 14400    # 4 hours max
  ttlSecondsAfterFinished: 86400
  template:
    spec:
      containers:
        - name: etl
          image: data-pipeline:v3.0
          command: ["./process_partition.sh"]
          env:
            - name: TOTAL_PARTITIONS
              value: "10"
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
            limits:
              cpu: "4"
              memory: 8Gi
      restartPolicy: Never

Each of the 10 indexed pods processes one partition of the daily data. Five run in parallel. The JOB_COMPLETION_INDEX environment variable tells each pod which partition to process.

Database Backup CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pg-backup
spec:
  schedule: "0 3 * * *"
  timeZone: "UTC"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 600
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 10
  jobTemplate:
    spec:
      activeDeadlineSeconds: 3600
      backoffLimit: 2
      ttlSecondsAfterFinished: 604800    # Keep for 7 days
      template:
        spec:
          containers:
            - name: backup
              image: postgres:16.2
              command:
                - /bin/bash
                - -c
                - |
                  pg_dump -h db-0.db.default.svc.cluster.local \
                    -U backup_user -Fc mydb | \
                    aws s3 cp - s3://my-backups/pg/$(date +%Y%m%d).dump
              envFrom:
                - secretRef:
                    name: pg-backup-credentials
          restartPolicy: OnFailure

ML Training Job

apiVersion: batch/v1
kind: Job
metadata:
  name: training-run-042
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 1                    # Do not retry expensive training
  activeDeadlineSeconds: 86400       # 24-hour timeout
  ttlSecondsAfterFinished: 604800   # Keep for a week (to check logs)
  template:
    spec:
      containers:
        - name: train
          image: ml-training:v2.1
          command: ["python", "train.py", "--epochs", "100"]
          resources:
            requests:
              cpu: "8"
              memory: 32Gi
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: 32Gi
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-output
              mountPath: /output
      volumes:
        - name: model-output
          persistentVolumeClaim:
            claimName: training-output
      restartPolicy: Never

Jobs vs Other Workload Types

QuestionAnswer
Should it run forever?Use Deployment or StatefulSet
Should it run once and stop?Use Job
Should it run on a schedule?Use CronJob
Should it run on every node?Use DaemonSet
Does it need stable identity?Use StatefulSet
Does it need parallel indexed processing?Use Job with completionMode: Indexed

Common Mistakes and Misconceptions

  • “CronJobs are reliable for exactly-once execution.” CronJobs can create 0 or 2+ Jobs for a single schedule point (missed schedules, clock skew). Use concurrencyPolicy: Forbid and design jobs to be idempotent.
  • “Failed Jobs retry forever.” Jobs respect backoffLimit (default 6). After that many failures, the Job is marked Failed. Set activeDeadlineSeconds to prevent runaway jobs consuming resources.
  • “Jobs clean up after themselves.” Completed and Failed Jobs (and their pods) persist in the API until you or a TTL controller deletes them. Set ttlSecondsAfterFinished to auto-clean, or they accumulate and clutter kubectl get pods.

Further Reading


This concludes Part 4: Stateful Workloads. You now know how to run applications that need stable identity, persistent storage, and batch processing semantics. Part 5 turns to the question that becomes urgent once you are running real workloads: how do you secure them?

Next: RBAC from First Principles

Chapter 25: RBAC from First Principles

Kubernetes has no firewall between “can deploy an app” and “can delete the entire cluster.” Without access control, every user and every workload operates with full administrative privileges. Role-Based Access Control (RBAC) is the mechanism that prevents a junior developer’s typo from becoming a production incident and ensures that a compromised pod cannot read every secret in the cluster.

RBAC answers a single question: who can do what to which resources? Understanding it from first principles requires understanding the four objects that encode that answer, the subjects they reference, and the design patterns that make multi-tenant clusters safe.

The Authorization Model

In Chapter 3, we described the API server as the gateway that every request must pass through. The API server authenticates every request, then authorizes it — RBAC is the authorization module used by virtually every production cluster.

Every request to the Kubernetes API server carries three pieces of information relevant to authorization:

  1. Subject — who is making the request (user, group, or service account)
  2. Verb — what action is being attempted (get, list, create, update, delete, watch, patch)
  3. Resource — what is being acted upon (pods, services, secrets, configmaps, etc.)

RBAC evaluates these against a set of rules. If any rule grants the requested action, the request is allowed. If no rule matches, the request is denied. RBAC is additive-only — there is no way to write a “deny” rule. You grant permissions; you never revoke them. If a subject has no matching grants, the default is denial.

flowchart TD
    req["kubectl get pods -n production"]
    authn["API Server<br>(Authentication)"]
    extract["Who: user:alice<br>What: verb:get resource:pods<br>Where: namespace:production"]
    authz["API Server<br>(Authorization)"]
    scan["Scan RoleBindings<br>in 'production' namespace"]
    binding["RoleBinding 'dev-access'<br>subjects: group:developers<br>roleRef: role:pod-reader"]
    role["Role 'pod-reader'<br>resources: pods<br>verbs: get, list, watch"]
    checkUser{"alice in<br>group:developers?"}
    checkVerb{"verb:get on<br>pods allowed?"}
    allow["RESULT: ALLOW"]
    deny["RESULT: DENY"]

    req --> authn
    authn --> extract
    extract --> authz
    authz --> scan
    scan --> binding
    binding --> role
    role --> checkUser
    checkUser -- YES --> checkVerb
    checkUser -- NO --> deny
    checkVerb -- YES --> allow
    checkVerb -- NO --> deny

    style allow fill:#d4edda,stroke:#333
    style deny fill:#f8d7da,stroke:#333

The Four RBAC Objects

RBAC uses exactly four object types. Two define permissions, two bind permissions to subjects.

flowchart TD
    subgraph Permissions
        Role["Role<br>(namespaced)"]
        CR["ClusterRole<br>(cluster-wide)"]
    end

    subgraph Bindings
        RB["RoleBinding<br>(namespaced)"]
        CRB["ClusterRoleBinding<br>(cluster-wide)"]
    end

    subgraph Subjects
        S["User / Group /<br>ServiceAccount"]
    end

    RB -- "roleRef" --> Role
    RB -- "roleRef" --> CR
    CRB -- "roleRef" --> CR
    RB -- "subjects" --> S
    CRB -- "subjects" --> S
    style Bindings fill:#fff0e0,stroke:#333
    style Subjects fill:#e0ffe0,stroke:#333

Key insight: A RoleBinding can reference a ClusterRole. This grants the ClusterRole’s permissions only within the RoleBinding’s namespace. This is the most common pattern for multi-tenant clusters.

Role

A Role defines permissions within a single namespace. It lists which API resources can be accessed and which verbs are allowed.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: production
rules:
  - apiGroups: [""]            # "" = core API group
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/log"]    # subresource
    verbs: ["get"]

ClusterRole

A ClusterRole defines permissions cluster-wide. It can also grant access to cluster-scoped resources (nodes, namespaces, persistentvolumes) that have no namespace.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-reader
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["get", "list"]

RoleBinding

A RoleBinding grants permissions defined in a Role (or ClusterRole) to a set of subjects within a specific namespace.

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dev-pod-access
  namespace: production
subjects:
  - kind: Group
    name: developers
    apiGroup: rbac.authorization.k8s.io
  - kind: ServiceAccount
    name: ci-deployer
    namespace: ci-system
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

ClusterRoleBinding

A ClusterRoleBinding grants cluster-wide permissions. Every namespace, plus cluster-scoped resources, are accessible.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-admins
subjects:
  - kind: Group
    name: platform-team
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io

Subjects: Who Can Be Granted Access

RBAC recognizes three kinds of subjects:

User — An external identity authenticated by the API server. Kubernetes has no User object; users are established through client certificates, bearer tokens, or an external identity provider. The username is a string extracted during authentication.

Group — A set of users. Groups are also strings extracted during authentication. The identity provider (OIDC, certificates) determines group membership. Key built-in groups: system:authenticated (all authenticated users), system:unauthenticated (anonymous requests), system:masters (unconditional full access).

ServiceAccount — A namespaced Kubernetes object representing a workload’s identity. Unlike users and groups, ServiceAccounts are managed through the API. Every pod runs as a ServiceAccount; if none is specified, it runs as the default ServiceAccount in its namespace.

Default ClusterRoles

Kubernetes ships with a set of default ClusterRoles designed for common access patterns. These are the building blocks for most RBAC configurations:

ClusterRoleScopePermissions
cluster-adminCluster-wideEverything. Full access to all resources in all namespaces. Equivalent to root.
adminNamespace (via RoleBinding)Full access within a namespace: create/update/delete Roles, RoleBindings, all workloads, secrets, configmaps. Cannot modify namespace quotas or the namespace itself.
editNamespace (via RoleBinding)Create/update/delete workloads, services, configmaps, secrets, PVCs. Cannot manage Roles or RoleBindings.
viewNamespace (via RoleBinding)Read-only access to most namespace resources. Cannot view secrets.

The typical pattern is to bind these ClusterRoles via RoleBindings in specific namespaces, not via ClusterRoleBindings:

# Grant "edit" in the "staging" namespace to the QA team
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: qa-edit
  namespace: staging
subjects:
  - kind: Group
    name: qa-team
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole         # Reference a ClusterRole...
  name: edit
  apiGroup: rbac.authorization.k8s.io
# ...but the binding is namespaced, so permissions apply only in "staging"

Aggregated ClusterRoles

Aggregated ClusterRoles solve a subtle problem: when you install a CRD (Custom Resource Definition), how do the default roles (admin, edit, view) learn about the new resource types?

The answer is label-based aggregation. The default ClusterRoles have an aggregationRule that selects other ClusterRoles by label. When you create a CRD, you create small ClusterRoles with the appropriate labels, and their rules are automatically merged into the aggregated roles.

# This ClusterRole's rules get merged into "admin"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: custom-app-admin
  labels:
    rbac.authorization.k8s.io/aggregate-to-admin: "true"
rules:
  - apiGroups: ["mycompany.io"]
    resources: ["widgets"]
    verbs: ["get", "list", "watch", "create", "update", "delete"]

Any user who has admin access in a namespace now automatically gets full access to widgets in that namespace. No manual RoleBinding updates required.

ServiceAccount Tokens: The Modern Model

Kubernetes v1.24 removed the automatic creation of long-lived ServiceAccount token secrets. The modern model uses bound service account tokens with four important properties:

  1. Time-bound — Tokens expire (default: 1 hour; the kubelet proactively rotates the token when 80% of its lifetime has elapsed, i.e., ~48 minutes by default)
  2. Audience-scoped — Tokens are valid only for specific audiences (typically the API server)
  3. Pod-bound — Tokens are invalidated when the pod is deleted
  4. Auto-rotated — The kubelet refreshes tokens before expiration

This is a significant security improvement over the old model, where a leaked ServiceAccount token granted permanent access until manually revoked.

# Explicit token request for non-Kubernetes consumers
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  serviceAccountName: my-app-sa
  containers:
    - name: app
      image: my-app:latest
      volumeMounts:
        - name: token
          mountPath: /var/run/secrets/tokens
  volumes:
    - name: token
      projected:
        sources:
          - serviceAccountToken:
              path: api-token
              expirationSeconds: 3600
              audience: my-external-service

OIDC Integration for Human Users

Production clusters should authenticate human users via OIDC rather than client certificates, which cannot be revoked once issued. Use OpenID Connect (OIDC) to delegate authentication to an identity provider (Okta, Azure AD, Google Workspace, Dex).

The flow works as follows:

  1. User authenticates with the identity provider (browser-based login)
  2. Identity provider issues an ID token (JWT) containing username and groups
  3. kubectl sends the ID token with each API request
  4. API server validates the token signature against the OIDC provider’s public keys
  5. RBAC evaluates the extracted username and groups against bindings

This means group membership is managed in your identity provider, not in Kubernetes. When someone leaves the team, disabling their IdP account immediately revokes cluster access.

Multi-Tenant RBAC Design

Most production clusters serve multiple teams. The standard model is namespace-per-tenant with a three-tier access structure:

MULTI-TENANT NAMESPACE MODEL
──────────────────────────────

  Cluster
  ├── Namespace: team-alpha-dev
  │   ├── RoleBinding: alpha-devs → ClusterRole:edit
  │   ├── RoleBinding: alpha-leads → ClusterRole:admin
  │   ├── RoleBinding: platform-team → ClusterRole:admin
  │   └── ResourceQuota + LimitRange
  │
  ├── Namespace: team-alpha-prod
  │   ├── RoleBinding: alpha-ci → ClusterRole:edit   (ServiceAccount)
  │   ├── RoleBinding: alpha-leads → ClusterRole:admin
  │   ├── RoleBinding: platform-team → ClusterRole:admin
  │   └── ResourceQuota + LimitRange
  │
  ├── Namespace: team-beta-dev
  │   ├── RoleBinding: beta-devs → ClusterRole:edit
  │   ├── RoleBinding: beta-leads → ClusterRole:admin
  │   ├── RoleBinding: platform-team → ClusterRole:admin
  │   └── ResourceQuota + LimitRange
  │
  └── Namespace: kube-system (platform only)
      └── ClusterRoleBinding: platform-team → cluster-admin

  THREE-TIER MODEL
  ─────────────────
  Tier 1: Platform Team    → cluster-admin (ClusterRoleBinding)
  Tier 2: Team Leads       → admin per namespace (RoleBinding)
  Tier 3: Developers       → edit per namespace (RoleBinding)

Design principles:

  • Bind to Groups, not Users. When Alice joins team-alpha, add her to the alpha-devs group in your identity provider. No Kubernetes RBAC changes needed.
  • Use ClusterRoles with namespaced RoleBindings. Define permissions once, apply per-namespace.
  • Every namespace gets ResourceQuota and LimitRange. RBAC controls what you can do; quotas control how much.
  • CI/CD uses dedicated ServiceAccounts with edit permissions scoped to specific namespaces. Never share ServiceAccounts across pipelines.

Least Privilege Checklist

  1. Every workload has its own ServiceAccount
  2. automountServiceAccountToken: false unless the workload needs API access
  3. No ClusterRoleBindings to cluster-admin except for the platform team
  4. No wildcard verbs or resources in custom Roles
  5. Groups (not individual users) in all RoleBindings
  6. OIDC for human users, bound tokens for workloads
  7. Regular audits of who can access secrets and create pods (pod creation implies secret access)

Common Mistakes and Misconceptions

  • “cluster-admin for everyone is fine in dev.” Bad habits in dev carry to production. Practice least-privilege from the start. Create namespace-scoped roles that match what each team actually needs.
  • “RBAC denies by default, so I’m secure.” RBAC only controls API access. It doesn’t prevent a compromised pod from attacking the network, reading the filesystem, or accessing cloud metadata. RBAC is one layer of defense, not the whole strategy.
  • “I can see who has access by reading RoleBindings.” Aggregated ClusterRoles, group memberships, and impersonation make the effective permission set non-obvious. Use kubectl auth can-i --list --as=user to audit actual permissions.

Further Reading


Next: Network Policies — Controlling pod-to-pod traffic with ingress and egress rules.

Chapter 26: Network Policies

By default, every pod can reach every other pod — no firewalls, no segmentation. A compromised pod can reach databases, scan the cluster network, and exfiltrate data.

The Fundamental Model

Kubernetes Network Policies operate on three principles:

  1. Non-isolated by default. A pod with no Network Policy selecting it accepts all inbound and all outbound traffic. Network Policies are opt-in.

  2. Additive allow-only. There are no “deny” rules. Policies can only allow traffic. If you create a policy that selects a pod, that pod becomes isolated for the direction(s) specified (ingress, egress, or both). Once isolated, only traffic explicitly allowed by a policy is permitted.

  3. Both sides must allow. For traffic to flow from pod A to pod B, the egress policy on pod A must allow traffic to B, AND the ingress policy on pod B must allow traffic from A. If either side denies (by isolation without a matching allow), the traffic is dropped.

NETWORK POLICY TRAFFIC FLOW
─────────────────────────────

  Pod A (team-alpha)              Pod B (team-beta)
  ┌───────────────────┐           ┌───────────────────┐
  │                   │           │                   │
  │  Egress Policy:   │           │  Ingress Policy:  │
  │  "allow to        │──────────▶│  "allow from      │
  │  team-beta pods"  │  Traffic  │  team-alpha pods" │
  │                   │  flows    │                   │
  └───────────────────┘  only if  └───────────────────┘
                         BOTH
                         allow

  If Pod A has no egress policy → Pod A is non-isolated
    for egress → all egress allowed (A's side: OK)
  If Pod B has no ingress policy → Pod B is non-isolated
    for ingress → all ingress allowed (B's side: OK)
  If Pod A has egress policy that does NOT list Pod B
    → traffic BLOCKED at A's side

The Essential Policy Templates

Default Deny All Ingress

The most important policy in any cluster. Apply this to every namespace and then add specific allow rules.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}          # Empty selector = all pods in namespace
  policyTypes:
    - Ingress              # Isolate for ingress; no ingress rules = deny all
  # No ingress rules → all inbound traffic denied

Default Deny All Egress

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  # No egress rules → all outbound traffic denied

Warning: Denying all egress breaks DNS resolution. Pods will not be able to resolve service names. You almost always need to pair this with a DNS allow rule (see below).

Default Deny Both Directions

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Allow DNS Egress (Critical)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

Namespace Isolation

Allow traffic only from pods within the same namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector: {}    # All pods in THIS namespace

Specific Pod-to-Pod Communication

Allow only the frontend to reach the backend, and only the backend to reach the database:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-ingress
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: database-ingress
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: backend
      ports:
        - protocol: TCP
          port: 5432

Egress to External IPs

Allow pods to reach a specific external service (e.g., a third-party API):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-external-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 203.0.113.0/24     # External API range
      ports:
        - protocol: TCP
          port: 443

The AND vs OR Selector Trap

This is the single most common source of Network Policy bugs. The behavior changes depending on whether selectors appear in the same from/to item or in separate list items.

THE SELECTOR LOGIC TRAP
─────────────────────────

  COMBINED (AND logic) --- both conditions must match:

  ingress:
    - from:
        - namespaceSelector:        ┐
            matchLabels:            │  AND
              env: production       │
          podSelector:              │  Both must be true:
            matchLabels:            │  namespace=production
              app: frontend         ┘  AND app=frontend

  SEPARATE (OR logic) --- either condition can match:

  ingress:
    - from:
        - namespaceSelector:        ← OR: any pod in namespace
            matchLabels:               with env=production
              env: production
        - podSelector:              ← OR: any pod in SAME namespace
            matchLabels:               with app=frontend
              app: frontend

  THE DIFFERENCE:
  Combined: Only frontend pods in production namespaces
  Separate: ALL pods in production namespaces
            OR frontend pods in the CURRENT namespace

The difference is a single - character (a new list item). Combined selectors are intersections (AND). Separate selectors are unions (OR). Getting this wrong can open your namespace to traffic from every pod in a production-labeled namespace.

CNI Support: The Enforcement Gap

Network Policies are a Kubernetes API object. Any cluster accepts them. But enforcing them requires a CNI plugin that implements the NetworkPolicy specification. If your CNI does not support Network Policies, policies exist in etcd but are silently ignored — no warning, no effect on traffic.

CNI PluginNetwork Policy SupportNotes
CalicoFullThe most widely deployed policy-capable CNI. Supports both Kubernetes NetworkPolicy and its own more expressive CRDs (GlobalNetworkPolicy, deny rules, application-layer policies).
CiliumFull + extendedeBPF-based. Supports Kubernetes NetworkPolicy plus CiliumNetworkPolicy with L7 (HTTP, gRPC, Kafka) filtering, DNS-aware policies, and identity-based enforcement.
Weave NetFullSupports standard NetworkPolicy. Less common in new deployments.
AntreaFullVMware-backed, built on Open vSwitch. Good support for NetworkPolicy and its own Antrea-native policies.
FlannelNoneFlannel provides connectivity only. If you apply a NetworkPolicy on a Flannel cluster, it is silently ignored. This is the most common enforcement gap in production.
kubenetNoneBasic CNI for simple clusters. No policy support.

How to verify enforcement: Deploy two pods. Apply a deny-all ingress policy to the target pod’s namespace. Attempt to connect from the source pod. If the connection succeeds, your CNI is not enforcing policies.

# Quick verification test
kubectl run source --image=busybox --rm -it --restart=Never -- \
  wget -qO- --timeout=3 http://target-pod-ip:8080
# If this succeeds after a deny-all policy, your CNI does not enforce policies

A Complete Namespace Policy Set

A production namespace typically needs a layered set of policies. Here is a complete example for a three-tier application:

POLICY LAYERING FOR A NAMESPACE
─────────────────────────────────

  production namespace
  ┌──────────────────────────────────────────────────┐
  │                                                  │
  │  Policy 1: default-deny-all (ingress + egress)   │
  │  Policy 2: allow-dns (egress to kube-dns)        │
  │                                                  │
  │  ┌──────────┐    ┌──────────┐    ┌──────────┐    │
  │  │ frontend │───▶│ backend  │───▶│ database │    │
  │  │          │    │          │    │          │    │
  │  └──────────┘    └──────────┘    └──────────┘    │
  │       ▲                │              │          │
  │       │                ▼              │          │
  │  Policy 3:        Policy 5:      Policy 7:       │
  │  allow ingress    allow egress   deny all        │
  │  from ingress     to database    egress (no      │
  │  controller       on 5432        external)       │
  │                                                  │
  │  Policy 4:        Policy 6:                      │
  │  allow egress     allow ingress                  │
  │  to backend       from backend                   │
  │  on 8080          on 5432                        │
  │                                                  │
  └──────────────────────────────────────────────────┘

  External traffic → ingress controller → frontend → backend → database
  Every other path is blocked.

Debugging Network Policies

When traffic is unexpectedly blocked:

  1. Check that policies exist: kubectl get networkpolicy -n <namespace>
  2. Verify CNI enforcement: Test with a known-blocked connection
  3. Inspect the policy: kubectl describe networkpolicy <name> -n <namespace>
  4. Check labels: Policies select pods by label. A missing or misspelled label means the policy does not apply to the pod you think it does. kubectl get pods --show-labels -n <namespace>
  5. Check DNS: If pods can connect by IP but not by name, the egress DNS rule is missing or incorrect
  6. Remember the AND/OR trap: Review your from/to selectors for unintended union logic

Limitations of Kubernetes Network Policies

The standard NetworkPolicy API has real limitations:

  • No deny rules. You cannot explicitly block a specific source. You can only fail to allow it.
  • No logging. There is no built-in way to log dropped packets.
  • No cluster-wide policies. Each NetworkPolicy is namespaced. There is no way to apply a policy across all namespaces without creating it in each one.
  • No L7 filtering. Standard policies operate at L3/L4 (IP and port). They cannot distinguish between GET /api/public and DELETE /api/admin.

For these capabilities, use your CNI’s extended policy CRDs. Calico’s GlobalNetworkPolicy and Cilium’s CiliumNetworkPolicy both address these gaps.

Common Mistakes and Misconceptions

  • “Pods are isolated by default.” The opposite: all pods can reach all other pods by default. You must explicitly create NetworkPolicies to restrict traffic. No policy = fully open.
  • “A NetworkPolicy on ingress also blocks egress.” Ingress and egress are independent. A policy selecting only ingress rules does not restrict outbound traffic. You need separate egress rules.
  • “My CNI supports NetworkPolicy.” See the CNI support table above. Always verify enforcement with a test connection.
  • “NetworkPolicies work across namespaces automatically.” You must use namespaceSelector to allow cross-namespace traffic. A policy only applies to pods in its own namespace.

Further Reading


Next: Supply Chain Security — Image signing, admission policies, SBOMs, and the SLSA framework.

Chapter 27: Supply Chain Security

A container image passes through source code, build systems, registries, and your cluster — each step is an opportunity for compromise. Supply chain security verifies that nothing was tampered with along the way.

This is not a theoretical concern. The SolarWinds attack (2020) injected malicious code into a build pipeline. The Codecov breach (2021) modified a bash uploader to exfiltrate credentials. The xz utils backdoor (2024) hid a sophisticated compromise in a compression library used by SSH. Kubernetes clusters are particularly exposed because they pull images from external registries on every deployment, and a single compromised base image can propagate to hundreds of workloads.

The Problem in Layers

THE SOFTWARE SUPPLY CHAIN
──────────────────────────

  Source Code ──▶ Build System ──▶ Registry ──▶ Cluster
       │              │              │             │
       ▼              ▼              ▼             ▼
  Was the code     Was the build   Was the       Is the image
  reviewed?        tampered with?  image         allowed to
  Who authored     Was the build   modified      run? Was it
  this commit?     reproducible?   in transit    signed? Is it
                                   or at rest?   from a trusted
                                                 registry?

  ATTACK SURFACE AT EACH STAGE:
  ┌─────────┐     ┌─────────┐    ┌──────────┐  ┌─────────────┐
  │ Typo-   │     │ Build   │    │ Registry │  │ Deployment  │
  │ squatted│     │ system  │    │ compro-  │  │ of unsigned │
  │ deps    │     │ compro- │    │ mised    │  │ or outdated │
  │         │     │ mised   │    │          │  │ images      │
  └─────────┘     └─────────┘    └──────────┘  └─────────────┘

Image Signing with Sigstore/Cosign

Sigstore is the dominant open-source project for signing and verifying container images. Its key innovation is keyless signing — you do not need to manage long-lived signing keys. Instead, you prove your identity through an existing OIDC provider (GitHub Actions, Google, Microsoft), and Sigstore issues a short-lived certificate tied to that identity.

The Keyless Signing Flow

SIGSTORE KEYLESS SIGNING PIPELINE
───────────────────────────────────

  Developer / CI Pipeline
       │
       │ 1. Request identity token (OIDC)
       ▼
  ┌────────────┐
  │  OIDC      │  GitHub Actions, Google, etc.
  │  Provider  │  Issues JWT with identity claims
  └─────┬──────┘
        │ 2. Present OIDC token
        ▼
  ┌────────────┐
  │  Fulcio    │  Sigstore's certificate authority
  │  (CA)      │  Verifies OIDC token
  │            │  Issues short-lived X.509 cert
  │            │  (valid ~20 minutes)
  └─────┬──────┘
        │ 3. Ephemeral certificate + private key
        ▼
  ┌────────────┐
  │  Cosign    │  Signs the image digest using
  │  (client)  │  the ephemeral private key
  │            │  Pushes signature to registry
  └─────┬──────┘
        │ 4. Record signing event
        ▼
  ┌────────────┐
  │  Rekor     │  Sigstore's transparency log
  │  (log)     │  Immutable, append-only record
  │            │  Proves signing happened at
  │            │  a specific time with a
  │            │  specific identity
  └────────────┘

  VERIFICATION:
  cosign verify checks:
  ✓ Signature matches image digest
  ✓ Certificate was issued by Fulcio
  ✓ Certificate identity matches expected signer
  ✓ Signing event exists in Rekor transparency log

Cosign in Practice

# Sign an image (keyless, in CI)
cosign sign ghcr.io/myorg/myapp@sha256:abc123...

# Verify an image
cosign verify \
  --certificate-identity=https://github.com/myorg/myapp/.github/workflows/build.yml@refs/heads/main \
  --certificate-oidc-issuer=https://token.actions.githubusercontent.com \
  ghcr.io/myorg/myapp@sha256:abc123...

# Sign with a key pair (traditional, for air-gapped environments)
cosign generate-key-pair
cosign sign --key cosign.key ghcr.io/myorg/myapp@sha256:abc123...
cosign verify --key cosign.pub ghcr.io/myorg/myapp@sha256:abc123...

Notation / Notary v2

Notation (the CNCF’s Notary v2 project) takes a traditional PKI approach. You manage your own signing keys and certificates, sign images using the notation CLI, and store signatures as OCI artifacts alongside the image in the registry.

Notation is the right choice when your organization already has a PKI infrastructure, when you need to comply with regulations that require specific key management practices, or when you operate in air-gapped environments where Sigstore’s online services (Fulcio, Rekor) are not reachable.

FeatureCosign (Sigstore)Notation (Notary v2)
Key managementKeyless (OIDC) or key-pairKey-pair with PKI
Certificate authorityFulcio (public)Your own CA
Transparency logRekor (public)None (optional)
Air-gapped supportRequires key-pair modeNative
Ecosystem adoptionWider (GitHub, GCP, AWS)Growing (Azure ACR native)
Signature storageOCI registryOCI registry

Admission Control: Enforcing Policy at Deploy Time

Signing images is useless unless you verify signatures before deployment. This is the job of admission controllers — webhook-based components that intercept API requests and enforce policies before objects are created.

OPA Gatekeeper vs Kyverno

FeatureOPA GatekeeperKyverno
Policy languageRego (purpose-built, steep learning curve)YAML (Kubernetes-native, familiar)
MutationSupported (via assign/modify)Native (mutate rules in policy)
GenerationNot supportedNative (generate resources from policy)
Image verificationVia external data or custom RegoBuilt-in verifyImages rule
ValidationCore strengthCore strength
Audit modeBuilt-in (audit violations without blocking)Built-in (audit/enforce modes)
Learning curveHigh (Rego is a new language)Low (YAML-native)
CommunityMature, CNCF GraduatedFast-growing, CNCF Incubating
Policy libraryGatekeeper LibraryKyverno Policies

For image verification specifically, Kyverno has a significant advantage: signature verification is a first-class feature, not something you bolt on with Rego functions.

# Kyverno policy: require Cosign signature from trusted identity
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-signed-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: verify-signature
      match:
        any:
          - resources:
              kinds:
                - Pod
      verifyImages:
        - imageReferences:
            - "ghcr.io/myorg/*"
          attestors:
            - entries:
                - keyless:
                    issuer: "https://token.actions.githubusercontent.com"
                    subject: "https://github.com/myorg/*"
                    rekor:
                      url: "https://rekor.sigstore.dev"

SBOM: Software Bill of Materials

An SBOM is a machine-readable inventory of every component in a container image — every package and dependency. It answers the question: “When the next Log4Shell happens, are we affected?”

Generation tools:

  • Trivy — Generates SBOMs as part of its scanning workflow. Supports SPDX and CycloneDX formats. Can scan container images, filesystems, and Git repositories.
  • Syft — Anchore’s dedicated SBOM generator. Deeper catalog of package types. Outputs SPDX, CycloneDX, and its own JSON format.

Formats:

  • SPDX — Linux Foundation standard. Widely adopted for compliance. Verbose.
  • CycloneDX — OWASP standard. More focused on security use cases. Lighter.
# Generate SBOM with Trivy
trivy image --format cyclonedx --output sbom.json ghcr.io/myorg/myapp:latest

# Generate SBOM with Syft
syft ghcr.io/myorg/myapp:latest -o spdx-json > sbom.spdx.json

# Attach SBOM to image with Cosign
cosign attach sbom --sbom sbom.json ghcr.io/myorg/myapp@sha256:abc123...

Image Scanning

Scanning should happen in CI, in the registry, and at admission time (via Kyverno/Gatekeeper).

The SLSA Framework

SLSA (Supply-chain Levels for Software Artifacts, pronounced “salsa”) is a framework from Google that defines increasingly rigorous levels of supply chain integrity.

LevelNameRequirements
0No guaranteesNo SLSA compliance
1Provenance existsBuild process generates provenance metadata documenting how the artifact was built
2Hosted buildBuild runs on a hosted service (not a developer laptop). Provenance is signed.
3Hardened buildsBuild service is hardened against tampering. Provenance is non-forgeable. Build is isolated. Source is version-controlled.

GitHub Actions with reusable workflows can achieve SLSA Level 3 using the slsa-framework/slsa-github-generator action, which produces signed provenance attestations.

Restricting Image Registries

A fundamental control: only allow images from registries you trust.

# Kyverno: restrict to approved registries
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-registries
spec:
  validationFailureAction: Enforce
  rules:
    - name: allowed-registries
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Images must come from approved registries."
        pattern:
          spec:
            containers:
              - image: "ghcr.io/myorg/* | registry.internal.company.com/*"
            initContainers:
              - image: "ghcr.io/myorg/* | registry.internal.company.com/*"

Putting It Together: The Secure Pipeline

END-TO-END SUPPLY CHAIN SECURITY
──────────────────────────────────

  ┌─────────────┐
  │ Source Code │  Signed commits, code review,
  │             │  dependency scanning (Dependabot)
  └──────┬──────┘
         │
         ▼
  ┌─────────────┐
  │  CI Build   │  SLSA Level 2+: hosted, signed provenance
  │  (GitHub    │  Trivy scan: fail on CRITICAL
  │   Actions)  │  SBOM generation (Syft/Trivy)
  └──────┬──────┘
         │
         ▼
  ┌─────────────┐
  │  Sign &     │  Cosign keyless sign
  │  Attest     │  Attach SBOM attestation
  │             │  Record in Rekor transparency log
  └──────┬──────┘
         │
         ▼
  ┌─────────────┐
  │  Registry   │  Continuous scanning
  │  (GHCR/ECR) │  Image retention policy
  └──────┬──────┘
         │
         ▼
  ┌─────────────┐
  │  Admission  │  Kyverno/Gatekeeper:
  │  Control    │  ✓ Signature verified
  │             │  ✓ Registry allowed
  │             │  ✓ No critical CVEs
  │             │  ✓ SBOM attached
  └──────┬──────┘
         │
         ▼
  ┌─────────────┐
  │  Runtime    │  Pod Security Standards
  │  Cluster    │  Network Policies
  │             │  Runtime monitoring (Falco)
  └─────────────┘

Common Mistakes and Misconceptions

  • “I scan images once and they’re secure.” New CVEs are discovered daily. Images that were clean yesterday may have critical vulnerabilities today. Continuous scanning in the registry (not just at build time) is essential.
  • “Using official base images means no vulnerabilities.” Even official images contain OS packages with CVEs. Use distroless or scratch-based images to minimize attack surface. Regularly rebuild images to pick up base image patches.
  • “Image signing is enough.” Signing proves provenance but not safety. A signed image can still contain vulnerabilities. Signing + scanning + admission policy (Kyverno/Gatekeeper) together form the chain.

Further Reading


Next: Secrets Management — Encryption at rest, KMS integration, and external secrets operators.

Chapter 28: Secrets Management

Kubernetes Secrets are base64-encoded. This is not encryption. Base64 is a reversible encoding — echo "cGFzc3dvcmQxMjM=" | base64 -d produces password123 instantly. Every tutorial mentions this, yet production clusters routinely store database passwords, API keys, and TLS certificates in Secrets with no additional protection. The data sits in etcd in plaintext (or rather, in trivially decodable base64), readable by anyone with access to the etcd data directory or sufficient RBAC permissions.

This chapter covers the full spectrum of secrets protection: encrypting data at rest in etcd, integrating with external key management systems, and using external secrets operators that keep sensitive data out of Kubernetes entirely.

The Default: No Encryption

When you create a Secret, Kubernetes stores it in etcd. By default, the identity provider is used, which means the data is stored as-is (base64-encoded, not encrypted). Anyone with read access to etcd — a backup, a compromised node, a misconfigured endpoint — can read every secret in the cluster.

flowchart LR
    kubectl["<b>kubectl</b><br>create secret generic db-creds<br>--from-literal=password=hunter2"]
    api["<b>API Server</b>"]
    etcd["<b>etcd</b><br>/registry/secrets/default/db-creds<br><br>data:<br>&nbsp;&nbsp;password: aHVudGVyMg==<br><br>base64 'hunter2' — NOT encrypted"]

    kubectl --> api --> etcd

Encryption at Rest

Kubernetes supports encrypting Secret data before it reaches etcd. You configure this through an EncryptionConfiguration file referenced by the API server’s --encryption-provider-config flag.

Encryption Providers

ProviderAlgorithmKey ManagementUse Case
identityNone (plaintext)N/ADefault. Insecure.
aescbcAES-256-CBCStatic key in config fileSimple encryption. Key is on disk alongside the API server.
aesgcmAES-256-GCMStatic key in config fileAuthenticated encryption (integrity + confidentiality). Uses random 96-bit nonces (collision risk negligible). Key rotation still recommended.
secretboxXSalsa20-Poly1305Static key in config fileModern authenticated encryption. Preferred over aescbc/aesgcm for static key scenarios.
kms v2Envelope encryptionExternal KMS (AWS KMS, GCP KMS, Azure Key Vault, HashiCorp Vault)Production-grade. Keys never leave the KMS.

Basic EncryptionConfiguration

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
    providers:
      - secretbox:                    # Primary: encrypt with secretbox
          keys:
            - name: key1
              secret: <base64-encoded-32-byte-key>
      - identity: {}                  # Fallback: read unencrypted data

The provider order matters. The first provider is used for writing (encrypting new secrets). All listed providers are tried for reading (so you can decrypt data written by a previous provider during key rotation). The identity provider at the end ensures that secrets written before encryption was enabled can still be read.

KMS v2 Envelope Encryption

Static keys stored in configuration files have an obvious weakness: the key is on the same machine as the encrypted data. If someone compromises the API server node, they have both the ciphertext and the key. KMS v2 solves this with envelope encryption.

flowchart TD
    A["<b>1. API Server</b><br>Generates random DEK<br>(plaintext, cached in memory)"]
    B["<b>2. External KMS</b><br>AWS KMS / GCP Cloud KMS /<br>Vault / Azure Key Vault"]
    C["<b>3. API Server</b><br>Encrypts secret data with plaintext DEK"]
    D["<b>4. etcd</b><br>Stores encrypted DEK + encrypted data"]

    A -- "Send plaintext DEK<br>for encryption" --> B
    B -- "Return encrypted DEK<br>(wrapped with KEK;<br>KEK never leaves KMS)" --> C
    C -- "Store enc(DEK) + enc(data)" --> D

Key insight: The KEK never leaves the KMS. Even if etcd is fully compromised, the attacker has encrypted data and an encrypted DEK but no way to decrypt either without KMS access. The plaintext DEK is cached in API Server memory and never written to disk.

KMS v2 Configuration

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
    providers:
      - kms:
          apiVersion: v2
          name: aws-kms-provider
          endpoint: unix:///var/run/kmsplugin/socket.sock
          timeout: 3s
      - identity: {}

The KMS plugin runs as a separate process (typically a DaemonSet or static pod on control plane nodes) that translates between the Kubernetes KMS gRPC protocol and your cloud provider’s KMS API.

Key Rotation

For static key providers, rotation requires three steps:

  1. Add the new key as the first entry in the keys list (so new writes use it)
  2. Restart the API server to pick up the configuration change
  3. Re-encrypt all existing secrets: kubectl get secrets --all-namespaces -o json | kubectl replace -f -
  4. Remove the old key from the configuration

For KMS v2, rotation happens in the KMS itself. When you rotate the KEK in AWS KMS or GCP Cloud KMS, new DEKs are wrapped with the new KEK. Existing secrets are re-encrypted on next write or via the re-encryption command above.

External Secrets Solutions

Encrypting at rest protects data in etcd, but the secrets still exist as Kubernetes Secret objects — visible to anyone with RBAC read access, exposed in pod environment variables, logged by admission webhooks. External secrets solutions keep the canonical secret in an external system and sync or inject it into pods.

Sealed Secrets

What it is: A controller that encrypts secrets with a public key so they can be safely stored in Git. Only the controller running in the cluster has the private key to decrypt them.

How it works: You use kubeseal to encrypt a Secret into a SealedSecret custom resource. The SealedSecret can be committed to Git. The controller decrypts it and creates the corresponding Secret in the cluster.

# Encrypt a secret for Git storage
kubectl create secret generic db-creds \
  --from-literal=password=hunter2 --dry-run=client -o yaml \
  | kubeseal --controller-namespace kube-system \
    --controller-name sealed-secrets -o yaml > sealed-db-creds.yaml

Trade-offs: Simple to deploy, works with GitOps, no external dependencies beyond the controller. But the decrypted Secret still exists in etcd as a standard Kubernetes Secret. Sealed Secrets protect the Git side, not the runtime side.

External Secrets Operator (ESO)

What it is: A controller that syncs secrets from external providers (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, HashiCorp Vault, 1Password, and many more) into Kubernetes Secrets.

flowchart TD
    aws["<b>AWS Secrets Manager</b><br>prod/db-pass"]
    gcp["<b>GCP Secret Manager</b><br>api-key"]
    store["<b>SecretStore / ClusterSecretStore</b><br>(auth config per provider)"]
    eso["<b>External Secrets Operator</b><br>Reads ExternalSecret CRs<br>Fetches values from providers<br>Creates/updates K8s Secrets<br>Syncs on interval (e.g. every 1h)"]
    secret["<b>Kubernetes Secret</b><br>(auto-created, kept in sync)"]
    pods["<b>Pods</b><br>Mounted as files or env vars"]

    aws --> store
    gcp --> store
    store --> eso
    eso --> secret
    secret --> pods
# SecretStore: how to authenticate to the provider
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets
  namespace: production
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: eso-sa    # Uses IRSA for authentication
---
# ExternalSecret: what to sync
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets
  target:
    name: db-credentials     # Name of the K8s Secret to create
  data:
    - secretKey: password
      remoteRef:
        key: prod/database/password

HashiCorp Vault

Vault provides dynamic secret generation (short-lived database credentials created on demand), PKI certificate issuance, transit encryption (encrypt data without exposing keys), and detailed audit logging.

Vault integrates with Kubernetes in three ways:

Agent Sidecar Injector — A mutating webhook injects a Vault Agent sidecar into pods. The agent authenticates to Vault using the pod’s ServiceAccount, retrieves secrets, and writes them to a shared volume. The application reads secrets from files.

CSI Provider — The Vault CSI provider mounts secrets as a CSI volume. Simpler than the sidecar approach but with fewer features (no dynamic renewal).

Vault Secrets Operator (VSO) — The newest approach. A Kubernetes operator that syncs Vault secrets into Kubernetes Secret objects, similar to ESO but Vault-specific and with native Vault features like dynamic secrets and lease renewal.

Comparison

See Appendix C: Decision Trees for a secret management decision flowchart.

FeatureSealed SecretsESOVault
ComplexityLowMediumHigh
External dependencyNone (controller only)Cloud provider secrets serviceVault cluster
Git-safe secretsYes (primary purpose)No (syncs from cloud)No
Dynamic secretsNoNoYes (database creds, PKI certs)
Multi-cloudN/AYes (many providers)Yes (one Vault, many consumers)
Audit loggingNoProvider-dependentYes (detailed)
CostFreeFree + cloud secrets service pricingFree (OSS) or paid (Enterprise) + operational cost
Best forSmall teams, GitOpsCloud-native, multi-providerEnterprise, strict compliance, dynamic secrets

Best Practices

Mount secrets as files, not environment variables. Environment variables are exposed in /proc/<pid>/environ, appear in crash dumps, and are inherited by child processes. File-mounted secrets can have restrictive file permissions and are not leaked through process inspection.

# Preferred: mount as file
containers:
  - name: app
    volumeMounts:
      - name: db-creds
        mountPath: /etc/secrets
        readOnly: true
volumes:
  - name: db-creds
    secret:
      secretName: db-credentials
      defaultMode: 0400     # Read-only by owner

Use short-lived credentials. A database password that never expires is a permanently valid attack vector. Vault’s dynamic secrets generate credentials with a TTL (e.g., 1 hour). When the lease expires, Vault revokes the credentials. ESO’s refresh interval keeps synced secrets current.

Scope RBAC for secrets. Not every developer needs kubectl get secrets. Restrict Secret read access to the specific ServiceAccounts and namespaces that need it. Remember that pod creation implies secret access (anyone who can create a pod can mount any secret in the namespace).

Audit secret access. Enable Kubernetes audit logging for Secret read operations. In Vault, audit logging is built in and records every secret access with the requesting identity.

Rotate regularly. Automate key rotation for encryption at rest. Automate credential rotation for application secrets. Test that rotation does not cause downtime.

Never log secrets. Ensure admission webhooks, logging sidecars, and debug tools do not capture secret values. Mask sensitive fields in application logs.

Common Mistakes and Misconceptions

  • “Kubernetes Secrets are encrypted.” By default, Secrets are stored as base64 in etcd — which is encoding, not encryption. You must enable encryption at rest (EncryptionConfiguration) or use an external KMS provider.
  • “Sealed Secrets or External Secrets solve everything.” These tools solve the GitOps problem (how to store secrets in Git). They don’t solve rotation, access auditing, or least-privilege access. Use them with a proper vault backend.

Further Reading


Next: Pod Security Standards — Privileged, Baseline, and Restricted profiles with Pod Security Admission.

Chapter 29: Pod Security Standards

A container is a process running on a Linux host. By default, Kubernetes places remarkably few restrictions on what that process can do. A pod can run as root, mount the host filesystem, share the host network namespace, escalate privileges, and disable security profiles. Each of these capabilities is a legitimate attack surface. Pod Security Standards define three profiles — Privileged, Baseline, and Restricted — that progressively lock down what pods are allowed to do. Pod Security Admission (PSA) enforces these profiles at the namespace level, providing a built-in mechanism to prevent dangerous pod configurations from ever reaching the cluster.

This chapter covers the standards themselves, the admission controller that enforces them, and the migration path from the now-removed PodSecurityPolicy (PSP) to the current model.

Why Pod-Level Security Matters

Consider what an attacker gains from a compromised container with no security restrictions:

  • Privileged mode: Full access to host devices, effectively root on the node
  • Host PID namespace: See and signal every process on the node
  • Host network namespace: Bind to any port on the node, sniff network traffic
  • hostPath volumes: Read and write any file on the node filesystem
  • Root user: Write to container filesystem, install tools, exploit kernel vulnerabilities
  • Privilege escalation: Gain capabilities beyond the container’s initial set
  • No seccomp profile: Access the full set of ~300+ Linux syscalls, including dangerous ones like ptrace, mount, and reboot

Without pod security controls, every container is one exploit away from full node compromise. The standards exist to define a sensible default: what should a “normal” pod look like?

The Three Profiles

Controls Matrix

ControlPrivilegedBaselineRestricted
Privileged containersAllowedForbiddenForbidden
Host namespaces (hostPID, hostIPC, hostNetwork)AllowedForbiddenForbidden
Host portsAllowedLimited (known ranges)Limited (known ranges)
HostPath volumesAllowedForbiddenForbidden
Privileged escalation (allowPrivilegeEscalation)AllowedAllowedForbidden (must be false)
Running as root (runAsNonRoot)AllowedAllowedForbidden (must be true)
Root user (runAsUser: 0)AllowedAllowedForbidden
CapabilitiesAllCannot add capabilities beyond the default Docker set (AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FKILL, FSETID, KILL, MKNOD, NET_BIND_SERVICE, SETFCAP, SETGID, SETPCAP, SETUID, SYS_CHROOT)Drop ALL, add only: NET_BIND_SERVICE
Seccomp profileAny or noneAny or noneMust set RuntimeDefault or Localhost
Volume typesAllAll except hostPathRestricted set: configMap, downwardAPI, emptyDir, persistentVolumeClaim, projected, secret
SysctlsAllSafe set onlySafe set only
AppArmorAny or noneAny or noneMust not opt out of default profile
SELinuxAnyCannot set MustRunAs type to escalating typesCannot set MustRunAs type to escalating types
/proc mount typeAnyDefault onlyDefault only
Seccomp (ephemeral containers)AnyAnyMust set RuntimeDefault or Localhost

Profile Descriptions

Privileged — No restrictions. Used for system-level workloads that genuinely need full host access: CNI plugins, storage drivers, logging agents that read /var/log, monitoring agents that access /proc and /sys. This profile should apply only to system namespaces (kube-system) and never to application workloads.

Baseline — Prevents known privilege escalation paths. Blocks privileged containers, host namespaces, and hostPath volumes. Allows running as root and does not require seccomp profiles. This is the minimum viable security policy for application workloads. Most applications work under Baseline without modification.

Restricted — Enforces current security best practices. Requires non-root execution, drops all capabilities, mandates seccomp profiles, and limits volume types. Many applications need modification to work under Restricted (switching from root to a non-root user, updating file permissions in the container image). This is the target state for all application workloads.

Pod Security Admission (PSA)

Pod Security Admission is the built-in admission controller (enabled by default since Kubernetes 1.25) that enforces Pod Security Standards. It operates at the namespace level via labels.

flowchart TD
    kubectl["<b>kubectl apply -f deployment.yaml</b>"]
    api["<b>API Server</b>"]
    psa["<b>Pod Security Admission Controller</b><br>Namespace: production"]
    lookup["Look up namespace labels:<br>enforce: baseline<br>warn: restricted<br>audit: restricted"]

    enforce{"<b>ENFORCE (baseline)</b><br>hostNetwork? hostPath?<br>privileged?"}
    reject["REJECT (403)"]
    warn{"<b>WARN (restricted)</b><br>runs as root?<br>no seccomp profile?"}
    audit{"<b>AUDIT (restricted)</b><br>same checks as warn"}
    allow["ALLOW<br>(pod admitted)"]

    kubectl --> api --> psa --> lookup --> enforce
    enforce -- "violation" --> reject
    enforce -- "pass" --> warn
    warn -- "violation" --> allow
    warn -. "warnings shown to user" .-> allow
    allow --> audit
    audit -. "violations logged for review" .-> allow

Namespace Labels

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    # Enforce: reject pods that violate the profile
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: v1.30

    # Warn: allow but show warnings for violations
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: v1.30

    # Audit: allow but log violations
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: v1.30

The three modes serve different purposes:

  • enforce — Hard block. The pod is rejected with a 403 error. Use for the profile you are confident about.
  • warn — Soft signal. The pod is admitted, but the user sees a warning in their kubectl output. Use for the profile you are migrating toward.
  • audit — Silent logging. The pod is admitted, and the violation is recorded in the audit log. Use for monitoring before tightening.

The recommended pattern is to enforce the current standard and warn/audit at the next level up. This gives teams visibility into what would break if you tightened the policy.

Version Pinning

The *-version labels pin the profile to a specific Kubernetes version’s definition. This prevents surprise breakage when you upgrade the cluster: a new Kubernetes version might add new checks to the Restricted profile, and pinning ensures the old definition is used until you explicitly update.

# Pin to v1.30 definitions regardless of cluster version
pod-security.kubernetes.io/enforce-version: v1.30

# Use "latest" to always get the current version's definitions
pod-security.kubernetes.io/enforce-version: latest

Migration from PodSecurityPolicy

PodSecurityPolicy (PSP) was removed in Kubernetes 1.25. If your cluster still relies on PSP, the migration to PSA follows a deliberate progression:

StepActionNamespace Label
1. AUDITAdd audit labels to all namespaces.
Review audit logs for violations.
No impact on running workloads.
audit: restricted
2. WARNAdd warn labels.
Developers see warnings when deploying non-compliant pods.
Still no enforcement.
warn: restricted
3. FIXUpdate workloads to comply:
- runAsNonRoot: true
- seccompProfile: RuntimeDefault
- Drop all capabilities
- Switch to non-root base images
(no label change)
4. ENFORCEAdd enforce labels.
Non-compliant pods are rejected.
Remove PSP resources.
enforce: baseline
warn: restricted
5. TIGHTENMove enforcement from baseline to restricted
as workloads are updated.
enforce: restricted

Common Migration Fixes

Running as non-root:

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop: ["ALL"]
        seccompProfile:
          type: RuntimeDefault

Choosing a non-root base image:

FROM node:20-slim
# Create non-root user
RUN groupadd -r app && useradd -r -g app -d /app app
WORKDIR /app
COPY --chown=app:app . .
USER app

Namespace Exemptions

Some namespaces legitimately need Privileged access. The PSA admission controller supports exemptions configured at the API server level:

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
  - name: PodSecurity
    configuration:
      apiVersion: pod-security.admission.config.k8s.io/v1
      kind: PodSecurityConfiguration
      defaults:
        enforce: baseline
        enforce-version: latest
        warn: restricted
        warn-version: latest
        audit: restricted
        audit-version: latest
      exemptions:
        usernames: []
        runtimeClasses: []
        namespaces:
          - kube-system        # System components need privileges
          - monitoring         # Node exporters need host access
          - storage-system     # CSI drivers need host access

This configuration sets cluster-wide defaults (enforce baseline, warn restricted) and exempts specific namespaces. Exempt namespaces are not subject to PSA checks at all, so apply RBAC and other controls carefully.

When to Supplement with Kyverno or Gatekeeper

Pod Security Standards cover the most critical pod-level controls, but they are intentionally limited in scope. They do not address:

  • Image registry restrictions (only allow images from approved registries)
  • Required labels or annotations (every pod must have team and cost-center labels)
  • Resource limits (every container must have CPU and memory limits)
  • Specific capability requirements (allow NET_RAW for ping utilities)
  • Per-workload exceptions (allow hostNetwork for a specific DaemonSet but not others)
  • Custom validation (container images must use digest references, not tags)

For these use cases, supplement PSA with Kyverno or OPA Gatekeeper. The recommended pattern:

  1. PSA handles the broad security baseline (enforce at the namespace level, zero configuration per workload)
  2. Kyverno/Gatekeeper handles fine-grained policies (per-resource exceptions, organizational standards, image policies)
# Kyverno: require resource limits on all containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-limits
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "All containers must have CPU and memory limits."
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    cpu: "?*"
                    memory: "?*"

A Practical Security Posture

RECOMMENDED PSA CONFIGURATION
───────────────────────────────

  Namespace Type          Enforce      Warn         Audit
  ──────────────          ───────      ────         ─────
  kube-system             privileged   ---          ---
  monitoring              privileged   ---          ---
  storage-system          privileged   ---          ---
  application-dev         baseline     restricted   restricted
  application-staging     restricted   restricted   restricted
  application-production  restricted   restricted   restricted

  Start with baseline enforcement for dev namespaces.
  Move to restricted as workloads are updated.
  Production should enforce restricted from the start
  for new applications.

Common Mistakes and Misconceptions

  • “Running as root in a container is the same as root on the host.” Without proper configuration, it can be. Container root can escape to host root via privileged containers, host mounts, or kernel exploits. Always set runAsNonRoot: true and drop capabilities.
  • “Pod Security Standards are optional.” Without enforcement (via PSS labels or admission controllers), any user who can create pods can create privileged pods. Default to restricted and grant exceptions explicitly.
  • “My application needs privileged: true.” Very few applications genuinely need host-level access. Most cases can be solved with specific Linux capabilities (NET_BIND_SERVICE, SYS_PTRACE) instead of full privilege.

Further Reading


Part 6 shifts from securing workloads to scaling them.

Next: Horizontal Pod Autoscaler

Chapter 30: Horizontal Pod Autoscaler

A deployment with a fixed replica count is a bet that traffic will stay constant. Traffic never stays constant. If you guess too low, pods become overloaded and latency spikes. If you guess too high, you pay for idle compute around the clock. The Horizontal Pod Autoscaler (HPA) replaces this guessing game with a feedback loop: measure demand, compute the right number of replicas, and adjust — continuously.

Understanding HPA from first principles requires understanding the algorithm it uses, the metrics it consumes, how to extend it beyond built-in metrics, and the tuning knobs that prevent it from behaving erratically.

The Scaling Algorithm

The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). Each iteration executes a single formula:

desiredReplicas = ceil( currentReplicas * ( currentMetricValue / desiredMetricValue ) )

This is a proportional controller. If you have 4 replicas running at 80% CPU and your target is 50% CPU, the math is:

desiredReplicas = ceil( 4 * (80 / 50) ) = ceil( 6.4 ) = 7

The HPA will scale from 4 to 7 replicas. When those 7 replicas bring average CPU down to 45%, the formula produces:

desiredReplicas = ceil( 7 * (45 / 50) ) = ceil( 6.3 ) = 7

No change. The system has stabilized.

The 10% Tolerance Band

To prevent constant oscillation around the target, the HPA applies a tolerance of 0.1 (10%). If the ratio currentMetric / desiredMetric falls within [0.9, 1.1], the HPA takes no action. This dead zone prevents the controller from chasing noise.

sequenceDiagram
    participant M as Metrics API
    participant H as HPA Controller
    participant D as Deployment
    participant P as Pods

    loop Every 15 seconds
        H->>M: Fetch current metrics
        M-->>H: CPU 80%, target 60%
        Note right of H: ratio = 80/60 = 1.33<br>Outside tolerance (0.9–1.1)<br>desiredReplicas = ceil(current * 1.33)<br>Clamp to [min, max]
        H->>D: Patch .spec.replicas
        D->>P: Create new pods (or terminate)
        P-->>M: Report metrics via cAdvisor
    end

Default Metrics: CPU and Memory

The simplest HPA targets CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-frontend
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-frontend
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Critical prerequisite: CPU utilization is computed as a percentage of the pod’s resource request. If your pods do not have resources.requests.cpu set, the HPA cannot compute utilization and will refuse to scale. This is the single most common HPA misconfiguration.

You can target memory the same way, but memory-based scaling is tricky. Many applications (JVM, Python, Go with large heaps) allocate memory and never release it. Scaling up works, but scaling down may never trigger because memory consumption does not drop when load drops.

Custom Metrics via Prometheus Adapter

Built-in CPU and memory metrics are crude. Most services should scale on business-relevant metrics: requests per second, queue depth, p99 latency. The custom metrics API (custom.metrics.k8s.io) provides the abstraction; Prometheus Adapter is the most common implementation that bridges Prometheus metrics into this API.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

The Prometheus Adapter configuration maps PromQL queries to Kubernetes metric names. When the HPA asks “what is the current value of http_requests_per_second for deployment api-server?”, the adapter executes the corresponding PromQL query and returns the result.

KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) does not replace HPA — it extends it. KEDA solves two problems that HPA cannot:

  1. Zero-to-one scaling. HPA’s minReplicas must be at least 1. KEDA can scale a deployment to zero and activate it when an event arrives.

  2. Diverse event sources. KEDA ships with 60+ scalers: Kafka consumer lag, AWS SQS queue depth, Azure Service Bus, Redis streams, Cron schedules, PostgreSQL query results, and more. Adding a new metric source requires no adapter installation — just a ScaledObject manifest.

KEDA Architecture

KEDA installs two components:

  • Operator (keda-operator): Watches ScaledObject and ScaledJob CRDs. When scaling from 0 to 1, KEDA directly modifies the deployment’s replica count. For scaling from 1 to N, KEDA creates and manages an HPA resource, feeding it metrics through the second component.

  • Metrics Adapter (keda-operator-metrics-apiserver): Implements the external metrics API (external.metrics.k8s.io). The HPA that KEDA creates targets metrics served by this adapter.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0       # Scale to zero when idle
  maxReplicaCount: 100
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: orders
        topic: incoming-orders
        lagThreshold: "50"

When the Kafka consumer lag for the orders group exceeds 50, KEDA activates the deployment (0 to 1), then the HPA scales from 1 to N based on how far the lag exceeds the threshold.

HPAv2 Behavior Tuning

The autoscaling/v2 API introduced the behavior field, which provides fine-grained control over how fast the HPA scales up and down. Without tuning, the HPA can oscillate: a traffic spike causes rapid scale-up, load drops as new pods absorb traffic, the HPA immediately scales down, load spikes again.

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300      # Wait 5 minutes before scaling down
    policies:
      - type: Percent
        value: 10
        periodSeconds: 60               # Remove at most 10% of pods per minute
      - type: Pods
        value: 2
        periodSeconds: 60               # Or at most 2 pods per minute
    selectPolicy: Min                    # Use whichever policy removes FEWER pods
  scaleUp:
    stabilizationWindowSeconds: 0        # Scale up immediately
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15               # Double pod count every 15 seconds
      - type: Pods
        value: 4
        periodSeconds: 15               # Or add 4 pods every 15 seconds
    selectPolicy: Max                    # Use whichever policy adds MORE pods

Key Concepts

  • stabilizationWindowSeconds: The HPA looks at all recommended replica counts over this time window and picks the highest value for both scale-up and scale-down. This conservative behavior prevents premature scale-down (by remembering recent high recommendations) and ensures adequate scale-up (by not ignoring recent spikes). A 300-second stabilization window for scale-down means the HPA will not reduce replicas until no recommendation within the past 5 minutes called for a higher count. This prevents premature scale-down after a traffic burst.

  • Policies (Percent vs Pods): Each policy defines a maximum change rate. Percent: 10 means remove at most 10% of current replicas. Pods: 2 means remove at most 2 pods. You can combine multiple policies.

  • selectPolicy: When multiple policies exist, Min picks the one that changes the least (conservative), Max picks the one that changes the most (aggressive), and Disabled prevents scaling in that direction entirely.

General wisdom: Scale up aggressively (fast selectPolicy: Max), scale down conservatively (slow selectPolicy: Min with a stabilization window). It is always cheaper to run a few extra pods for a few minutes than to drop requests during a scale-up delay.

Common Pitfalls

Metrics lag. The metrics pipeline introduces latency. cAdvisor scrapes every 10–15 seconds. Metrics Server aggregates. The HPA polls every 15 seconds. End-to-end, there can be 30–60 seconds between a load spike and the HPA deciding to scale. For latency-sensitive services, consider scaling on leading indicators (queue depth, connection count) rather than lagging indicators (CPU).

Thrashing. Without behavior tuning, the HPA can oscillate between two replica counts every loop iteration. The stabilization window and policy limits exist to prevent this. If you see ScalingActive events alternating between scale-up and scale-down, increase the stabilization window.

Cold start. New pods take time to start (image pull, init containers, JVM warmup, cache loading). The HPA sees new pods as “not yet reporting metrics” and may scale up further before the first wave is ready. Use readiness probes with appropriate initial delays and consider scaleUp.stabilizationWindowSeconds to give new pods time to absorb load.

Missing resource requests. If pods lack resources.requests.cpu, the HPA cannot compute utilization percentages and will emit FailedGetResourceMetric events. Always set resource requests on pods that will be autoscaled.

Scaling both on CPU and a custom metric. When multiple metrics are specified, the HPA computes the desired replica count for each and takes the maximum. This is usually correct (scale up if either metric is hot), but can lead to over-provisioning if metrics are poorly correlated.

Putting It Together

A production-ready HPA configuration typically combines:

  1. A primary business metric (requests per second, queue depth)
  2. A safety-net CPU metric (catches runaway computation)
  3. Conservative scale-down behavior (stabilization window of 5–10 minutes)
  4. Aggressive scale-up behavior (double capacity every 15–30 seconds)
  5. Reasonable min/max bounds (min = 2 for HA, max = cost limit)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-service
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-service
  minReplicas: 3
  maxReplicas: 40
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "500"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 5
          periodSeconds: 15
      selectPolicy: Max

Common Mistakes and Misconceptions

  • “HPA reacts instantly to traffic spikes.” End-to-end reaction time is 1–2 minutes due to metrics lag and stabilization windows (see above).
  • “I can use HPA and VPA together on CPU.” HPA and VPA both try to act on CPU metrics, creating a conflict. Use HPA for horizontal scaling on CPU/memory and VPA only for non-HPA-targeted resources, or use the VPA recommendation-only mode alongside HPA.
  • “Setting target CPU utilization to 50% wastes resources.” 50% target means HPA scales up when average utilization exceeds 50%. This headroom absorbs traffic spikes during the scaling delay. Setting it to 90% means pods are overloaded before new ones arrive.
  • “HPA works without resource requests.” Utilization is computed as a percentage of requests; without them, CPU/memory-based HPA cannot function (see above).

Further Reading


Next: Vertical Pod Autoscaler — Right-sizing pod resource requests with VPA, in-place resize, and Goldilocks.

Chapter 31: Vertical Pod Autoscaler and Right-Sizing

The Vertical Pod Autoscaler (VPA) adjusts pod resource requests and limits rather than replica count — a harder problem because changing resources historically required restarting the pod. This constraint shaped VPA’s design from the beginning, and in-place pod resize — alpha since Kubernetes 1.27 and beta since 1.33 — is expected to reach general availability around Kubernetes 1.35, after more than six years of development.

Note: In-place pod resize is a rapidly evolving feature. GA timing and API details may change; check KEP-1287 for current status.

Understanding VPA requires understanding why right-sizing matters, how VPA’s three modes work, the new in-place resize mechanism, the critical interaction between VPA and HPA, and the practical workflow for using VPA in production.

Why Right-Sizing Matters

Most teams set resource requests once during initial deployment and never revisit them. Studies of large Kubernetes clusters consistently show that only 10–15% of requested CPU is actually consumed. The remaining 85–90% is reserved but idle — the scheduler cannot assign it to other workloads because it is “spoken for.”

This waste compounds:

  • Overprovisioned pods reserve resources they never use. The scheduler treats requests as firm commitments, so idle reservations block other pods from being scheduled.
  • Underprovisioned pods hit CPU throttling and memory OOM kills. Teams respond by doubling requests, creating more waste.
  • Node scaling follows requests, not usage. The Cluster Autoscaler adds nodes when pods cannot be scheduled, which depends on requested resources. Bloated requests cause premature node scaling.

VPA closes this loop by observing actual usage over time and recommending (or applying) appropriate resource requests.

VPA Architecture

VPA consists of three components:

flowchart TD
    subgraph VPA["VPA Components"]
        rec["<b>Recommender</b><br>Watches pod metrics over time<br>Builds usage histogram<br>Emits target, lowerBound,<br>upperBound"]
        upd["<b>Updater</b><br>Evicts pods outside<br>recommended range<br>(Auto mode only)"]
        adm["<b>Admission Webhook</b><br>Mutates pod spec at<br>creation time<br>(applies recs to new pods)"]
    end

    metrics["<b>Metrics Server / Prometheus</b>"]
    pods["<b>Running Pods</b><br>(evict + recreate)"]
    api["<b>API Server</b><br>(pod create admission)"]

    rec --> metrics
    upd --> pods
    adm --> api
  1. Recommender: Continuously observes pod resource usage (via Metrics API or Prometheus) and computes recommendations. It maintains a decaying histogram of usage patterns and outputs four values per container: lowerBound, target, uncappedTarget, and upperBound.

  2. Updater: In Auto mode, compares running pods’ resource requests against the recommended range. If a pod’s requests fall outside the [lowerBound, upperBound] range, the Updater evicts it so it can be recreated with updated requests.

  3. Admission Webhook: Intercepts pod creation requests and mutates the resource requests to match the VPA’s current recommendation. This is how the updated values actually get applied — the Updater evicts the old pod, the Deployment creates a replacement, and the Admission Webhook sets the recommended requests on the new pod.

The Three Modes (Plus One)

VPA operates in one of four modes, set via updatePolicy.updateMode:

Off (Recommendation Only)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"

The Recommender computes and stores recommendations, but neither the Updater nor the Admission Webhook applies them. You read recommendations from the VPA status and decide manually whether to act. This is the safest starting point.

kubectl get vpa api-server-vpa -o jsonpath='{.status.recommendation}' | jq .

Initial

The Admission Webhook applies recommendations to new pods at creation time, but the Updater does not evict running pods. Existing pods keep their current requests until they are restarted for other reasons (deployment rollout, node drain, crash). This is useful for gradual rollouts — new pods get right-sized, old pods are unaffected.

Auto (Recreate)

Both the Updater and Admission Webhook are active. The Updater will evict pods whose requests are outside the recommended range, causing them to be recreated with new requests. This provides fully automated right-sizing but causes pod restarts, which can be disruptive for stateful workloads or services with long startup times.

InPlaceOrRecreate (New)

With in-place pod resize approaching GA, VPA is gaining a fourth mode (the name InPlaceOrRecreate is used here but may differ in the final implementation — check the VPA documentation for your version). In this mode, VPA first attempts to resize the pod in place — updating its resource requests without restarting it. If in-place resize is not possible (for example, the new requests exceed node capacity), VPA falls back to the Recreate behavior and evicts the pod.

This is the mode most teams should target once their clusters support in-place resize at GA.

In-Place Pod Resize

In-place pod resize was proposed in KEP-1287 and has been in development for over six years (alpha in 1.27, beta in 1.33). The core challenge was that Kubernetes originally treated a pod’s resource requests as immutable — changing them required deleting and recreating the pod.

With in-place resize, you can patch a running pod’s spec.containers[*].resources.requests and spec.containers[*].resources.limits, and the kubelet will apply the change to the running container’s cgroup without restarting it. The pod’s status.resize field reports whether the resize was accepted (InProgress, Proposed, Deferred, Infeasible).

For CPU, this is straightforward — the kubelet adjusts the CFS quota. For memory, it is more complex. Increasing memory limits is safe (just raise the cgroup limit). Decreasing memory limits can only succeed if the container’s current resident memory is below the new limit.

VPA and HPA Interaction

Never use VPA and HPA on the same metric for the same workload. This is the most critical rule. If both VPA and HPA target CPU:

  1. Load increases. CPU utilization rises.
  2. HPA wants to add more pods.
  3. VPA wants to increase per-pod CPU requests.
  4. VPA increases requests. Utilization (relative to new, higher request) drops.
  5. HPA sees lower utilization and scales down.
  6. Fewer pods mean higher per-pod load. Cycle repeats.

The result is oscillation and instability.

Safe combinations:

  • HPA scales on a custom metric (requests per second, queue depth). VPA manages CPU and memory requests. They operate on orthogonal signals.
  • HPA scales on CPU. VPA is in Off mode (recommendation only), and a human periodically adjusts requests based on VPA’s suggestions.
  • Use Multidimensional Pod Autoscaler (MPA) from Google, which coordinates horizontal and vertical scaling decisions in a single controller.

The Right-Sizing Workflow

For production workloads, the recommended approach is deliberate and manual:

Step 1: Deploy VPA in Off mode. Attach a VPA with updateMode: "Off" to your deployment. Let it observe for at least 7 days to capture weekly traffic patterns.

Step 2: Collect recommendations. Read the VPA status. The target field is what VPA would set. The lowerBound and upperBound define the acceptable range.

Step 3: Analyze. Compare VPA’s target against current requests. If the target is significantly lower, your pods are overprovisioned. If higher, they are underprovisioned. Cross-reference with actual OOM kills and CPU throttling events.

Step 4: Set manual requests. Update your deployment manifests with the recommended values. Use the target as the request and upperBound as the limit (or no limit for CPU — see Chapter 33). Deploy via your normal rollout process.

Step 5: Repeat. Traffic patterns change. Revisit VPA recommendations quarterly.

Goldilocks: Automated Recommendations at Scale

Running kubectl get vpa across hundreds of deployments is tedious. Goldilocks (by Fairwinds) automates VPA recommendation collection and presents it as a dashboard.

Goldilocks creates a VPA in Off mode for every deployment in labeled namespaces, then serves a web UI showing current requests versus VPA recommendations for every container. It provides both “guaranteed” (VPA upper bound) and “burstable” (VPA target) suggestions.

# Label namespaces for Goldilocks
kubectl label namespace production goldilocks.fairwinds.com/enabled=true

# Goldilocks creates VPAs automatically and serves a dashboard

This is the fastest path to answering “how much are we wasting across the entire cluster?” without changing any workload behavior.

Resource Policy: Constraining VPA

You can constrain VPA’s recommendations with a resource policy to prevent it from setting values too low (risking OOM kills) or too high (wasting resources):

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledResources: ["cpu", "memory"]
      - containerName: sidecar
        mode: "Off"          # Don't touch the sidecar

The mode: "Off" per container is particularly useful for sidecars (Istio proxies, log collectors) that should retain their manually tuned requests.

Common Mistakes and Misconceptions

  • “VPA automatically right-sizes my pods.” In updateMode: Auto, VPA evicts and recreates pods with new resources, causing restarts. Use Off mode to get recommendations without disruption, then apply them during planned maintenance.
  • “VPA recommendations are immediately correct.” VPA needs days to weeks of historical data to produce good recommendations. Initial recommendations based on hours of data are often wrong. Let it observe through at least one full traffic cycle.
  • “I should apply VPA to every workload.” VPA’s pod-restart behavior makes it unsuitable for workloads that can’t tolerate restarts (single-replica databases, leader-election services). Use it for stateless services with multiple replicas.

Further Reading


Next: Node Scaling — Cluster Autoscaler, Karpenter, and the architecture of node-level scaling.

Chapter 32: Node Scaling: Cluster Autoscaler and Karpenter

When pods cannot be scheduled due to insufficient cluster capacity, the system must provision new nodes. When nodes sit idle, it must remove them.

Two tools dominate this space: the Cluster Autoscaler, which has been the standard since 2016, and Karpenter, which rethinks node provisioning from first principles. Understanding both requires understanding why one emerged to replace the other and the architectural difference that makes Karpenter faster, cheaper, and simpler.

Cluster Autoscaler

The Cluster Autoscaler (CA) is a Kubernetes controller that watches for pods stuck in the Pending state due to insufficient resources. When it finds them, it asks the cloud provider to add nodes. When nodes are underutilized, it drains and removes them.

How It Works

flowchart LR
    pending["Pending Pods"] --> CA["Cluster<br>Autoscaler"]
    CA -->|"which group<br>can fit?"| A["Node Group A<br>m5.large"]
    CA -->|"which group<br>can fit?"| B["Node Group B<br>m5.4xlarge"]
    CA -->|"which group<br>can fit?"| C["Node Group C<br>p3.2xlarge (GPU)"]
    A --> cloud["Cloud API<br>+1 node"]
    B --> cloud
    C --> cloud

Key constraint: Each node group is a fixed pool of identical instances. CA picks a group and increments its count — it cannot mix instance types or optimize across groups.

The critical abstraction is the node group (called Auto Scaling Group on AWS, Managed Instance Group on GCP, VM Scale Set on Azure). Each node group is a pool of identically configured nodes: same instance type, same labels, same taints. The Cluster Autoscaler does not provision individual machines — it increments or decrements a node group’s desired count.

The Latency Problem

Cluster Autoscaler’s end-to-end scaling latency typically runs 3–4 minutes:

  1. Detection (0–30s): CA polls for unschedulable pods every 10 seconds.
  2. Decision (10–30s): CA simulates scheduling against each node group template.
  3. Cloud API (30–60s): The cloud provider acknowledges the scale-up request.
  4. Instance launch (60–120s): The VM boots, pulls the OS image, starts kubelet.
  5. Node ready (10–30s): kubelet registers with the API server, node passes health checks.

For workloads that can tolerate minutes of latency, this is acceptable. For latency-sensitive services, it is not.

Multi-Cloud Support

CA’s strength is breadth. It supports AWS, GCP, Azure, OpenStack, vSphere, and more. For teams running Kubernetes on-premise or on non-AWS clouds, CA is often the only option.

Karpenter

Karpenter takes a fundamentally different approach. Instead of managing node groups, it provisions individual nodes with the exact size and configuration needed for the pending pods. There are no pre-defined node groups. Karpenter looks at what pods need and picks the optimal instance type, availability zone, and purchase option (on-demand vs spot) in a single step.

Architecture

flowchart LR
    pending2["Pending Pods<br>(batched 10s)"] --> K["Karpenter<br>bin-pack + select"]
    K -->|"best fit from<br>full catalog"| cloud2["Cloud API<br>launch exact instance"]

    cloud2 --> ex1["m5.2xlarge<br>spot, us-east-1b"]
    cloud2 --> ex2["c5.xlarge<br>on-demand, us-east-1a"]

Key difference: No node groups. Karpenter evaluates the full instance type catalog, bin-packs pending pods, and launches exactly the right instance — type, size, AZ, and purchase option — in a single API call.

Why Karpenter Is Architecturally Superior

The node group abstraction that Cluster Autoscaler depends on is the root cause of most of its limitations:

Instance type rigidity. A node group has a single instance type (or a mixed-instance policy with limitations). If your workload needs 7.5 GB of memory, and your node group uses m5.large (8 GB), you waste very little. But if it needs 9 GB, you must either use a different node group with m5.xlarge (16 GB) — wasting 7 GB — or create a new node group for every size bracket. In practice, teams maintain 3–10 node groups, each an imperfect approximation.

Karpenter eliminates this entirely. It evaluates the full instance type catalog and picks the cheapest instance that fits the pending pods after bin-packing. If three pending pods need 2 CPU + 4 GB each, Karpenter might choose a single m5.xlarge (4 CPU, 16 GB) rather than three separate nodes.

Scaling speed. Karpenter’s end-to-end latency is approximately 60–90 seconds — roughly 2–3x faster than Cluster Autoscaler. It skips the node group indirection and calls the cloud API directly. It also batches pending pods for 10 seconds before making a decision, which produces better bin-packing.

The following sequence diagram shows the timing of each step in Karpenter’s scaling cascade — notice the 10-second batching window that enables better bin-packing:

sequenceDiagram
    participant PP as Pending Pod
    participant K as Karpenter
    participant EC2 as EC2 Fleet API
    participant VM as New EC2
    participant KL as kubelet
    participant S as Scheduler
    participant P as Pod

    Note over PP,K: 0-10s
    PP->>K: detected (unschedulable)

    Note over K: batch pending pods (10s wait)
    Note over K: select optimal instance type (bin-pack)

    K->>EC2: CreateFleet (spot/OD, AZ, instance type)
    EC2-->>K: fleet accepted (~5s)

    EC2->>VM: launch VM

    Note over VM,KL: ~20-30s
    VM->>KL: boot OS, start kubelet

    Note over KL: ~5s
    KL->>KL: register node with API server

    KL->>S: node Ready
    S->>P: bind pod to node

    Note over PP,P: ~60-90s total end-to-end
    Note over P: Running

Consolidation. Karpenter continuously evaluates whether existing nodes can be consolidated. If node A is 30% utilized and node B is 25% utilized, Karpenter can cordon both, move their pods to a single smaller node, and terminate the originals. Cluster Autoscaler can only scale down nodes that are underutilized — it cannot replace a node with a smaller one.

Disruption budgets. Karpenter respects NodePool disruption budgets that control how many nodes can be disrupted simultaneously during consolidation, drift remediation, or node expiry. This prevents consolidation from causing service disruptions.

Karpenter Configuration

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m5", "m6i", "m6a", "c5", "c6i"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: "1000"
    memory: 2000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
    budgets:
      - nodes: "10%"

Comparison

AspectCluster AutoscalerKarpenter
AbstractionNode groups (ASG/MIG)Direct instance provisioning
Instance selectionFixed per node groupDynamic per scheduling batch
Scale-up latency3–4 minutes~60–90 seconds
Scale-downRemove underutilized nodesConsolidation (replace + remove)
Bin-packingLimited (one group at a time)Cross-instance-type optimization
Spot handlingMixed instance policiesFirst-class, per-node decisions
Cloud supportAWS, GCP, Azure, othersAWS (GA), Azure (GA via AKS Node Autoprovision since late 2024)
ConfigurationNode groups + CA flagsNodePool CRDs
Maturity8+ years, battle-testedYounger, rapidly maturing

When to Use Each

Use Cluster Autoscaler when:

  • You run on GCP, OpenStack, vSphere, or bare metal
  • Your organization requires the stability of a long-established project
  • You have existing node group infrastructure and limited appetite for migration

Use Karpenter when:

  • You run on AWS or Azure (AKS Node Autoprovision, GA since late 2024)
  • You want faster scaling, better bin-packing, and automated cost optimization
  • You are starting a new cluster or willing to migrate from node groups
  • You run diverse workloads that benefit from flexible instance type selection

For most AWS-based clusters starting today, Karpenter is the better default. Its consolidation alone typically reduces node costs by 20–35% compared to Cluster Autoscaler with static node groups.

Common Mistakes and Misconceptions

  • “Cluster Autoscaler scales down immediately.” CA waits 10 minutes (default scale-down-delay-after-add) before considering a node for removal, then checks if pods can be moved safely. Scale-down is intentionally conservative.
  • “Spot/preemptible instances are unreliable for anything.” With proper pod disruption budgets, multiple instance types, and spread across availability zones, spot instances work well for stateless services. Karpenter handles spot interruptions by proactively replacing nodes.

Further Reading


Next: Resource Tuning Deep Dive — CFS quotas, CPU throttling, QoS classes, and why removing CPU limits sometimes improves performance.

Chapter 33: Resource Tuning Deep Dive

CPU requests and limits translate directly into Linux cgroup parameters. Getting them wrong causes throttling on idle nodes, random OOM kills, and wasted capacity at scale. Understanding resource tuning from first principles requires understanding the kernel mechanisms themselves.

CFS Quota Mechanics

When you set a CPU limit on a container, Kubernetes translates it into two cgroup v2 parameters (or cgroup v1 equivalents):

  • cpu.cfs_period_us: The length of the scheduling period, always 100,000 microseconds (100ms).
  • cpu.cfs_quota_us: The total CPU time the container may consume within each period.

The formula is:

cpu.cfs_quota_us = cpu_limit * cpu.cfs_period_us

For a container with a CPU limit of 500m (half a core):

cpu.cfs_quota_us = 0.5 * 100,000 = 50,000 us

This means the container can use at most 50ms of CPU time in every 100ms period. If it uses its 50ms in the first 30ms of the period, the kernel throttles it — the container gets zero CPU for the remaining 70ms, even if the node’s other cores are completely idle.

CFS PERIOD AND QUOTA
─────────────────────

  cpu.cfs_period_us = 100,000 (100ms)
  cpu.cfs_quota_us  =  50,000 (50ms)  ← limit: 500m

  Period 1                    Period 2
  ├──────────────────────────┤──────────────────────────┤
  │████████████░░░░░░░░░░░░░░│████████████░░░░░░░░░░░░░░│
  │← 50ms used →│← throttled │← 50ms used →│← throttled │
  │             │   50ms    →│             │   50ms    →│
  └──────────────────────────┘──────────────────────────┘

  Container bursts to full speed, exhausts quota in 50ms,
  then sits idle for 50ms. Latency spikes every 100ms.


  Multi-threaded container with limit: 1000m (1 core)
  and 4 threads running simultaneously:

  ├──────────────────────────┤
  │ Thread 1: ██████ (25ms)  │
  │ Thread 2: ██████ (25ms)  │  Total: 100ms of CPU time
  │ Thread 3: ██████ (25ms)  │  consumed in first 25ms
  │ Thread 4: ██████ (25ms)  │  of wall-clock time
  │                          │
  │ ALL THREADS THROTTLED    │  Quota exhausted.
  │ for remaining 75ms       │  75ms of wall-clock
  │░░░░░░░░░░░░░░░░░░░░░░░░░ │  latency added.
  └──────────────────────────┘

The Throttling Paradox

This is the most counterintuitive aspect of CPU limits: a container can be heavily throttled even when the node has plenty of idle CPU. CFS quotas are enforced per-container, regardless of overall node utilization. The kernel does not say “the node is 30% utilized, let this container use more.” It says “this container has used its quota for this period, stop.”

You can observe throttling via:

# cgroup v2
cat /sys/fs/cgroup/<pod-cgroup>/cpu.stat

# Look for:
#   nr_throttled    ← number of times throttled
#   throttled_usec  ← total time spent throttled (microseconds)

Or via Prometheus:

rate(container_cpu_cfs_throttled_periods_total[5m])
  / rate(container_cpu_cfs_periods_total[5m])

A throttle ratio above 10–20% indicates the limit is actively harming performance.

Why NOT Setting CPU Limits Is Sometimes Better

For bursty workloads — web servers, API gateways, batch processors — CPU usage is spiky. A request handler might be idle for 95ms, then need 40ms of CPU to process a request. With a 500m limit, the container has 50ms of quota per period, which is enough for the burst. But if two requests arrive in the same period, the container needs 80ms and gets throttled for 20ms.

Removing the CPU limit entirely allows the container to burst to whatever the node can provide. The container still has a CPU request, which guarantees it a minimum share of CPU via CFS weight (the cpu.weight cgroup parameter). Requests affect scheduling and provide a proportional minimum, but without a limit, there is no hard ceiling.

resources:
  requests:
    cpu: 500m       # Guaranteed minimum share
    memory: 256Mi
  limits:
    # cpu: omitted  # No hard ceiling --- container can burst
    memory: 512Mi   # Memory limits should ALWAYS be set

When to remove CPU limits:

  • Web servers, API handlers, and other latency-sensitive, bursty workloads
  • When throttling metrics show significant throttle ratios
  • When the cluster has spare CPU capacity (requests < node allocatable)

When to keep CPU limits:

  • Multi-tenant clusters where one workload could starve others
  • Batch jobs that would happily consume every available core
  • Environments that require Guaranteed QoS class (limits must equal requests)

Always keep memory limits. Unlike CPU (which throttles), exceeding a memory limit causes the OOM killer to terminate the container. Memory is an incompressible resource — the kernel cannot “slow down” memory usage the way it can pause CPU access.

QoS Classes

Kubernetes assigns every pod a Quality of Service class based on its resource configuration. QoS determines eviction priority when a node runs out of resources.

QoS CLASSES AND EVICTION ORDER
────────────────────────────────

  EVICTED FIRST                              EVICTED LAST
  ◄──────────────────────────────────────────────────────►

  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐
  │  BestEffort  │  │  Burstable   │  │   Guaranteed     │
  │              │  │              │  │                  │
  │  No requests │  │  Requests    │  │  requests ==     │
  │  No limits   │  │  set, but    │  │  limits for      │
  │              │  │  limits !=   │  │  every container │
  │  First to    │  │  requests    │  │  in every pod    │
  │  die under   │  │  (or limits  │  │                  │
  │  pressure    │  │  missing)    │  │  Last to die     │
  └──────────────┘  └──────────────┘  └──────────────────┘

BestEffort: No resource requests or limits on any container. These pods are scheduled wherever there is room and are the first evicted. Appropriate only for truly disposable workloads (background log cleanup, test pods).

Burstable: At least one container has a request or limit, but they are not equal. This is the most common class. Eviction order within Burstable is based on how far the pod exceeds its requests.

Guaranteed: Every container in the pod has requests equal to limits for both CPU and memory. These pods get the highest scheduling priority and are evicted last. The trade-off is that Guaranteed pods cannot burst above their limits, which means you must size them for peak usage or accept throttling.

# Guaranteed QoS
resources:
  requests:
    cpu: "2"
    memory: 4Gi
  limits:
    cpu: "2"         # Must equal request
    memory: 4Gi      # Must equal request

Topology Manager and NUMA-Aware Scheduling

On multi-socket servers, memory access times vary depending on which CPU socket is accessing which memory bank. This is Non-Uniform Memory Access (NUMA). A process running on socket 0 accessing memory on socket 1 pays a latency penalty of 50–100 nanoseconds per access — irrelevant for most workloads, but significant for high-performance computing, machine learning inference, and network-intensive pods using SR-IOV.

The Topology Manager is a kubelet component that coordinates resource allocation across CPU Manager, Memory Manager, and Device Manager to ensure aligned NUMA placement. It operates in four policies:

PolicyBehavior
noneNo topology alignment (default).
best-effortPrefer aligned allocation but allow misalignment.
restrictedRequire aligned allocation; reject pods that cannot be aligned.
single-numa-nodeAll resources must come from a single NUMA node.

Topology Manager only affects Guaranteed QoS pods. Burstable and BestEffort pods always get the default behavior.

Node Allocatable vs Capacity

A node’s total resources (capacity) are not entirely available for pods. The kubelet reserves resources for itself, the OS, and eviction thresholds:

block
  columns 2
  block
    columns 1
    blockArrowId1<["<b>capacity</b>:<br>16 CPU, 64 Gi memory"]>(right)
    blockArrowId2<["<b>kube-reserved</b><br>kubelet, container runtime<br>cpu: 200m, memory: 1Gi"]>(right)
    blockArrowId3<["<b>system-reserved</b><br>OS daemons, sshd, journald<br>cpu: 100m, memory: 500Mi"]>(right)
    blockArrowId4<["<b>eviction-threshold</b><br>memory.available < 100Mi"]>(right)
  end
  block
    columns 1
    AL["= capacity<br>− kube-reserved<br>− system-reserved<br>− eviction-threshold<br><br>= 15.7 CPU, 62.4 Gi<br><br><b>This is what the scheduler<br>uses to place pods.</b>"]
end

The scheduler uses allocatable, not capacity, when deciding whether a pod fits on a node. If you do not set kube-reserved and system-reserved, the node can become unstable under load as the kubelet and OS compete with pods for resources.

The Overcommitment Reality

In practice, most clusters are dramatically overcommitted on CPU and undercommitted on memory:

  • Developers set CPU requests based on peak usage to avoid throttling.
  • Actual average utilization is 10–15% of requested CPU across large clusters.
  • Memory is harder to reclaim, so teams set memory requests closer to actual usage.

This means a cluster with 100 CPUs of total requests might only be using 13 CPUs at any given time. The remaining 87 CPUs are reserved but idle.

Strategies for handling overcommitment:

  1. VPA in Off mode to identify overprovisioned workloads (see Chapter 31).
  2. Remove CPU limits for bursty workloads so they can use idle CPU.
  3. Pod Priority and Preemption to ensure critical workloads can evict less important ones.
  4. Cluster-level overcommit policies (request-to-limit ratios in LimitRanges) to systematically set requests lower than limits.
  5. Right-size nodes. A few large nodes waste less to fragmentation than many small nodes.

Practical Guidelines

ResourceRequestLimitRationale
CPU (latency-sensitive)Set to P50 usageOmitBurst without throttling
CPU (batch/background)Set to averageSet to maxPrevent neighbor starvation
Memory (all workloads)Set to P95 usageSet to P99 or maxAlways limit memory

Start with VPA recommendations in Off mode, remove CPU limits for web workloads, always set memory limits, and monitor container_cpu_cfs_throttled_periods_total as a key performance indicator.

Common Mistakes and Misconceptions

  • “CPU limits prevent my app from using idle CPU.” CPU limits enforce CFS quotas regardless of available CPU. A pod hitting its CPU limit is throttled even if the node has idle cores. This is why some teams remove CPU limits entirely.
  • “Setting requests equal to limits (Guaranteed QoS) is always best.” Guaranteed QoS means your pod is never throttled or OOM-killed for using burst capacity, but it also means you pay for peak capacity at all times. Burstable QoS is more cost-effective for most workloads.
  • “Memory limits protect my application.” Memory limits protect the node by OOM-killing your container when it exceeds the limit. This is protection for neighbors, not for you. Your app crashes. Set limits above your expected peak, and profile memory usage to right-size.
  • “1 CPU means one full core.” 1 CPU = 1000 millicores of CFS bandwidth, enforced in 100ms periods (100ms of CPU time per 100ms wall clock, per thread — a multi-threaded process shares this budget across all threads, as described above). On a throttled container, “1 CPU” may deliver far less throughput than expected due to CFS burst behavior.

Further Reading


This concludes Part 6: Scaling and Performance. You now understand how to scale pods horizontally and vertically, scale nodes underneath them, and tune resource allocation down to the kernel level. Part 7 zooms out from a single cluster to the organizational challenge: running multiple clusters, building internal developer platforms, and managing multi-tenancy.

Next: Multi-Cluster Strategies

Chapter 34: Multi-Cluster Strategies

A single Kubernetes cluster is a single failure domain. One misconfigured admission webhook can block all deployments. One etcd corruption event can lose all state. One cloud region outage can take everything offline. As organizations move from “we run some things on Kubernetes” to “Kubernetes is our platform,” the question shifts from “how do we run a cluster?” to “how do we run many clusters, and how do they relate to each other?”

Multi-cluster is not about redundancy alone. Teams adopt multiple clusters for blast radius reduction, regulatory compliance, geographic latency, team isolation, and environment separation. The challenge is not running multiple clusters — it is managing them as a coherent system without reintroducing the operational complexity Kubernetes was supposed to eliminate. For a visual overview of Part 7’s platform engineering concepts, see Appendix B: Mental Models.

Why Multi-Cluster

  • Blast radius — Multiple clusters contain failures so a bad upgrade in staging does not take down production.
  • Compliance and data sovereignty — Regulations like GDPR and HIPAA may require per-region clusters to keep data local.
  • Latency — Geographic distribution puts compute close to users.
  • Team isolation — Separate clusters provide hard isolation (API servers, RBAC, upgrade schedules) beyond what namespaces offer.
  • Upgrade cadence — Running version N in production and N+1 in staging lets teams validate upgrades before rollout.

Approach 1: Independent Clusters

The simplest multi-cluster strategy is no strategy at all. Each cluster is independently provisioned, independently configured, and independently managed. Teams own their clusters end-to-end.

This works for small organizations with 2–3 clusters and dedicated platform teams per cluster. It fails at scale because every cluster drifts: different versions, different policies, different monitoring configurations, different security postures.

Approach 2: GitOps-Driven Multi-Cluster

The most widely adopted approach uses a GitOps tool to manage multiple clusters from a single source of truth. ArgoCD ApplicationSets are purpose-built for this.

flowchart TB
    subgraph git["Git Repository"]
        base["/base/<br>deployment.yaml<br>networkpolicy.yaml<br>monitoring.yaml"]
        usVals["/clusters/us-east/<br>values.yaml"]
        euVals["/clusters/eu-west/<br>values.yaml"]
        apVals["/clusters/ap-south/<br>values.yaml"]
    end

    subgraph hub["ArgoCD Hub Cluster"]
        appset["ApplicationSet generator<br>For each cluster:<br>- Create Application<br>- Inject cluster-specific values<br>- Sync state to match Git"]
    end

    subgraph regional["Regional Clusters"]
        usEast["us-east cluster<br>base + region overrides"]
        euWest["eu-west cluster<br>base + region overrides"]
        apSouth["ap-south cluster<br>base + region overrides"]
    end

    git -- "ArgoCD watches repo" --> hub
    appset --> usEast
    appset --> euWest
    appset --> apSouth

    style git fill:#f0f0ff,stroke:#333
    style hub fill:#fff0e0,stroke:#333
    style regional fill:#e0ffe0,stroke:#333
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: platform-services
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production
  template:
    metadata:
      name: "platform-{{name}}"
    spec:
      project: default
      source:
        repoURL: https://github.com/org/platform-config
        targetRevision: main
        path: "clusters/{{metadata.labels.region}}"
      destination:
        server: "{{server}}"
        namespace: platform
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

The ApplicationSet generator iterates over all clusters registered in ArgoCD that match the label selector, creates one Application per cluster, and injects cluster-specific values. A single Git commit can roll out a change to every production cluster worldwide.

This approach does not, however, provide cross-cluster service discovery or traffic management.

Approach 3: Federation

Federation projects attempt to provide a single API that spans multiple clusters. You submit a workload to the federation control plane, and it distributes replicas across member clusters.

KubeFed (Kubernetes Federation v2) was the original approach but is no longer actively developed. Karmada is the current leading project in this space. Karmada provides:

  • A dedicated API server that accepts standard Kubernetes resources
  • PropagationPolicy resources that define which clusters receive which workloads
  • OverridePolicy resources for per-cluster customization
  • Replica scheduling across clusters (weighted, by resource availability, or by policy)
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: api-server-spread
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: api-server
  placement:
    clusterAffinity:
      clusterNames:
        - us-east
        - eu-west
        - ap-south
    replicaScheduling:
      replicaDivisionPreference: Weighted
      replicaSchedulingType: Divided
      weightPreference:
        staticWeightList:
          - targetCluster:
              clusterNames: [us-east]
            weight: 2
          - targetCluster:
              clusterNames: [eu-west]
            weight: 1
          - targetCluster:
              clusterNames: [ap-south]
            weight: 1

Open Cluster Management (OCM), a CNCF sandbox project backed by Red Hat, takes a different approach to federation. Rather than a centralized control plane that pushes workloads to clusters, OCM uses a hub-and-spoke model where managed clusters pull their desired state from the hub via agents. This pull-based model can be easier to operate in environments with strict network policies or firewalls between clusters.

Federation is powerful but complex. It introduces a new control plane that must itself be highly available, and debugging failures requires understanding the federation layer, the per-cluster state, and the reconciliation between them.

Approach 4: Service Mesh Multi-Cluster

Service meshes solve the cross-cluster networking problem: how do services in cluster A discover and call services in cluster B?

Istio multi-cluster supports multiple topologies: shared control plane, replicated control planes, and multi-primary. In the multi-primary model, each cluster runs its own Istio control plane, and they exchange service endpoint information so that a service in cluster A can route traffic to pods in cluster B as if they were local.

Cilium ClusterMesh provides a similar capability at the CNI level. Cilium agents across clusters connect via a shared etcd (or KVStoreMesh proxy) and exchange pod identity and endpoint information. Services can be declared as “global,” making them accessible from any cluster in the mesh.

# Cilium global service annotation
apiVersion: v1
kind: Service
metadata:
  name: api-server
  annotations:
    service.cilium.io/global: "true"
    service.cilium.io/shared: "true"
spec:
  ports:
    - port: 80

With this annotation, any pod in any cluster in the ClusterMesh can resolve api-server and reach backends in the originating cluster. Cilium handles endpoint synchronization, identity-aware routing, and even affinity (prefer local cluster backends).

Approach 5: Cluster API for Lifecycle Management

All the above approaches assume clusters already exist. Cluster API (CAPI) addresses the lifecycle problem: how do you create, upgrade, and delete clusters declaratively?

Cluster API treats clusters as Kubernetes resources. You define a Cluster, MachineDeployment, and infrastructure-specific resources (AWS, Azure, GCP, vSphere), and Cluster API controllers reconcile them into running clusters. Upgrading a cluster’s Kubernetes version is a spec change; Cluster API handles the rolling update of control plane and worker nodes.

Combining Cluster API with GitOps gives you a fully declarative multi-cluster lifecycle: Git commits create clusters, ArgoCD ApplicationSets configure them, and Cluster API manages their infrastructure.

Choosing an Approach

RequirementRecommended Approach
Consistent configuration across clustersGitOps (ArgoCD ApplicationSets)
Cross-cluster service discoveryService mesh (Istio, Cilium ClusterMesh)
Workload distribution across clustersFederation (Karmada)
Declarative cluster lifecycleCluster API
Simple, low-overheadIndependent clusters + GitOps

Most organizations start with GitOps-driven multi-cluster and add service mesh or federation only when they have a concrete cross-cluster routing or scheduling requirement. Cluster API is orthogonal — it manages infrastructure regardless of the workload management strategy.

Common Mistakes and Misconceptions

  • “One big cluster is always better than multiple small ones.” Large clusters have larger blast radius, harder upgrades, and more complex RBAC. Many organizations use multiple clusters for environment isolation, team autonomy, and regional locality.
  • “Service mesh is required for cross-cluster communication.” DNS-based service discovery, cloud load balancers, or simple ingress routing can connect services across clusters. A mesh adds mTLS and observability but isn’t always necessary.

Further Reading


Next: Building Internal Developer Platforms — Backstage, golden paths, and the platform engineering stack.

Chapter 35: Building Internal Developer Platforms

Kubernetes gives you the building blocks of a platform. It does not give you a platform. A raw Kubernetes cluster presents developers with 60+ resource types, YAML manifests that regularly exceed 200 lines, and a debugging experience that requires understanding networking, storage, scheduling, and Linux internals. Platform engineering is the discipline of assembling these building blocks into something a product developer can use without a week of onboarding.

This is not an abstraction for its own sake. It is a response to a measurable problem: developer cognitive load. When deploying a service requires editing Kubernetes manifests, Terraform modules, CI pipelines, monitoring dashboards, and alerting rules across multiple repositories, developers spend more time on infrastructure plumbing than on the product they are building. Platform engineering inverts this by providing opinionated, pre-built paths that handle the infrastructure automatically.

The Platform Layers

An internal developer platform is a stack of tools, each handling a layer of the infrastructure problem. The typical production stack looks like this:

LayerPurposeTypical Tools
Developer InterfaceService catalog, scaffolding, docs, API registry, golden pathsBackstage
Delivery & DeploymentGitOps continuous delivery, CI pipelinesArgoCD / Flux, Tekton / GitHub Actions
Infrastructure ProvisioningCloud resources as codeCrossplane (CRDs), Terraform (HCL)
Container PlatformScheduling, networking, service discovery, autoscalingKubernetes
ObservabilityMetrics, logs, traces, alertingPrometheus + Grafana, Loki, Tempo, PagerDuty

Each layer serves a distinct purpose, and the platform team’s job is to integrate them so that developers interact primarily with the top layer.

Backstage: The Developer Portal

Backstage, originally built at Spotify and now a CNCF incubating project, is the most widely adopted developer portal. It provides:

Service catalog. Every service, library, website, and infrastructure component registered in a single searchable catalog. Each entry tracks ownership, dependencies, documentation links, API definitions, CI/CD status, and deployment targets.

Software templates. Scaffolding that creates a new service with all the boilerplate pre-configured: repository, CI pipeline, Kubernetes manifests, monitoring dashboards, and Backstage catalog entry. A developer clicks “Create New Service,” fills in a form, and gets a production-ready repository in minutes.

TechDocs. Documentation generated from Markdown files in the service’s repository and rendered in Backstage. This solves the “where do I find docs?” problem by making documentation discoverable alongside the service catalog.

Plugin ecosystem. Backstage is extensible via plugins. The Kubernetes plugin shows pod status, deployment history, and logs. The ArgoCD plugin shows sync status. The PagerDuty plugin shows on-call schedules and incidents. This consolidation means developers check one portal instead of switching between five tools.

# backstage catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: checkout-service
  description: Handles order checkout and payment processing
  annotations:
    backstage.io/techdocs-ref: dir:.
    argocd/app-name: checkout-service
    pagerduty.com/service-id: P123ABC
  tags:
    - python
    - grpc
spec:
  type: service
  lifecycle: production
  owner: team-payments
  system: commerce-platform
  dependsOn:
    - component:payment-gateway
    - resource:orders-database
  providesApis:
    - checkout-api

Golden Paths

A golden path is a pre-built, opinionated, end-to-end workflow for a common task. It is not a mandate — developers can deviate — but it is the supported, documented, well-tested way to do something.

Examples of golden paths:

  • Deploy a new microservice: Use the Backstage template. It creates a repo with Dockerfile, Helm chart, ArgoCD Application, Prometheus ServiceMonitor, and Grafana dashboard. Merge to main triggers CI, which builds the image and updates the Helm values. ArgoCD syncs to the cluster.

  • Add a PostgreSQL database: File a Crossplane Claim (see Chapter 36). The platform provisions an RDS instance, creates a Kubernetes Secret with credentials, and injects the connection string into the service via environment variables.

  • Scale for a traffic event: Set the HPA target metric and max replicas in the Helm values file. The platform handles the rest — HPA, node autoscaling, and monitoring adjustments are pre-configured.

The key property of a golden path is that it requires zero Kubernetes knowledge from the developer. They fill in business-level inputs (service name, language, database size) and the platform handles the infrastructure mapping.

The Platform Team

Platform engineering is a product discipline, not an infrastructure discipline. The platform team builds a product whose users are developers. This means:

Measure adoption, not features. A platform with 50 features that nobody uses is worse than one with 5 features that everyone uses. Track what percentage of services use the golden paths, how long it takes to go from “new service idea” to “running in production,” and how many support tickets the platform team receives.

Treat the platform as an internal product. Have a roadmap, gather user feedback, prioritize ruthlessly. The most successful platform teams run internal betas, have documentation budgets, and deprecate features deliberately.

Provide escape hatches. Golden paths should be the default, not a prison. When a team needs something non-standard (a GPU workload, a non-HTTP service, a custom CRD), the platform should not block them. The platform reduces friction for the 90% case; the 10% case gets manual support.

Anti-Patterns

Leaky abstractions. If the platform hides Kubernetes but developers still need to debug Kubernetes when things go wrong, the abstraction has not reduced cognitive load — it has added a layer. Good platforms either make the underlying system invisible (developers never need to know it is Kubernetes) or transparent (developers can drill down when they choose to).

Ignoring the developer experience. A platform that requires developers to learn a new DSL, install three CLI tools, and read 40 pages of documentation has failed. The best platforms feel like they were designed by someone who has deployed a service in anger.

No migration path. Organizations that build v1 of the platform without a plan for migrating existing services end up running two platforms indefinitely. Design for migration from the start.

A Minimal Starting Stack

For teams beginning their platform engineering journey, the minimal viable stack is:

  1. Kubernetes (managed: EKS, GKE, or AKS)
  2. ArgoCD for GitOps deployment
  3. Helm for templating with sensible defaults
  4. Prometheus + Grafana for monitoring (or a managed equivalent)
  5. A software template (even a shell script that generates a repo from a template)

Add Backstage when you have 10+ services and the catalog becomes valuable. Add Crossplane when you need self-service cloud resources. Add Tekton or a CI system when GitHub Actions is insufficient.

The goal is to make the most common developer workflows — deploy, observe, debug, rollback — take less than 5 minutes and require no Kubernetes-specific knowledge.

Common Mistakes and Misconceptions

  • “A platform team should build everything from scratch.” The best platforms compose existing tools (ArgoCD, Crossplane, Backstage) with thin glue layers. Building custom versions of solved problems wastes years and creates maintenance burdens.
  • “If we build it, developers will use it.” Platforms succeed when they’re easier than the alternative. If your platform is harder than kubectl apply, developers will bypass it. Invest in developer experience and documentation.
  • “Platform engineering is just DevOps renamed.” DevOps is a culture of shared responsibility. Platform engineering builds self-service products (internal developer platforms) that embed operational best practices. The platform is the product; developers are the customers.

Further Reading


Next: Crossplane — Managing cloud infrastructure as Kubernetes CRDs with the universal control plane.

Chapter 36: Crossplane: Infrastructure as CRDs

Crossplane extends Kubernetes’ reconciliation engine to any cloud resource — databases, storage buckets, DNS records, IAM roles — by representing each as a Kubernetes Custom Resource.

The Architecture

Crossplane installs as a set of controllers in your Kubernetes cluster. It extends the API server with CRDs that represent cloud resources, then reconciles those CRDs against the actual cloud state via provider plugins.

flowchart TD
    Claim["<b>Claim (XC)</b><br>I need a PostgreSQL DB,<br>medium size"]
    XR["<b>Composite Resource (XR)</b><br>cluster-scoped, created by<br>Crossplane from Claim"]
    Comp["Composition (template)<br>maps XR to managed resources"]

    Claim -->|Developer writes| XR
    XR --> Comp

    Comp --> MR1["<b>Managed Resource</b><br>RDS Instance<br>(provider-aws)"]
    Comp --> MR2["<b>Managed Resource</b><br>Subnet Group<br>(provider-aws)"]
    Comp --> MR3["<b>Managed Resource</b><br>Security Group<br>(provider-aws)"]

    MR1 --> AWS["<b>AWS API</b><br>Actual RDS instance, subnet group,<br>security group created and<br>continuously reconciled"]
    MR2 --> AWS
    MR3 --> AWS

Core Concepts

Providers

A Provider is a Crossplane package that installs CRDs and controllers for a specific cloud platform or service. provider-aws adds CRDs for RDS, S3, IAM, VPC, and hundreds of other AWS resources. provider-gcp, provider-azure, provider-helm, and provider-kubernetes do the same for their respective domains.

Providers authenticate to the cloud API using credentials stored in Kubernetes Secrets or via IRSA/Workload Identity.

apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-aws
spec:
  package: xpkg.upbound.io/upbound/provider-family-aws:v1.17.0

Managed Resources

A Managed Resource is a 1:1 representation of a cloud resource. One Managed Resource maps to exactly one external resource. The Crossplane controller for that resource type continuously reconciles: if the resource does not exist, create it. If it exists but has drifted from the spec, update it. If the Managed Resource is deleted, delete the cloud resource.

apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
metadata:
  name: my-database
spec:
  forProvider:
    region: us-east-1
    instanceClass: db.t3.medium
    engine: postgres
    engineVersion: "15"
    allocatedStorage: 20
    masterUsername: admin
    masterPasswordSecretRef:
      name: db-password
      namespace: crossplane-system
      key: password
  providerConfigRef:
    name: aws-provider

This is the lowest-level Crossplane abstraction. Platform teams rarely expose Managed Resources directly to developers — they are too detailed and cloud-specific.

Composite Resource Definitions (XRDs)

An XRD defines a new custom API — a higher-level abstraction that hides cloud-specific details. Think of it as defining a new Kubernetes resource type. The XRD specifies the schema (what fields developers can set) and optionally offers a namespaced Claim variant.

apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xpostgresqls.database.example.org
spec:
  group: database.example.org
  names:
    kind: XPostgreSQL
    plural: xpostgresqls
  claimNames:
    kind: PostgreSQL
    plural: postgresqls
  versions:
    - name: v1alpha1
      served: true
      revalidation: Strict
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                size:
                  type: string
                  enum: ["small", "medium", "large"]
                version:
                  type: string
                  default: "15"
              required:
                - size

This XRD creates two new resource types: XPostgreSQL (cluster-scoped composite resource) and PostgreSQL (namespaced claim). Developers only interact with the claim.

Compositions

A Composition is the template that maps a Composite Resource to one or more Managed Resources. It is where the platform team encodes organizational opinions: which instance types correspond to “small,” “medium,” and “large,” what security groups to attach, what backup policies to apply.

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: postgresql-aws
spec:
  compositeTypeRef:
    apiVersion: database.example.org/v1alpha1
    kind: XPostgreSQL
  resources:
    - name: rds-instance
      base:
        apiVersion: rds.aws.upbound.io/v1beta2
        kind: Instance
        spec:
          forProvider:
            region: us-east-1
            engine: postgres
            publiclyAccessible: false
            storageEncrypted: true
            backupRetentionPeriod: 7
      patches:
        - type: FromCompositeFieldPath
          fromFieldPath: spec.version
          toFieldPath: spec.forProvider.engineVersion
        - type: FromCompositeFieldPath
          fromFieldPath: spec.size
          toFieldPath: spec.forProvider.instanceClass
          transforms:
            - type: map
              map:
                small: db.t3.small
                medium: db.t3.medium
                large: db.r6g.xlarge

Claims

A Claim is the developer-facing interface. It is namespaced (unlike the Composite Resource), so it integrates naturally with team namespaces and RBAC. When a developer creates a Claim, Crossplane creates the corresponding Composite Resource, which the Composition expands into Managed Resources.

apiVersion: database.example.org/v1alpha1
kind: PostgreSQL
metadata:
  name: orders-db
  namespace: checkout-team
spec:
  size: medium
  version: "15"

Three lines of meaningful configuration. The developer does not need to know about RDS instance classes, security groups, subnet groups, or parameter groups. The platform team has encoded all of those decisions in the Composition.

Crossplane vs Terraform

Both Crossplane and Terraform manage cloud infrastructure declaratively. The differences are architectural:

AspectCrossplaneTerraform
Execution modelContinuous reconciliationOn-demand apply
State storageKubernetes etcd (CRDs)State files (S3, local, etc.)
Drift detectionAutomatic, continuousManual (terraform plan)
Drift correctionAutomaticManual (terraform apply)
Developer interfacekubectl, Kubernetes RBACCLI, separate auth
CompositionXRDs + Compositions (CRDs)Modules (HCL)
EcosystemGrowing, CRD-based providersMassive, mature provider ecosystem
Secret handlingKubernetes Secrets, nativeState file (encrypted via backend configuration; e.g., S3 SSE, Terraform Cloud encryption at rest)

Crossplane’s advantage: Continuous reconciliation means drift is detected and corrected automatically. If someone manually changes an RDS instance’s configuration via the AWS console, Crossplane will notice and revert it on the next reconciliation cycle (typically 1–10 minutes). Terraform only detects drift when someone runs terraform plan.

Terraform’s advantage: Maturity, ecosystem breadth, and the terraform plan workflow that lets teams review changes before applying them. Crossplane’s reconciliation model means changes to a Composition apply immediately to all resources that use it — there is no “plan” step.

In practice, many organizations use both: Terraform for foundational infrastructure (VPCs, IAM, Kubernetes clusters) managed by a platform team with manual review, and Crossplane for application-level resources (databases, caches, queues) managed self-service by development teams.

The Universal Control Plane Vision

Crossplane’s long-term vision is the “universal control plane” — a single Kubernetes API server that manages everything: containers, cloud resources, SaaS services, and internal tooling. Instead of developers learning kubectl for Kubernetes, the AWS console for cloud resources, and a CI tool’s web interface for pipelines, they interact with a single API that accepts declarative manifests for all of it.

Provider coverage is broad but not total. Complex multi-resource dependencies (create VPC, then subnet, then security group, then RDS instance) require careful ordering in Compositions. Error messages from failed cloud API calls can be opaque. But the trajectory is clear: the Kubernetes resource model is becoming the universal interface for infrastructure management, and Crossplane is the primary vehicle for that expansion.

Common Mistakes and Misconceptions

  • “Crossplane replaces Terraform.” See the comparison table above. Many organizations use both: Terraform for foundational infrastructure, Crossplane for application-level self-service.
  • “Compositions apply changes immediately with no review.” This is actually true and often a surprise. Unlike Terraform’s plan/apply workflow, changing a Composition affects all resources using it immediately. Use Composition revisions and staged rollouts.
  • “Crossplane providers cover every cloud resource.” Coverage is broad but not complete. Check the provider’s CRD list before committing to Crossplane for a specific resource. Some niche services may need Terraform or direct API calls.

Further Reading


Next: Multi-Tenancy — Namespace isolation, hierarchical namespaces, vCluster, and when soft boundaries are not enough.

Chapter 37: Multi-Tenancy

A Kubernetes cluster is expensive. Running one cluster per team, per environment, or per application multiplies that cost — not just in compute, but in operational overhead: patching, monitoring, upgrading, and securing each cluster independently. Multi-tenancy is the practice of sharing a single cluster among multiple tenants (teams, applications, customers) while maintaining isolation between them.

The fundamental tension in multi-tenancy is between sharing (for efficiency) and isolation (for safety). Too much sharing and one tenant’s misconfiguration affects others. Too much isolation and you lose the efficiency gains that motivated sharing in the first place. Kubernetes provides multiple isolation mechanisms at different strengths, and choosing the right combination depends on your trust model: are tenants friendly teams within the same organization, or are they untrusted customers running arbitrary code?

Namespace-Level Isolation

The namespace is Kubernetes’s primary unit of multi-tenancy. A namespace provides a scope for names and a target for access control, network policies, and resource quotas. For trusted, internal tenants, namespace isolation is often sufficient.

The Four Pillars

Effective namespace isolation requires four mechanisms working together:

1. RBAC (who can do what). Each tenant gets a Role and RoleBinding scoped to their namespace. Tenants can create Deployments, Services, and ConfigMaps in their namespace but cannot access other namespaces or cluster-scoped resources.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: tenant-developer
  namespace: team-alpha
rules:
  - apiGroups: ["", "apps", "batch"]
    resources: ["pods", "deployments", "services", "configmaps", "jobs"]
    verbs: ["get", "list", "watch", "create", "update", "delete"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list"]    # Read but not create --- secrets managed by platform

2. NetworkPolicies (who can talk to whom). Default-deny ingress and egress policies per namespace, with explicit allow rules for legitimate cross-namespace traffic. Without NetworkPolicies, pods in team-alpha can freely reach pods in team-beta.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: team-alpha
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  egress:
    - to: []                          # DNS only (add DNS allow separately)
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP

3. ResourceQuotas (how much can be consumed). Without quotas, one tenant can consume all cluster resources, starving others. ResourceQuotas set hard limits per namespace.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    services: "10"
    persistentvolumeclaims: "20"

4. LimitRanges (sane defaults). LimitRanges set default requests and limits for containers that do not specify them, and enforce minimum/maximum bounds. This prevents a developer from deploying a pod with requests.memory: 1Ti.

apiVersion: v1
kind: LimitRange
metadata:
  name: tenant-limits
  namespace: team-alpha
spec:
  limits:
    - type: Container
      default:
        cpu: 500m
        memory: 256Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      max:
        cpu: "4"
        memory: 8Gi
      min:
        cpu: 50m
        memory: 64Mi
flowchart TD
    subgraph Cluster["Shared Cluster"]
        subgraph Alpha["team-alpha"]
            AlphaConfig["RBAC: own role<br>NetworkPolicy: deny by default<br>ResourceQuota: 10 CPU, 20Gi<br>LimitRange: defaults set"]
            PodA["Pod A"]
            PodB["Pod B"]
        end
        subgraph Beta["team-beta"]
            BetaConfig["RBAC: own role<br>NetworkPolicy: deny by default<br>ResourceQuota: 8 CPU, 16Gi<br>LimitRange: defaults set"]
            PodC["Pod C"]
            PodD["Pod D"]
        end
        Shared["<b>Shared:</b> API server, scheduler, kubelet,<br>etcd, container runtime, nodes"]
    end

Limitations of Namespace Isolation

  • CRDs are cluster-scoped. One tenant’s CRD installation affects all tenants. A buggy CRD controller can crash the API server for everyone.
  • Cluster-scoped resources cannot be isolated. ClusterRoles, PriorityClasses, IngressClasses, and StorageClasses are visible to all tenants.
  • Node-level resources are shared. Tenants share the Linux kernel, container runtime, and host filesystem. A container escape vulnerability gives access to all pods on the node.
  • API server rate limits affect everyone. One tenant’s controller making excessive API calls degrades performance for all tenants.
  • No per-tenant admission control. Admission webhooks are cluster-scoped. You cannot run different admission policies per namespace without complex webhook routing.

For internal teams with moderate trust, these limitations are acceptable. For untrusted tenants or strict compliance requirements, they are not.

Hierarchical Namespaces (HNC)

In flat namespace models, creating a new team or sub-team requires manual namespace provisioning with duplicated RBAC, NetworkPolicies, and ResourceQuotas. Hierarchical Namespace Controller (HNC) adds parent-child relationships between namespaces. A child namespace automatically inherits Roles, RoleBindings, NetworkPolicies, and ResourceQuotas from its parent.

apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
  name: team-alpha-staging
  namespace: team-alpha

This creates team-alpha-staging as a child of team-alpha. RBAC bindings from team-alpha propagate automatically. When the parent’s policies change, children update. This is particularly useful for organizations with hierarchical team structures (org > division > team > project).

HNC does not add stronger isolation — it makes namespace management more scalable. The isolation boundary is still the namespace with its four pillars.

vCluster: Virtual Clusters

When namespace isolation is insufficient, the next step is vCluster — a project that creates virtual Kubernetes clusters inside a host cluster. Each vCluster runs its own API server, controller manager, and (optionally) its own etcd, all as pods within a namespace of the host cluster. Tenants interact with their vCluster as if it were a standalone cluster.

flowchart TD
    subgraph Host["Host Cluster"]
        subgraph VCAlpha["Namespace: vc-alpha"]
            subgraph Alpha["vCluster alpha"]
                AlphaCP["Own API server<br>Own controller-mgr<br>Own scheduler<br>Own etcd (or SQLite)"]
                AlphaTenant["<b>Tenant sees:</b><br>Own namespaces<br>Own CRDs<br>Own RBAC<br>Own secrets"]
            end
            SyncerA["Syncer<br>Syncs pods to host<br>cluster for actual scheduling"]
            Alpha --> SyncerA
        end
        subgraph VCBeta["Namespace: vc-beta"]
            subgraph Beta["vCluster beta"]
                BetaCP["Own API server<br>Own ctrl-mgr<br>Own scheduler<br>Own etcd"]
                BetaTenant["<b>Tenant sees:</b><br>Own namespaces<br>Own CRDs<br>Own RBAC<br>Own secrets"]
            end
            SyncerB["Syncer<br>Syncs pods to host"]
            Beta --> SyncerB
        end
        HostProvides["Host cluster provides: nodes, networking, storage"]
    end

How vCluster Works

  1. The vCluster control plane (API server, controller manager, optional etcd) runs as pods in a host namespace.
  2. Tenants connect to the vCluster’s API server via a kubeconfig. They see a normal Kubernetes cluster with its own namespaces, CRDs, and RBAC.
  3. When a tenant creates a pod in the vCluster, the syncer translates it into a pod in the host namespace with a mangled name. The host cluster’s scheduler places it on a real node.
  4. The tenant’s pod runs on host infrastructure but appears in the vCluster’s API server with the tenant’s labels, annotations, and namespace.

What vCluster Provides

  • CRD isolation. Each vCluster can install its own CRDs without affecting other tenants.
  • Cluster-admin per tenant. Tenants can have cluster-admin inside their vCluster without affecting the host.
  • Independent upgrades. Each vCluster can run a different Kubernetes version.
  • Full RBAC isolation. ClusterRoles and ClusterRoleBindings are scoped to the vCluster.
  • Admission webhook isolation. Tenants can install their own admission webhooks.

What vCluster Does NOT Provide

  • Node-level isolation. Pods from different vClusters share the same nodes and Linux kernel. Container escape is still a risk.
  • Network isolation by default. You still need NetworkPolicies on the host cluster to isolate traffic between vClusters.
  • Zero overhead. Each vCluster’s control plane consumes resources (typically 0.5–1 CPU and 512Mi–1Gi memory for the API server and syncer).

Comparison

CapabilityNamespacesNamespaces + HNCvCluster
RBAC isolationNamespace-scopedInherited + namespaceFull cluster-admin
CRD isolationNoneNoneFull
Network isolationVia NetworkPolicyVia NetworkPolicyVia NetworkPolicy + own Services
Resource quotasPer namespaceInheritedPer vCluster namespace on host
Independent K8s versionNoNoYes
Own admission webhooksNoNoYes
Overhead per tenant~0~00.5–1 CPU, 512Mi–1Gi
Node-level isolationNoNoNo (use Kata/gVisor)
Suitable forInternal teamsHierarchical orgsSaaS, untrusted tenants

When Namespaces Are Not Enough

Use separate physical clusters when:

  • Tenants are mutually untrusted and require node-level isolation
  • Compliance mandates physical separation (some PCI-DSS interpretations)
  • Failure domains must be completely independent

The Isolation Spectrum

Multi-tenancy is not binary. It is a spectrum from shared namespaces to dedicated clusters, and you can mix strategies:

  • Production workloads for different teams: namespaces with strict RBAC, NetworkPolicies, and quotas
  • Development and CI environments: vClusters (disposable, fast to create, cheap)
  • Customer-facing SaaS tenants: vClusters with NetworkPolicies and optional gVisor for runtime isolation
  • Regulated workloads: dedicated clusters with Cluster API lifecycle management

The right answer depends on your threat model, compliance requirements, and operational capacity. Start with namespaces, add vCluster when you hit a namespace limitation, and reach for dedicated clusters only when virtual isolation is insufficient.

Common Mistakes and Misconceptions

  • “Namespaces provide security isolation.” Namespaces are a grouping mechanism, not a security boundary. Without NetworkPolicies, RBAC, ResourceQuotas, and Pod Security Standards, pods in different namespaces can freely communicate and compete for resources.
  • “vCluster is overkill for multi-tenancy.” For strong isolation (e.g., different customers, untrusted workloads), namespace-level controls are often insufficient. vCluster provides full API isolation with lower overhead than separate physical clusters.

Further Reading


This concludes Part 7: Multi-Cluster and Platform Engineering. You now know how to operate Kubernetes at organizational scale — managing multiple clusters, building internal platforms, extending the API with Crossplane, and isolating tenants. Part 8 goes deeper into the machinery itself: writing your own controllers, understanding API internals, operating etcd, and running GPU and ML workloads.

Next: Writing Controllers and Operators

Chapter 38: Writing Controllers and Operators

Kubernetes ships with roughly thirty built-in controllers — the Deployment controller, the ReplicaSet controller, the Job controller, and so on. Each one watches a particular resource type, compares the desired state in the spec with the actual state in the cluster, and takes action to close the gap. This reconciliation pattern is the engine that makes Kubernetes declarative.

An operator is simply a custom controller that encodes domain-specific operational knowledge for a particular application. The Deployment controller knows how to roll out generic pods; a PostgreSQL operator knows how to initialize replicas, manage failover, and orchestrate backups. The extension mechanism is the same — only the knowledge embedded in the reconciliation logic differs.

This chapter covers how to build operators using the standard Go toolchain: the controller-runtime library and its scaffolding tool, Kubebuilder.

The Reconcile Loop

Every controller follows the same fundamental pattern. The control plane delivers a reconcile request — essentially a namespace/name pair — and the controller’s job is to make reality match the desired state for that object. The loop looks like this:

flowchart TD
    WQ["Work Queue<br>ns/name, ns/name, ..."] --> Fetch

    Fetch["1. FETCH<br>Get resource by ns/name"]
    Fetch -->|"found"| List
    Fetch -->|"not found<br>(deleted)"| Cleanup["Cleanup owned resources"] --> Done

    List["2. LIST<br>List owned child resources<br>(Deployments, Services, etc.)"]
    List --> Compare

    Compare["3. COMPARE<br>Diff desired state (spec)<br>vs actual state (children)"]
    Compare --> Act

    Act{"4. ACT"}
    Act -->|"missing"| Create["Create resources"]
    Act -->|"drifted"| Update["Update resources"]
    Act -->|"obsolete"| Delete["Delete resources"]
    Create --> Status
    Update --> Status
    Delete --> Status

    Status["5. STATUS<br>Update status subresource<br>(conditions, counts)"]
    Status --> Return

    Return{"6. RETURN"}
    Return -->|"error"| ErrRequeue["Requeue with<br>exponential backoff"] --> WQ
    Return -->|"RequeueAfter"| TimedRequeue["Requeue after<br>duration"] --> WQ
    Return -->|"nil, nil"| Done2["Done<br>(wait for next event)"]

Kubebuilder Scaffolding

Kubebuilder generates the boilerplate so you can focus on the reconciliation logic. A typical workflow:

# Initialize a new project
kubebuilder init --domain example.com --repo github.com/example/myoperator

# Create an API (CRD + controller)
kubebuilder create api --group apps --version v1alpha1 --kind MyApp

# Create a webhook (optional)
kubebuilder create webhook --group apps --version v1alpha1 --kind MyApp \
  --defaulting --programmatic-validation

This generates a directory structure with api/v1alpha1/myapp_types.go (your CRD schema), internal/controller/myapp_controller.go (your Reconcile function), and the wiring to register everything with the manager.

The Reconcile Function Skeleton

Here is the canonical structure in Go using controller-runtime:

package controller

import (
    "context"
    "fmt"
    "time"

    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/api/errors"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/runtime"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/log"

    myappv1 "github.com/example/myoperator/api/v1alpha1"
)

type MyAppReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

func (r *MyAppReconciler) Reconcile(ctx context.Context,
    req ctrl.Request) (ctrl.Result, error) {

    logger := log.FromContext(ctx)

    // ── Step 1: Fetch the primary resource ──────────────────
    var app myappv1.MyApp
    if err := r.Get(ctx, req.NamespacedName, &app); err != nil {
        if errors.IsNotFound(err) {
            logger.Info("MyApp deleted, nothing to do")
            return ctrl.Result{}, nil
        }
        return ctrl.Result{}, err // requeue with backoff
    }

    // ── Step 2: List owned child resources ──────────────────
    var childDeploys appsv1.DeploymentList
    if err := r.List(ctx, &childDeploys,
        client.InNamespace(req.Namespace),
        // NOTE: This field selector requires a custom index. You must register it
        // in SetupWithManager using mgr.GetFieldIndexer().IndexField() — it does
        // not work out of the box. See the controller-runtime documentation for
        // how to set up custom field indexes.
        client.MatchingFields{"metadata.ownerReferences.uid": string(app.UID)},
    ); err != nil {
        return ctrl.Result{}, err
    }

    // ── Step 3: Compare desired vs actual ───────────────────
    desiredReplicas := app.Spec.Replicas
    if len(childDeploys.Items) == 0 {
        // ── Step 4a: Create ─────────────────────────────────
        deploy := r.buildDeployment(&app)
        if err := ctrl.SetControllerReference(&app, deploy, r.Scheme); err != nil {
            return ctrl.Result{}, err
        }
        if err := r.Create(ctx, deploy); err != nil {
            return ctrl.Result{}, err
        }
        logger.Info("Created Deployment", "replicas", desiredReplicas)
    } else {
        // ── Step 4b: Update if drifted ──────────────────────
        existing := &childDeploys.Items[0]
        if *existing.Spec.Replicas != desiredReplicas {
            existing.Spec.Replicas = &desiredReplicas
            if err := r.Update(ctx, existing); err != nil {
                return ctrl.Result{}, err
            }
        }
    }

    // ── Step 5: Update status ───────────────────────────────
    if len(childDeploys.Items) > 0 {
        app.Status.ReadyReplicas = childDeploys.Items[0].Status.ReadyReplicas
    }
    app.Status.Phase = "Running"
    if err := r.Status().Update(ctx, &app); err != nil {
        return ctrl.Result{}, err
    }

    // ── Step 6: Return result ───────────────────────────────
    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}

Notice that the status update uses r.Status().Update() — this hits the /status subresource, which has a separate authorization check and does not modify the spec. This separation is deliberate: it prevents a controller that only needs to report status from accidentally mutating the desired state.

Watches and Predicates

A controller must tell the manager which objects to watch. The SetupWithManager method configures this:

func (r *MyAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&myappv1.MyApp{}).              // primary resource
        Owns(&appsv1.Deployment{}).          // child resource
        Owns(&corev1.Service{}).             // another child
        WithEventFilter(predicate.GenerationChangedPredicate{}).
        Complete(r)
}

.For() registers a watch on the primary resource. When a MyApp object is created, updated, or deleted, a reconcile request is enqueued.

.Owns() registers a watch on child resources and automatically maps events back to the owning parent. If someone manually edits a Deployment owned by your MyApp, the controller will reconcile the parent MyApp — and correct the drift.

Predicates filter which events actually trigger reconciliation. GenerationChangedPredicate skips status-only updates (since .metadata.generation only increments on spec changes). You can write custom predicates for arbitrary filtering:

withAnnotation := predicate.NewPredicateFuncs(func(obj client.Object) bool {
    return obj.GetAnnotations()["myapp.example.com/managed"] == "true"
})

Requeue Logic

The return value of Reconcile controls what happens next:

Return ValueBehavior
ctrl.Result{}, nilTerminal. No requeue. The controller is done until the next watch event.
ctrl.Result{}, errImmediate requeue with exponential backoff (default ~1s → 16min).
ctrl.Result{Requeue: true}, nilImmediate requeue (no backoff). Use sparingly.
ctrl.Result{RequeueAfter: 30s}, nilScheduled requeue. Useful for polling external systems.

The exponential backoff on error is critical. Without it, a controller that encounters a persistent error (like a missing dependency) would hammer the API server in a tight loop. The backoff gives transient errors time to resolve and limits the blast radius of permanent failures.

Concurrency and Idempotency

By default, a controller processes one reconcile request at a time. You can increase parallelism:

ctrl.NewControllerManagedBy(mgr).
    WithOptions(controller.Options{MaxConcurrentReconciles: 5}).
    Complete(r)

But this means two reconcile calls for different objects may run simultaneously. Your Reconcile function must be idempotent — calling it twice with the same input must produce the same result. It must also be safe for concurrent execution across different keys. Never rely on in-memory state between reconciliations; always read from the API server.

Leader Election

In production, operators typically run with two or more replicas for availability. Only one replica should be actively reconciling at any time. controller-runtime supports leader election out of the box:

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    LeaderElection:   true,
    LeaderElectionID: "myapp-operator-lock",
})

Leader election uses a Lease object in the cluster. The active leader renews the lease periodically. If it fails to renew (crash, network partition), another replica acquires the lease and begins reconciling. The transition typically takes 15–30 seconds depending on configuration.

Webhook Development

Kubebuilder scaffolds two types of admission webhooks:

Mutating (Defaulting) webhooks modify incoming objects before they are persisted. Use these to inject default values, add labels, or set fields the user omitted:

func (r *MyApp) Default() {
    if r.Spec.Replicas == 0 {
        r.Spec.Replicas = 3
    }
    if r.Spec.Image == "" {
        r.Spec.Image = "myapp:latest"
    }
}

Validating webhooks reject invalid objects. They run after mutating webhooks and return an error if the object violates business rules:

func (r *MyApp) ValidateCreate() (admission.Warnings, error) {
    if r.Spec.Replicas > 100 {
        return nil, fmt.Errorf("replicas cannot exceed 100")
    }
    return nil, nil
}

Webhooks require TLS certificates. In production, use cert-manager to automate certificate provisioning and rotation.

The Operator Maturity Model

The Operator Framework defines five maturity levels. Most operators in the wild sit at Level 1 or 2. Reaching Level 5 is rare and typically reserved for complex stateful systems.

OPERATOR MATURITY MODEL
────────────────────────

  Level 5  │  AUTO PILOT
           │  Automatic scaling, tuning, anomaly detection.
           │  Horizontal/vertical scaling based on load.
           │  Self-healing beyond simple restart.
           │
  Level 4  │  DEEP INSIGHTS
           │  Expose metrics, alerts, log processing.
           │  Grafana dashboards, SLO tracking.
           │  Workload-specific telemetry.
           │
  Level 3  │  FULL LIFECYCLE
           │  Automated backup/restore.
           │  Version upgrades with data migration.
           │  Configuration tuning.
           │
  Level 2  │  SEAMLESS UPGRADES
           │  Patch and minor version upgrades.
           │  Operand configuration changes.
           │  No downtime during upgrades.
           │
  Level 1  │  BASIC INSTALL
           │  Automated deployment and configuration.
           │  Operator manages basic provisioning.
           │
           └──────────────────────────────────────────────
             Increasing automation and operational knowledge

Start at Level 1. Level 3 is the inflection point where automated backup and upgrades pay off. Level 5 (auto-pilot) is rare; CockroachDB and ECK are examples.

Putting It All Together

  1. Start with the API. Design your CRD spec and status carefully. They are a contract with your users. Changing them later requires conversion webhooks and migration paths.

  2. Keep Reconcile idempotent. If you create a resource, check whether it already exists first. If you update, compare before patching. Never assume the world has not changed between your List and your Create.

  3. Use owner references. They give you garbage collection for free and enable the .Owns() watch pattern. When the parent is deleted, all owned children are cleaned up automatically.

  4. Separate spec from status. Always use the status subresource. Never let the controller modify the spec.

  5. Test with envtest. controller-runtime includes an integration test harness that spins up a real API server and etcd without needing a full cluster. Use it.

  6. Think about failure modes. What happens when the API server is unreachable? When a child resource is stuck terminating? When two operators fight over the same resource? The answers should be in your code, not in a runbook.

Common Mistakes and Misconceptions

  • “Every application needs an Operator.” Operators are for stateful, complex applications that need operational automation (databases, message queues). A stateless web service managed by a Deployment does not need an Operator.
  • “Writing an Operator is straightforward.” Operators encode operational expertise in code. The happy path is simple, but handling every failure mode (partial updates, resource conflicts, cascading failures) correctly takes significant engineering effort.
  • “Operators are always better than Helm charts.” Helm charts are simpler: apply once, done. Use Operators when you need active reconciliation; use Helm when install-time configuration is sufficient.
  • “All Operators on OperatorHub are production-quality.” OperatorHub lists community and vendor operators with varying maturity levels. Check the capability level (basic install through full lifecycle) and community adoption before deploying to production.

Further Reading

  • Operator pattern — the official Kubernetes documentation explaining the operator concept, when to use one, and how operators extend the API.
  • Operator SDK documentation — the full guide for building operators with the Operator SDK, covering Go, Ansible, and Helm-based operators.
  • The Kubebuilder Book — a comprehensive tutorial that walks through building a controller from scratch using kubebuilder, including CRD design, webhook configuration, and testing.
  • OperatorHub.io — a catalog of community and vendor operators you can install in your cluster, useful for understanding what problems operators solve in practice.
  • Introducing Operators — the original CoreOS blog post by Brandon Philips that coined the term “operator” and explained the motivation behind encoding operations knowledge in software.
  • controller-runtime documentation — API reference for the Go library that underpins both kubebuilder and Operator SDK, covering the Manager, Controller, Reconciler, and client interfaces.
  • Programming Kubernetes (O’Reilly) — a book by Michael Hausenblas and Stefan Schimanski that covers the Kubernetes API machinery, custom resources, and operator development in depth.

Next: The Kubernetes API Internals — how requests flow through admission, what aggregated API servers are, and how API priority and fairness protects the control plane.

Chapter 39: The Kubernetes API Internals

Every interaction with a Kubernetes cluster — every kubectl apply, every controller reconciliation, every kubelet heartbeat — is an HTTP request to the API server.

This chapter covers the internal request lifecycle, API versioning mechanics, aggregated API servers, admission webhooks, conversion webhooks, and the priority and fairness system that prevents any single tenant from overwhelming the control plane.

API Groups and Versioning

Kubernetes organizes its API into groups. The core group (empty string, paths under /api/v1) contains the original resources: Pods, Services, ConfigMaps, Secrets. Everything added later lives in named groups under /apis/apps/v1 for Deployments, batch/v1 for Jobs, networking.k8s.io/v1 for Ingress.

Each group can serve multiple versions simultaneously, but only one is the storage version — the format actually written to etcd. When you create a Deployment via apps/v1, the API server converts it to the storage version before writing. When you read via a different version, it converts from storage on the way out.

Version Progression

Versions follow a strict graduation path:

StageConventionMeaning
Alphav1alpha1Disabled by default. May change or be removed without notice. Never use in production.
Betav1beta1Enabled by default (since 1.24, beta APIs require explicit opt-in for new APIs). Schema is mostly stable but may change. Migration path will be provided.
Stablev1GA. The API is committed. Breaking changes require a new group or version.

The version in the group name (v1, v2) is the API version, not the software version. autoscaling/v2 replaced autoscaling/v2beta2 when HPA’s extended metrics support graduated.

The API Request Lifecycle

Every request to the API server passes through a series of stages.

sequenceDiagram
    participant C as Client
    participant AuthN as Authentication
    participant AuthZ as Authorization
    participant MW as Mutating Webhooks
    participant SV as Schema Validation
    participant VW as Validating Webhooks
    participant E as etcd

    C->>AuthN: request (kubectl, controller, kubelet)
    Note right of AuthN: Who are you?<br>x509 certs, bearer tokens,<br>OIDC, webhook
    AuthN->>AuthZ: user: alice, groups: [devs]
    Note right of AuthZ: Are you allowed?<br>RBAC, Webhook, Node
    AuthZ->>MW: allowed
    Note right of MW: Modify object (serial)<br>Add defaults, inject sidecars,<br>set labels
    MW->>SV: mutated object
    Note right of SV: OpenAPI schema check<br>+ CEL validation rules
    SV->>VW: valid object
    Note right of VW: Policy checks (parallel)<br>Cannot modify, only reject
    VW->>E: approved
    Note right of E: Convert to storage version<br>Write to etcd
    E-->>C: response

The ordering of mutating before validating is deliberate. Mutating webhooks may add fields that validating webhooks then check. If validation ran first, it would reject objects that mutating webhooks would have fixed.

Aggregated API Servers

Not every API endpoint is served by the core API server. The aggregation layer allows you to register custom API servers that handle specific API groups. The core API server proxies requests to these backends transparently.

This is how the metrics API (metrics.k8s.io) works. The metrics-server registers an APIService object:

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.metrics.k8s.io
spec:
  service:
    name: metrics-server
    namespace: kube-system
  group: metrics.k8s.io
  version: v1beta1
  groupPriorityMinimum: 100
  versionPriority: 100

When a client requests kubectl top pods, the API server sees that metrics.k8s.io is handled by the metrics-server Service and proxies the request there. Authentication and authorization still happen at the front door — the aggregated server receives the request with identity headers already set.

They require their own storage, their own availability guarantees, and careful certificate management. For most use cases, CRDs are the simpler extension mechanism. Use aggregated APIs when you need custom storage backends, custom admission logic baked into the server, or sub-resource behaviors that CRDs cannot express.

Admission Webhooks

Admission webhooks are the primary extension point for policy enforcement and object mutation. They intercept requests after authentication and authorization but before storage.

Mutating Admission Webhooks

Mutating webhooks are called in serial, but the invocation order is determined by the API server (alphabetically by webhook name) and should not be relied upon. Each webhook can modify the object, and subsequent webhooks see the modifications made by previous ones. Furthermore, the API server may re-invoke mutating webhooks if a later webhook modifies the object, to give earlier webhooks a chance to react. Design mutating webhooks to be idempotent and order-independent. Common uses:

  • Injecting sidecar containers (Istio, Linkerd)
  • Adding default labels and annotations
  • Setting resource requests/limits from policy
  • Rewriting image references to use a private registry mirror

Validating Admission Webhooks

Validating webhooks are called in parallel after all mutating webhooks have run. They cannot modify the object — they can only accept or reject it. If any validating webhook rejects, the request is denied. Common uses:

  • Enforcing naming conventions
  • Requiring specific labels (owner, cost-center)
  • Blocking privileged containers
  • Preventing deployments to protected namespaces

The Admission Webhook Pipeline

The following diagram traces a Pod creation through the full admission webhook pipeline, showing how mutating webhooks run serially (each receiving the output of the previous one) while validating webhooks run in parallel (all must allow for the request to succeed).

sequenceDiagram
    participant Client as Client (kubectl)
    participant API as API Server
    participant MW1 as Mutating Webhook 1<br>(e.g. Istio sidecar inject)
    participant MW2 as Mutating Webhook 2<br>(e.g. Vault secret inject)
    participant SV as Schema Validator
    participant VW1 as Validating Webhook 1<br>(e.g. OPA Gatekeeper)
    participant VW2 as Validating Webhook 2<br>(e.g. Kyverno)
    participant etcd as etcd

    Client->>API: CREATE Pod
    Note over API: authn/authz (internal)

    rect rgba(50, 108, 229, 0.1)
        Note over API,MW2: Mutating webhooks run SERIALLY
        API->>MW1: POST /mutate
        MW1-->>API: mutated obj
        API->>MW2: POST /mutate (with mutations from WH 1)
        MW2-->>API: mutated obj
    end

    API->>SV: validate schema
    SV-->>API: OK

    rect rgba(90, 142, 240, 0.1)
        Note over API,VW2: Validating webhooks run in PARALLEL
        par
            API->>VW1: POST /validate
            VW1-->>API: allowed: true
        and
            API->>VW2: POST /validate
            VW2-->>API: allowed: true
        end
    end

    API->>etcd: ALL passed -> store
    etcd-->>API: stored
    API-->>Client: 201 Created

Configuration Details

A webhook configuration includes several critical fields:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: require-labels
webhooks:
  - name: require-labels.example.com
    rules:
      - apiGroups: ["apps"]
        apiVersions: ["v1"]
        operations: ["CREATE", "UPDATE"]
        resources: ["deployments"]
    clientConfig:
      service:
        name: label-enforcer
        namespace: policy-system
      caBundle: <base64-encoded-CA>
    failurePolicy: Fail        # or Ignore
    sideEffects: None           # None, NoneOnDryRun, or Unknown
    timeoutSeconds: 5           # default 10, max 30
    matchConditions:            # CEL-based filtering (beta 1.28, GA 1.30)
      - name: exclude-system
        expression: "!object.metadata.namespace.startsWith('kube-')"

failurePolicy controls what happens when the webhook itself is unreachable or returns an error. Fail means the API request is rejected — safe but can block the entire cluster if the webhook goes down. Ignore means the request proceeds without the webhook check — available but potentially unsafe. In production, Fail is the correct default for security-critical webhooks, but you must ensure the webhook is highly available.

sideEffects declares whether the webhook has side effects beyond modifying the admission response. Webhooks that write to external systems should declare this honestly; it affects dry-run behavior.

matchConditions (beta in Kubernetes 1.28, GA in 1.30) use CEL expressions to filter which objects actually get sent to the webhook. This is far more efficient than filtering inside the webhook itself, because non-matching objects never leave the API server.

timeoutSeconds sets how long the API server waits for a webhook response. Keep this low (3–5 seconds). A slow webhook adds latency to every matching API request. A webhook that consistently times out under failurePolicy: Fail will make the cluster unusable.

Conversion Webhooks

When a CRD serves multiple versions, the API server needs a way to convert between them. For trivial changes, Kubernetes can handle this automatically. For structural changes, you deploy a conversion webhook.

The model is hub and spoke: you designate one version as the “hub” (typically the storage version), and the webhook converts between the hub and every other version. This avoids the combinatorial explosion of converting between every pair of versions.

flowchart LR
    Hub["v1<br>(Hub / storage version)"]
    A["v1alpha1"]
    B["v1beta1"]

    Hub -- "webhook converts<br>v1 → v1alpha1" --> A
    A -- "webhook converts<br>v1alpha1 → v1" --> Hub
    Hub -- "webhook converts<br>v1 → v1beta1" --> B
    B -- "webhook converts<br>v1beta1 → v1" --> Hub

Conversion webhooks must be lossless — converting v1beta1 → v1 → v1beta1 must produce the original object. If a new version adds fields that older versions lack, use annotations to preserve the data during round-trips. This is subtle and error-prone; test conversion extensively.

API Priority and Fairness

A single misbehaving controller can issue thousands of LIST requests and overwhelm the API server, starving kubelet heartbeats and other critical traffic. API Priority and Fairness (APF) prevents this by categorizing requests into priority levels and applying fair queuing within each level.

How It Works

  1. FlowSchemas classify incoming requests. Each FlowSchema matches requests by user, group, namespace, verb, or resource and assigns them to a priority level.

  2. Priority levels define how much of the API server’s capacity is allocated to that class of traffic. Higher-priority levels get more capacity and can borrow from lower levels.

  3. Within a priority level, fair queuing ensures that no single flow (e.g., requests from one service account) monopolizes the allocation.

The system ships with several built-in FlowSchemas:

FlowSchemaPriority LevelPurpose
exemptexemptHealth checks, system:masters. No queuing.
system-leader-electionleader-electionController manager, scheduler leader election.
system-nodessystemKubelet requests. Must not be starved.
kube-controller-managerworkload-highBuilt-in controllers.
service-accountsworkload-lowDefault for service account traffic.
global-defaultglobal-defaultCatch-all for unmatched requests.

The exempt level is special — requests skip all queuing and rate limiting. This ensures that the API server can always respond to its own health checks and that break-glass admin access is never throttled.

Diagnosing APF Issues

When requests are being throttled, the API server returns 429 Too Many Requests with a Retry-After header. The apiserver_flowcontrol_dispatched_requests_total and apiserver_flowcontrol_rejected_requests_total metrics reveal which priority levels are saturated.

If your operator is being throttled, the fix is usually one of:

  1. Reduce the request rate (use watches instead of polling, use caches, reduce list scope with label selectors).
  2. Create a dedicated FlowSchema that assigns your operator to a higher priority level.
  3. Increase the concurrency shares for the relevant priority level.

Option 1 is almost always the right answer. Options 2 and 3 just shift the problem to other tenants.

Design Implications

Understanding the API internals changes how you build on Kubernetes:

Webhook placement matters. A mutating webhook that injects sidecars adds latency to every pod creation. Measure it. Keep webhook logic fast and simple. Avoid calling external services from inside a webhook.

Conversion webhooks are migration infrastructure. Plan for them from the start if your CRD is likely to evolve. Design your v1 storage version with enough flexibility that you do not need structural changes with every release.

APF protects the control plane from you. If your operator lists all pods in a 10,000-pod cluster every 30 seconds, APF will eventually throttle it. Use informer caches, label selectors, and field selectors to minimize API server load.

Authentication is pluggable. The API server does not care how you prove your identity — it supports client certificates, OIDC tokens, webhook-based token review, and service account tokens.

Common Mistakes and Misconceptions

  • “Admission webhooks are fire-and-forget.” A failing webhook can block all resource creation in your cluster. Always configure failurePolicy: Ignore for non-critical webhooks and ensure webhook services have high availability.
  • “Mutating and validating webhooks run in any order.” Mutating webhooks run first (and can run multiple rounds), then validating webhooks run. A validating webhook sees the final mutated object, not the original user submission.
  • “CRDs are free to create.” Each CRD adds load to the API server: storage in etcd, watch channels, discovery endpoints. Hundreds of CRDs (common with Crossplane providers) measurably increase API server memory and CPU usage.

Further Reading

  • Dynamic Admission Control — the official Kubernetes documentation on mutating and validating admission webhooks, including configuration, failure policies, and reinvocation.
  • Webhook Configuration Reference — details on configuring webhook authentication and authorization backends, including token review and subject access review webhooks.
  • API Aggregation Layer — how to extend the Kubernetes API with your own API server registered via APIService objects, including when to use aggregation versus CRDs.
  • Versions in CustomResourceDefinitions — how CRD versioning works, including storage versions, conversion webhooks, and strategies for evolving your API without breaking clients.
  • Extending the Kubernetes API — an overview comparing CRDs and aggregated API servers, covering the tradeoffs and use cases for each extension mechanism.
  • The Life of a Kubernetes API Request — a KubeCon talk that traces a request through authentication, authorization, admission, validation, and storage, visualizing the full request pipeline.
  • API Priority and Fairness — the official documentation on APF, covering FlowSchemas, PriorityLevelConfigurations, and how the API server prevents any single client from starving others.

Next: etcd Operations — the database that stores everything, and how to keep it healthy.

Chapter 40: etcd Operations

Every object in a Kubernetes cluster — every Pod, every Secret, every ConfigMap, every CRD instance — exists as a key-value pair in etcd. There is no secondary database, no cache that can reconstruct state. If etcd loses data, the cluster loses its memory. If etcd becomes slow, every API server call becomes slow. If etcd goes down, the cluster is effectively frozen: controllers cannot reconcile, the scheduler cannot place pods, and kubectl returns errors.

This chapter covers backup, restore, maintenance, monitoring, and the disaster recovery procedures you hope to never need.

What etcd Stores

etcd is a distributed key-value store that uses the Raft consensus protocol for replication. In a Kubernetes cluster, the API server is the only client — all reads and writes go through it. etcd stores:

  • All resource objects (Pods, Services, Deployments, Secrets, etc.)
  • Cluster configuration (RBAC rules, admission configurations)
  • Lease objects (node heartbeats, leader election)
  • Custom resources (anything registered via CRDs)
  • Events (though these are often short-lived)

The data is stored under a key hierarchy rooted at /registry/. A Pod named nginx in namespace default lives at /registry/pods/default/nginx.

etcd Cluster Architecture

flowchart TD
    subgraph Member1["etcd Member 1 (LEADER)"]
        WAL1["WAL<br>(write-ahead log)"]
        DB1["DB<br>(boltdb / bbolt)"]
    end
    subgraph Member2["etcd Member 2 (FOLLOWER)"]
        WAL2["WAL<br>(write-ahead log)"]
        DB2["DB<br>(boltdb / bbolt)"]
    end
    subgraph Member3["etcd Member 3 (FOLLOWER)"]
        WAL3["WAL<br>(write-ahead log)"]
        DB3["DB<br>(boltdb / bbolt)"]
    end

    Member1 -- "Raft: replicate<br>log entries" --> Member2
    Member1 -- "Raft: replicate<br>log entries" --> Member3

    Writes["Write requires agreement<br>from majority (2 of 3)"]
    Reads["Read can be served by<br>any member (with<br>consistency options)"]
    Member1 --- Writes
    Member2 --- Reads

    API["API Server connects via gRPC over TLS.<br>Only the API server talks to etcd directly.<br>All other components go through the API server."]
    Writes --- API
    Reads --- API

Raft requires a quorum — a majority of members — to commit writes. With 3 members, you can lose 1. With 5, you can lose 2. Always run an odd number of members. An even number provides no additional fault tolerance (4 members still tolerates only 1 failure, same as 3) while increasing the coordination overhead.

Backup

A backup you have never tested is wishful thinking.

Taking a Snapshot

# Using etcdctl (the network-aware CLI)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table

The snapshot captures the entire database at a point in time. Schedule snapshots at least every hour for production clusters. Store them off-cluster — in object storage (S3, GCS) with versioning enabled.

What Snapshots Do Not Capture

Snapshots capture etcd data only. They do not capture:

  • Container images (stored in registries)
  • Persistent volume data (stored on disks/NAS/cloud volumes)
  • External secrets (in Vault, AWS Secrets Manager, etc.)
  • Certificates (unless stored as Kubernetes Secrets)

A complete disaster recovery plan must address all of these.

Restore

Restoring from a snapshot is a destructive operation. It creates a new etcd data directory with a new cluster ID. All existing members must be stopped, and the new data directory must be distributed to all of them.

The Restore Command

Since etcd 3.5.x, the etcdctl snapshot restore command is deprecated. Use etcdutl instead:

# etcdutl operates on local files --- no network connection needed
etcdutl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored \
  --name=etcd-member-1 \
  --initial-cluster="etcd-member-1=https://10.0.1.10:2380,\
etcd-member-2=https://10.0.1.11:2380,\
etcd-member-3=https://10.0.1.12:2380" \
  --initial-cluster-token=etcd-cluster-restored \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

etcdctl vs etcdutl

This distinction confuses many operators:

ToolScopeExample Operations
etcdctlNetwork operations. Talks to a running etcd server over gRPC.snapshot save, get, put, member list, endpoint health
etcdutlFile operations. Works on local data files without a running server.snapshot restore, snapshot status, defrag (offline)

The rule of thumb: if the cluster is running and you are interacting with it, use etcdctl. If the cluster is down and you are operating on files, use etcdutl.

Compaction

etcd is a versioned key-value store. Every write creates a new revision. By default, etcd keeps all historical revisions, which means the database grows indefinitely. Compaction removes revisions older than a specified point.

# Get the current revision
rev=$(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')

# Compact everything older than current revision minus 10000
etcdctl compact $((rev - 10000))

Kubernetes API server handles compaction automatically via the --etcd-compaction-interval flag (default: 5 minutes). The API server compacts revisions older than the specified interval. You rarely need to run compaction manually unless the automatic process has fallen behind.

Defragmentation

Compaction marks old revisions as deleted but does not reclaim disk space. The database file retains its size (or grows) because bbolt uses a free-list internally. Defragmentation rewrites the database to reclaim this space.

# Online defragmentation (one member at a time)
etcdctl defrag --endpoints=https://10.0.1.10:2379

# Offline defragmentation (member must be stopped)
etcdutl defrag --data-dir=/var/lib/etcd

Defragmentation is an expensive operation that briefly blocks reads and writes on the affected member. In a multi-member cluster, defragment one member at a time, waiting for it to catch up with the leader before moving to the next. Never defragment the leader first — defragment followers, then transfer leadership, then defragment the old leader.

Performance Tuning

etcd is exquisitely sensitive to disk I/O latency. The single most impactful tuning decision is giving etcd dedicated, fast storage.

Hardware Recommendations

  • Disk: NVMe SSD or high-IOPS cloud volumes (gp3 with provisioned IOPS on AWS, pd-ssd on GCP). etcd’s WAL fsync is on the critical path for every write. Spinning disks are unacceptable. Network-attached storage with unpredictable latency is dangerous.
  • CPU: 2–4 dedicated cores. etcd is not CPU-intensive but is sensitive to scheduling delays.
  • Memory: 8 GB is sufficient for most clusters. etcd memory-maps its database, so larger databases need proportionally more RAM.
  • Network: Low-latency links between members. Raft consensus requires leader-to-follower round trips for every write. Cross-region etcd clusters are a recipe for latency problems.

Dedicated Machines

For production clusters, run etcd on dedicated nodes — not co-located with the API server or other control plane components. A CPU-hungry admission webhook or a memory leak in the scheduler should not be able to starve etcd of resources.

Tuning Parameters

# Increase heartbeat interval for high-latency networks (default 100ms)
--heartbeat-interval=250

# Increase election timeout proportionally (default 1000ms)
--election-timeout=2500

# Set snapshot count (how many transactions between snapshots)
--snapshot-count=10000

# Set quota backend bytes (database size limit, default 2GB, max 8GB)
--quota-backend-bytes=8589934592

The --quota-backend-bytes is a safety valve. When the database exceeds this limit, etcd switches to read-only mode to prevent unbounded growth. If this happens, you must compact and defragment to get below the limit before etcd will accept writes again.

Key Monitoring Metrics

Monitor these metrics to catch problems before they become outages:

MetricWhat It Tells YouAlert Threshold
etcd_mvcc_db_total_size_in_bytesDatabase size. Indicates growth trends.> 6 GB (approaching 8 GB quota)
etcd_disk_wal_fsync_duration_secondsWAL write latency. The canary for disk problems.p99 > 10ms
etcd_disk_backend_commit_duration_secondsBackend commit latency.p99 > 25ms
etcd_network_peer_round_trip_time_secondsPeer-to-peer latency.p99 > 50ms
etcd_server_proposals_failed_totalFailed Raft proposals. Indicates leader instability.Any increase
etcd_server_leader_changes_seen_totalLeader elections. Frequent changes signal network or disk issues.> 3 per hour
etcd_server_has_leaderWhether this member sees a leader.0 for > 30s

The WAL fsync duration is the single most important metric. When disk latency increases, writes slow down, Raft heartbeats are delayed, followers fall behind, and the leader may trigger unnecessary elections. Everything cascades from slow disks.

Scaling the Cluster

Adding Members

# Add a new member (run from existing cluster)
etcdctl member add etcd-member-4 \
  --peer-urls=https://10.0.1.13:2380

# Start the new member with --initial-cluster-state=existing

New members join as learners (non-voting) until they catch up with the leader’s log, then promote to full voting members. Always add one member at a time and wait for it to become healthy before adding the next.

Removing Members

# Get member ID
etcdctl member list

# Remove the member
etcdctl member remove <member-id>

When scaling from 3 to 5 members, you gain tolerance for 2 failures instead of 1, but you increase the write latency (the leader must wait for acknowledgment from 3 members instead of 2). Most clusters should stay at 3 or 5 members; 7+ members hurt write latency without meaningful availability gain.

Disaster Recovery Procedure

When etcd is down and you have a snapshot, follow this sequence.

The following sequence diagram shows the exact ordering of a disaster recovery restore. The order is critical — starting API servers before etcd is stopped, or restoring only some members, causes data loss or split-brain:

sequenceDiagram
    participant Op as SRE / Operator
    participant API1 as API Server (node 1)
    participant API23 as API Server (node 2+3)
    participant E1 as etcd member-1
    participant E2 as etcd member-2
    participant E3 as etcd member-3
    participant Tool as etcdutl

    Note over Op,Tool: 1. Stop ALL API servers (prevent writes during restore)
    Op->>API1: stop
    Op->>API23: stop
    API1-->>Op: stopped
    API23-->>Op: stopped

    Note over Op,Tool: 2. Stop ALL etcd members
    Op->>E1: stop
    Op->>E2: stop
    Op->>E3: stop
    E1-->>Op: stopped
    E2-->>Op: stopped
    E3-->>Op: stopped

    Note over Op,Tool: 3. Restore snapshot on EACH member
    Op->>Tool: etcdutl snapshot restore
    Tool->>E1: restore to new data-dir
    Tool->>E2: restore to new data-dir
    Tool->>E3: restore to new data-dir

    Note over Op,Tool: 4. Replace data directories (mv new-data -> /var/lib/etcd)
    Op->>E1: replace data-dir
    Op->>E2: replace data-dir
    Op->>E3: replace data-dir

    Note over Op,Tool: 5. Start etcd members (one at a time)
    Op->>E1: start
    E1-->>Op: running
    Op->>E2: start
    E2-->>Op: running
    Op->>E3: start
    E3-->>Op: running

    Note over Op,Tool: 6. Verify: etcdctl endpoint health + member list
    Op->>E1: health check
    E1-->>Op: all healthy

    Note over Op,Tool: 7. Start API servers
    Op->>API1: start
    Op->>API23: start
    API1-->>Op: running
    API23-->>Op: running

    Note over Op,Tool: CRITICAL: If you start API servers<br>before stopping etcd, new writes will be<br>lost when the snapshot overwrites the data directory.

Follow these steps:

  1. Stop all API servers. They will reconnect when etcd is back.
  2. Stop all etcd members.
  3. Restore the snapshot on each member using etcdutl snapshot restore with the correct --name, --initial-cluster, and --initial-advertise-peer-urls for each member.
  4. Replace the data directory on each member with the restored data.
  5. Start all etcd members simultaneously (or within a few seconds of each other).
  6. Verify cluster health: etcdctl endpoint health
  7. Start the API servers.
  8. Verify cluster state: kubectl get nodes, kubectl get pods --all-namespaces

After a restore, the cluster will be in the state captured by the snapshot. Any objects created between the snapshot and the failure are lost. This is why frequent snapshots and short RPO targets matter.

If only one member has failed (and quorum is maintained), do not restore from a snapshot. Instead, remove the failed member, provision a new one, and add it to the cluster. The Raft protocol will replicate the current state to the new member automatically.

Common Mistakes and Misconceptions

  • “etcd backs up automatically in managed Kubernetes.” True for the control plane etcd in EKS/GKE/AKS. But if you run self-managed clusters or use etcd for other purposes, backups are your responsibility. Test restores regularly.
  • “etcd can store large values.” etcd has a default per-value limit of 1.5 MB. Storing large ConfigMaps, Secrets, or CRDs that approach this limit degrades performance. Keep resources small.
  • “Adding more etcd nodes improves write performance.” More nodes means more Raft acknowledgments per write — 7+ members hurt, not help, write performance.

Further Reading

  • etcd Documentation — the official etcd docs covering installation, configuration, clustering, authentication, and the client API.
  • etcd Performance — benchmarking methodology and tuning guidance for etcd, including disk I/O recommendations, network latency requirements, and how to interpret benchmark results.
  • etcd Disaster Recovery — step-by-step procedures for recovering an etcd cluster from snapshot backups, including single-member and multi-member restore workflows.
  • etcd FAQ — answers to common operational questions about etcd, including cluster sizing, data size limits, request size limits, and performance expectations.
  • Operating etcd Clusters for Kubernetes — the Kubernetes-specific guide for setting up, backing up, and upgrading etcd, including TLS configuration and snapshot best practices.
  • etcd-operator (archived) — the original CoreOS operator for managing etcd clusters on Kubernetes, now archived but valuable as a reference for understanding automated etcd lifecycle management.
  • Auger — a tool for directly decoding and inspecting Kubernetes objects stored in etcd, useful for debugging and understanding how the API server serializes resources.

Next: GPU Workloads and AI/ML on Kubernetes — how GPUs are exposed to the scheduler, shared between workloads, and orchestrated for distributed training.

Chapter 41: GPU Workloads and AI/ML on Kubernetes

Kubernetes was built to orchestrate stateless web services. GPUs were built to render triangles and multiply matrices. Bringing these two worlds together required years of extension work — device plugins, operator stacks, specialized schedulers, and high-speed networking — because none of the original Kubernetes abstractions anticipated hardware accelerators.

The Device Plugin Framework

Kubernetes has no native understanding of GPUs. It knows about CPU (millicores), memory (bytes), ephemeral storage, and hugepages. Everything else enters through the device plugin framework, a gRPC-based extension point introduced in Kubernetes 1.8. In Chapter 3, we described the kubelet as a single-responsibility agent that converts API state into running containers. The device plugin framework extends the kubelet’s vocabulary beyond CPU and memory, letting it manage hardware it was never designed to know about.

How It Works

A device plugin is a process (usually running as a DaemonSet on every GPU node) that implements three gRPC services:

  1. Registration: The plugin connects to the kubelet’s Registration service at /var/lib/kubelet/device-plugins/kubelet.sock and announces a resource name (e.g., nvidia.com/gpu).

  2. ListAndWatch: The kubelet calls ListAndWatch on the plugin. The plugin returns a stream of device IDs — one per physical GPU (or virtual slice). If a GPU fails or is removed, the plugin sends an updated list. The kubelet forwards this inventory to the API server, which stores it in the Node’s .status.capacity and .status.allocatable fields.

  3. Allocate: When the scheduler places a pod requesting nvidia.com/gpu: 1 on this node, the kubelet calls Allocate with the chosen device ID. The plugin returns the environment variables, device mounts, and annotations needed to make the GPU visible inside the container (e.g., /dev/nvidia0, the NVIDIA device files, and NVIDIA_VISIBLE_DEVICES).

flowchart TD
    subgraph Node["GPU Node"]
        Plugin["NVIDIA Device Plugin (Pod)<br>Reports: GPU-0, GPU-1, GPU-2, GPU-3"]
        Kubelet["kubelet"]

        Plugin -- "1. Register('nvidia.com/gpu')" --> Kubelet
        Kubelet -- "2. ListAndWatch()<br>Plugin streams device IDs:<br>{GPU-0, GPU-1, GPU-2, GPU-3}" --> Plugin
        Kubelet -- "4. Allocate(GPU-2)" --> Plugin
        Plugin -- "Returns:<br>/dev/nvidia2<br>NVIDIA_VISIBLE_DEVICES=2<br>volume mounts" --> Kubelet
    end

    Kubelet -- "3. Updates Node status:<br>capacity: nvidia.com/gpu: 4" --> API["API Server<br>Node object .status:<br>allocatable: nvidia.com/gpu: 4"]

The device plugin protocol has two phases — registration (once at startup) and per-pod allocation. The following sequence diagram shows both:

sequenceDiagram
    participant Plugin as NVIDIA Device Plugin
    participant Kubelet as kubelet<br>(device manager)
    participant API as API Server
    participant Sched as Scheduler
    participant User as User (kubectl)

    rect rgba(50, 108, 229, 0.1)
        Note over Plugin,User: PHASE 1: REGISTRATION (startup)
        Plugin->>Kubelet: Register() via Unix socket
        Kubelet-->>Plugin: accepted
        Plugin->>Kubelet: ListAndWatch() stream:<br>[gpu-0, gpu-1, gpu-2, gpu-3]
        Kubelet->>API: update Node .status.capacity<br>nvidia.com/gpu: 4
    end

    rect rgba(90, 142, 240, 0.1)
        Note over Plugin,User: PHASE 2: PER-POD ALLOCATION
        User->>API: create Pod<br>nvidia.com/gpu: 1
        API->>Sched: schedule: node has available GPU
        Sched-->>API: bind Pod to node
        API-->>Kubelet: Pod assigned to node
        Kubelet->>Plugin: Allocate() request: gpu-0
        Plugin-->>Kubelet: response:<br>/dev/nvidia0<br>CUDA_VISIBLE_DEVICES=0<br>/usr/lib/nvidia
        Note over Kubelet: start container with<br>GPU access (via CRI)
        Kubelet->>API: update Pod status: Running
    end

Critical Constraints

The device plugin model has several hard limitations that shape everything downstream:

  • Integer-only quantities. You request nvidia.com/gpu: 1 or nvidia.com/gpu: 2. There is no nvidia.com/gpu: 0.5. Fractional GPUs do not exist in this model.
  • Non-sharable. A GPU allocated to one pod is exclusively allocated. Two pods cannot share the same device ID through the standard device plugin.
  • Not overcommittable. Unlike CPU, which can be overcommitted (requests < limits), GPU counts are absolute. If a node has 4 GPUs and 4 are allocated, a fifth pod cannot be scheduled there.
  • No memory management. Kubernetes has no visibility into GPU memory. There is no equivalent of resources.limits.memory for GPU VRAM. A pod requesting nvidia.com/gpu: 1 gets the full physical GPU.

These constraints are why MIG, MPS, time-slicing, and ultimately DRA were created.

The NVIDIA GPU Operator

Installing GPU drivers on bare metal is annoying. Installing GPU drivers on every node in a Kubernetes cluster, keeping them in sync with the CUDA toolkit version, ensuring the container runtime is configured correctly, and monitoring GPU health across hundreds of nodes — that is an operational nightmare. The NVIDIA GPU Operator solves this by packaging the entire GPU software stack as Kubernetes-native operators and containerized components.

The Eight Components

ComponentFunction
Node Feature Discovery (NFD)Labels nodes with hardware capabilities (PCI vendor IDs, CPU features). The GPU stack depends on NFD labels to identify GPU nodes.
GPU Driver ContainerRuns the NVIDIA kernel driver inside a container, compiled for the host’s kernel version. No host-level driver installation needed.
NVIDIA Container ToolkitConfigures the container runtime (containerd/CRI-O) to expose GPUs to containers. Installs the nvidia-container-runtime hook.
Device PluginThe gRPC device plugin described above. Reports GPUs to the kubelet.
GPU Feature Discovery (GFD)Labels nodes with GPU-specific metadata: model (nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB), driver version, CUDA version, MIG capabilities.
DCGM ExporterExposes GPU metrics (utilization, temperature, memory usage, ECC errors, power draw) as Prometheus metrics.
MIG ManagerConfigures Multi-Instance GPU partitioning on supported hardware (A100, H100, H200). Applies MIG profiles via node labels.
Operator ValidatorRuns post-installation validation to confirm the entire stack is functional. Reports status as conditions on the ClusterPolicy CRD.

The Modern Stack (2025-2026)

NVIDIA announced the evolution of the GPU management stack at KubeCon 2026:

GPU Operator –> DRA Driver –> KAI Scheduler

The GPU Operator now ships a DRA driver (replacing the legacy device plugin path) that exposes GPUs through the Dynamic Resource Allocation API. The KAI Scheduler is a topology-aware GPU scheduler that understands NVLink domains, MIG slices, and multi-node placement. This trio is the direction all production GPU infrastructure is heading.

GPU Sharing and Multi-Tenancy

A single H100 has 80 GB of HBM3 memory and massive compute throughput. Running a small inference model that uses 2 GB of VRAM on a dedicated H100 wastes 97.5% of the memory. GPU sharing exists to solve this economics problem.

Three Approaches

GPU SHARING MODELS
──────────────────

  MULTI-INSTANCE GPU (MIG)             MULTI-PROCESS SERVICE (MPS)
  Hardware Partitioning                Software Space Partitioning
  ┌──────────────────────┐             ┌───────────────────────┐
  │     Physical GPU     │             │     Physical GPU      │
  │  ┌─────┬─────┬─────┐ │             │                       │
  │  │ MIG │ MIG │ MIG │ │             │  ┌───┐ ┌───┐ ┌───┐    │
  │  │Inst │Inst │Inst │ │             │  │P1 │ │P2 │ │P3 │    │
  │  │ 0   │ 1   │ 2   │ │             │  │   │ │   │ │   │    │
  │  │     │     │     │ │             │  │30%│ │50%│ │20%│    │
  │  │1g.  │1g.  │1g.  │ │             │  │   │ │   │ │   │    │
  │  │10gb │10gb │10gb │ │             │  └───┘ └───┘ └───┘    │
  │  ├─────┼─────┼─────┤ │             │   Shared CUDA Context │
  │  │Own  │Own  │Own  │ │             │   Explicit memory     │
  │  │SM   │SM   │SM   │ │             │   and compute limits  │
  │  │+Mem │+Mem │+Mem │ │             │   per process         │
  │  └─────┴─────┴─────┘ │             └───────────────────────┘
  └──────────────────────┘
                                       TIME-SLICING
  Isolated compute engines,            CUDA Context Switching
  memory controllers, and              ┌───────────────────────┐
  cache partitions.                    │     Physical GPU      │
  Fault isolation: yes.                │                       │
  Memory isolation: yes.               │  ┌──────────────────┐ │
                                       │  │ Time T1: Pod A   │ │
                                       │  ├──────────────────┤ │
                                       │  │ Time T2: Pod B   │ │
                                       │  ├──────────────────┤ │
                                       │  │ Time T3: Pod C   │ │
                                       │  └──────────────────┘ │
                                       │  Round-robin context  │
                                       │  switching. All pods  │
                                       │  see full GPU memory. │
                                       │  No memory isolation. │
                                       └───────────────────────┘

Multi-Instance GPU (MIG) partitions a physical GPU at the hardware level. On an A100-80GB, you can create up to 7 instances, each with dedicated streaming multiprocessors, memory controllers, and L2 cache. Profiles include 1g.5gb (1 compute slice, 5 GB), 2g.10gb, 3g.20gb, 4g.40gb, and 7g.80gb. MIG provides true fault and memory isolation. One instance cannot see another’s memory, and a CUDA crash in one instance does not affect others.

Multi-Process Service (MPS) is a software-level sharing mechanism. An MPS server sits between CUDA clients and the GPU, multiplexing access. You can set explicit per-client limits: CUDA_MPS_PINNED_DEVICE_MEM_LIMIT=0=4096M caps a process to 4 GB. MPS allows concurrent kernel execution (true parallelism on the SM level) but lacks the hard isolation of MIG.

Time-Slicing is the simplest approach. The NVIDIA device plugin is configured to advertise more “GPUs” than physically exist (e.g., 4 physical GPUs advertised as 16 time-sliced replicas). CUDA contexts are switched in round-robin fashion. There is no memory isolation — all pods see the full VRAM and can OOM-kill each other. Context switching adds latency overhead.

When to Use Each

ScenarioRecommended ApproachRationale
Production inference with SLAsMIGHard isolation, predictable performance
Development and experimentationTime-slicingSimple setup, maximum flexibility
Batch inference pipelinesMPSConcurrent execution, configurable limits
Multi-tenant cluster, untrusted workloadsMIGFault isolation between tenants
Cost optimization, trusted workloadsTime-slicing or MPSMaximize utilization

Dynamic Resource Allocation (DRA)

You cannot express “give me a MIG slice with 20 GB of memory on a GPU that has NVLink connectivity to another GPU already allocated to this pod” with nvidia.com/gpu: 1.

Why Device Plugins Were Insufficient

  1. Count-based only. No way to parameterize requests (memory size, compute capability, MIG profile).
  2. No sharing semantics. Two pods cannot request access to the same physical device.
  3. No topology awareness. No way to express “these two GPUs must be on the same NVLink domain.”
  4. No scheduling integration. Device allocation happens at the kubelet level, after scheduling. The scheduler has no visibility into device topology.
  5. Vendor-locked plugin logic. All allocation intelligence is inside the vendor’s plugin binary.

The DRA Model

DRA, graduating to GA in Kubernetes 1.34-1.35, introduces a structured, parameterized model for hardware allocation.

DEVICE PLUGIN MODEL vs DRA MODEL
─────────────────────────────────

  DEVICE PLUGIN (Legacy)                DRA (Modern)
  ──────────────────────                ────────────

  Pod spec:                             Pod spec:
    resources:                            resourceClaims:
      limits:                               - name: gpu
        nvidia.com/gpu: 1                     resourceClaimTemplateName: gpu-claim

  That's it. Count only.                ResourceClaimTemplate:
  No parameters.                          spec:
  No sharing.                               devices:
  No topology.                                requests:
                                              - name: gpu
                                                deviceClassName: gpu.nvidia.com
                                                selectors:
                                                - cel:
                                                    expression: >
                                                      device.attributes["gpu.nvidia.com"]
                                                      .productName == "H100" &&
                                                      device.attributes["gpu.nvidia.com"]
                                                      .memory.isGreaterThan(
                                                        quantity("40Gi"))

  ┌─────────┐  count=1  ┌─────────┐     ┌─────────┐ claim ┌────────────┐
  │  Pod    │──────────►│ kubelet │     │   Pod   │──────►│ Scheduler  │
  │         │           │ picks   │     │         │       │ evaluates  │
  │         │           │ any     │     │         │       │ CEL exprs, │
  │         │           │ GPU     │     │         │       │ topology,  │
  └─────────┘           └─────────┘     └─────────┘       │ sharing    │
                                                          └────────────┘
                                                               │
                                                          ┌─────▼──────┐
                                                          │ DRA Driver │
                                                          │ prepares   │
                                                          │ device     │
                                                          └────────────┘

The Four API Objects

ObjectPurpose
ResourceSlicePublished by the DRA driver. Describes available devices on a node: attributes, capacity, topology. The scheduler reads these to make placement decisions.
DeviceClassCluster-scoped. Defines a class of devices with admin-set constraints and configuration. Example: gpu.nvidia.com class might set a default MIG profile.
ResourceClaimNamespace-scoped. A pod’s request for a device, with CEL-based selectors. Allocated by the scheduler, bound to specific devices.
ResourceClaimTemplateCreates ResourceClaims per pod, like PVCs from PVC templates in StatefulSets.

CEL selector expressions can match on any device attribute: product name, memory size, MIG capability, driver version, NUMA node, NVLink group. You can express prioritized alternatives (“prefer H100, accept A100”) and device sharing (“this claim can share a device with that claim”).

NVIDIA donated their DRA driver to the CNCF at KubeCon 2026, making it a vendor-neutral component of the ecosystem.

ML Training on Kubernetes

Training a large model is a distributed systems problem. A single GPU can handle fine-tuning a 7B model. Training a 70B model from scratch requires hundreds of GPUs coordinated across dozens of nodes, all processing data in lockstep. Kubernetes needs specialized operators and schedulers to manage these workloads.

Training Operators

Kubeflow Training Operator provides CRDs for distributed training frameworks:

  • PyTorchJob: Launches distributed PyTorch with torchrun. Configures MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK automatically.
  • TFJob: TensorFlow distributed training with PS/Worker topology.
  • MPIJob: MPI-based training (Horovod). Launches an MPI ring with SSH between pods.
  • TrainJob (v2): The unified API that abstracts framework details behind a single CRD. Specify a model, dataset, and training runtime; the operator generates the correct distributed topology.

KubeRay is the Kubernetes operator for Ray, the distributed compute framework used by OpenAI for ChatGPT training infrastructure. It provides:

  • RayCluster: A persistent Ray cluster with head and worker nodes.
  • RayJob: Submits a job to a RayCluster (or creates an ephemeral one).
  • RayService: Serves Ray Serve deployments with rolling upgrades.

Ray’s advantage is its unified API for training, tuning, and serving. A single Ray program can orchestrate data preprocessing, distributed training with PyTorch, hyperparameter tuning, and model serving.

Gang Scheduling

Standard Kubernetes scheduling is pod-by-pod. For a distributed training job requiring 64 GPUs across 8 nodes, the default scheduler might place 7 pods and then get stuck waiting for the 8th. Those 7 pods sit idle, burning GPU-hours, waiting for a resource that may not free up for hours.

Gang scheduling (all-or-nothing scheduling) ensures that either all pods in a job are scheduled simultaneously, or none are. Volcano is the primary gang scheduler for Kubernetes. It introduces:

  • Job CRD with minAvailable (minimum pods required to start).
  • Queue-based scheduling with fair-sharing across teams.
  • Preemption policies for priority-based scheduling.

Job Queuing with Kueue

Kueue is the Kubernetes-native job queuing system. While Volcano is a full scheduler replacement, Kueue works with the default scheduler, adding queuing and quota semantics on top.

Core concepts:

  • ClusterQueue: Defines a pool of resources (e.g., 100 GPUs, 200 CPUs) with borrowing limits.
  • LocalQueue: Namespace-scoped queue that points to a ClusterQueue. Users submit jobs here.
  • ResourceFlavor: Describes a class of nodes (e.g., a100-spot, h100-ondemand). Maps to node labels.
  • Cohort borrowing: ClusterQueues in the same cohort can borrow unused resources from each other. Team A’s unused GPU quota flows to Team B automatically.

Kueue vs Volcano: Use Kueue when you need multi-tenant quota management and work with the default scheduler. Use Volcano when you need a full scheduler replacement with gang scheduling, preemption, and topology-aware placement. Many production clusters use both: Kueue for queuing and quota, Volcano for gang scheduling.

Networking for Distributed Training

Distributed training spends a significant fraction of total time on communication. After each forward/backward pass, gradients must be synchronized across all workers (AllReduce).

Why Standard TCP Is Insufficient

Standard TCP networking (Pod-to-Pod via CNI) adds multiple copies and context switches per message:

  1. GPU memory –> CPU memory (PCIe DMA)
  2. CPU memory –> kernel socket buffer
  3. Kernel –> NIC (TCP/IP stack processing, segmentation)
  4. Network transit
  5. NIC –> kernel socket buffer –> CPU memory –> GPU memory (reverse path)

For a 70B parameter model with fp16 gradients, each AllReduce exchanges ~140 GB of data. Over standard 25 Gbps Ethernet with TCP, this gradient sync alone takes minutes. Real-world benchmarks show: standard TCP can make a training run take 5 hours that completes in 1h40m with GPUDirect RDMA.

GPU-TO-GPU COMMUNICATION PATHS
───────────────────────────────

  STANDARD TCP (SLOW)
  ┌──────┐  PCIe  ┌──────┐  TCP/IP  ┌──────┐  PCIe  ┌──────┐
  │ GPU  │───────►│ CPU  │─────────►│ CPU  │───────►│ GPU  │
  │ Node │  copy  │ RAM  │  stack   │ RAM  │  copy  │ Node │
  │  A   │◄───────│      │◄─────────│      │◄───────│  B   │
  └──────┘        └──────┘  NIC     └──────┘        └──────┘
  Copies: 4 (GPU→CPU, CPU→NIC, NIC→CPU, CPU→GPU)
  Latency: ~100μs+       Bandwidth: limited by TCP stack

  RDMA / RoCE (FAST)
  ┌──────┐  PCIe  ┌──────┐  RDMA   ┌──────┐  PCIe ┌──────┐
  │ GPU  │───────►│ CPU  │ bypass  │ CPU  │──────►│ GPU  │
  │ Node │  copy  │ RAM  │────────►│ RAM  │  copy │ Node │
  │  A   │        └──────┘ no TCP  └──────┘       │  B   │
  └──────┘        NIC does          NIC does      └──────┘
                  direct            direct
                  memory            memory
                  access            access
  Copies: 2 (GPU→CPU, CPU→GPU)
  Latency: ~2μs          Kernel bypass, zero-copy NIC

  GPUDirect RDMA (FASTEST)
  ┌──────┐         RDMA          ┌──────┐
  │ GPU  │──────────────────────►│ GPU  │
  │ Node │  NIC reads directly   │ Node │
  │  A   │  from GPU memory      │  B   │
  │      │◄──────────────────────│      │
  └──────┘  No CPU involved      └──────┘
  Copies: 0 (GPU memory → NIC → network → NIC → GPU memory)
  Latency: ~1μs          Maximum bandwidth, zero CPU overhead

The Communication Stack

NCCL (NVIDIA Collective Communications Library) is the standard for multi-GPU collective operations (AllReduce, AllGather, Broadcast). NCCL automatically selects the fastest available transport: NVLink for intra-node, InfiniBand or RoCE for inter-node, falling back to TCP if nothing better exists.

InfiniBand provides the highest bandwidth (400 Gbps NDR) with sub-microsecond latency and native RDMA. Most large GPU clusters (DGX SuperPOD, etc.) use InfiniBand fabrics.

RoCE (RDMA over Converged Ethernet) provides RDMA semantics over standard Ethernet. Lower cost than InfiniBand, but requires lossless Ethernet configuration (PFC, ECN).

NVIDIA Network Operator

The NVIDIA Network Operator brings RDMA networking to Kubernetes:

  • Multus CNI: Attaches multiple network interfaces to pods (one for standard traffic, one for RDMA).
  • SR-IOV Device Plugin: Exposes SR-IOV Virtual Functions as schedulable resources (nvidia.com/rdma_shared_device_a).
  • RDMA Shared Device Plugin: Enables RDMA device sharing across containers.
  • Host Device Network: Passes InfiniBand/RoCE interfaces directly into pods.

A distributed training pod spec requests both GPUs and RDMA devices:

resources:
  limits:
    nvidia.com/gpu: 8
    nvidia.com/rdma_shared_device_a: 1

Cost Optimization for GPU Workloads

H100 instances cost $25-35/hour on-demand in major clouds. A 64-GPU training cluster burns $50,000-$70,000 per day. Cost optimization is not a nice-to-have; it is an engineering requirement.

Spot/Preemptible GPU Instances

Cloud providers offer GPU instances at 60-80% discounts through spot/preemptible pricing. The tradeoff: instances can be reclaimed with 30-120 seconds notice. For training workloads with checkpointing (save state every N steps), this is viable. For inference with graceful draining, it works with proper pod disruption budgets.

Karpenter with GPU Node Pools

Karpenter provisions right-sized nodes on demand. For GPU workloads, configure separate NodePools:

  • GPU NodePool: Instance types restricted to GPU families (p5, p4d, g5). Spot pricing enabled. SpotToSpotConsolidation moves workloads between spot pools to maintain availability.
  • CPU NodePool: Standard instances for non-GPU workloads. Prevents GPU nodes from being used for CPU-only pods.

Scheduling: Bin-Packing

The default Kubernetes scheduler spreads pods across nodes. For GPU workloads, bin-packing is critical: fill GPU nodes completely before allocating new ones. A half-utilized 8-GPU node is a node you are paying full price for. Use NodeResourcesFit with MostAllocated scoring strategy, or Karpenter’s consolidation to continuously pack workloads onto fewer nodes.

The Full Cost Stack

  1. Spot instances for fault-tolerant training (60-80% savings).
  2. GPU sharing (MIG/MPS/time-slicing) for inference and dev workloads (2-7x utilization improvement).
  3. Bin-packing scheduling to minimize partially-used nodes.
  4. Kueue quotas to prevent teams from hoarding GPUs.
  5. Scale-to-zero for inference endpoints with no traffic (via KServe or KEDA).
  6. Preemption policies to let high-priority training preempt low-priority batch jobs.

Common Mistakes and Misconceptions

  • GPUs are allocated exclusively as whole units by default. No fractional requests, no sharing between pods without MIG, MPS, or DRA.
  • “Any Kubernetes node can schedule GPU workloads.” Nodes need the NVIDIA device plugin (or GPU Operator) installed, proper drivers, and the nvidia container runtime configured. Without this stack, K8s doesn’t know GPUs exist.
  • “Training and inference need the same infrastructure.” Training needs high-bandwidth interconnects (NVLink, InfiniBand), gang scheduling, and checkpointing. Inference needs low latency, autoscaling, and model serving frameworks. Different workloads, different architectures.

Further Reading

  • NVIDIA GPU Operator Documentation — the complete guide to deploying and managing the GPU Operator, which automates driver installation, container runtime configuration, device plugin deployment, and GPU monitoring.
  • Device Plugins — the official Kubernetes documentation on the device plugin framework, explaining how hardware vendors expose accelerators, FPGAs, and other devices to the kubelet.
  • Dynamic Resource Allocation KEP — the Kubernetes Enhancement Proposal for DRA with structured parameters, replacing the opaque device plugin model with a richer, scheduler-integrated resource claim system.
  • NVIDIA Multi-Instance GPU User Guide — how to partition A100 and H100 GPUs into isolated MIG instances with dedicated compute, memory, and cache, including supported profiles and configuration procedures.
  • Kubeflow Documentation — the full guide to the Kubeflow ML platform, covering pipelines, training operators (TFJob, PyTorchJob, MPIJob), model serving with KServe, and experiment tracking.
  • KubeRay Documentation — deploying and managing Ray clusters on Kubernetes for distributed training, hyperparameter tuning, and Ray Serve inference workloads.
  • Volcano Scheduler — documentation for the batch scheduling system designed for high-performance and ML workloads, supporting gang scheduling, fair-share queuing, and resource reservation.
  • NVIDIA Container Toolkit — the low-level runtime that makes GPUs accessible inside containers, including installation, configuration, and CDI (Container Device Interface) support.
  • NVIDIA GPU Operator Quickstart — Hands-on guide to setting up GPU scheduling on Kubernetes

Next: Chapter 42: Running LLMs on Kubernetes

Chapter 42: Running LLMs on Kubernetes

Serving a large language model is not the same problem as serving a web application. A web app handles requests independently in milliseconds with megabytes of memory. An LLM loads 50-400 GB of weights into GPU memory, processes requests through billions of sequential matrix multiplications, generates tokens one at a time, and must manage a KV cache that grows with every token. The infrastructure required — specialized inference servers, GPU-aware autoscaling, multi-node parallelism, model caching, and intelligent routing — demands a purpose-built stack.

ML Inference on Kubernetes

The inference server sits between Kubernetes and the GPU. It loads model weights, manages batching, handles tokenization, and exposes an API. Choosing the right one determines your throughput, latency, and cost.

KServe

KServe is the Kubernetes-native model serving framework. It provides a standard InferenceService CRD that abstracts away the inference runtime:

  • Autoscaling with Knative (including scale-to-zero, so idle models release GPU nodes entirely).
  • Canary rollouts: Route 10% of traffic to a new model version, monitor metrics, promote or roll back.
  • Multi-framework support: TensorFlow, PyTorch, ONNX, XGBoost, Triton, vLLM, and custom containers.
  • Transformer/Predictor/Explainer pipeline: Pre-process, predict, and post-process in a single InferenceService.

KServe v0.16 introduced the LLMInferenceService CRD, purpose-built for large language models:

  • OpenAI-compatible API endpoints out of the box (/v1/chat/completions, /v1/completions).
  • Integration with Gateway API for traffic management.
  • Distributed parallelism: define tensor parallelism and pipeline parallelism directly in the CRD spec.
  • Backend support for vLLM, TGI, and SGLang.
apiVersion: serving.kserve.io/v1beta1
kind: LLMInferenceService
metadata:
  name: llama-3-70b
spec:
  modelId: meta-llama/Llama-3-70B-Instruct
  workerSpec:
    tensorParallelSize: 4
    resources:
      limits:
        nvidia.com/gpu: 4

NVIDIA Triton (Dynamo Triton)

Triton Inference Server (now part of the NVIDIA Dynamo framework) is the most feature-rich inference server:

  • Multi-framework: Load TensorRT, ONNX, PyTorch, TensorFlow, and Python models simultaneously.
  • Dynamic batching: Accumulates requests for a configurable window (e.g., 50ms) and batches them into a single GPU kernel launch. Transforms 100 serial requests into 1 batched operation.
  • Model ensembles: Chain multiple models (tokenizer –> encoder –> decoder –> post-processor) in a DAG with zero-copy tensor passing between stages.
  • Model repository: Hot-load and unload models from S3/GCS/local storage without restarting.

vLLM

vLLM changed LLM inference economics. Its two core innovations:

PagedAttention: Traditional inference servers pre-allocate a contiguous block of GPU memory for each request’s KV cache, sized for the maximum sequence length. Most of this memory is wasted (a 2048-token allocation for a 200-token response wastes 90%). PagedAttention borrows the concept of virtual memory paging from operating systems: the KV cache is stored in non-contiguous physical blocks, mapped through a block table. Memory is allocated on demand as tokens are generated.

Continuous batching: Traditional batching waits for all requests in a batch to complete before accepting new ones. If one request generates 500 tokens and another generates 10, the short request’s GPU cycles are wasted while waiting. Continuous batching (also called iteration-level scheduling) adds and removes requests from the batch at every decode step. The GPU is never idle.

Together, these deliver up to 24x throughput improvement over naive single-request serving (2–4x over production servers with static batching). Organizations adopting vLLM have reported 50–75% cost reductions compared to traditional serving stacks, thanks to the combination of PagedAttention’s memory efficiency and continuous batching’s GPU utilization gains.

Inference Server Comparison

FeatureKServe + vLLMTriton (Dynamo)vLLM standalone
Autoscaling (incl. scale-to-zero)Yes (Knative/KPA)Manual / customNo (needs wrapper)
OpenAI-compatible APIYes (v0.16+)Via ensemble configYes, native
Dynamic batchingContinuous (vLLM)Configurable windowContinuous
Multi-model servingVia multiple InferenceServicesSingle server, multiple modelsOne model per process
PagedAttentionYesVia vLLM backendYes
Canary / traffic splittingNativeExternal (Istio/Gateway)External
Model ensemble / chainingVia Transformer pipelineNative DAGNo
Production maturityHigh (CNCF project)High (NVIDIA supported)High (growing fast)
Best forProduction serving with MLOpsMulti-framework, complex pipelinesMaximum single-model throughput

The GPU Autoscaling Problem

Autoscaling GPU inference is fundamentally different from autoscaling web services.

GPU utilization is a misleading metric. A GPU running vLLM at 95% utilization might be handling 10 requests/sec or 200 requests/sec — utilization stays pinned high as long as any work is being done. GPU utilization stays pinned high as long as any work is in-flight — it does not indicate whether users are getting good latency.

The right metrics to scale on:

  • Queue depth: Number of requests waiting to be processed. If the queue is growing, you need more replicas.
  • Time to First Token (TTFT): Latency from request receipt to first generated token. This is what users perceive as “responsiveness.”
  • Inter-Token Latency (ITL): Time between consecutive tokens. Affects streaming experience.
  • Request throughput: Requests completed per second vs requests arriving per second.

KEDA Configuration for GPU Workloads

KEDA scales based on external metrics. For LLM inference:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-scaler
spec:
  scaleTargetRef:
    name: llm-deployment
  pollingInterval: 10          # Check every 10s (not 30s default)
  cooldownPeriod: 300          # 5 min cooldown (GPU nodes are expensive to churn)
  minReplicaCount: 1           # Keep 1 warm pod always
  maxReplicaCount: 8
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        query: |
          sum(vllm:num_requests_waiting{model="llama-3-70b"})
        threshold: "10"        # Scale up when >10 requests queued

Scaling Latency Benchmarks

Every second of scaling latency is a second of degraded user experience:

ScenarioTime
Warm node, model already loaded~45 seconds (pod scheduling + container start)
Cold node, Karpenter provisioning~6.5 minutes (instance launch + GPU driver init + model load)
Model load from NVMe local storage~18 seconds (for 70B fp16 model)
Model load from SATA/network PVC~74 seconds (same model)
Model load from S3/GCS~90-180 seconds (varies by region and model size)

The implication: keep warm pods. The cost of one idle GPU pod ($25-35/hour) is almost always less than the cost of 6+ minutes of failed requests during cold scale-up.

Cost Circuit Breakers

KEDA supports maxReplicaCount, but that is a blunt instrument. For cost control, implement circuit breakers:

  • Set maxReplicaCount to cap worst-case spend.
  • Use KEDA’s fallback configuration to define behavior when the metrics source is unreachable.
  • Monitor scaling events with alerts: “LLM deployment scaled to max replicas” should trigger investigation.

Multi-Node Inference

A single 70B parameter model in fp16 requires ~140 GB of GPU memory. An H100 has 80 GB. The model does not fit on one GPU. You have two ways to split it across multiple GPUs, and they solve different bottlenecks.

Tensor Parallelism (TP)

Tensor parallelism splits individual matrix multiplications across GPUs. For a weight matrix W of shape [4096, 4096], TP=4 gives each GPU a [4096, 1024] slice. Each GPU computes its portion of the output, then an AllReduce synchronizes the results.

Requirement: TP demands extremely high inter-GPU bandwidth because synchronization happens within every layer (multiple times per token). NVLink (900 GB/s on H100) is required. TP across network-connected GPUs is impractical.

Pipeline Parallelism (PP)

Pipeline parallelism splits the model by layers. If a model has 80 layers and PP=2, GPU group A handles layers 0-39 and GPU group B handles layers 40-79. A request’s activations flow from A to B after layer 39. The communication is sequential and relatively infrequent (once per micro-batch per pipeline stage), so network bandwidth requirements are modest.

Advantage: PP works across nodes connected by standard (even Ethernet) networking.

TENSOR PARALLELISM vs PIPELINE PARALLELISM
───────────────────────────────────────────

  TENSOR PARALLELISM (TP=4)           PIPELINE PARALLELISM (PP=2)
  Split WITHIN each layer             Split BY layers

  Layer N:                            Node A (Layers 0-39):
  ┌─────────────────────────┐         ┌───────────────────────┐
  │  Weight Matrix [4096²]  │         │  Layer 0              │
  │                         │         │  Layer 1              │
  │  ┌────┬────┬────┬────┐  │         │  ...                  │
  │  │GPU │GPU │GPU │GPU │  │         │  Layer 39             │
  │  │ 0  │ 1  │ 2  │ 3  │  │         │                       │
  │  │1024│1024│1024│1024│  │         │  Activations ─────────┼──►
  │  │cols│cols│cols│cols│  │         └───────────────────────┘
  │  └──┬─┴──┬─┴──┬─┴──┬─┘  │                              Network
  │     │    │    │    │    │                              (modest BW)
  │     └────┴──┬─┴────┘    │
  │          AllReduce      │         Node B (Layers 40-79):
  │         (NVLink,        │         ┌───────────────────────┐
  │          900 GB/s)      │    ──►  │  Layer 40             │
  └─────────────────────────┘         │  Layer 41             │
                                      │  ...                  │
  Communication: per-layer,           │  Layer 79             │
  extremely frequent.                 │                       │
  Requires NVLink.                    │  Output ──────────────┼──►
                                      └───────────────────────┘

  COMBINED: 2 nodes x 8 H100s = TP=8 (within node) + PP=2 (across nodes)
  This is how Llama 3.1 405B runs: ~810 GB in fp16, split across 16 GPUs.

LeaderWorkerSet (LWS)

LeaderWorkerSet is a Kubernetes-native API for multi-node GPU workloads. It creates a group of pods where one is designated the leader and the rest are workers. The leader’s hostname and IP are injected into all workers, solving the distributed coordination problem:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: llama-405b
spec:
  replicas: 2               # 2 model replicas
  leaderWorkerTemplate:
    size: 2                  # 2 nodes per replica (PP=2)
    leaderTemplate:
      spec:
        containers:
          - name: vllm
            resources:
              limits:
                nvidia.com/gpu: 8  # TP=8 within each node
    workerTemplate:
      spec:
        containers:
          - name: vllm
            resources:
              limits:
                nvidia.com/gpu: 8

llm-d (CNCF Sandbox, March 2026)

Traditional LLM serving treats prefill and decode as a single operation on the same GPU. This is wasteful: prefill (processing the input prompt) is compute-bound and bursty, while decode (generating output tokens one at a time) is memory-bandwidth-bound and latency-sensitive. A GPU optimized for one is suboptimal for the other.

llm-d (accepted into CNCF Sandbox in March 2026) disaggregates these phases:

Prefill/Decode Disaggregation

  • Prefill nodes: Handle prompt processing. Can use batch-optimized configurations, larger batch sizes, and are less sensitive to per-request latency.
  • Decode nodes: Handle token generation. Optimized for low latency, smaller batches, dedicated KV cache memory.

Requests flow: client –> prefill node (processes prompt, generates KV cache) –> KV cache transfer –> decode node (generates tokens, streams back to client).

The following sequence diagram shows how llm-d splits a single inference request across two specialized node pools — the prefill node processes the full prompt, then hands off the KV cache to a decode node for token generation:

sequenceDiagram
    participant Client as Client (API request)
    participant GW as Gateway / EPP Router
    participant Prefill as Prefill Node<br>(vLLM worker)
    participant KV as KV Cache Transfer
    participant Decode as Decode Node<br>(vLLM worker)

    Client->>GW: POST /chat/completions<br>{prompt: 4096 tokens}
    GW->>Prefill: route to prefill pool (prompt-heavy)

    Note over Prefill: Process full prompt in one<br>forward pass (compute-bound,<br>high GPU utilization)

    Prefill->>KV: transfer KV cache state
    Note over Prefill: Prefill done,<br>GPU freed for next prompt
    KV->>Decode: deliver KV cache to decode worker

    Note over Decode: Auto-regressive token generation<br>(memory-bound, sequential)

    Decode->>GW: stream tokens
    loop Token streaming
        GW->>Client: token
    end
    GW->>Client: [DONE]

    Note over Client,Decode: Key insight: Prefill is compute-bound (one large forward pass).<br>Decode is memory-bound (sequential token generation).<br>Splitting them lets each pool use optimally-sized GPUs and scale independently.

KV Cache Management

llm-d implements hierarchical KV cache offloading:

  1. GPU HBM: Fastest, most expensive. Active decode requests.
  2. CPU DRAM: 10-50x cheaper per GB. Recently completed requests that may be reused (prefix caching).
  3. Local NVMe/distributed storage: Persistent cache for common prefixes (system prompts, few-shot examples).

When a new request arrives with a prefix matching a cached KV cache, the decode node skips recomputation entirely.

Performance

Benchmarks from the llm-d team show:

  • ~57x faster P90 Time to First Token compared to round-robin load balancing in prefix-cache-heavy workloads with high prompt reuse (because cache-aware routing eliminates redundant prefill).
  • ~2x throughput improvement versus round-robin distribution.

The Production Stack

The emerging production architecture is: KServe (model lifecycle, autoscaling, API) + llm-d (intelligent routing, disaggregated serving, KV cache management). KServe handles the Kubernetes-native concerns; llm-d handles the LLM-specific optimization.

Gateway API Inference Extension

As LLM endpoints proliferate in a cluster, standard load balancing (round-robin, least-connections) leaves performance on the table. A request whose prefix matches a warm KV cache on GPU-3 should be routed to GPU-3, not to GPU-7 which would recompute the cache from scratch. Round-robin ignores the most important variable: which GPU already has relevant computation cached in memory.

The Gateway API Inference Extension adds model-aware routing to Kubernetes. It extends the standard Gateway API (the successor to Ingress) with inference-specific semantics.

CRDs

  • InferencePool: Defines a pool of pods serving inference (analogous to a Service, but model-aware). Each pool has an Endpoint Selection Extension (ESE) sidecar that makes routing decisions based on real-time pod state.
  • InferenceModel: Maps a model name to an InferencePool with criticality levels and traffic policies. Multiple InferenceModel resources can point to the same pool, enabling multi-model routing through a single gateway.

Endpoint Selection Extension (ESE)

The ESE sidecar receives routing requests from the gateway and selects the optimal backend pod based on:

  • KV cache affinity: Route to the pod most likely to have the request’s prefix cached. This is the single biggest optimization — prefix cache hits eliminate redundant prefill computation, reducing TTFT from seconds to milliseconds for repeated system prompts.
  • Queue depth: Avoid overloaded pods. The ESE tracks per-pod pending request counts in real time.
  • Model version: Route to pods serving the requested model version during canary deployments.
  • LoRA adapter affinity: When serving multiple LoRA fine-tuned variants from a single base model, route to the pod that already has the requested adapter loaded in memory.

Request Criticality

InferenceModel supports criticality levels (Critical, Standard, Sheddable). During overload, the gateway sheds Sheddable requests first, protecting Critical traffic. This maps naturally to production use cases: customer-facing chat is Critical, background summarization is Sheddable, internal testing is BestEffort.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: llama-critical
spec:
  modelName: meta-llama/Llama-3-70B-Instruct
  criticality: Critical
  poolRef:
    name: llm-pool
  targetModels:
    - name: meta-llama/Llama-3-70B-Instruct
      weight: 100

Model Caching and Storage

The single biggest contributor to LLM cold-start latency is model loading. Llama 3.1 405B in fp16 is ~810 GB. Downloading this from object storage to GPU memory takes minutes. Every strategy in this section exists to minimize or eliminate that wait.

MODEL LOADING STRATEGIES
────────────────────────

  STRATEGY 1: Object Storage (Slow, Simple)
  ┌──────┐   download    ┌───────────┐   load    ┌──────┐
  │  S3  │──────────────►│ Pod       │──────────►│ GPU  │
  │  GCS │  90-180s      │ (ephemeral│  10-30s   │ VRAM │
  │  Hub │  (network)    │  storage) │           │      │
  └──────┘               └───────────┘           └──────┘
  Total: 100-210s.  Every scale-up pays this cost.

  STRATEGY 2: Shared PVC (NFS / ReadWriteMany)
  ┌──────────────────────────────────────────────────────┐
  │  NFS PVC (ReadWriteMany)                             │
  │  /models/llama-3-70b/  (pre-populated)               │
  │                                                      │
  │  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
  │  │ Pod A    │  │ Pod B    │  │ Pod C    │            │
  │  │ mounts   │  │ mounts   │  │ mounts   │            │
  │  │ /models  │  │ /models  │  │ /models  │            │
  │  └──────────┘  └──────────┘  └──────────┘            │
  └──────────────────────────────────────────────────────┘
  Total: 30-74s (NFS read → GPU).  No download step.

  STRATEGY 3: KServe LocalModel + Local NVMe
  ┌───────────┐ pre-cached   ┌────────────┐  load   ┌──────┐
  │ LocalModel│─────────────►│ Node NVMe  │────────►│ GPU  │
  │ controller│ (background) │ /mnt/models│  ~18s   │ VRAM │
  └───────────┘              └────────────┘         └──────┘
  Total: ~18s.  Model pre-staged on node before pod starts.

  STRATEGY 4: GKE Hyperdisk ML
  ┌──────────┐   block device  ┌──────────┐  load   ┌──────┐
  │ Hyperdisk│────────────────►│ Pod      │────────►│ GPU  │
  │ ML volume│   1.2 TB/s read │          │  ~20min │ VRAM │
  │ (GKE)    │   throughput    │          │         │      │
  └──────────┘                 └──────────┘         └──────┘
  Total: ~20 min for 405B.  Was 90 min from GCS.

The Concurrent Download Corruption Problem

When multiple pods share an NFS PVC and a new model version is deployed, naive init containers in each pod will download the model simultaneously. This creates a race condition: Pod A writes half the file, Pod B overwrites it, both end up with corrupt weights.

The solution: Use a central Kubernetes Job that downloads the model once to the shared PVC. Pods wait (via an init container that checks for a sentinel file) until the Job completes. This pattern is simple but eliminates an entire class of data corruption bugs.

apiVersion: batch/v1
kind: Job
metadata:
  name: download-llama-70b
spec:
  template:
    spec:
      containers:
        - name: downloader
          image: python:3.11
          command: ["huggingface-cli", "download",
                    "meta-llama/Llama-3-70B-Instruct",
                    "--local-dir", "/models/llama-3-70b"]
          volumeMounts:
            - name: model-store
              mountPath: /models
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: shared-model-store
      restartPolicy: OnFailure

GKE Hyperdisk ML

Google’s Hyperdisk ML volumes provide up to 1.2 TB/s read throughput from a block storage volume. For Llama 3.1 405B loading, GKE benchmarks show reduction from 90 minutes (GCS download) to approximately 20 minutes with Hyperdisk ML, with further improvement possible through multi-volume striping.

The Hugging Face Ecosystem on Kubernetes

Text Generation Inference (TGI)

TGI was the first production-grade open-source LLM inference server. It pioneered several techniques now considered industry standard: continuous batching, flash attention integration, tensor parallelism, quantization support (GPTQ, AWQ, EETQ), and speculative decoding. TGI continues to be actively developed by Hugging Face. TGI 3.x added structured generation, speculative decoding, and multi-LoRA support, keeping it competitive with other inference servers. For new deployments, evaluate TGI alongside vLLM (for throughput) and SGLang (for structured generation and agent workloads) based on your specific requirements.

Text Embeddings Inference (TEI)

TEI is purpose-built for embedding and reranking models. Key characteristics:

  • Small footprint: Embedding models (e.g., BAAI/bge-large-en-v1.5 at 1.3 GB) fit on a single GPU or even CPU.
  • Fast boot: Sub-second cold starts for small models.
  • Dynamic batching: Automatically batches concurrent requests.
  • Token-based API: POST /embed with OpenAI-compatible response format.

TEI is the right choice for embedding pipelines in RAG architectures. Run it on a small GPU (T4, L4) or CPU nodes to keep costs minimal.

Hub Integration on Kubernetes

Most Hugging Face models are served from the Hugging Face Hub. On Kubernetes, the integration pattern is:

  1. Authentication: Store your token in a Kubernetes Secret and mount as HUGGING_FACE_HUB_TOKEN (or HF_TOKEN) environment variable.
  2. Caching: The Hub client caches downloads in ~/.cache/huggingface/hub. Mount a PVC at this path to persist downloads across pod restarts.
  3. Multi-replica caching: For multiple replicas sharing a model, use a ReadWriteMany NFS PVC with a pre-population Job (as described above). This ensures one download, many readers.
env:
  - name: HF_TOKEN
    valueFrom:
      secretKeyRef:
        name: hf-secret
        key: token
  - name: HF_HOME
    value: /models/cache
volumeMounts:
  - name: model-cache
    mountPath: /models/cache

NVIDIA NIM

NVIDIA NIM (NVIDIA Inference Microservices) provides pre-optimized inference containers. Rather than configuring TensorRT profiles, quantization settings, and parallelism parameters yourself, NIM containers ship with models already optimized for specific GPU configurations.

Why NIM Matters

Raw vLLM or Triton deployments require significant tuning: choosing the right quantization (GPTQ, AWQ, fp8), compiling TensorRT-LLM engines for your GPU architecture, setting optimal batch sizes and cache configurations. NIM pre-solves this optimization problem. NVIDIA benchmarks show 2.6x throughput improvement over off-the-shelf vLLM deployment for the same model on the same hardware.

NIM Operator 3.0.0

The NIM Operator manages NIM containers on Kubernetes:

  • Multi-LLM: Deploy and manage multiple models from a single operator.
  • Multi-node: Automatic configuration of tensor and pipeline parallelism across nodes.
  • DRA support: Integrates with Dynamic Resource Allocation for fine-grained GPU management.
  • NIMCache CRD: Pre-downloads and caches model engines on nodes, solving the cold-start problem.
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-70b-nim
spec:
  image: nvcr.io/nim/meta/llama-3-70b-instruct:latest
  replicas: 2
  resources:
    gpus: 4                  # TP=4 automatically configured
  storage:
    nimCache: llama-cache    # Pre-populated NIMCache

When to Use NIM vs vLLM

Use NIM when you need maximum performance with minimal tuning effort and are running NVIDIA-supported models on NVIDIA GPUs. The pre-optimization is the differentiator: NIM containers include TensorRT-LLM engines compiled for specific GPU architectures, with quantization, batching, and cache settings already tuned. You trade flexibility for performance.

Use vLLM directly when you need full control over serving configuration, run non-NVIDIA hardware (AMD ROCm, Intel Gaudi), serve models not in the NIM catalog, or need to customize the serving logic (custom sampling, constrained decoding, speculative decoding with draft models). vLLM’s open-source community moves fast — new model architectures are typically supported within days of release.

Common Mistakes and Misconceptions

  • “Serving an LLM is just deploying a container.” Large models need tensor parallelism across multiple GPUs, KV cache management, continuous batching, and careful memory planning.
  • “Bigger instances are always better for LLM serving.” Cost-per-token often favors multiple smaller GPU instances over fewer large ones, depending on model size and batching strategy. Profile your specific model to find the cost-optimal configuration.
  • “Auto-scaling LLM inference works like web services.” LLM pods take minutes to load models into GPU memory. Scale-from-zero is extremely slow. Maintain warm replicas and scale on custom metrics (queue depth, KV cache utilization) rather than CPU.
  • “All LLM serving frameworks are interchangeable.” vLLM excels at throughput with PagedAttention, TGI integrates tightly with Hugging Face models, Triton supports multi-model serving. Choose based on your specific model and serving requirements.

Further Reading


Next: Disaster Recovery — cluster backup, etcd snapshots, multi-region strategies, and the procedures you test before you need them.

Chapter 43: Disaster Recovery

A Kubernetes cluster is not a single thing that fails in a single way. The control plane can fail while workloads keep running. A namespace can be accidentally deleted while the rest of the cluster is fine. An entire region can go dark. Disaster recovery for Kubernetes requires thinking in layers: the cluster state layer and the workload layer, each with its own backup strategy, its own restore procedure, and its own failure modes.

Two-Layer Backup Strategy

Kubernetes disaster recovery operates on two distinct layers, and you need both.

flowchart TD
    subgraph Layer1["LAYER 1: CLUSTER STATE (etcd)"]
        Etcd["etcd snapshots capture ALL cluster state:<br>- Every resource object (Pods, Deployments, Services)<br>- RBAC rules, NetworkPolicies, CRDs<br>- Secrets, ConfigMaps<br>- Custom resources (operators, databases, etc.)"]
        NotCaptured["Does NOT capture:<br>- Persistent Volume data<br>- Container images<br>- External state (DNS, load balancers, IAM)"]
    end

    subgraph Layer2["LAYER 2: WORKLOAD BACKUP (Velero)"]
        Velero["Velero backs up K8s resources AND persistent volumes:<br>- Namespace-scoped resource manifests<br>- PersistentVolume snapshots (via CSI or cloud APIs)<br>- Label/annotation-based selection<br>- Scheduled backups on a cron cadence"]
        Storage["Stored externally in object storage<br>(S3, GCS, MinIO)"]
    end

    Layer1 --- Why{{"WHY BOTH?"}}
    Layer2 --- Why
    Why --> EtcdUse["etcd snapshots: Full cluster restore<br>after total loss. All or nothing."]
    Why --> VeleroUse["Velero backups: Surgical restore of<br>specific namespaces or workloads.<br>Includes PV data. Cross-cluster migration."]

etcd snapshots are your insurance against total cluster loss. They capture the complete cluster state at a point in time. But they are all-or-nothing — you cannot restore a single namespace from an etcd snapshot without restoring everything. They also do not include persistent volume data.

Velero (formerly Heptio Ark) fills the gaps. It backs up Kubernetes resource manifests and can snapshot persistent volumes via CSI snapshot support or cloud provider APIs. It supports selective backup by namespace, label, or resource type. And it can restore into a different cluster, which makes it invaluable for migration.

Velero in Practice

Backup Configuration

# Install Velero with AWS plugin
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-cluster-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1

# Create a scheduled backup
velero schedule create daily-production \
  --schedule="0 2 * * *" \
  --include-namespaces=production,staging \
  --ttl=720h \
  --snapshot-volumes=true

Selective Backup and Restore

Velero’s label selectors allow targeted backups:

# Back up only resources with a specific label
velero backup create critical-apps \
  --selector app.kubernetes.io/tier=critical

# Back up everything except ephemeral namespaces
velero backup create full-backup \
  --exclude-namespaces=kube-system,monitoring,temp-*

Restore with Dependency Ordering

A common failure mode during restore is attempting to create resources before their dependencies exist — a Deployment that references a ConfigMap that has not been restored yet. Velero handles this through a priority-based restore order:

flowchart TD
    S1["1. Cluster-scoped resources<br>(Namespaces, ClusterRoles, CRDs, StorageClasses)"]
    S2["2. Namespace-scoped foundation<br>(ServiceAccounts, ConfigMaps, Secrets, PVCs)"]
    S3["3. Workload resources<br>(Deployments, StatefulSets, DaemonSets, Services)"]
    S4["4. Dependent resources<br>(Ingress, NetworkPolicies, HPA, PodDisruptionBudgets)"]
    S5["5. Custom Resources<br>(CRD instances -- restored after CRDs exist)"]
    S6["6. Volume data<br>(PV snapshots restored and bound to new PVCs)"]

    S1 --> S2 --> S3 --> S4 --> S5 --> S6

You can customize this order via restore hooks and init containers to wait for dependencies.

RPO and RTO

Two metrics define your disaster recovery targets:

Recovery Point Objective (RPO): How much data can you afford to lose? If your etcd snapshots run hourly and Velero backups run daily, your RPO is the time since the last relevant backup. An RPO of 1 hour means you accept losing up to 1 hour of changes.

Recovery Time Objective (RTO): How quickly must you be back in service? This includes time to detect the failure, execute the recovery procedure, verify the cluster is healthy, and confirm application availability.

ScenarioTypical RPOTypical RTO
Single namespace deletionMinutes (Velero)15–30 minutes
Control plane failure (etcd intact)0 (no data loss)5–15 minutes
Total cluster loss (etcd gone)Last etcd snapshot interval1–4 hours
Full region failureLast cross-region replication15 min – 4 hours

The gap between your target RPO/RTO and your actual tested RPO/RTO is your risk. Measure both.

Multi-Region Strategies

For organizations that cannot tolerate the RTO of rebuilding from backup, multi-region architecture provides resilience at the infrastructure level.

Active-Active

Two or more clusters in different regions serve traffic simultaneously. A global load balancer distributes requests. Stateful workloads either use a multi-region database (CockroachDB, Spanner) or accept eventual consistency.

Pros: Near-zero RTO for region failure. No cold-start latency. Cons: Operationally complex. Data consistency is hard. Cost doubles (or more).

Active-Passive

A primary cluster serves all traffic. A standby cluster in another region has the same applications deployed but receives no traffic. On failure, DNS or the load balancer shifts traffic to the standby.

Pros: Simpler than active-active. Lower cost (standby can be smaller). Cons: RTO is limited by DNS propagation and application warm-up. Standby cluster may have stale data.

Partitioned (Regional Affinity)

Each region operates independently, serving only users or workloads in that region. There is no failover between regions — each is self-contained.

Pros: Simplest multi-region model. Data sovereignty compliance. Cons: No cross-region resilience. If a region goes down, its users are affected.

MULTI-REGION STRATEGY COMPARISON
──────────────────────────────────

  ACTIVE-ACTIVE                  ACTIVE-PASSIVE
  ┌──────────┐ ┌──────────┐    ┌──────────┐ ┌──────────┐
  │ Region A │ │ Region B │    │ Region A │ │ Region B │
  │ ████████ │ │ ████████ │    │ ████████ │ │ (standby)│
  │ traffic  │ │ traffic  │    │ traffic  │ │          │
  └─────┬────┘ └─────┬────┘    └─────┬────┘ └─────┬────┘
        │            │               │            │
        └──────┬─────┘               │        (failover)
               │                     │            │
          Global LB              Primary ──────▶ Promote
                                                on failure

  PARTITIONED
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ Region A │ │ Region B │ │ Region C │
  │ Users: A │ │ Users: B │ │ Users: C │
  │ Data: A  │ │ Data: B  │ │ Data: C  │
  └──────────┘ └──────────┘ └──────────┘
  (independent, no cross-region failover)

Testing Recovery

Testing restores is not optional — an untested backup procedure is an untested promise. Teams routinely discover during an actual incident that:

  • The backup credentials have rotated and the backup job has been silently failing for weeks.
  • The etcd snapshot is from the wrong cluster (staging, not production).
  • The Velero restore fails because the target cluster has a different Kubernetes version and the CRD schemas are incompatible.
  • The persistent volume snapshots are in a different region from the recovery cluster.
  • The restore completes but the application does not start because it depends on an external service that was not part of the backup.

Testing Practices

  1. Schedule monthly restore drills. Restore to a separate cluster and verify application health. Automate as much as possible.

  2. Test at every layer. Restore a single namespace from Velero. Restore an entire cluster from an etcd snapshot. Fail over to a standby region.

  3. Measure actual RTO. Start a timer when the drill begins. Stop when the application is serving traffic. Compare against your target. If you miss the target, the plan needs work.

  4. Break things intentionally. Delete a namespace. Corrupt an etcd member. Simulate a region failure by blocking network traffic. Chaos engineering is the only honest test of resilience.

  5. Verify data integrity. After restore, do not just check that pods are running. Verify that the application data is consistent and correct. A running pod with a corrupted database is not a successful recovery.

Documented Runbooks

Disaster recovery procedures must be written down, version-controlled, and accessible during an outage. A runbook stored in the cluster that just failed is useless.

A good runbook includes:

  • Prerequisites: What tools, credentials, and access are needed?
  • Decision tree: Which procedure applies to which failure scenario?
  • Step-by-step commands: Copy-pasteable, with placeholders clearly marked.
  • Verification steps: How to confirm each step succeeded before proceeding.
  • Rollback: What to do if the recovery makes things worse.
  • Communication plan: Who to notify, what channels to use, what to tell customers.

Store runbooks in a location that survives the failure of the thing they describe. A Git repository in a different cloud account. A wiki on a different provider. Printed copies in a binder (yes, really, for the truly catastrophic scenarios).

Putting It Together

A complete disaster recovery strategy for Kubernetes looks like this:

  1. etcd snapshots every hour, uploaded to cross-region object storage with versioning and lifecycle rules.
  2. Velero scheduled backups daily, with volume snapshots, stored in a separate object storage bucket.
  3. Multi-region standby cluster for production workloads that cannot tolerate multi-hour RTO.
  4. Monthly restore drills that exercise both etcd restore and Velero restore paths.
  5. Runbooks that have been used successfully in a drill within the last quarter.
  6. Monitoring and alerting on backup job success/failure, backup age, and storage health.

Common Mistakes and Misconceptions

  • “Backing up etcd is enough for DR.” etcd contains cluster state, but not PersistentVolume data, external DNS records, cloud load balancers, or IAM configurations. A complete DR plan includes application data, infrastructure-as-code, and secrets.
  • “Velero backs up everything.” Velero backs up Kubernetes resources and can snapshot cloud volumes, but it doesn’t back up external databases, object storage contents, or resources managed outside K8s. Know what’s covered and what isn’t.
  • “I’ll figure out DR when I need it.” By definition, you need DR during an emergency when you have the least capacity for planning. Test restores quarterly. An untested backup is not a backup.

Further Reading


Next: Cost Optimization — making sure all this infrastructure is not more expensive than it needs to be.

Chapter 44: Cost Optimization

Kubernetes makes it easy to deploy applications and hard to understand what they cost. A developer requests 2 CPU cores and 4 GB of memory for a service that uses 0.3 cores and 800 MB at peak. Multiply that by hundreds of services across dozens of namespaces, and you arrive at the industry average: only 13% of requested CPU is actually used. The rest is reserved but idle, burning money on cloud provider invoices.

The Cost Problem

The disconnect between requested and used resources exists because of a rational incentive: nobody wants their service to be OOM-killed or CPU-throttled, so everyone over-provisions.

THE RESOURCE EFFICIENCY GAP
─────────────────────────────

  Requested CPU                   Actual CPU Used
  ┌──────────────────────────┐   ┌──────────────────────────┐
  │██████████████████████████│   │███░░░░░░░░░░░░░░░░░░░░░░░│
  │██████████████████████████│   │███░░░░░░░░░░░░░░░░░░░░░░░│
  │██████████████████████████│   │███░░░░░░░░░░░░░░░░░░░░░░░│
  │         100 cores        │   │  13 cores   87 wasted    │
  └──────────────────────────┘   └──────────────────────────┘

  ██ = Allocated/Used    ░░ = Allocated but Idle

  Industry average: 13% CPU utilization of requested resources
  Typical savings from right-sizing: 30-50%

This is not a Kubernetes problem per se — the same over-provisioning existed in the VM world. But Kubernetes makes it both more visible (you can measure it) and more actionable (you can change it without reprovisioning hardware).

Right-Sizing with VPA and Goldilocks

The Vertical Pod Autoscaler (VPA) observes actual resource usage over time and recommends (or automatically sets) CPU and memory requests. In recommendation mode, it does not change anything — it just tells you what the values should be.

Goldilocks (from Fairwinds) wraps VPA in a dashboard that shows recommendations for every deployment in a namespace. It creates a VPA object in recommendation mode for each deployment and surfaces the results in a web UI.

# Install Goldilocks
helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks

# Enable for a namespace
kubectl label namespace production goldilocks.fairwinds.com/enabled=true

After a few days of observation, Goldilocks will show you something like:

DeploymentCurrent RequestRecommendedMonthly Savings
api-server2 CPU / 4 Gi500m CPU / 1 Gi$340
worker4 CPU / 8 Gi1.5 CPU / 3 Gi$520
frontend1 CPU / 2 Gi200m CPU / 512 Mi$180
cache2 CPU / 16 Gi500m CPU / 12 Gi$85

Typical savings from right-sizing are 30–50% of compute cost. This is the lowest-effort, highest-impact optimization available.

Caution: Do not blindly apply VPA recommendations. Review them in the context of peak load, seasonal patterns, and latency requirements. A recommendation based on two weeks of low traffic will not survive Black Friday.

Spot and Preemptible Instances

Cloud providers sell unused capacity at steep discounts — 60–90% off on-demand pricing. The trade-off is that the instances can be reclaimed with as little as 30 seconds notice (AWS Spot) or 2 minutes (GCP Preemptible/Spot).

Kubernetes makes spot instances practical because it was designed for failure. Pods are ephemeral. Deployments replace terminated pods automatically. The key is ensuring your workloads can tolerate interruption.

Karpenter and Spot

Karpenter excels at spot instance management. It can:

  • Diversify across many instance types to reduce interruption probability
  • Automatically replace interrupted nodes
  • Mix spot and on-demand in a single NodePool via capacity-type weights
  • Consolidate workloads onto fewer nodes as demand decreases
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-workers
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m5.xlarge", "m5a.xlarge", "m6i.xlarge",
                   "m6a.xlarge", "c5.xlarge", "c6i.xlarge"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: "200"
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 60s

Best practice: Run control plane workloads (monitoring, CI, databases) on on-demand instances. Run stateless application workloads (web servers, API handlers, batch jobs) on spot. The cost savings typically range from 60–90% for the spot-eligible portion of your fleet.

Cost Attribution with Kubecost and OpenCost

You cannot optimize what you cannot measure. Kubecost and OpenCost provide cost attribution — breaking down cluster costs by namespace, deployment, label, or any other dimension.

OpenCost is the open-source standard for Kubernetes cost monitoring, donated to the CNCF by Kubecost. It calculates costs by:

  1. Querying cloud provider pricing APIs for node costs
  2. Allocating node costs to pods based on resource requests (and optionally usage)
  3. Adding persistent volume and network costs
  4. Aggregating by any Kubernetes metadata (namespace, label, annotation)

Chargeback via Labels

The foundation of cost attribution is consistent labeling. Every workload should carry labels that identify its owner and purpose:

metadata:
  labels:
    app.kubernetes.io/name: checkout-service
    app.kubernetes.io/part-of: ecommerce
    cost-center: "CC-4521"
    team: payments
    environment: production

With these labels, you can generate reports like:

TeamNamespaceMonthly CostCPU EfficiencyMemory Efficiency
Paymentspayments-prod$4,20022%45%
Searchsearch-prod$8,10031%52%
MLml-training$12,50078%65%
Platformmonitoring$2,30015%40%

The ML team has high efficiency because GPU workloads tend to saturate resources. The platform team has low efficiency because monitoring tools are sized for peak incident load. Context matters — not every namespace should target the same efficiency percentage.

Cluster Consolidation

Karpenter Consolidation

Karpenter’s consolidation feature continuously evaluates whether workloads can be packed onto fewer or cheaper nodes:

  • WhenEmpty: Remove nodes that have no non-daemonset pods.
  • WhenEmptyOrUnderutilized: Also replace nodes when their workloads could fit on other existing nodes or on a single cheaper node.

This is particularly powerful in clusters with variable load. During off-peak hours, Karpenter consolidates workloads onto fewer nodes and terminates the empties. During peak, it scales back out.

kube-green for Off-Hours

Many development and staging environments are used only during business hours. kube-green scales workloads to zero during off-hours:

apiVersion: kube-green.com/v1alpha1
kind: SleepInfo
metadata:
  name: working-hours
  namespace: development
spec:
  weekdays: "1-5"
  sleepAt: "20:00"
  wakeUpAt: "08:00"
  timeZone: "America/New_York"
  suspendDeployments: true
  suspendStatefulSets: true
  suspendCronJobs: true

If your development cluster costs $10,000/month and is used 10 hours a day, 5 days a week, kube-green can reduce that to roughly $3,000/month — a 70% savings with zero impact on developer productivity.

Unused Resource Detection

Waste hides in plain sight. Common sources of orphaned cost:

  • Unattached PersistentVolumes: PVCs deleted but PVs retained due to Retain reclaim policy. Cloud disks still billing.
  • Idle load balancers: Services of type LoadBalancer that no longer receive traffic.
  • Orphaned node groups: Managed node groups or ASGs with minimum size > 0 but no workloads scheduled.
  • Oversized namespaces: Test namespaces that were never cleaned up.
  • Unused ConfigMaps and Secrets: Resources referenced by nothing.

Tools like kubectl-cost (from Kubecost), pluto (for deprecated APIs), and custom scripts that compare resource references against actual usage can surface these.

Optimization Strategy Comparison

StrategyEffortTypical SavingsRisk
Right-sizing (VPA/Goldilocks)Low30–50%Under-provisioning causes latency/OOM
Spot/Preemptible instancesMedium60–90% of eligible workloadsInterruption, requires fault tolerance
Off-hours scaling (kube-green)Low50–70% for non-prodForgot to wake up before a demo
Cluster consolidation (Karpenter)Medium20–40%Consolidation churn, scheduling delays
Unused resource cleanupLow5–15%Accidentally deleting needed resources
Reserved instances / savings plansLow30–40% vs on-demandLock-in, less flexibility
Namespace resource quotasLowPreventive (caps waste)Blocks legitimate scaling

The highest-ROI strategy for most organizations is to start with right-sizing (immediate, low-risk, high-impact) and then layer on spot instances for eligible workloads. Together, these two strategies alone typically reduce compute costs by 50–70%.

Building a Cost-Aware Culture

Tools and automation are necessary but not sufficient. Cost optimization sticks only when teams have visibility and accountability:

  1. Dashboard visibility. Put cost dashboards where developers already look — Grafana, Backstage, Slack summaries. If people have to seek out cost data, they will not.

  2. Cost in the deploy pipeline. Show the cost impact of resource request changes in pull request comments. “This change increases monthly cost for checkout-service by $120.”

  3. Team-level budgets. Allocate cloud budgets to teams, not just to the organization. When a team sees that their namespace costs $8,000/month, they start asking whether that staging environment with 16 replicas is really necessary.

  4. Regular review cadence. Monthly cost reviews at the team level, quarterly at the organization level. Celebrate wins (a team that cut costs 40% through right-sizing) and investigate anomalies (a namespace that doubled in cost with no traffic increase).

The goal is not to minimize cost — it is to maximize the value per dollar.

Common Mistakes and Misconceptions

  • “Kubernetes saves money.” Kubernetes adds overhead: control plane costs, monitoring, engineer expertise, and operational complexity. It saves money at scale through bin-packing and automation, but small deployments often cost more than VMs.
  • “Spot instances are always 60-90% cheaper.” Spot pricing is dynamic. Popular instance types in busy regions may offer small discounts. Diversify across instance families and AZs. Karpenter handles this automatically.
  • “Right-sizing is a one-time task.” Application resource needs change with code changes, traffic patterns, and data growth. Continuous monitoring with VPA recommendations or tools like Kubecost is necessary to prevent drift.

Further Reading

  • OpenCost Documentation — the CNCF open-source standard for real-time Kubernetes cost monitoring with allocation by namespace, label, and deployment.
  • OpenCost Project — the CNCF sandbox project for Kubernetes cost monitoring, providing a vendor-neutral open-source specification and implementation for cost allocation.
  • FinOps Foundation — the industry body defining FinOps practices, frameworks, and maturity models for managing cloud costs across engineering and finance teams.
  • AWS: Best Practices for EC2 Spot Instances — AWS guidance on diversifying instance types, handling interruptions, and using Spot with EKS node groups and Karpenter.
  • GKE Cost Optimization Guide — Google’s recommendations for GKE right-sizing, cluster autoscaling, committed use discounts, and Spot VMs.
  • Kubernetes Documentation: Resource Management for Pods and Containers — the official reference for requests, limits, QoS classes, and LimitRanges that form the foundation of cost control.
  • Goldilocks by Fairwinds — an open-source tool that runs VPA in recommendation mode and presents a dashboard of right-sizing suggestions per workload.

Next: Observability with OpenTelemetry — making sure you can see what is happening inside all these workloads.

Chapter 45: Observability with OpenTelemetry

Observability is the ability to understand the internal state of a system by examining its external outputs. In a Kubernetes environment, those outputs are metrics (numerical measurements over time), logs (discrete events with context), and traces (the path of a request through multiple services). These are the three pillars, and OpenTelemetry is the open standard that unifies how they are collected, processed, and exported.

The Three Pillars

SignalAnswersStrengthLimitation
MetricsWhat is happening right now and how does it compare to the past? (CPU utilization, request latency percentiles, error rates, queue depths)Cheap to store, fast to query, excellent for dashboards and alertingTerrible for debugging specific requests
LogsWhat happened in this specific component at this specific time? (stack traces, failed SQL queries, loaded configuration values)Rich in context, excellent for debuggingExpensive to store and slow to search at scale; terrible for detecting trends
TracesWhat was the path of this specific request through the system? (timing and outcome of each hop across services)Essential for debugging latency in distributed systemsNearly useless for trend detection or component-level debugging

No single pillar is sufficient. Effective observability requires all three, correlated so you can move from a metric anomaly to the relevant traces to the specific log lines that explain the root cause.

OpenTelemetry Architecture

OpenTelemetry (OTel) provides a vendor-neutral framework for instrumentation, collection, and export of telemetry data. The key components are:

  • SDKs: Language-specific libraries that instrument applications (auto-instrumentation or manual)
  • Collector: A standalone binary that receives, processes, and exports telemetry data
  • Protocol (OTLP): The wire format for transmitting telemetry between components

Collector Deployment Patterns

The Collector is the workhorse of the OTel pipeline. How you deploy it determines the reliability, scalability, and cost of your observability stack.

flowchart TD
    subgraph P1["PATTERN 1: DAEMONSET / AGENT (most widely adopted)"]
        subgraph N1["Node 1"]
            A1["App A"] --> C1["OTel Collector"]
            A2["App B"] --> C1
        end
        subgraph N2["Node 2"]
            A3["App C"] --> C2["OTel Collector"]
            A4["App D"] --> C2
        end
        C1 --> B1["Backends"]
        C2 --> B1
    end

    P1 --> P2

    subgraph P2["PATTERN 2: SIDECAR (per-pod collector, high isolation)"]
        subgraph Pod["Pod"]
            App["App"] --> OTel["OTel Collector"]
        end
        OTel --> B2["Backend"]
    end

    P2 --> P3

    subgraph P3["PATTERN 3: GATEWAY (centralized, scaled Deployment)"]
        GA["App A"] --> GW["OTel Collector Gateway"]
        GB["App B"] --> GW
        GC["App C"] --> GW
        GW --> B3["Backend"]
    end

    style P1 stroke:#326CE5
    style P2 stroke:#326CE5
    style P3 stroke:#326CE5

DaemonSet (Agent) is the recommended pattern for most clusters. Each node runs a collector pod that receives telemetry from all application pods on that node via localhost. This minimizes network hops, provides natural load distribution, and fails gracefully (a collector crash affects only one node).

Sidecar provides the strongest isolation — each application pod has its own collector. Use this when different applications require different collection configurations or when you need strict resource accounting per application. The cost is significant: every pod runs an additional container.

Gateway centralizes collection into a single deployment. Use this as a second tier behind agents (agent → gateway → backend) for cross-cutting processing like tail sampling, enrichment, or routing to multiple backends. Do not use a gateway as the sole collector tier — it creates a single point of failure.

The production pattern is Agent + Gateway: node-level agents forward to a gateway for sampling and export.

The LGTM Stack

The most widely adopted open-source backend stack for Kubernetes observability is LGTM:

ComponentSignalRole
LokiLogsLog aggregation. Indexes labels, not content. Cheap at scale.
GrafanaAllVisualization and dashboarding. Unified query interface.
TempoTracesDistributed tracing backend. Object-storage-based.
MimirMetricsLong-term metrics storage. Horizontally scalable Prometheus.

This stack is entirely open source (all Grafana Labs projects under AGPLv3) and can be self-hosted or consumed as Grafana Cloud. The key advantage over alternatives is the tight integration — Grafana can correlate a metric spike to traces to logs without leaving the UI.

flowchart TD
    Apps["Applications"]
    Collector["OTel Collector Agent<br>(DaemonSet)"]
    Mimir["Mimir<br>(metrics)"]
    Tempo["Tempo<br>(traces)"]
    Loki["Loki<br>(logs)"]
    Grafana["Grafana<br>(query, visualize, alert)"]

    Apps --> Collector
    Collector -- "metrics" --> Mimir
    Collector -- "traces" --> Tempo
    Collector -- "logs" --> Loki
    Mimir --> Grafana
    Tempo --> Grafana
    Loki --> Grafana

The OpenTelemetry Operator

The OTel Operator is a Kubernetes operator that manages OTel Collectors and provides auto-instrumentation for application pods.

Auto-Instrumentation

Instead of modifying application code to import OTel SDKs, you annotate pods and the operator injects the instrumentation automatically:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
  annotations:
    instrumentation.opentelemetry.io/inject-java: "true"
spec:
  template:
    spec:
      containers:
        - name: checkout
          image: myapp/checkout:v1.2.3

The operator supports auto-instrumentation for:

  • Java — via a Java agent injected as an init container
  • Python — via the opentelemetry-instrument wrapper
  • .NET — via the .NET startup hook
  • Node.js — via the @opentelemetry/auto-instrumentations-node package
  • Go — via eBPF-based instrumentation (more limited than other languages)

Auto-instrumentation captures HTTP requests, database queries, gRPC calls, and messaging operations without any code changes. It is the fastest path to distributed tracing in an existing application.

Instrumentation Resource

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: default-instrumentation
  namespace: production
spec:
  exporter:
    endpoint: http://otel-collector.observability:4317
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"          # sample 10% of traces
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest

Signal Correlation

The power of observability comes from connecting the three signal types. When a metric alert fires for high latency on the checkout service, you want to click through to the traces that show which downstream call is slow, and then to the log lines from that specific call.

This requires consistent identifiers across signals:

  • Trace context propagation: Every HTTP or gRPC call propagates traceparent headers (W3C Trace Context standard). The OTel SDKs handle this automatically.
  • Trace ID in logs: Configure your logging library to include the trace ID and span ID in every log line. This allows Grafana to jump from a trace to the exact log lines produced during that span.
  • Exemplars in metrics: Prometheus exemplars attach a trace ID to a specific metric observation, so you can click from a latency histogram bucket to a representative trace.
SIGNAL CORRELATION FLOW
─────────────────────────

  Grafana Dashboard
  ┌───────────────────────────────────────────────────┐
  │  Checkout Latency p99 = 1.2s  [▲ spike at 14:23]  │
  │                                    │              │
  │  Click exemplar ──────────────────▶│              │
  │                                    ▼              │
  │  Trace: abc123                                    │
  │  ├── checkout-svc    200ms                        │
  │  ├── inventory-svc   150ms                        │
  │  └── payment-svc     850ms  ◄── slow!             │
  │                        │                          │
  │  Click span ───────────▶                          │
  │                        ▼                          │
  │  Logs for payment-svc, traceID=abc123:            │
  │  14:23:01 WARN  Connection pool exhausted         │
  │  14:23:01 ERROR Timeout waiting for DB connection │
  └───────────────────────────────────────────────────┘

Production Lessons

Teams that have deployed OTel in production at scale converge on a common set of lessons:

Version-Lock Everything

The OTel ecosystem moves fast. The Operator, Collector, and auto-instrumentation images must be compatible. Pin all three to tested versions and upgrade them together:

# Do not use "latest" in production
operator: ghcr.io/open-telemetry/opentelemetry-operator:v0.96.0
collector: otel/opentelemetry-collector-contrib:0.96.0
java-agent: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.1.0

Memory Requirements

OTel Collectors buffer data in memory. Under load, a DaemonSet collector can easily consume 1–2 GB of memory. Gateway collectors handling high-cardinality traces may need 4 GB or more. Size your collector pods with appropriate requests and limits, and set memory_limiter processor as the first processor in your pipeline:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 500

Start with Traces, Not Metrics

If you already have Prometheus for metrics, adding OTel for traces provides the most incremental value. Auto-instrumentation gives you distributed tracing with zero code changes. Migrating metrics to OTel can come later (and for many teams, Prometheus remains the better choice for metrics).

Sampling is Essential

Collecting 100% of traces is prohibitively expensive at scale. Use tail sampling at the gateway tier to keep:

  • All error traces
  • All slow traces (above a latency threshold)
  • A random sample of normal traces (1–10%)

This captures the traces you actually need for debugging while keeping storage costs manageable.

Target Allocator for Prometheus Scraping

If you use the OTel Collector to scrape Prometheus endpoints (replacing Prometheus itself), the Target Allocator distributes scrape targets across collector replicas. Without it, every collector scrapes every target, duplicating data. The Target Allocator requires careful resource provisioning — plan for 4 GB+ nodes in the allocator pool.

What to Monitor About Your Monitoring

Observability infrastructure is itself a system that can fail. Monitor:

  • Collector memory and CPU usage (alert before OOM)
  • Export failures (collector cannot reach backend)
  • Queue depth (data backing up faster than it can be exported)
  • Span drop rate (how much data is being discarded)
  • Backend ingestion rate and storage growth

An observability system that silently drops data during the incident you need to debug is worse than no observability at all, because it gives you confidence that is not warranted.

Common Mistakes and Misconceptions

  • “Prometheus can store data forever.” Prometheus is designed for real-time monitoring with limited retention (default 15 days). For long-term storage, use Thanos, Cortex, or Grafana Mimir as a remote write backend.
  • “More metrics are always better.” High-cardinality metrics (per-user, per-request-id labels) can overwhelm Prometheus and explode storage costs. Be intentional about labels. Cardinality is the primary cost driver in metrics systems.
  • “Logging everything to stdout is sufficient.” Unstructured logs are hard to query. Use structured logging (JSON) with consistent fields (request_id, user_id, trace_id). This makes log aggregation systems (Loki, Elasticsearch) actually useful.

Further Reading

  • Prometheus Documentation — the CNCF graduated project for metrics collection and alerting, covering PromQL, service discovery, recording rules, and alerting configuration.
  • OpenTelemetry Documentation — the CNCF observability framework unifying traces, metrics, and logs with auto-instrumentation, SDKs, and the Collector pipeline.
  • OpenTelemetry Collector Documentation — detailed reference for configuring receivers, processors, exporters, and connectors in the OTel Collector, including the Target Allocator for Prometheus scraping.
  • Grafana Documentation — the visualization platform for building dashboards across Prometheus, Loki, Tempo, and other data sources.
  • Grafana Loki Documentation — the log aggregation system designed for cost-effective storage with label-based indexing rather than full-text indexing.
  • Jaeger Documentation — the CNCF graduated distributed tracing platform for monitoring and troubleshooting microservice architectures.
  • kube-prometheus-stack (GitHub) — the Helm chart bundling Prometheus Operator, Grafana, Alertmanager, node-exporter, and pre-built Kubernetes dashboards and recording rules.
  • Kubernetes SIG Instrumentation — the upstream SIG responsible for Kubernetes metrics, structured logging, tracing standards, and the metrics stability framework.

Back to: Table of Contents (00-README.md)

Appendix A: Glossary

This appendix provides a quick-reference glossary for terms used throughout the book. Entries are organized alphabetically with cross-references to the chapter where each concept is covered in depth.


Admission Controller — A plugin that intercepts requests to the Kubernetes API server after authentication and authorization but before the object is persisted, used to validate or mutate resources. (see Chapter 39)

Affinity — A set of rules that constrain which nodes a Pod can be scheduled on, based on labels on nodes or other Pods. (see Chapter 33)

API Group — A logical grouping of related Kubernetes API resources (e.g., apps, batch, networking.k8s.io), enabling independent versioning and extension. (see Chapter 4)

API Server (kube-apiserver) — The central management component of the Kubernetes control plane that exposes the Kubernetes API, validates requests, and persists state to etcd. (see Chapter 3)

ArgoCD — A declarative, GitOps-based continuous delivery tool for Kubernetes that synchronizes cluster state with Git repositories. (see Chapter 6)

Backstage — An open-source developer portal framework, originally from Spotify, used to build internal developer platforms with service catalogs and templates. (see Chapter 35)

Cloud Controller Manager — A control plane component that embeds cloud-specific control logic, allowing Kubernetes to interact with the underlying cloud provider’s APIs for nodes, routes, and load balancers. (see Chapter 17)

Cluster Autoscaler — A component that automatically adjusts the number of nodes in a cluster based on pending Pod resource requests and node utilization. (see Chapter 32)

ClusterIP — The default Service type that exposes a Service on a cluster-internal virtual IP, reachable only from within the cluster. (see Chapter 5)

ClusterRole — An RBAC resource that defines a set of permissions across all namespaces or for cluster-scoped resources. (see Chapter 25)

ClusterRoleBinding — An RBAC resource that grants the permissions defined in a ClusterRole to a user, group, or ServiceAccount cluster-wide. (see Chapter 25)

CNI (Container Network Interface) — A specification and set of plugins for configuring networking in Linux containers, used by Kubernetes to set up Pod networking. (see Chapter 13)

ConfigMap — A Kubernetes object used to store non-confidential configuration data as key-value pairs, which can be consumed by Pods as environment variables or mounted files. (see Chapter 18)

containerd — An industry-standard container runtime that manages the complete container lifecycle on a host, commonly used as the runtime in Kubernetes nodes. (see Chapter 10)

Container Runtime — The software responsible for running containers on a node, such as containerd or CRI-O. (see Chapter 10)

Controller — A control loop that watches the state of the cluster through the API server and makes changes to move the current state toward the desired state. (see Chapter 38)

Controller Manager (kube-controller-manager) — A control plane component that runs the core set of built-in controllers (ReplicaSet, Deployment, etc.) as a single process. (see Chapter 3)

CoreDNS — The default cluster DNS server in Kubernetes, providing service discovery via DNS for Services and Pods. (see Chapter 5)

CRD (Custom Resource Definition) — An extension mechanism that allows users to define their own resource types in the Kubernetes API without modifying the API server. (see Chapter 4)

CRI (Container Runtime Interface) — A plugin interface that enables the kubelet to use different container runtimes without needing to recompile. (see Chapter 10)

CRI-O — A lightweight container runtime purpose-built for Kubernetes, implementing the CRI specification. (see Chapter 10)

CronJob — A Kubernetes resource that creates Jobs on a recurring schedule defined using cron syntax. (see Chapter 24)

Crossplane — An open-source framework that extends Kubernetes to provision and manage cloud infrastructure and services using CRDs and controllers. (see Chapter 36)

CSI (Container Storage Interface) — A standard interface for exposing block and file storage systems to container orchestrators like Kubernetes. (see Chapter 23)

DaemonSet — A resource that ensures a copy of a Pod runs on every (or a selected subset of) node in the cluster, commonly used for logging agents and monitoring. (see Chapter 18)

Deployment — A resource that provides declarative updates for Pods and ReplicaSets, supporting rolling updates and rollbacks. (see Chapter 18)

Device Plugin — A kubelet framework that allows hardware vendors to advertise specialized resources (GPUs, FPGAs, etc.) to the Kubernetes scheduler without modifying core code. (see Chapter 41)

Digest — A content-addressable identifier (usually a SHA-256 hash) that uniquely identifies a specific container image, providing an immutable reference. (see Chapter 10)

DRA (Dynamic Resource Allocation) — A Kubernetes framework for requesting and sharing specialized hardware resources (GPUs, accelerators) with fine-grained allocation semantics beyond the device plugin model. (see Chapter 41)

Edge-triggered — A reconciliation approach where the controller reacts only when a change event occurs, as opposed to level-triggered reconciliation. (see Chapter 38)

Endpoint — A network address (IP and port) that represents a single backend for a Service, historically tracked via Endpoints objects. (see Chapter 5)

EndpointSlice — A scalable replacement for the Endpoints resource that splits endpoint information across multiple objects to reduce API server and etcd load. (see Chapter 13)

etcd — A consistent, distributed key-value store used as the primary datastore for all Kubernetes cluster state and configuration. (see Chapter 3)

ExternalName — A Service type that maps a Service to an external DNS name, acting as a CNAME alias without proxying. (see Chapter 5)

Finalizer — A metadata key on a Kubernetes object that prevents deletion until a controller has performed its cleanup logic and removed the finalizer. (see Chapter 39)

Flux — A GitOps toolkit for Kubernetes that keeps clusters in sync with configuration stored in Git repositories. (see Chapter 6)

Gateway API — A next-generation Kubernetes API for modeling service networking, designed to be expressive, extensible, and role-oriented as a successor to Ingress. (see Chapter 13)

Grafana — An open-source observability platform for visualizing metrics, logs, and traces, commonly used alongside Prometheus in Kubernetes monitoring stacks. (see Chapter 45)

GVR (Group/Version/Resource) — The three-part coordinate system (API group, version, resource name) used to uniquely identify any resource type in the Kubernetes API. (see Chapter 4)

Helm — A package manager for Kubernetes that uses templated charts to define, install, and upgrade applications. (see Chapter 12)

HPA (Horizontal Pod Autoscaler) — A controller that automatically scales the number of Pod replicas based on observed CPU, memory, or custom metrics. (see Chapter 30)

Image — A lightweight, standalone, executable package that includes everything needed to run a piece of software: code, runtime, libraries, and settings. (see Chapter 10)

Informer — A client-side caching mechanism in client-go that watches API server resources and maintains a local cache to reduce API server load. (see Chapter 39)

Ingress — A Kubernetes resource that manages external HTTP/HTTPS access to Services, providing load balancing, TLS termination, and name-based virtual hosting. (see Chapter 5)

Ingress Controller — A controller that fulfills Ingress resources by configuring a load balancer or reverse proxy (e.g., NGINX, Envoy, Traefik). (see Chapter 13)

Init Container — A specialized container that runs to completion before any app containers start in a Pod, used for setup tasks like waiting for dependencies or populating shared volumes. (see Chapter 18)

Job — A Kubernetes resource that creates one or more Pods and ensures a specified number of them successfully terminate, used for batch and one-off tasks. (see Chapter 24)

Karpenter — A node provisioning tool that automatically launches right-sized compute nodes in response to unschedulable Pods, offering faster and more flexible scaling than Cluster Autoscaler. (see Chapter 32)

KServe — A Kubernetes-native platform for serving machine learning models with support for autoscaling, canary rollouts, and multi-framework inference. (see Chapter 42)

kube-proxy — A network component running on each node that maintains network rules for Service traffic forwarding using iptables, IPVS, or eBPF. (see Chapter 3)

Kubeflow — An open-source machine learning platform for Kubernetes that provides tools for ML pipelines, training, tuning, and serving. (see Chapter 41)

kubelet — The primary node agent that runs on every node, responsible for ensuring that containers described in PodSpecs are running and healthy. (see Chapter 3)

KubeRay — A Kubernetes operator for deploying and managing Ray clusters, commonly used for distributed ML training and inference workloads. (see Chapter 42)

Kustomize — A template-free configuration management tool built into kubectl that uses overlays to customize Kubernetes manifests for different environments. (see Chapter 12)

Kyverno — A Kubernetes-native policy engine that validates, mutates, and generates configurations using policies defined as Kubernetes resources. (see Chapter 29)

Label — A key-value pair attached to Kubernetes objects used for organizing and selecting subsets of resources. (see Chapter 4)

LeaderWorkerSet — A Kubernetes API for deploying multi-node distributed workloads with leader-worker topology, commonly used for distributed ML training. (see Chapter 42)

Level-triggered — A reconciliation approach where the controller continuously compares desired state to actual state and acts on the difference, regardless of what events occurred. (see Chapter 38)

Liveness Probe — A periodic check that determines whether a container is still running; if it fails, the kubelet restarts the container. (see Chapter 20)

LoadBalancer — A Service type that exposes the Service externally using a cloud provider’s load balancer, automatically provisioning an external IP. (see Chapter 5)

MIG (Multi-Instance GPU) — An NVIDIA technology that partitions a single GPU into multiple isolated instances, each with dedicated compute, memory, and bandwidth. (see Chapter 41)

Namespace — A virtual partition within a Kubernetes cluster that provides scope for resource names and a mechanism for applying policies and resource quotas. (see Chapter 37)

NetworkPolicy — A resource that specifies how groups of Pods are allowed to communicate with each other and with external endpoints, acting as a firewall for Pod traffic. (see Chapter 26)

Node — A worker machine (virtual or physical) in Kubernetes that runs Pods, managed by the control plane. (see Chapter 3)

NodePool — A Karpenter resource that defines a set of constraints and instance types for provisioning nodes, replacing the older Provisioner resource. (see Chapter 32)

NodePort — A Service type that exposes a Service on a static port on every node’s IP, making it accessible from outside the cluster. (see Chapter 5)

OCI (Open Container Initiative) — A set of industry standards for container image formats and runtimes, ensuring interoperability across container tools. (see Chapter 10)

OPA/Gatekeeper — Open Policy Agent integrated with Kubernetes via the Gatekeeper project, providing policy enforcement through admission control using the Rego policy language. (see Chapter 29)

OpenTelemetry — A vendor-neutral observability framework for generating, collecting, and exporting telemetry data (traces, metrics, logs) from applications. (see Chapter 45)

Operator — A pattern that combines a CRD with a custom controller to encode operational knowledge for managing complex applications on Kubernetes. (see Chapter 38)

Owner Reference — A metadata field on a Kubernetes object that identifies its parent object, enabling garbage collection when the parent is deleted. (see Chapter 39)

PersistentVolume (PV) — A cluster-level storage resource provisioned by an administrator or dynamically via a StorageClass, representing a piece of networked storage. (see Chapter 23)

PersistentVolumeClaim (PVC) — A user’s request for storage that binds to an available PersistentVolume, abstracting the underlying storage implementation. (see Chapter 23)

Pod — The smallest deployable unit in Kubernetes, consisting of one or more containers that share networking and storage and are co-scheduled on the same node. (see Chapter 3)

Pod Security Standards — A set of three built-in security profiles (Privileged, Baseline, Restricted) enforced at the namespace level to control Pod security contexts. (see Chapter 29)

PodDisruptionBudget (PDB) — A resource that limits the number of Pods of a replicated application that can be voluntarily disrupted at the same time, ensuring availability during maintenance. (see Chapter 20)

Priority Class — A resource that defines a priority value for Pods, influencing scheduling order and preemption decisions when cluster resources are scarce. (see Chapter 33)

Prometheus — An open-source monitoring and alerting toolkit that collects metrics via a pull model and stores them in a time-series database, widely used in Kubernetes environments. (see Chapter 45)

RBAC (Role-Based Access Control) — The Kubernetes authorization mechanism that regulates access to resources based on the roles assigned to users or service accounts. (see Chapter 25)

Readiness Probe — A periodic check that determines whether a container is ready to accept traffic; failing containers are removed from Service endpoints. (see Chapter 20)

Reconciliation Loop — The core control pattern in Kubernetes where a controller continuously observes the current state, compares it with the desired state, and takes action to converge them. (see Chapter 38)

Registry — A service that stores and distributes container images, such as Docker Hub, GitHub Container Registry, or a private registry. (see Chapter 10)

ReplicaSet — A resource that ensures a specified number of identical Pod replicas are running at any given time, typically managed by a Deployment. (see Chapter 18)

Resource Quota — A constraint that limits the aggregate resource consumption (CPU, memory, object count) within a Namespace. (see Chapter 37)

Role — An RBAC resource that defines a set of permissions within a specific Namespace. (see Chapter 25)

RoleBinding — An RBAC resource that grants the permissions defined in a Role to a user, group, or ServiceAccount within a specific Namespace. (see Chapter 25)

runc — The reference implementation of the OCI runtime specification, a low-level container runtime that spawns and runs containers. (see Chapter 10)

SBOM (Software Bill of Materials) — A formal inventory of all components, libraries, and dependencies in a software artifact, used for supply chain security and vulnerability tracking. (see Chapter 27)

Scheduler (kube-scheduler) — A control plane component that assigns newly created Pods to nodes based on resource requirements, constraints, affinity rules, and scheduling policies. (see Chapter 3)

Secret — A Kubernetes object for storing sensitive data (passwords, tokens, TLS certificates). Values are base64-encoded in YAML, but base64 is encoding, not encryption — configure encryption at rest for real protection. (see Chapter 28)

Selector — A query expression that uses labels to filter and identify a set of Kubernetes objects. (see Chapter 4)

Service — An abstraction that defines a stable network endpoint (virtual IP and DNS name) for accessing a set of Pods selected by labels. (see Chapter 5)

Service Mesh — An infrastructure layer that manages service-to-service communication with features like mutual TLS, traffic management, and observability (e.g., Istio, Linkerd). (see Chapter 13)

ServiceAccount — A Kubernetes identity assigned to Pods that enables them to authenticate with the API server and other services. (see Chapter 25)

Sidecar — A secondary container that runs alongside the main application container within a Pod, providing supporting functionality like logging, proxying, or configuration. (see Chapter 18)

Sigstore — An open-source project providing tools (Cosign, Fulcio, Rekor) for signing, verifying, and protecting the software supply chain for container images. (see Chapter 27)

StatefulSet — A resource for managing stateful applications that require stable network identities, persistent storage, and ordered deployment and scaling. (see Chapter 21)

StorageClass — A resource that defines a class of storage with a provisioner and parameters, enabling dynamic provisioning of PersistentVolumes. (see Chapter 23)

Taint — A property applied to a node that repels Pods unless those Pods have a matching Toleration, used to reserve nodes for specific workloads. (see Chapter 33)

Tag — A human-readable label (e.g., v1.2.3, latest) applied to a container image in a registry, which can be overwritten and is therefore mutable. (see Chapter 10)

Toleration — A Pod-level property that allows the Pod to be scheduled onto a node with a matching Taint. (see Chapter 33)

Topology Spread Constraints — Rules that control how Pods are distributed across failure domains (zones, nodes, etc.) to improve availability and resource utilization. (see Chapter 33)

Velero — An open-source tool for backing up, restoring, and migrating Kubernetes cluster resources and persistent volumes. (see Chapter 43)

vLLM — A high-throughput, memory-efficient inference engine for large language models that uses PagedAttention for optimized GPU memory management. (see Chapter 42)

VPA (Vertical Pod Autoscaler) — A component that automatically adjusts the CPU and memory resource requests of Pods based on historical usage patterns. (see Chapter 31)

Watch — An API mechanism that allows clients to receive streaming notifications of changes to Kubernetes resources, enabling reactive controllers. (see Chapter 39)

Webhook — An HTTP callback used in Kubernetes for admission control (validating or mutating webhooks) and for extending API server behavior. (see Chapter 39)


Back to Table of Contents

Appendix B: Mental Models

Each part of this book introduces a cluster of related concepts. These diagrams show how they connect — use them as maps when navigating the chapters.


Part 1: First Principles (Chapters 1-9)

The Reconciliation Loop — the heart of Kubernetes.

flowchart TD
    A["User writes YAML"] --> B["kubectl"]
    B --> C["API Server"]
    C --> D["etcd<br>(desired state stored here)"]

    C --> E["Controller Manager<br>(reconcile)"]
    C --> F["Scheduler<br>(assign to node)"]

    E --> G["kubelet (on node)"]
    F --> G

    G --> H["Container Runtime"]
    H --> I["Container"]

    subgraph loop ["The Watch / Reconciliation Loop"]
        direction LR
        W1["Controller watches"] --> W2["Detects drift"]
        W2 --> W3["Compares desired<br>vs. actual state"]
        W3 --> W4["Takes action<br>to converge"]
        W4 --> W1
    end

Part 2: Tooling Evolution (Chapters 10-14)

The Stack — what runs on what.

flowchart TD
    S1["Application"]
    S2["Helm / Kustomize (packaging)"]
    S3["kubeadm / k3s (bootstrap)"]
    S4["Kubernetes API"]
    S5["Container Runtime<br>containerd / CRI-O"]
    S6["CNI Plugin<br>Cilium, Calico, Flannel ..."]
    S7["OCI Runtime (runc)"]

    S1 --> S2 --> S3 --> S4
    S4 --> S5
    S4 --> S6
    S5 --> S7
    S6 --> S7

    subgraph kernel ["Linux Kernel"]
        K1["cgroups<br>(resource limits)"]
        K2["namespaces<br>(isolation)"]
    end

    S7 --> kernel

    subgraph cni ["CNI Virtual Network"]
        direction LR
        PA["Pod A"] <--> PB["Pod B"]
    end

    S6 --> cni

Part 3: Practical Setup (Chapters 15-19)

Your First Cluster — who talks to whom.

flowchart TD
    Cloud["Cloud Provider<br>(AWS / GCP / AZ)"]
    Cloud --> VPC

    kubectl["kubectl"] --> API
    CICD["CI/CD Pipeline"] --> VPC

    subgraph VPC
        subgraph CP ["Control Plane (managed)"]
            API["API Server"]
        end

        subgraph N1 ["Worker Node 1"]
            subgraph Pod1 ["Pod"]
                App["app"]
                Sidecar["sidecar"]
            end
        end

        subgraph N2 ["Worker Node 2"]
            Pod2a["Pod"]
            Pod2b["Pod"]
        end
    end

    API --> N1
    API --> N2

    subgraph debug ["Debugging Tools"]
        direction LR
        KC["kubectl"] --> Logs["logs"]
        KC --> Exec["exec"]
        KC --> Describe["describe (events)"]
    end

Part 4: Stateful Workloads (Chapters 20-24)

State — the hard problem.

flowchart TD
    Deploy["Deployment<br>(stateless)<br>Pods are fungible,<br>interchangeable"]
    SS["StatefulSet<br>(ordered, stable ID)<br>pod-0, pod-1, pod-2<br>each has stable name"]

    SS --> PVC["PVC<br>(claim storage)"]
    PVC --> PV["PV<br>(actual volume)"]
    PV --> SC["StorageClass<br>(provisioner)"]
    SC --> Disk["Cloud Disk<br>(EBS / PD / AzD)"]

    subgraph operators ["Operators manage databases on K8s"]
        direction LR
        Op["Operator"] -->|watches| CRD["CRD<br>(e.g. PostgresCluster)"]
        CRD -->|manages| Res["StatefulSet +<br>PVCs + Secrets"]
    end

    subgraph jobs ["Jobs and CronJobs"]
        direction LR
        Job["Job<br>(run once)"]
        CronJob["CronJob<br>(scheduled)"] -->|creates Job<br>on schedule| Job
    end

Part 5: Security (Chapters 25-29)

Defense in Depth.

    ┌────────────────────────────────────────────────────────┐
    │  Supply Chain (outermost ring)                         │
    │  Sigstore, SBOM, image scanning                        │
    │                                                        │
    │  ┌─────────────────────────────────────────────────┐   │
    │  │  Cluster                                        │   │
    │  │  RBAC, Admission Control (OPA/Kyverno)          │   │
    │  │                                                 │   │
    │  │  ┌─────────────────────────────────────────┐    │   │
    │  │  │  Namespace                              │    │   │
    │  │  │  NetworkPolicy, ResourceQuota           │    │   │
    │  │  │                                         │    │   │
    │  │  │  ┌─────────────────────────────────┐    │    │   │
    │  │  │  │  Pod                            │    │    │   │
    │  │  │  │  SecurityContext, Seccomp,      │    │    │   │
    │  │  │  │  AppArmor                       │    │    │   │
    │  │  │  │                                 │    │    │   │
    │  │  │  │  ┌─────────────────────────┐    │    │    │   │
    │  │  │  │  │  Container (innermost)  │    │    │    │   │
    │  │  │  │  │  read-only rootfs       │    │    │    │   │
    │  │  │  │  │  non-root user          │    │    │    │   │
    │  │  │  │  │  dropped capabilities   │    │    │    │   │
    │  │  │  │  └─────────────────────────┘    │    │    │   │
    │  │  │  └─────────────────────────────────┘    │    │   │
    │  │  └─────────────────────────────────────────┘    │   │
    │  └─────────────────────────────────────────────────┘   │
    └────────────────────────────────────────────────────────┘

    Secrets Management (cross-cutting concern):
    ┌──────────────────────────────────────────────┐
    │                                              │
    │  External Secrets ──▶ K8s Secret ──▶ Pod     │
    │       │                                      │
    │  Vault / AWS SM / GCP SM                     │
    │  (source of truth)                           │
    │                                              │
    │  Cuts across ALL rings above                 │
    └──────────────────────────────────────────────┘

Part 6: Scaling (Chapters 30-33)

The Scaling Cascade — metrics to machines.

flowchart TD
    M["Metrics<br>(CPU, memory, custom)"]
    M --> HPA["HPA"]
    HPA -->|"scale pods<br>horizontally"| Pods["More Pods"]
    HPA -->|"pods go Pending<br>(no capacity)"| KCA["Karpenter /<br>Cluster Autoscaler"]
    KCA -->|"scale nodes"| Cloud["Cloud API<br>(provision new VMs)"]

    subgraph vpa ["VPA (Vertical Pod Autoscaler)"]
        direction LR
        VM["Metrics"] --> VPA2["VPA"] --> Resize["Resize pods vertically<br>(adjust requests/limits)"]
    end

    subgraph scheduling ["Resource Tuning feeds Scheduling"]
        direction LR
        RL["requests and limits<br>(CPU, memory)"] --> Sched["Scheduler decisions"]
        RL --> Effects["Affects bin-packing,<br>QoS class, eviction<br>priority, HPA thresholds"]
    end

Part 7: Platform Engineering (Chapters 34-39)

The Platform — abstraction over infrastructure.

flowchart TD
    Dev["Developer"] -->|writes Claim| PlatAPI["Platform API<br>(Crossplane XRD / CRD)"]
    PlatAPI -->|provisions| CloudRes["Cloud Resources<br>(RDS, S3, etc.)"]

    Git["Git Repo<br>(source of truth)"] -->|GitOps loop| Argo["ArgoCD / Flux"]
    Argo -->|sync| Clusters["Cluster(s)"]

    subgraph ext ["Extension Mechanism"]
        direction LR
        CRD["CRD<br>(defines new API)"] --> Operator["Operator<br>(watches & reconciles)"] --> Resources["Manages resources"]
    end

    subgraph horiz ["Horizontal Concerns"]
        MC["Multi-Cluster<br>(fleet mgmt, federation)"]
        MT["Multi-Tenancy<br>(namespaces, vClusters,<br>resource quotas)"]
    end

Part 8: Advanced Topics (Chapters 40-45)

Running it for Real.

    Operational Concerns:
    ┌──────────────────────────────────────────────────────┐
    │                                                      │
    │  ┌──────────┐  ┌────────────────┐  ┌─────────────┐   │
    │  │ etcd ops │  │ Disaster       │  │ Cost        │   │
    │  │ (backup, │  │ Recovery       │  │ Optimization│   │
    │  │  defrag, │  │ (Velero)       │  │ (right-size,│   │
    │  │  health) │  │                │  │  spot, idle)│   │
    │  └──────────┘  │ backup ──▶     │  └─────────────┘   │
    │                │ restore ──▶    │                    │
    │                │ migrate        │                    │
    │                └────────────────┘                    │
    └──────────────────────────────────────────────────────┘

    Observability (the three pillars):
            ┌───────────┐
            │  Metrics  │
            │(Prometheus│
            │ / Mimir)  │
            └─────┬─────┘
                  │
        ┌─────────┼─────────┐
        │         │         │
        ▼         ▼         ▼
    ┌───────┐ ┌───────┐ ┌────────┐
    │ Logs  │ │Traces │ │Alerts  │
    │(Loki) │ │(Tempo)│ │(Grafana│
    └───────┘ └───────┘ │ / PD)  │
                        └────────┘

    GPU Scheduling:
    ┌──────────────────┐     ┌──────────────────┐     ┌────────────┐
    │  Pod with        │────▶│  Device Plugin / │────▶│ NVIDIA GPU │
    │  gpu request     │     │  DRA             │     │ (on node)  │
    │  (limits:        │     │  (allocates GPU) │     │            │
    │   nvidia.com/gpu)│     └──────────────────┘     └────────────┘
    └──────────────────┘

    LLM Serving:
    ┌─────────┐    ┌──────────────┐    ┌──────────┐    ┌────────────┐
    │ Model   │───▶│ vLLM / TGI   │───▶│ KServe   │───▶│ Inference  │
    │(weights)│    │ (serving     │    │ (routing,│    │ endpoint   │
    │         │    │  engine)     │    │  scaling)│    │ (/predict) │
    └─────────┘    └──────────────┘    └──────────┘    └────────────┘

Back to Table of Contents

Appendix C: Decision Trees

Kubernetes offers many options for the same problem. These decision trees encode the trade-offs discussed throughout the book into quick-reference flowcharts.


1. Which Workload Controller?

Kubernetes provides several controllers for running workloads, each designed for a different scheduling pattern. Chapter 18 covers Deployments and Services, Chapter 21 covers StatefulSets, Chapter 24 covers Jobs and CronJobs, and Chapter 42 covers LeaderWorkerSet for ML gang scheduling. Start by asking whether your workload is stateless.

flowchart TD
    A[New Workload] --> B{Stateless?}
    B -->|Yes| C([Deployment])
    B -->|No| D{Need stable identity<br>or ordering?}
    D -->|Yes| E([StatefulSet])
    D -->|No| F{Run on every node?}
    F -->|Yes| G([DaemonSet])
    F -->|No| H{Run to completion?}
    H -->|Yes| I([Job])
    H -->|No| J{Run on schedule?}
    J -->|Yes| K([CronJob])
    J -->|No| L(["LeaderWorkerSet + Volcano<br>(ML gang scheduling)"])

2. Which Service Type?

Every application that receives traffic needs a Service, but Kubernetes offers five types with very different behaviors. Chapter 18 introduces ClusterIP and NodePort, Chapter 17 covers LoadBalancer integration with cloud providers, and Chapter 13 discusses Ingress and the newer Gateway API.

flowchart TD
    A[Expose a Service] --> B{Internal only?}
    B -->|Yes| C([ClusterIP])
    B -->|No| D{External DNS name<br>only, no proxy?}
    D -->|Yes| E([ExternalName])
    D -->|No| F{Need L7 routing<br>by host or path?}
    F -->|Yes| G(["Ingress / Gateway API"])
    F -->|No| H{Dev/test only?}
    H -->|Yes| I([NodePort])
    H -->|No| J(["LoadBalancer<br>(L4 TCP/UDP)"])

3. Which Storage?

Storage decisions depend on durability, access patterns, and whether multiple pods need simultaneous access. Chapter 23 covers PersistentVolumes, StorageClasses, and CSI drivers in depth. Chapter 17 explains how cloud providers implement storage backends.

flowchart TD
    A[Need Storage] --> B{Ephemeral — survives<br>container restart?}
    B -->|Yes| C([emptyDir])
    B -->|No| D{Shared across pods<br>ReadWriteMany?}
    D -->|Yes| E(["NFS / EFS<br>(RWX PVC)"])
    D -->|No| F{High IOPS database?}
    F -->|Yes| G(["Local SSD / io2<br>(PVC + StorageClass)"])
    F -->|No| H{Object storage?}
    H -->|Yes| I(["S3 / GCS<br>(use SDK, not a PV)"])
    H -->|No| J(["PVC + StorageClass<br>(general purpose)"])

4. Which Autoscaler?

Kubernetes scaling operates at two levels: pod-level (adding replicas or resizing resource requests) and node-level (adding machines when pods can’t be scheduled). Chapter 30 covers HPA, Chapter 31 covers VPA, Chapter 32 covers Karpenter and Cluster Autoscaler, and Chapter 33 explains how resource requests feed into scheduling.

flowchart TD
    A[Need Autoscaling] --> B{Scale pods or nodes?}
    B -->|Pods| C{Horizontal —<br>more replicas?}
    B -->|Nodes| D{Running on AWS or Azure?}
    C -->|Yes| E([HPA])
    C -->|No| F{Right-size resources?}
    F -->|Yes| G([VPA])
    F -->|No| H(["KEDA<br>(event-driven, queues, etc.)"])
    D -->|Yes| I(["Karpenter<br>(AWS native, Azure via AKS NAP)"])
    D -->|No| J(["Cluster Autoscaler<br>(GCP / on-prem)"])

5. Which Managed Kubernetes?

The choice between managed and self-managed Kubernetes depends on your infrastructure constraints and operational maturity. Chapter 16 compares EKS, GKE, and AKS in detail. Chapter 15 covers kubeadm for self-managed clusters, and Chapter 11 covers k3s and other lightweight distributions.

flowchart TD
    A[Choose Managed K8s] --> B{On-premises?}
    B -->|Yes| C(["kubeadm / k3s / Rancher"])
    B -->|No| D{Which cloud?}
    D -->|AWS| E(["EKS<br>(see Karpenter for node scaling)"])
    D -->|GCP| F[GKE] --> G{Zero node management?}
    G -->|Yes| H([GKE Autopilot])
    G -->|No| I([GKE Standard])
    D -->|Azure| J[AKS] --> K(["AKS Free tier<br>(free control plane, dev)"])

6. Which CNI?

The Container Network Interface plugin determines how pods get IP addresses and how network traffic flows between nodes. Most managed clusters default to the cloud provider’s native CNI, but self-managed clusters require an explicit choice. Chapter 13 traces the evolution from Flannel through Calico to Cilium and explains the eBPF performance advantage.

flowchart TD
    A[Choose a CNI] --> B{Managed cloud cluster?}
    B -->|Yes| C(["Use provider default<br>(VPC CNI / Azure CNI / GKE native)"])
    B -->|No| D{Need eBPF, no<br>iptables overhead?}
    D -->|Yes| E([Cilium])
    D -->|No| F{Need NetworkPolicy?}
    F -->|Yes| G([Calico])
    F -->|No| H(["Flannel<br>(simple overlay)"])

7. Which Package Manager?

Managing Kubernetes YAML at scale requires tooling — the question is which kind. Helm uses Go templates for parameterization and dominates third-party chart distribution. Kustomize uses overlay-based patching without any template language. Many teams combine both. Chapter 12 covers all three approaches and explains when to use each.

flowchart TD
    A["Package / Template<br>K8s Manifests"] --> B{Need type-safe<br>code generation?}
    B -->|Yes| C([cdk8s])
    B -->|No| D{Need Go-template style<br>parameterization?}
    D -->|Yes| E(["Helm<br>(most popular for 3rd-party<br>chart distribution)"])
    D -->|No| F{Want patch-based overlays<br>without templates?}
    F -->|Yes| G([Kustomize])
    F -->|Both| H(["helm template |<br>kustomize build<br>(common hybrid)"])

8. Which Secret Management?

Secrets require special handling: they must not appear in plain text in Git, they may need to rotate automatically, and they often originate from an external vault or cloud provider. Chapter 28 covers Kubernetes Secrets and encryption at rest, Sealed Secrets for GitOps, and integration with HashiCorp Vault and cloud secret managers via the External Secrets Operator.

flowchart TD
    A[Manage Secrets] --> B{Need external secret store<br>— Vault, AWS SM?}
    B -->|Yes| C{Need auto-rotation?}
    B -->|No| D{Storing in Git<br>for GitOps?}
    C -->|Yes| E(["Vault + sidecar injector<br>(dynamic secrets)"])
    C -->|No| F(["External Secrets Operator<br>+ Vault / AWS Secrets Manager"])
    D -->|Yes| G(["Sealed Secrets<br>(encrypt before committing)"])
    D -->|No| H(["K8s Secrets +<br>encryption at rest<br>(simple, low security)"])

9. Which GitOps Tool?

GitOps applies the Kubernetes reconciliation pattern to deployment itself: a controller watches a Git repository and ensures the cluster matches. The two major tools are ArgoCD and Flux, which differ primarily in UI richness and multi-cluster management. Chapter 12 covers both, and Chapter 34 discusses multi-cluster GitOps patterns.

flowchart TD
    A[Adopt GitOps] --> B{Need rich UI, multi-cluster,<br>app-of-apps pattern?}
    B -->|Yes| C([ArgoCD])
    B -->|No| D{Want lightweight, Git-native,<br>Helm/Kustomize controller?}
    D -->|Yes| E([Flux])
    D -->|Both| F(["They can coexist:<br>Flux for infra clusters,<br>Argo for app clusters"])

10. StatefulSet vs Operator for Databases?

Running databases on Kubernetes is possible but requires careful consideration. A managed cloud database (RDS, Cloud SQL) avoids the operational burden entirely. If you must run on K8s, operators like CloudNativePG and Percona handle failover, backups, and scaling automatically. A raw StatefulSet works for dev/staging but lacks production automation. Chapter 22 covers this decision in depth, and Chapter 38 explains the operator pattern.

flowchart TD
    A[Run a Database on K8s] --> B{Managed DB available<br>— RDS, Cloud SQL, etc.?}
    B -->|Yes| C(["Use managed DB<br>(provision via Crossplane<br>or Terraform)"])
    B -->|No| D{Production with failover,<br>backup, scaling?}
    D -->|Yes| E(["Use an Operator<br>(CloudNativePG, Percona,<br>Strimzi, etc.)"])
    D -->|No| F(["StatefulSet<br>(simple, single instance,<br>dev/staging)"])

Back to Table of Contents

Appendix D: Troubleshooting Quick Reference

This appendix maps the error messages and symptoms you will encounter in practice to their most common root causes. Organized by where you see the error.


General Debugging Flowchart

flowchart LR
    Start["Pod not working?"]
    GetPods["kubectl get pods -n namespace"]
    Status{"What status do you see?"}

    Start --> GetPods --> Status

    Status --> Pending
    Status --> Crash["CrashLoopBackOff"]
    Status --> Image["ImagePullBackOff"]
    Status --> Running["Running but not working"]
    Status --> Evicted
    Status --> Unknown["Unknown / NodeLost"]

    Pending --> PDescribe["kubectl describe pod"]
    PDescribe --> P1{"Insufficient<br>cpu/memory?"}
    PDescribe --> P2{"No nodes match<br>selectors?"}
    PDescribe --> P3{"PVC not bound?"}
    P1 --> P1Fix["Scale up or adjust requests"]
    P2 --> P2Fix["Fix nodeSelector/affinity"]
    P3 --> P3Fix["Check PVC"]

    Crash --> CLogs["kubectl logs --previous"]
    CLogs --> C1{"OOMKilled?"}
    CLogs --> C2{"App error?"}
    CLogs --> C3{"Missing config?"}
    C1 --> C1Fix["Increase memory limits"]
    C2 --> C2Fix["Fix application startup"]
    C3 --> C3Fix["Check ConfigMaps/Secrets"]

    Image --> IDescribe["kubectl describe pod"]
    IDescribe --> I1{"repo does not exist?"}
    IDescribe --> I2{"unauthorized?"}
    IDescribe --> I3{"tag not found?"}
    I1 --> I1Fix["Fix image name"]
    I2 --> I2Fix["Fix imagePullSecrets"]
    I3 --> I3Fix["Fix image tag"]

    Running --> RLogs["kubectl logs -f"]
    RLogs --> R1["Check readiness probe:<br>kubectl describe pod"]
    RLogs --> R2["Check service endpoints:<br>kubectl get endpoints"]
    RLogs --> R3["Test from inside:<br>kubectl exec -it -- sh"]

    Evicted --> EDescribe["kubectl describe node"]
    EDescribe --> E1["Check for DiskPressure /<br>MemoryPressure"]

    Unknown --> UNodes["kubectl get nodes"]
    UNodes --> U1{"Node NotReady?"}
    U1 --> U1Fix["SSH to node, check kubelet"]

Pod Status Errors

Pending

What it means: The scheduler cannot find a node to place the pod on, or a prerequisite resource is not ready.

Common causes:

  • No node has enough CPU or memory to satisfy the pod’s resource requests.
  • nodeSelector, nodeAffinity, or tolerations do not match any available node.
  • A referenced PersistentVolumeClaim is not bound.
  • Resource quotas in the namespace are exhausted.
  • The cluster has no nodes at all (scaling from zero).

How to diagnose:

kubectl describe pod <pod-name> -n <namespace>    # Look at the Events section
kubectl get nodes -o wide                          # Check node status and capacity
kubectl describe node <node-name>                  # Check Allocatable vs Allocated
kubectl get pvc -n <namespace>                     # Check PVC status
kubectl get resourcequota -n <namespace>           # Check quota usage

Fix:

  1. If resource-constrained: lower the pod’s resource requests, add nodes, or remove idle workloads.
  2. If selector mismatch: correct nodeSelector/nodeAffinity labels or add labels to nodes.
  3. If PVC not bound: ensure a matching PV exists or the StorageClass can dynamically provision one.
  4. If quota exceeded: request a quota increase or free capacity in the namespace.

CrashLoopBackOff

What it means: The container starts, exits with an error, and Kubernetes keeps restarting it with exponential backoff.

Common causes:

  • Application crashes on startup (uncaught exception, missing dependency).
  • Required ConfigMap or Secret is mounted but contains wrong data (wrong key, wrong format).
  • OOMKilled – the container exceeds its memory limit on startup.
  • Liveness probe is too aggressive and kills the container before it finishes starting.
  • Entrypoint or command is misconfigured.

How to diagnose:

kubectl logs <pod-name> -n <namespace> --previous   # Logs from the last crashed container
kubectl describe pod <pod-name> -n <namespace>       # Check Last State, Exit Code, Reason
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState}'

Fix:

  1. Read the logs from --previous to find the actual error.
  2. If OOMKilled (exit code 137): increase resources.limits.memory.
  3. If liveness probe is killing the pod: increase initialDelaySeconds and failureThreshold.
  4. If config is missing: verify ConfigMap/Secret exists and has the expected keys.

ImagePullBackOff / ErrImagePull

What it means: The kubelet cannot pull the container image from the registry.

Common causes:

  • Image name or tag is misspelled.
  • The image tag does not exist (e.g., latest was overwritten or a SHA was pruned).
  • The registry is private and imagePullSecrets are missing or contain invalid credentials.
  • The node cannot reach the registry (network/firewall issue).
  • Docker Hub rate limits are hit on unauthenticated pulls.

How to diagnose:

kubectl describe pod <pod-name> -n <namespace>       # Read the pull error message
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].image}'
# Verify the image exists:
docker pull <image>                                   # Or: crane manifest <image>
# Check imagePullSecrets:
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

Fix:

  1. Correct the image name and tag.
  2. Create or fix imagePullSecrets:
    kubectl create secret docker-registry regcred \
      --docker-server=<registry> \
      --docker-username=<user> \
      --docker-password=<pass> \
      -n <namespace>
    
  3. If rate-limited: configure a pull-through cache or authenticate to Docker Hub.

OOMKilled

What it means: The Linux kernel’s OOM killer terminated the container because it tried to use more memory than its cgroup limit allows.

Common causes:

  • Memory limit is set too low for the workload.
  • Java application is not configured for container-aware memory (-XX:MaxRAMPercentage not set, or old JVM ignoring cgroup limits).
  • Memory leak in the application.
  • Large file processing or caching loading entire datasets into memory.

How to diagnose:

kubectl describe pod <pod-name> -n <namespace>       # Look for "OOMKilled" in Last State
kubectl top pod <pod-name> -n <namespace>             # Current memory usage
kubectl logs <pod-name> -n <namespace> --previous     # App logs before kill
# On the node:
dmesg | grep -i "oom\|killed"                         # Kernel OOM killer logs

Fix:

  1. Increase resources.limits.memory to match the actual needs of the workload.
  2. For Java: set -XX:MaxRAMPercentage=75.0 instead of a fixed -Xmx, and ensure JVM version 10+.
  3. For memory leaks: profile the application, fix the leak, then right-size the limit.
  4. Set resources.requests.memory close to limits.memory to avoid scheduling on nodes that cannot support the workload.

CreateContainerConfigError

What it means: The kubelet cannot create the container because a referenced ConfigMap or Secret does not exist.

Common causes:

  • The ConfigMap or Secret was not created before the pod.
  • Typo in the ConfigMap/Secret name in the pod spec.
  • The ConfigMap/Secret is in a different namespace (they are namespace-scoped).
  • It was accidentally deleted.

How to diagnose:

kubectl describe pod <pod-name> -n <namespace>       # Events will name the missing resource
kubectl get configmap -n <namespace>
kubectl get secret -n <namespace>

Fix:

  1. Create the missing ConfigMap or Secret.
  2. Correct any name typos in the pod spec.
  3. If it should be optional, set optional: true on the configMapRef/secretRef.

Init:CrashLoopBackOff

What it means: An init container is repeatedly crashing, preventing the main containers from starting.

Common causes:

  • The init container is waiting for a service that is not yet available (e.g., database migration init container cannot connect to the DB).
  • Script error in the init container command.
  • Wrong image or command for the init container.

How to diagnose:

kubectl describe pod <pod-name> -n <namespace>       # Identify which init container is failing
kubectl logs <pod-name> -n <namespace> -c <init-container-name> --previous

Fix:

  1. Check the init container logs for the specific error.
  2. Verify the service it depends on is running and reachable.
  3. Fix the init container command, image, or configuration.

Evicted

What it means: The kubelet evicted the pod because the node was under resource pressure (disk, memory, or PID).

Common causes:

  • Node is under DiskPressure (ephemeral storage or container logs filled the disk).
  • Node is under MemoryPressure (too many pods with BestEffort QoS).
  • PID exhaustion on the node.

How to diagnose:

kubectl describe pod <pod-name> -n <namespace>       # Shows eviction reason
kubectl describe node <node-name>                     # Check Conditions for pressure
kubectl get pods -n <namespace> --field-selector=status.phase=Failed | grep Evicted

Fix:

  1. Clean up disk usage on the node (prune unused images, clear old logs).
  2. Set proper resources.requests so BestEffort pods are evicted first.
  3. Configure ephemeral-storage requests and limits.
  4. Set up log rotation and image garbage collection on nodes.

Node Issues

NotReady

What it means: The kubelet on the node is not communicating with the API server, so the control plane marks it NotReady.

Common causes:

  • Kubelet service is not running or has crashed.
  • CNI plugin is not installed or is misconfigured (the node cannot report Ready without a working CNI).
  • Node is under DiskPressure or MemoryPressure.
  • Network partition between the node and the control plane.
  • Expired kubelet client certificate.

How to diagnose:

kubectl describe node <node-name>                     # Check Conditions and Events
kubectl get node <node-name> -o yaml                  # Look at .status.conditions
# SSH to the node:
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago"
crictl ps                                              # Check container runtime
ls /etc/cni/net.d/                                     # Check CNI configuration

Fix:

  1. Restart kubelet: systemctl restart kubelet.
  2. If CNI is missing: install or reinstall the CNI plugin (Calico, Cilium, Flannel, etc.).
  3. If certificates expired: rotate certificates with kubeadm certs renew.
  4. If disk pressure: free disk space on the node.

SchedulingDisabled

What it means: The node has been cordoned – new pods will not be scheduled onto it.

Common causes:

  • An administrator ran kubectl cordon <node>.
  • A node drain is in progress (kubectl drain).
  • A cluster autoscaler is decommissioning the node.

How to diagnose:

kubectl get nodes                                      # Look for SchedulingDisabled
kubectl describe node <node-name>                      # Check Taints for NoSchedule

Fix:

  1. If the maintenance is complete: kubectl uncordon <node-name>.
  2. If autoscaler-managed: the node will be removed; no action needed.

DiskPressure / MemoryPressure

What it means: The node’s available disk or memory has dropped below the kubelet’s eviction threshold.

Common causes:

  • Container images consuming too much disk.
  • Application logs not rotated, filling up the filesystem.
  • Too many pods on the node relative to available memory.
  • Large emptyDir volumes.

How to diagnose:

kubectl describe node <node-name>                      # Check Conditions section
# SSH to the node:
df -h                                                   # Disk usage
free -m                                                 # Memory usage
crictl images | wc -l                                   # Number of cached images
du -sh /var/log/pods/*                                  # Pod log sizes

Fix:

  1. Disk: prune unused images (crictl rmi --prune), enable log rotation, clean /var/log.
  2. Memory: evict low-priority pods, increase node size, or add more nodes.
  3. Configure kubelet garbage collection thresholds in the KubeletConfiguration.

Networking

Connection refused on Service

What it means: A TCP connection to the Service IP and port is actively refused, meaning nothing is listening.

Common causes:

  • No ready endpoints behind the Service (pods are not running or not passing readiness probes).
  • The Service targetPort does not match the port the application is actually listening on.
  • The pod is running but the application inside has not started listening yet.

How to diagnose:

kubectl get endpoints <service-name> -n <namespace>    # Are there any endpoints?
kubectl get pods -n <namespace> -l <selector>           # Are pods running and Ready?
kubectl describe svc <service-name> -n <namespace>      # Check selector and ports
kubectl exec -it <pod> -n <namespace> -- ss -tlnp       # What is the pod listening on?

Fix:

  1. If no endpoints: ensure the Service selector matches the pod labels exactly.
  2. If targetPort is wrong: update the Service to match the container’s listening port.
  3. If pods are not Ready: fix the readiness probe or the underlying application issue.

DNS Resolution Failures

What it means: Pods cannot resolve Kubernetes service names or external hostnames.

Common causes:

  • CoreDNS pods are not running or are crashing.
  • The ndots setting (default: 5) causes excessive search domain lookups, leading to timeouts.
  • Pod’s dnsPolicy is set to Default (uses node DNS) instead of ClusterFirst.
  • Network policy blocking DNS traffic (UDP/TCP port 53 to kube-system).

How to diagnose:

kubectl get pods -n kube-system -l k8s-app=kube-dns    # Is CoreDNS running?
kubectl logs -n kube-system -l k8s-app=kube-dns        # CoreDNS errors
# Test from inside a pod:
kubectl exec -it <pod> -n <namespace> -- nslookup kubernetes.default
kubectl exec -it <pod> -n <namespace> -- cat /etc/resolv.conf

Fix:

  1. If CoreDNS is down: check its deployment, resource limits, and node resources.
  2. For ndots issues: add dnsConfig with ndots: 2 to the pod spec, or use FQDNs (trailing dot).
  3. If NetworkPolicy is blocking: allow egress to kube-system on port 53.
  4. If dnsPolicy is wrong: set dnsPolicy: ClusterFirst.

Service has no endpoints

What it means: The Service exists but has no backing pods, so all traffic to it fails.

Common causes:

  • The Service’s label selector does not match any pod labels (typo or mismatch).
  • All matching pods are failing their readiness probes.
  • The pods are in a different namespace than expected (selectors are namespace-scoped).

How to diagnose:

kubectl describe svc <service-name> -n <namespace>     # Check Selector
kubectl get endpoints <service-name> -n <namespace>     # Should list pod IPs
kubectl get pods -n <namespace> --show-labels           # Compare labels to selector
kubectl get pods -n <namespace> -l <key>=<value>        # Test the selector directly

Fix:

  1. Align the Service selector with the pod’s labels.
  2. Fix readiness probes so pods become Ready.
  3. Ensure pods are deployed in the correct namespace.

Timeout Connecting Between Pods

What it means: TCP connections between pods hang and eventually time out, rather than being refused.

Common causes:

  • A NetworkPolicy is blocking the traffic.
  • The CNI plugin is misconfigured or its pods are crashing.
  • IPtables/eBPF rules are stale after a CNI upgrade or node reboot.
  • Nodes are in different subnets and inter-node routing is broken.

How to diagnose:

kubectl get networkpolicy -n <namespace>                # Are there policies restricting traffic?
kubectl describe networkpolicy <name> -n <namespace>
kubectl get pods -n kube-system -l k8s-app=calico-node  # Or your CNI's pods
# Test connectivity from inside a pod:
kubectl exec -it <pod> -n <namespace> -- curl -v --connect-timeout 5 <target-svc>:<port>
# On the node:
iptables-save | grep <service-cluster-ip>               # Check kube-proxy rules

Fix:

  1. If NetworkPolicy is blocking: update the policy to allow the required ingress/egress.
  2. If CNI is broken: restart CNI pods, or reinstall the CNI plugin.
  3. If iptables are stale: restart kube-proxy (kubectl rollout restart ds/kube-proxy -n kube-system).
  4. Check cloud provider security groups and route tables for inter-node communication.

Storage

PVC Stuck in Pending

What it means: The PersistentVolumeClaim cannot be bound to a PersistentVolume.

Common causes:

  • No PV matches the PVC’s storageClassName, accessModes, or capacity.
  • The StorageClass does not exist or the provisioner is not installed.
  • In multi-zone clusters: the PV is in a different zone than the node running the pod.
  • WaitForFirstConsumer binding mode means the PVC will not bind until a pod using it is scheduled.

How to diagnose:

kubectl describe pvc <pvc-name> -n <namespace>          # Events explain why it is pending
kubectl get storageclass                                 # Does the StorageClass exist?
kubectl get pv                                           # Are there available PVs?
kubectl get events -n <namespace> --field-selector reason=ProvisioningFailed

Fix:

  1. If no StorageClass: create one or set a default StorageClass.
  2. If provisioner is missing: install the CSI driver (e.g., ebs-csi-driver, csi-driver-nfs).
  3. If zone mismatch: use volumeBindingMode: WaitForFirstConsumer to bind in the correct zone.
  4. If capacity mismatch: create a PV with the required size, or adjust the PVC request.

FailedMount / FailedAttachVolume

What it means: The kubelet cannot mount or attach the volume to the node.

Common causes:

  • The volume is still attached to another node (common when a pod is rescheduled – the old node has not detached yet).
  • The CSI driver is not installed or its pods are not running.
  • The volume does not exist (deleted out of band).
  • Filesystem corruption requiring manual fsck.
  • Exceeded the maximum number of volumes per node (e.g., AWS limit of EBS volumes per instance type).

How to diagnose:

kubectl describe pod <pod-name> -n <namespace>          # Look at Events for mount errors
kubectl get volumeattachments                            # Check if volume is attached elsewhere
kubectl get pods -n kube-system -l app=ebs-csi-node     # Check CSI driver pods
kubectl get pv <pv-name> -o yaml                        # Check volume status

Fix:

  1. If stuck attachment: wait for the VolumeAttachment to be cleaned up (up to 6 minutes), or manually delete the VolumeAttachment object.
  2. If CSI driver is missing: install it.
  3. If volume limit reached: use a larger instance type or distribute pods across more nodes.
  4. If volume was deleted: recreate it and restore from backup.

Control Plane

API Server connection refused

What it means: Clients cannot reach the Kubernetes API server.

Common causes:

  • The kube-apiserver process is not running.
  • TLS certificates have expired.
  • Firewall or security group is blocking port 6443.
  • Load balancer in front of API server is misconfigured or unhealthy.

How to diagnose:

# On a control plane node:
crictl ps | grep kube-apiserver                         # Is the container running?
crictl logs <apiserver-container-id> | tail -50          # API server logs
openssl s_client -connect <api-server>:6443             # Test TLS handshake
curl -k https://<api-server>:6443/healthz               # Health endpoint
journalctl -u kubelet | grep apiserver                  # Kubelet managing static pod?

Fix:

  1. If not running: check static pod manifest at /etc/kubernetes/manifests/kube-apiserver.yaml.
  2. If certificates expired: kubeadm certs renew all && systemctl restart kubelet.
  3. If firewall blocking: open port 6443 to the required source IPs.
  4. If load balancer: check health check configuration and backend targets.

etcd Errors

What it means: The etcd cluster backing the API server is unhealthy.

Common causes:

  • Disk latency is too high (etcd requires low-latency storage, ideally SSD).
  • Quorum lost (majority of etcd members are down).
  • Database size has exceeded the space quota (default 2 GB).
  • Clock skew between etcd members.

How to diagnose:

# If etcd is accessible:
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table
etcdctl alarm list
# Check disk latency:
etcdctl check perf
# From API server logs:
crictl logs <apiserver-container-id> | grep etcd

Fix:

  1. If disk latency: move etcd to SSD-backed storage, or use dedicated etcd nodes.
  2. If quorum lost: restore from snapshot (etcdctl snapshot restore).
  3. If space quota exceeded: compact and defragment: etcdctl compact then etcdctl defrag.
  4. If alarms triggered: etcdctl alarm disarm after resolving the root cause.

Forbidden (RBAC)

What it means: The authenticated identity does not have permission to perform the requested action.

Common causes:

  • Missing Role/ClusterRole or RoleBinding/ClusterRoleBinding.
  • The binding references the wrong ServiceAccount, user, or group.
  • Namespace mismatch: a Role only grants permissions in its own namespace.
  • The ServiceAccount token is from a different namespace.

How to diagnose:

kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<ns>:<sa>
kubectl auth can-i <verb> <resource> --as=<user> -n <namespace>
kubectl get rolebinding,clusterrolebinding -A | grep <service-account-name>
kubectl describe clusterrole <role-name>                # What permissions does it grant?

Fix:

  1. Create the missing Role and RoleBinding (or ClusterRole/ClusterRoleBinding for cluster-wide access).
  2. Verify the subjects in the binding match the actual identity making the request.
  3. Use kubectl auth can-i --list --as=<identity> to see all permissions for debugging.

Webhook Errors (failed calling webhook)

What it means: An admission webhook (validating or mutating) is failing, blocking resource creation or updates.

Common causes:

  • The webhook’s backing Service or pod is down.
  • The webhook’s TLS certificate has expired.
  • The webhook was installed with failurePolicy: Fail and the service is unreachable.
  • The webhook is rejecting the request due to policy (this is intentional, not an error in the webhook itself).

How to diagnose:

kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
kubectl describe validatingwebhookconfiguration <name>   # Check service and failurePolicy
kubectl get pods -n <webhook-namespace>                   # Is the webhook pod running?
kubectl logs -n <webhook-namespace> <webhook-pod>         # Webhook logs

Fix:

  1. If the webhook service is down: restart it or fix its deployment.
  2. If certificates expired: renew them (often managed by cert-manager).
  3. Emergency bypass: temporarily set failurePolicy: Ignore or delete the webhook configuration.
  4. To exclude a namespace: add the appropriate namespaceSelector to the webhook configuration.

Deployment Issues

Rollout Stuck

What it means: A Deployment rollout is not progressing – new pods are not becoming Ready or old pods are not being terminated.

Common causes:

  • New pods are failing (CrashLoopBackOff, ImagePullBackOff, Pending).
  • A PodDisruptionBudget is preventing old pods from being evicted.
  • Resource quota in the namespace is exhausted (cannot create new ReplicaSet pods).
  • The progressDeadlineSeconds has not yet been reached (default 600s).

How to diagnose:

kubectl rollout status deployment/<name> -n <namespace>
kubectl describe deployment <name> -n <namespace>        # Check Conditions and Events
kubectl get rs -n <namespace>                             # Compare old vs new ReplicaSet
kubectl get pods -n <namespace> -l <selector>             # What state are the new pods in?
kubectl get pdb -n <namespace>                            # Check PodDisruptionBudgets
kubectl get resourcequota -n <namespace>

Fix:

  1. Fix the underlying pod issue (image, config, resources) then let the rollout continue.
  2. If PDB is blocking: temporarily relax the PDB or scale up first.
  3. If stuck and unrecoverable: kubectl rollout undo deployment/<name> -n <namespace>.
  4. If quota exceeded: increase the quota or delete unused resources.

FailedCreate on ReplicaSet

What it means: The ReplicaSet controller cannot create new pods.

Common causes:

  • Resource quota in the namespace is fully consumed.
  • An admission webhook is rejecting pod creation.
  • LimitRange in the namespace is setting constraints the pod spec violates.
  • The ServiceAccount referenced by the pod does not exist.

How to diagnose:

kubectl describe rs <replicaset-name> -n <namespace>     # Events will show the error
kubectl get resourcequota -n <namespace> -o yaml          # Compare used vs hard
kubectl get limitrange -n <namespace> -o yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Fix:

  1. If quota exhausted: increase the quota or reduce resource requests on the pods.
  2. If webhook rejecting: check webhook logs to understand the rejection reason.
  3. If LimitRange violation: adjust pod resource requests/limits to comply.
  4. If ServiceAccount missing: create it or correct the reference.

Useful Commands Cheat Sheet

# --- Inspecting Resources ---
kubectl get pods -n <ns> -o wide                        # Pod status with node and IP
kubectl describe pod <pod> -n <ns>                      # Full pod details and events
kubectl get events -n <ns> --sort-by='.lastTimestamp'   # Recent events in namespace
kubectl get events -A --field-selector type=Warning     # All warnings cluster-wide

# --- Logs ---
kubectl logs <pod> -n <ns>                              # Current container logs
kubectl logs <pod> -n <ns> --previous                   # Logs from last crashed container
kubectl logs <pod> -n <ns> -c <container>               # Specific container in multi-container pod
kubectl logs -l app=<label> -n <ns> --tail=100          # Logs by label selector

# --- Interactive Debugging ---
kubectl exec -it <pod> -n <ns> -- /bin/sh               # Shell into a running container
kubectl debug node/<node> -it --image=busybox           # Debug node-level issues
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash  # Ephemeral network debug pod

# --- Networking ---
kubectl get endpoints <svc> -n <ns>                     # Service endpoints
kubectl port-forward svc/<svc> 8080:80 -n <ns>         # Forward service port to localhost
kubectl exec <pod> -n <ns> -- nslookup <svc>            # Test DNS resolution from pod

# --- Resource Usage ---
kubectl top nodes                                       # Node CPU and memory usage
kubectl top pods -n <ns> --sort-by=memory               # Pod resource consumption
kubectl top pods -n <ns> --containers                   # Per-container resource usage

# --- Cluster State ---
kubectl get componentstatuses                           # Control plane health (deprecated but useful)
kubectl cluster-info dump | grep -i error               # Dump cluster state and search for errors
kubectl api-resources                                   # All available API resources

# --- Rollouts ---
kubectl rollout status deployment/<name> -n <ns>        # Watch rollout progress
kubectl rollout history deployment/<name> -n <ns>       # Rollout revision history
kubectl rollout undo deployment/<name> -n <ns>          # Roll back to previous revision

Back to Table of Contents

Appendix E: Architecture Evolution Timeline

Kubernetes and its ecosystem have evolved rapidly since 2014. This timeline shows the major architectural shifts — each one driven by real problems with the previous approach. Understanding this evolution explains why the current ecosystem looks the way it does.


Visual Timeline (2013-2026)

Container Runtimes, Orchestration, Networking, and Package Management

flowchart TD
    subgraph y2013 ["2013"]
        docker13["Docker released<br>(monolithic daemon)"]
    end

    subgraph y2015 ["2015"]
        oci["OCI founded<br>runc extracted"]
        k8s10["Kubernetes 1.0<br>CNCF founded"]
        flannel["Flannel (overlay)<br>kube-proxy + iptables"]
        yaml15["Raw YAML<br>kubectl apply"]
    end

    subgraph y2016 ["2016"]
        containerd16["containerd extracted<br>from Docker"]
        calico["Calico (BGP)<br>Canal"]
        helm2["Helm v2<br>(with Tiller)"]
    end

    subgraph y2017 ["2017"]
        cri17["CRI interface defined"]
        swarm17["Docker Swarm<br>embedded in Docker"]
        cni17["CNI spec matures"]
    end

    subgraph y2018 ["2018-2019"]
        kust["Kustomize<br>(patch-based)"]
        cilium["Cilium (eBPF-based)"]
        helm3["Helm v3 (no Tiller!)"]
        mesos["Docker Enterprise<br>sold to Mirantis"]
    end

    subgraph y2020 ["2020-2022"]
        deprec["K8s 1.20: dockershim<br>DEPRECATED"]
        removed["K8s 1.24: dockershim<br>REMOVED"]
        mesos21["Apache Mesos<br>RETIRED"]
        helmkust["Helm + Kustomize<br>combined pattern"]
    end

    subgraph y2023 ["2023-2024"]
        std24["containerd + CRI-O<br>are the standards"]
        k8sstd["Kubernetes is<br>THE standard"]
        gw["Gateway API GA"]
        ciliumdef["Cilium = default CNI<br>for many platforms"]
        cdk["cdk8s, Timoni<br>(CUE-based)"]
    end

    docker13 --> oci --> containerd16 --> cri17 --> deprec --> removed --> std24
    docker13 --> k8s10 --> swarm17 --> mesos --> mesos21 --> k8sstd
    flannel --> calico --> cni17 --> cilium --> gw --> ciliumdef
    yaml15 --> helm2 --> kust --> helm3 --> helmkust --> cdk

Four parallel evolutions that shaped the infrastructure layer: Docker’s monolith was decomposed into containerd and CRI-O. The orchestration wars ended with Kubernetes as the universal standard. Networking shifted from overlays and iptables to eBPF-native with Cilium. And YAML management evolved from raw manifests through Helm’s Tiller era to today’s Helm v3 + Kustomize hybrid.

Security, GitOps, Scaling, and GPU/ML

flowchart TD
    subgraph y2017b ["2016-2017"]
        rbac["RBAC GA (K8s 1.8)"]
        ca["Cluster Autoscaler"]
        hpa["HPA v2"]
    end

    subgraph y2018b ["2018"]
        psp["PodSecurityPolicy"]
        argo["ArgoCD, Flux v1<br>GitOps begins"]
        devplugin["Device plugins<br>for GPUs"]
    end

    subgraph y2020b ["2020-2021"]
        opa["OPA / Gatekeeper"]
        sig["Sigstore, Cosign<br>Kyverno matures"]
        flux2["Flux v2 rewrite"]
        crossplane["Crossplane<br>Backstage joins CNCF"]
        karpenter["Karpenter (AWS)"]
        kubeflow["Kubeflow, KubeRay"]
    end

    subgraph y2022b ["2022-2023"]
        pspdep["PSP DEPRECATED"]
        pss["Pod Security Standards<br>replace PSP"]
        plateng["Platform Engineering<br>as a discipline"]
        gpuop["NVIDIA GPU<br>Operator mature"]
        dra["DRA alpha<br>(Dynamic Resource Allocation)"]
    end

    subgraph y2024b ["2024-2025"]
        supply["Supply chain security:<br>SBOM, SLSA standard"]
        idp["Internal Developer<br>Platforms go mainstream"]
        karpga["Karpenter GA<br>+ Azure support"]
        llm["LLM serving explosion:<br>vLLM, TGI, KServe"]
        llmd["llm-d, LeaderWorkerSet<br>multi-node inference"]
    end

    rbac --> psp --> opa --> pspdep --> pss --> supply
    argo --> flux2 --> crossplane --> plateng --> idp
    ca --> hpa --> karpenter --> karpga
    sig ~~~ pss
    devplugin --> kubeflow --> gpuop --> dra --> llm --> llmd

Security moved from the flawed PodSecurityPolicy to the simpler Pod Security Standards, while policy engines like OPA and Kyverno filled the gap. GitOps went from manual kubectl to ArgoCD/Flux, then broadened into full Internal Developer Platforms. Scaling evolved from the slow, group-based Cluster Autoscaler to Karpenter’s per-pod provisioning. And GPU/ML infrastructure exploded from basic device plugins to DRA, vLLM, and disaggregated serving with llm-d.

Observability

timeline
    title Observability Evolution
    2016 : Prometheus joins CNCF
    2018 : Prometheus graduates CNCF
    2019 : OpenTelemetry formed
         : (OpenTracing + OpenCensus merger)
    2021 : Grafana Loki, Tempo mature
    2023 : OpenTelemetry GA
         : (traces, metrics)
    2024 : OpenTelemetry logging matures

Observability converged from three fragmented signals — Prometheus for metrics, various tools for logs, and Jaeger/Zipkin for traces — into a unified standard with OpenTelemetry. The Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) emerged as the dominant open-source backend.


Node Autoscaling: The CA-to-Karpenter Transition

Cluster Autoscaler (2016)Karpenter (2021+)
AbstractionNode-group basedGroupless provisioning
Scaling unitScale by group min/maxPer-pod scheduling
SpeedSlow (minutes)Fast (seconds)
Bin-packingNoCross-instance-type optimization
ConsolidationReactive onlyActive consolidation
Instance typesFixed per groupWorks across all types

Why it changed: Cluster Autoscaler couldn’t keep up with diverse GPU/ML workloads that needed fast, flexible provisioning across many instance types. Karpenter eliminated the node group abstraction entirely.


Summary: Architectural Shifts by Domain

DomainOld WayNew WayWhy It Changed
Container RuntimeDocker (monolithic daemon)containerd / CRI-O via CRIDocker included too much (build, swarm, CLI). K8s only needs a runtime. CRI allows pluggable runtimes.
OrchestrationDocker Swarm, Mesos, multiple optionsKubernetes (universal standard)K8s won on extensibility (CRDs, operators) and ecosystem. Swarm was too simple, Mesos too complex.
NetworkingFlannel overlay + iptables kube-proxyCilium (eBPF) + Gateway APIiptables doesn’t scale. Overlay adds latency. eBPF gives kernel-level networking without kube-proxy.
Package ManagementRaw YAML / Helm v2 with TillerHelm v3 + Kustomize (or combined)Tiller was a security risk (cluster-admin in-cluster). Raw YAML doesn’t compose. Kustomize avoids templating.
SecurityPodSecurityPolicy (PSP)Pod Security Standards (PSS) + Kyverno/OPAPSP was confusing, hard to audit, and couldn’t be extended. PSS is simpler; policy engines are more flexible.
GitOps & PlatformManual kubectl apply / CI pipelinesArgoCD/Flux + Internal Developer PlatformsImperative deploys are fragile and unauditable. GitOps makes the desired state declarative and versioned.
ScalingCluster Autoscaler (node-group based)Karpenter (groupless, per-pod)CA was slow and inflexible with diverse workloads. Karpenter provisions exactly what’s needed, fast.
GPU/MLBasic device pluginsGPU Operator + DRA + specialized serving (vLLM, llm-d)LLM explosion demands multi-node GPU scheduling, fractional GPUs, and inference-optimized runtimes.
ObservabilityPrometheus + ad-hoc logging/tracingOpenTelemetry (unified) + Grafana stackThree separate telemetry signals (metrics, logs, traces) needed a unified collection and correlation standard.

Back to Table of Contents

Colophon

How This Book Was Made

Kubernetes from First Principles: Why It Works the Way It Does was written over a weekend (April 3-4, 2026) through a collaboration between Rajat Arya and Claude Code (Anthropic’s Claude Opus 4.6 with 1M context). The entire process — from “help me set up a Kubernetes cluster on EC2” to a published 98,000-word, 45-chapter textbook with 5 appendices — happened in one continuous conversation.

The Process

The book emerged organically from a hands-on Kubernetes learning session. Rajat was setting up a 3-node Kubernetes cluster on AWS EC2 from scratch using kubeadm. As we worked through real problems (containerd CRI disabled, SystemdCgroup mismatch, etcd crash loops, missing CNI plugins, API server CrashLoopBackOff), we documented what we learned. That documentation grew into Part 1 (First Principles), then expanded into a full curriculum.

Generation Method

  1. Research phase: For each part, specialized research agents were dispatched to search the web (via Actionbook browser automation), read official Kubernetes documentation, blog posts, cloud provider docs, CNCF project pages, and academic papers. Research agents ran in parallel — up to 4 simultaneously — to gather material for different topics.

  2. Writing phase: Writing agents received the research findings along with detailed chapter outlines specifying what to cover, what tone to use, and what diagrams to include. Writers also ran in parallel — up to 5 simultaneously — each producing 4-8 chapters.

  3. Coherence pass: A review agent read every chapter, verified all “Next:” links, added cross-references between chapters, wrote part transition paragraphs, and checked tone consistency.

  4. Link verification: All external URLs were tested for accessibility.

The Prompts

The book was generated through a series of natural-language prompts. Here are the key ones that shaped each part:

Part 1 (First Principles, chapters 1-9):

“I need to understand how Kubernetes and its ecosystem fit into the modern deployment landscape — but from first principles. I see an infinite number of resources online describing how to use k8s, but I don’t see any real information on where it comes from, why it was architected this way, and what problems it seeks to solve.”

Parts 2-3 (Tooling Evolution + Practice, chapters 10-20):

“Make a part 2 that includes tool ecosystem history and evolution. Has there always been kubeadm, kubelet, etc? And then part 3 can cover getting started with modern Kubernetes, including setting up a cluster from scratch, using public cloud kubernetes offerings, understanding how Kubernetes networking and storage map to public cloud VM offerings, and provide a more practical hands-on way to connect the theory to practice.”

Parts 4-8 (Stateful, Security, Scaling, Platform, Advanced, chapters 21-45):

“I want all of these. I also want the collection of individual topics. It is especially important that I understand the GPU workloads and AI/ML on Kubernetes. Go in as much depth as possible. I need all of these topics to understand the infrastructure at my work.”

Guiding Constraints

These instructions were consistent across all parts:

  • “Focus on WHY decisions were made, not HOW to use the tools.” — This shaped the entire tone. Every chapter explains the reasoning behind design decisions rather than just listing commands.
  • “I know Linux, I know the computer pretty well, and I know networking pretty well.” — This set the audience level. The book doesn’t explain what a process is or how TCP works, but it does explain why Kubernetes chose a flat networking model over Docker’s port-mapping.
  • “Liberally draw diagrams” — Every chapter includes ASCII diagrams illustrating architecture, data flow, or concept relationships.
  • “Same tone as part 1” — The first-principles, textbook-quality tone was established in Part 1 and maintained throughout by referencing the existing chapters as style guides.

Tools Used

  • Claude Code (Anthropic Claude Opus 4.6, 1M context) — conversation orchestration, research coordination, writing, and editing
  • Actionbook Browser — web research automation for gathering source material
  • GitHub CLI (gh) — repository creation and publishing

The Companion Cluster

The install.sh script in this repository is a real, working bootstrap script that was iteratively debugged during the conversation. It went through several revisions:

  • v1: Based on an outdated reference script, used deprecated apt-key, installed full Docker engine, had wrong CNI plugin version
  • v2: Fixed containerd CRI config, added SystemdCgroup, switched to containerd-only (no Docker engine), updated to modern keyring approach
  • v3: Fixed CNI plugin version (v1.6.1 didn’t exist, updated to v1.9.1), added conntrack dependency

Every error documented in the troubleshooting sections of the README and Chapter 15 was a real error encountered during the session.

Accuracy and Limitations

  • Research was conducted on April 3-4, 2026. Version numbers, feature statuses, and ecosystem information reflect this date. Kubernetes evolves rapidly — verify versions before following any specific instructions.
  • The AI-generated content was guided by web research from official documentation, CNCF project pages, and reputable technical blogs. However, AI can hallucinate details. When in doubt, consult the official Kubernetes documentation at https://kubernetes.io/docs/.
  • External links were verified at publication time but may break as pages move or are removed.
  • The book reflects one learning path. There are many valid ways to learn Kubernetes. This path emphasizes first-principles understanding over hands-on tutorials, which suits some learners better than others.

License

The book is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). Any additional code and supporting materials (e.g., install.sh) are licensed under the MIT License. See LICENSE for details.

Kubernetes is a registered trademark of The Linux Foundation.

Contributing

Found an error? A broken link? A concept that could be explained better? Contributions are welcome via pull requests.