Introduction
![]() |
Kubernetes from First PrinciplesWhy It Works the Way It Does |
Most Kubernetes resources teach you how to write YAML. This book teaches you why the YAML looks the way it does. 45 chapters — each traced from the original design problem through the ecosystem’s evolution to today’s best practice.
About
This is an eight-part book that takes you from “why does Kubernetes exist?” to “I’m running GPU-accelerated ML workloads in production across multiple clusters.” It is written for engineers who understand Linux, networking, and how systems work — and want to understand Kubernetes deeply, not just follow tutorials.
Part 1: First Principles
Why Kubernetes was designed the way it was.
- The Road to Kubernetes — From bare metal to Borg to Kubernetes
- The Problems Kubernetes Solves — Bin packing, service discovery, self-healing, and the desired state model
- Architecture from First Principles — etcd, API server, controllers, scheduler, kubelet, kube-proxy
- The API Model — Resources, specs, status, reconciliation loops, labels, and CRDs
- The Networking Model — Flat networking, CNI, Services, Ingress, and Network Policies
- The Ecosystem — Operators, Helm, service meshes, and Kubernetes as a platform for platforms
- Key Design Principles — Declarative over imperative, control loops, level-triggered vs edge-triggered
- Why Kubernetes Won — The competitive landscape and the deeper architectural lesson
- References and Further Reading — Foundational papers, design documents, talks, and books
Part 2: The Tooling Ecosystem — History and Evolution
How the tools around Kubernetes evolved, and why they look the way they do today.
- The Container Runtime Wars — Docker to containerd to CRI-O: why Docker was deprecated
- Bootstrapping a Cluster — From kube-up.sh to kubeadm: how cluster setup evolved
- Package Management and GitOps — Helm v2/v3, Kustomize, ArgoCD, Flux
- The Networking Stack Evolution — Flannel to Calico to Cilium: how eBPF changed everything
- Kubernetes Version History — A guided tour of key releases and what they introduced
Part 3: From Theory to Practice
Connecting the principles from Part 1 to real-world usage.
- Setting Up a Cluster from Scratch — What kubeadm actually does: TLS bootstrapping, static pods
- Managed Kubernetes: EKS, GKE, and AKS — Cloud provider comparison and how to choose
- Cloud Networking and Storage — VPC CNI, CSI drivers, and how K8s maps to cloud infrastructure
- Your First Workloads — Hands-on: Deployments, Services, ConfigMaps, rolling updates
- Debugging Kubernetes — The kubectl toolkit and diagnosing common failures
- Production Readiness — Monitoring, logging, security basics, and backup
Part 4: Stateful Workloads
Running real applications with persistent state.
- StatefulSets Deep Dive — Stable identities, ordered operations, and headless Services
- Databases on Kubernetes — When to run databases on K8s, operators, and the trade-offs
- Persistent Storage Patterns — volumeClaimTemplates, reclaim policies, backup, and resize
- Jobs and CronJobs — Batch processing, indexed completions, and scheduling patterns
Part 5: Security Deep Dive
Understanding and implementing Kubernetes security from the ground up.
- RBAC from First Principles — Roles, bindings, ServiceAccounts, and multi-tenant design
- Network Policies — Default deny, namespace isolation, and egress control
- Supply Chain Security — Image signing, admission policies, scanning, and SLSA
- Secrets Management — Encryption at rest, Vault, External Secrets Operator, and best practices
- Pod Security Standards — Privileged, Baseline, Restricted profiles and enforcement
Part 6: Scaling and Performance
Making Kubernetes handle real-world load.
- Horizontal Pod Autoscaler — The scaling algorithm, custom metrics, KEDA, and tuning
- Vertical Pod Autoscaler and Right-Sizing — Recommendation mode, in-place resize, and resource tuning
- Node Scaling: Cluster Autoscaler and Karpenter — How nodes scale, Karpenter’s architecture, and consolidation
- Resource Tuning Deep Dive — CPU throttling, memory cgroups, NUMA, and overcommitment
Part 7: Multi-Cluster and Platform Engineering
Operating Kubernetes at organizational scale.
- Multi-Cluster Strategies — Federation, GitOps-driven, service mesh, and Cluster API
- Building Internal Developer Platforms — Backstage, the platform stack, and reducing cognitive load
- Crossplane: Infrastructure as CRDs — Managing cloud resources through Kubernetes
- Multi-Tenancy — Namespace isolation, virtual clusters, and tenant boundaries
Part 8: Advanced Topics
Deep dives for infrastructure engineers.
- Writing Controllers and Operators — controller-runtime, Kubebuilder, and the Reconcile pattern
- The Kubernetes API Internals — Aggregation, admission webhooks, API priority and fairness
- etcd Operations — Backup, restore, compaction, monitoring, and disaster recovery
- GPU Workloads and AI/ML on Kubernetes — Device plugins, DRA, GPU sharing, distributed training
- Running LLMs on Kubernetes — vLLM, TGI, KServe, multi-node inference, and model serving
- Disaster Recovery — Cluster backup, etcd snapshots, multi-region strategies
- Cost Optimization — Right-sizing, spot instances, Kubecost, and chargeback
- Observability with OpenTelemetry — Metrics, logs, traces, and the OTel Collector
How to Read This
Part 1 is the intellectual foundation. Read it first.
Part 2 fills in the historical context of the tooling. Read it after Part 1.
Part 3 is hands-on. Reference it as you work through your own cluster.
Parts 4-5 cover stateful workloads and security — essential for running real production systems.
Part 6 covers scaling — read it when your workloads need to handle real load.
Part 7 is for when you’re operating multiple clusters or building a platform team.
Part 8 is deep reference material. Read chapters as needed. The GPU/ML chapters (41-42) are especially relevant for AI infrastructure teams.
If you only have time for one chapter from each part:
- Part 1: Architecture from First Principles
- Part 2: The Container Runtime Wars
- Part 3: Debugging Kubernetes
- Part 4: StatefulSets Deep Dive
- Part 5: RBAC from First Principles
- Part 6: Node Scaling: Cluster Autoscaler and Karpenter
- Part 7: Building Internal Developer Platforms
- Part 8: GPU Workloads and AI/ML on Kubernetes
Appendices
- Appendix A: Glossary — Quick-reference definitions for 100+ Kubernetes terms
- Appendix B: Mental Models — Visual diagrams showing how concepts in each part connect
- Appendix C: Decision Trees — Flowcharts for choosing workload types, storage, networking, and tools
- Appendix D: Troubleshooting Quick Reference — Error messages mapped to root causes and fixes
- Appendix E: Architecture Evolution Timeline — How the Kubernetes ecosystem evolved from 2013 to today
Companion Material
- install.sh — The bootstrap script we built to provision Kubernetes nodes on EC2
- Colophon — How this book was made, the prompts used, and accuracy notes
Chapter 1: The Road to Kubernetes
---
config:
flowchart:
nodeSpacing: 15
rankSpacing: 30
---
flowchart LR
subgraph industry ["Industry Timeline"]
bm["Bare Metal<br>1990s"] --> vm["Virtualization<br>VMware, Xen, KVM<br>2000s"]
vm --> cloud["AWS EC2<br>Cloud era<br>2006"]
cloud --> cm["Chef / Puppet<br>2009"]
cm --> docker["Docker<br>2013"]
docker --> k8s["Kubernetes<br>2014"]
k8s --> cncf["CNCF + Managed K8s<br>2015+"]
end
subgraph google ["Inside Google"]
borg["Borg<br>2003"] --> cg["cgroups +<br>namespaces<br>2006–08"]
cg --> omega["Omega<br>2011"]
omega --> k8sg["Kubernetes<br>2014"]
end
k8sg -.->|"open-sourced<br>as"| k8s
The Bare Metal Era: One Application, One Server
In the earliest days of server computing, one application ran on one physical server — simple, isolated, but catastrophically wasteful. Most servers ran at 5-15% average CPU utilization. Organizations maintained vast fleets of underutilized machines, each dedicated to a single workload, each requiring its own power, cooling, network connectivity, and physical maintenance.
The fundamental problem was resource fragmentation. You could not easily share a physical machine between two applications because there was no reliable mechanism to prevent one application from consuming all available CPU, memory, or disk I/O and starving the other. Operating system process isolation was insufficient: processes could interfere with each other through shared filesystems, port conflicts, library version conflicts, and resource exhaustion. The result was an era of enormous waste, where the primary cost driver was not compute but rather the operational overhead of managing vast numbers of barely-utilized machines.
The Virtualization Revolution: Abstracting Hardware
Virtualization, pioneered commercially by VMware in the late 1990s and later commoditized by Xen, KVM, and cloud providers like Amazon Web Services, represented the first fundamental shift. By inserting a hypervisor between the hardware and the operating system, virtualization allowed multiple isolated virtual machines to share a single physical host. Each VM got its own kernel, its own filesystem, its own network stack — complete isolation without dedicated hardware.
This solved the resource fragmentation problem at a macro level. You could now pack multiple workloads onto a single physical machine with strong isolation guarantees. Cloud computing emerged from this capability: Amazon Web Services launched EC2 in 2006, offering on-demand virtual machines that could be provisioned in minutes rather than the weeks required to procure and rack physical servers.
But virtualization introduced its own problems. Virtual machines were heavy: each carried a full operating system kernel, consuming hundreds of megabytes of RAM just for the OS overhead. Boot times were measured in minutes. VM images were large and slow to transfer. The hypervisor itself consumed resources. And while VMs solved the isolation problem, they did not solve the management problem. With hundreds or thousands of VMs, organizations still needed to answer fundamental questions: which workload runs where? How do you update an application across fifty VMs without downtime? How do you recover when a VM’s host machine fails? How do you ensure that a critical application always has enough resources?
The Configuration Management Interlude: Puppet, Chef, Ansible
The late 2000s and early 2010s saw the rise of configuration management tools — Puppet (2005), Chef (2009), Ansible (2012), SaltStack (2011). These tools addressed the management problem by allowing operators to describe the desired state of a server (which packages should be installed, which services should be running, which configuration files should be present) and then converge the actual state toward that desired state.
This was a crucial intellectual contribution that directly influenced Kubernetes: the desired state model. Instead of writing imperative scripts that said “install package X, then start service Y, then modify file Z,” configuration management tools let you declare “package X should be present, service Y should be running, file Z should contain these contents” and let the tool figure out how to get there. This declarative approach was more robust because it was idempotent — you could run the tool multiple times and get the same result, regardless of the starting state.
But configuration management operated at the wrong level of abstraction for the emerging world of containerized microservices. They could ensure that a particular server had the right software installed, but they could not easily reason about a distributed application that spanned dozens of servers, needed to be updated without downtime, and had to automatically recover from server failures. The unit of management was the machine, not the application.
The Container Revolution: Docker and the Shipping Container Metaphor
Containers were not new when Docker launched in 2013. The underlying Linux kernel features — cgroups (for resource limits) and namespaces (for isolation) — had been in the Linux kernel since 2008 (cgroups merged in v2.6.24). Google had been using containers internally since at least 2004, running everything from web search to Gmail inside Linux containers managed by their Borg system. FreeBSD had jails since 2000. Solaris had zones since 2004.
What Docker did was make containers accessible. It provided a simple command-line interface, a standardized image format (the Dockerfile and layered filesystem), and a distribution mechanism (Docker Hub). For the first time, a developer could package an application and all its dependencies into a single artifact, push it to a registry, and run it identically on any Linux machine. The shipping container metaphor was apt: just as standardized shipping containers revolutionized global trade by providing a uniform interface between ships, trains, and trucks, Docker containers provided a uniform interface between development, testing, and production.
Containers had profound advantages over VMs for application deployment:
- Lightweight: Containers shared the host kernel, eliminating the OS overhead of VMs. A container image might be tens of megabytes instead of gigabytes.
- Fast startup: Containers started in milliseconds to seconds, not minutes.
- Density: You could run dozens or hundreds of containers on a single host, compared to perhaps a dozen VMs.
- Reproducibility: The container image was immutable. The same image ran identically everywhere.
- Composability: Complex applications could be decomposed into multiple containers, each with a single responsibility.
But Docker, by itself, solved only the packaging and isolation problem. It told you nothing about how to run containers at scale across a fleet of machines. If you had one hundred machines and one thousand containers, Docker could not tell you which container should run on which machine, what to do when a machine failed, how to route network traffic to the right container, or how to update a running application without downtime. This was the orchestration problem.
Google’s Borg: The Secret Precursor
To understand why Kubernetes looks the way it does, you must understand Google’s Borg system. Published in a landmark 2015 EuroSys paper (“Large-scale cluster management at Google with Borg” by Verma et al.), Borg had been running inside Google since at least 2003-2004. It managed virtually everything Google ran: web search, Gmail, YouTube, Maps, BigTable, MapReduce — hundreds of thousands of jobs across tens of thousands of machines in each of dozens of clusters.
Borg introduced several concepts that directly shaped Kubernetes:
1. The declarative job specification. In Borg, users did not tell the system to “start a process on machine X.” They declared a job specification: “I need 100 instances of this binary, each with 2 GB of RAM and 0.5 CPU cores, and they should be spread across failure domains.” Borg figured out where to place them, and if instances died, Borg automatically restarted them. This declarative model — describe what you want, not how to get it — became the philosophical foundation of Kubernetes.
2. Bin packing and resource management. Borg treated a cluster of machines as a single pool of resources. Its scheduler solved a variant of the bin packing problem: given a set of tasks with resource requirements and a set of machines with resource capacities, place tasks on machines to maximize utilization while respecting constraints (failure domain isolation, hardware requirements, etc.). Borg achieved remarkably high utilization — published figures suggest 60-70% average CPU utilization across Google’s fleet, compared to the 5-15% typical of enterprise data centers.
3. Service discovery via naming. Borg provided a built-in naming service (BNS) that allowed tasks to find each other by name rather than by IP address and port. This was essential in an environment where tasks were constantly being started, stopped, and moved between machines.
4. Allocs and resource reservations. Borg introduced the concept of “allocs” — reserved resources on a machine that could be filled with tasks. This concept directly inspired the Kubernetes Pod: a group of containers that share resources and are co-scheduled on the same machine.
Google’s Omega: The Research System
Borg was a production system, evolved over a decade, carrying enormous technical debt. In 2011-2013, Google built Omega as a research project to explore alternative cluster management architectures. Omega’s key contribution was its approach to scheduling: instead of Borg’s monolithic scheduler, Omega used optimistic concurrency control with a shared state model. Multiple schedulers could operate in parallel, each reading the full cluster state, making scheduling decisions, and then atomically committing those decisions. If two schedulers made conflicting decisions, one would detect the conflict and retry.
This shared-state, optimistic-concurrency approach influenced Kubernetes’ design in a critical way: it demonstrated that you could have multiple independent controllers operating on shared state, each making progress independently, with conflicts resolved through mechanisms like resource versioning. This is exactly the model that Kubernetes uses for its controllers.
The Birth of Kubernetes: 2014
Kubernetes was born in mid-2014 at Google, created by Joe Beda, Brendan Burns, and Craig McLuckie, with significant contributions from Brian Grant, Tim Hockin, and many others. It was explicitly designed to be an open-source, vendor-neutral system that embodied the lessons of Borg and Omega without carrying their technical debt.
The founders made a crucial strategic decision: Rather than open-sourcing Borg, they built a clean-room redesign designed to run anywhere. This meant:
- Using standard open-source components (etcd for storage, instead of Google’s proprietary Chubby/Colossus)
- Supporting multiple container runtimes (not just Google’s internal runtime)
- Designing for extensibility from the start (a pluggable API that later evolved into ThirdPartyResources and then CRDs, custom controllers, pluggable networking)
- Making the system portable across cloud providers and on-premises environments
Kubernetes was donated to the newly formed Cloud Native Computing Foundation (CNCF) in 2015, ensuring its governance was independent of any single company. This was a masterstroke of ecosystem building: by making Kubernetes vendor-neutral, Google ensured that every major cloud provider (AWS, Azure, GCP) would offer managed Kubernetes services, creating a de facto standard that benefited everyone — including Google, whose cloud platform was smaller than AWS but whose expertise in running Kubernetes was unmatched.
The Borg Lineage: Kubernetes (Greek: steersman or pilot) was reportedly codenamed “Project Seven” — a reference to Seven of Nine from Star Trek, a Borg who became an individual. The name is a deliberate allusion to Kubernetes’ origins in Google’s Borg system, while signaling that it had been liberated from Google’s proprietary infrastructure to become something independent.
Common Mistakes and Misconceptions
-
“Kubernetes is just Docker orchestration.” Kubernetes is a general-purpose container orchestrator. Docker is one of many supported container runtimes (and modern K8s clusters typically use containerd, not Docker). Kubernetes predates most Docker-native orchestration features and was designed to be runtime-agnostic from the start.
-
“Google open-sourced Borg.” Borg is still an internal Google system and has never been released. Kubernetes is a clean-room redesign inspired by Borg’s lessons and design principles, not a port or fork of Borg. The two systems share no code.
-
“Kubernetes was the first container orchestrator.” Apache Mesos with Marathon, Fleet, and Docker Swarm all predated or emerged alongside Kubernetes. K8s won the orchestration wars not by being first but through superior API design, extensibility, and community governance.
For a visual overview of how Part 1’s concepts connect, see Appendix B: Mental Models.
Further Reading
- Large-scale cluster management at Google with Borg – The original 2015 Borg paper from EuroSys, detailing how Google manages billions of containers across its fleet.
- Omega: flexible, scalable schedulers for large compute clusters – The 2013 Omega paper introducing shared-state scheduling with optimistic concurrency control.
- Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade – A 2016 ACM Queue retrospective by Burns, Grant, Oppenheimer, Brewer, and Wilkes tracing the lineage from Borg through Omega to Kubernetes.
- Kubernetes first commit on GitHub (June 2014) – The initial commit that started the open-source project, useful for understanding the original scope and design intent.
- Brendan Burns – “The Illustrated Children’s Guide to Kubernetes” (CNCF) — A surprisingly effective introduction to Kubernetes concepts through visual storytelling, by one of its co-founders.
- Google: Borg – The Predecessor to Kubernetes (Google Cloud Blog) – Google’s own account of how internal cluster management evolved into a public project.
- Solomon Hykes announces Docker (PyCon 2013 lightning talk) – The five-minute lightning talk that introduced Docker to the world and catalyzed the container ecosystem that Kubernetes was built to orchestrate.
Next: The Problems Kubernetes Solves
Chapter 2: The Problems Kubernetes Solves
Kubernetes exists because running containerized applications at scale presents a set of interrelated problems that no individual tool solves.
The Bin Packing Problem
At its most fundamental, Kubernetes solves a resource allocation problem. You have N machines, each with some amount of CPU, memory, and other resources. You have M workloads, each requiring some amount of those resources. How do you assign workloads to machines to maximize utilization while respecting constraints?
This is a variant of the NP-hard bin packing problem. In the general case, finding the optimal solution is computationally intractable. But good heuristics exist, and Kubernetes’ scheduler implements several of them. The key insight is that centralized, automated scheduling dramatically outperforms human scheduling. When humans decide where to place workloads, they tend to be:
- Conservative — over-provisioning resources to avoid contention
- Forgetful — leaving old workloads running on machines long after they should have been decommissioned
- Inconsistent — different operators making different decisions for similar workloads
Borg’s experience demonstrated that automated bin packing could improve cluster utilization from the 5-15% typical of manually managed environments to 60-70%. Even modest improvements in utilization translate to enormous cost savings at scale: Google’s fleet comprises millions of machines, so a 1% improvement in utilization saves tens of thousands of servers.
The Service Discovery Problem
In a static world, you can configure your web frontend to talk to your database at a known IP address and port. But in a dynamic, containerized world, nothing has a stable address. Containers are created and destroyed constantly. They are moved between machines when hosts fail or when the scheduler finds a more efficient placement. The set of containers backing a particular service changes every time a deployment rolls out.
This creates the service discovery problem: how does one service find and communicate with another in an environment where addresses are constantly changing? There are several classic approaches:
- DNS-based discovery: Register service instances in DNS, and clients look up the DNS name. Simple, but DNS has caching and TTL issues that make it slow to reflect changes.
- Client-side registries: Services register themselves with a central registry (like ZooKeeper or Consul), and clients query the registry. Flexible, but requires every service to include registry client code.
- Load balancer-based discovery: A load balancer sits in front of service instances, and clients talk to the load balancer’s stable address. Simple for clients, but adds latency and a single point of failure.
Kubernetes provides service discovery as a first-class primitive through its Service abstraction. A Service has a stable IP address (the ClusterIP) and DNS name. The Kubernetes control plane automatically updates the set of endpoints (pod IP addresses) behind a Service as pods come and go. This is implemented transparently by kube-proxy (or the CNI plugin), which programs iptables/IPVS rules on every node to redirect traffic addressed to a Service’s ClusterIP to one of its backing pods.
The Rolling Deployment Problem
Updating a running application without downtime is one of the hardest problems in operations. The naive approach — stop all old instances, start all new instances — causes downtime proportional to the startup time of the new instances. In a microservices architecture with hundreds of services, even brief downtime cascades into widespread failures.
The rolling deployment strategy addresses this by incrementally replacing old instances with new ones: start one new instance, wait for it to become healthy, then stop one old instance, and repeat. This maintains capacity throughout the update. But implementing rolling deployments correctly requires solving several sub-problems:
- Health checking: How do you know when a new instance is ready to serve traffic? Kubernetes provides readiness probes and liveness probes.
- Traffic draining: How do you gracefully stop sending traffic to an instance before terminating it? Kubernetes provides graceful shutdown periods and endpoint management.
- Rollback: If the new version is broken, how do you quickly revert? Kubernetes maintains revision history and supports automatic rollback on failure.
- Surge and unavailability budgets: How many extra instances can you run during the update (surge), and how many instances can be unavailable at once? Kubernetes’ Deployment controller supports configurable maxSurge and maxUnavailable parameters.
The Self-Healing Problem
In any sufficiently large system, failures are not exceptions — they are the normal operating condition. Machines crash, networks partition, disks fill up, processes crash, memory leaks accumulate. Google’s published data suggests that in a cluster of 10,000 machines, several will fail every day.
The self-healing problem is: how do you build a system that automatically detects and recovers from failures without human intervention? Kubernetes addresses this at multiple levels:
- Container restart: If a container process crashes, the kubelet automatically restarts it, with exponential backoff to avoid restart storms.
- Pod health monitoring: Liveness probes detect when a container is running but unhealthy (e.g., deadlocked). Kubernetes kills and restarts unhealthy containers.
- Node failure detection: The control plane detects when a node stops reporting (via the node controller watching heartbeats) and automatically reschedules its pods onto healthy nodes.
- Replica maintenance: If a Deployment specifies 3 replicas and one pod dies, the Deployment controller automatically creates a replacement.
The key insight is that self-healing requires a control loop: continuously compare the actual state of the system to the desired state, and take action to reconcile any differences. This is the reconciliation loop — the central architectural pattern of Kubernetes, used by every controller from the scheduler to the kubelet.
The Desired State Model vs. Imperative Commands
Perhaps the most important conceptual contribution of Kubernetes is its commitment to the desired state (declarative) model over the imperative model.
In an imperative model, you issue commands: “start 3 instances of nginx,” “stop instance X,” “scale up to 5 instances.” The system executes each command as a one-shot action. If the command fails, or if the system state drifts after the command succeeds, the system does not automatically correct itself. The operator must detect the drift and issue corrective commands.
In a declarative model, you declare the desired state: “there should be 3 instances of nginx running.” The system continuously works to make reality match this declaration. If an instance crashes, the system automatically creates a replacement. If an extra instance somehow appears, the system terminates it. If the declaration changes to “5 instances,” the system creates 2 more.
The declarative model is fundamentally more robust because:
- It is self-correcting. The system continuously reconciles actual state toward desired state, handling failures automatically.
- It is idempotent. Applying the same desired state declaration multiple times has the same effect as applying it once.
- It separates intent from execution. The user says what they want; the system decides how to achieve it.
- It enables auditability. The desired state is a document (YAML or JSON) that can be version-controlled, reviewed, and diffed.
- It enables composition. Multiple controllers can independently reconcile different aspects of the desired state, each responsible for a single concern.
This is not merely a philosophical preference. The imperative model breaks down catastrophically in distributed systems where commands can be lost, duplicated, or reordered. The declarative model, by contrast, is eventually consistent by design: no matter what transient failures occur, the system will eventually converge to the desired state.
Imperative vs. Declarative: A Comparison
| Dimension | Imperative Model | Declarative (Kubernetes) Model |
|---|---|---|
| User action | Issue commands: “start X”, “stop Y” | Declare desired state: “there should be 3 of X” |
| Failure recovery | Manual: operator detects drift and issues corrections | Automatic: reconciliation loop continuously corrects drift |
| Idempotency | Commands may not be safe to replay | Applying same state is always safe to repeat |
| State visibility | State is the cumulative effect of past commands | State is a document that can be inspected, diffed, versioned |
| Scalability | Requires operator attention proportional to scale | Controller workload scales, but operator intent stays constant |
| Composition | Commands must be carefully ordered | Controllers reconcile independently and concurrently |
Common Mistakes and Misconceptions
-
“Kubernetes is only for microservices.” Kubernetes runs monoliths, batch jobs, stateful workloads, ML training pipelines, and more. Its primitives (Deployments, Jobs, StatefulSets, DaemonSets) are designed for a wide variety of workload patterns, not just microservices.
-
“I need Kubernetes for my small application.” For a single service with low traffic, a VM or platform-as-a-service (Heroku, Cloud Run, App Engine) is simpler, cheaper, and faster to operate. Kubernetes’ value emerges when you have the scaling, scheduling, and self-healing problems described in this chapter.
-
“Kubernetes replaces your CI/CD pipeline.” Kubernetes is a runtime platform, not a build or deploy tool. You still need a CI/CD system (GitHub Actions, Jenkins, ArgoCD, etc.) to build images, run tests, and push manifests. Kubernetes runs what your pipeline delivers.
Further Reading
- The Twelve-Factor App – Adam Wiggins’ methodology for building software-as-a-service applications. Kubernetes’ design embodies many of these factors, particularly port binding, concurrency via process model, disposability, and dev/prod parity.
- Martin Fowler – Microservices: a definition of this new architectural term – The canonical articulation of the microservices architecture pattern and the operational challenges (deployment, monitoring, failure handling) that Kubernetes was designed to solve.
- Netflix TechBlog – Chaos Engineering – Netflix’s pioneering work on deliberately injecting failures to test system resilience, directly motivating the self-healing and fault-tolerance properties Kubernetes provides.
- Google SRE Book – Chapter 7: The Evolution of Automation at Google – How Google evolved from manual operations to fully autonomous systems, providing context for why Kubernetes automates deployment, scaling, and recovery.
- Kelsey Hightower – “Kubernetes for Sysadmins” (PuppetConf 2016) – A practical demonstration of the problems Kubernetes solves for operations teams, from a leading practitioner.
- Google SRE Book – Chapter 24: Distributed Periodic Scheduling with Cron – How distributed scheduling of workloads works in practice at scale, illustrating the bin packing and scheduling challenges Kubernetes addresses.
- James Lewis & Martin Fowler – Microservices Trade-Offs – A balanced look at the costs and benefits of microservices, helping contextualize which problems Kubernetes genuinely solves versus which it shifts.
Next: Architecture from First Principles
Chapter 3: Architecture from First Principles
The Big Picture
flowchart TD
subgraph CP["Control Plane"]
etcd["etcd<br>(state)"]
API["API Server<br>(gateway)"]
CM["kube-controller-manager<br>(reconciliation)"]
SCHED["kube-scheduler<br>(placement)"]
etcd <--> API
API <--> CM
API <--> SCHED
end
subgraph N1["Node 1"]
kubelet1["kubelet"] --> containerd1["containerd"] --> pods1["Pod Pod"]
kubeproxy1["kube-proxy"]
end
subgraph N2["Node 2"]
kubelet2["kubelet"] --> containerd2["containerd"] --> pods2["Pod Pod"]
kubeproxy2["kube-proxy"]
end
subgraph N3["Node 3"]
kubelet3["kubelet"] --> containerd3["containerd"] --> pods3["Pod Pod"]
kubeproxy3["kube-proxy"]
end
API --> kubelet1
API --> kubelet2
API --> kubelet3
Every arrow is through the API Server. There are no direct connections between components. This is the single most important architectural constraint.
If you encounter unfamiliar terms in this chapter, see Appendix A: Glossary for quick definitions.
Why etcd? The Case for a Consistent, Distributed Key-Value Store
Kubernetes needs to store the desired state of the entire cluster: every pod specification, every service definition, every configuration map. This state must be consistent (all readers see the same data), durable (data survives machine failures), and available (the store can be read and written to even when some machines fail).
The CAP theorem forces a choice between consistency and availability (since network partitions are inevitable), and Kubernetes chose consistency. This is the right choice for a cluster management system: it is better to temporarily refuse writes than to allow conflicting writes that could result in two different controllers making contradictory scheduling decisions.
etcd implements the Raft consensus algorithm, which provides strong consistency (linearizability) across a cluster of typically 3 or 5 nodes. Every write must be acknowledged by a majority of nodes before it is committed. This means etcd requires a majority of nodes to commit writes: a 3-node cluster needs 2 (tolerates 1 failure), a 5-node cluster needs 3 (tolerates 2 failures). This is why etcd clusters always use an odd number of nodes — a 4-node cluster still requires 3 for quorum, giving no additional fault tolerance over 3.
Why etcd specifically, rather than ZooKeeper, Consul, or a relational database?
- ZooKeeper was the incumbent choice (used by Hadoop, Kafka, and many other systems). But ZooKeeper has a complex session-based model, a limited data model (tree of znodes with size limits), and a Java-based implementation that was harder to embed. etcd offered a simpler HTTP/gRPC API, a more flexible key-value model, and was written in Go (matching Kubernetes’ language).
- Consul was not yet mature when Kubernetes was designed.
- Relational databases provide strong consistency but are harder to operate in a distributed, fault-tolerant configuration. etcd’s Raft-based replication is simpler to reason about and deploy than MySQL/PostgreSQL with synchronous replication.
Critically, etcd provides a watch mechanism: clients can subscribe to changes on a key or key prefix and receive notifications when the data changes. This is the mechanism that powers Kubernetes’ reconciliation loops. Controllers do not poll etcd; they watch for changes and react to them. This makes the system event-driven and efficient.
Why a Single API Server? The Chokepoint That Enables Everything
All access to the Kubernetes cluster state — every read, every write, from every component — goes through the kube-apiserver. This seems like a bottleneck, and indeed it is a deliberate chokepoint. Why?
1. Authentication and authorization. The API server is the single enforcement point for access control. Every request is authenticated (who is making this request?) and authorized (is this identity allowed to perform this action on this resource?). Having a single enforcement point is a fundamental security principle: it eliminates the risk of inconsistent access control across multiple entry points.
2. Validation. The API server validates every object before storing it in etcd. This ensures that invalid state never enters the system. Validation includes schema validation (does this object have the right fields?), semantic validation (does this pod specification reference an existing service account?), and admission control (do custom policies allow this object?).
3. Admission control. The API server supports admission webhooks — external services that can examine, modify, or reject API requests. This enables powerful policy enforcement: injecting sidecar containers, enforcing naming conventions, requiring resource limits, preventing privilege escalation. The single-API-server model makes this possible because all mutations flow through one point.
4. Watch multiplexing. The API server multiplexes watch connections. Hundreds of controllers and kubelets watch for changes to different resources, and the API server efficiently fans out notifications from etcd changes. Without the API server as intermediary, every client would need a direct connection to etcd, which would not scale.
5. API versioning and conversion. The API server handles conversion between different API versions. An object stored as apps/v1 can be read as apps/v1beta1 (with appropriate conversion). This enables gradual API evolution without breaking clients.
The API server is designed to be horizontally scalable. You can run multiple instances behind a load balancer. Each instance is stateless — all state is in etcd. This means the single logical API server does not become a single point of failure in practice.
The Controller Pattern: Reconciliation as Architecture
The controller pattern is the heart of Kubernetes’ architecture. A controller is a loop that:
- Observes the current state of the world (by watching the API server)
- Compares the current state to the desired state (as expressed in API objects)
- Takes action to move the current state toward the desired state
- Repeats indefinitely
This is sometimes called the reconciliation loop or the observe-diff-act pattern. It is borrowed from control theory, where it is known as a closed-loop controller. The thermostat in your house is a simple example: it observes the current temperature, compares it to the desired temperature, and turns the heater on or off.
Kubernetes runs dozens of controllers, each responsible for a specific aspect of the system:
- The Deployment controller watches Deployment objects and ensures the right number of ReplicaSets exist with the right template.
- The ReplicaSet controller watches ReplicaSet objects and ensures the right number of Pods exist.
- The Node controller watches nodes and detects failures.
- The Service controller watches Service objects and updates endpoints.
- The Job controller watches Job objects and creates Pods to run tasks to completion.
- The Endpoint controller watches Services and Pods to maintain the mapping between them.
flowchart TD
OBSERVE["OBSERVE<br>current state"] -->|Compare actual<br>vs desired| DIFF["DIFF<br>actual vs spec"]
DIFF -->|Create/delete/update<br>objects via API Server| ACT["ACT<br>to fix drift"]
ACT -->|repeat| OBSERVE
OBSERVE -.-|Watch API Server| API(("API<br>Server"))
ACT -.-|Write to API Server| API
The genius of this pattern is decomposition. Each controller handles exactly one concern. The Deployment controller knows nothing about nodes or networking; the ReplicaSet controller knows nothing about rolling updates. Each controller reads from and writes to the API server, and the API server provides the shared state that coordinates them.
This decomposition also provides fault tolerance. If the Deployment controller crashes and restarts, it simply reads the current state from the API server and resumes reconciling. No state is lost because the controller is stateless — all state is in the API server (and ultimately in etcd). This is why Kubernetes components can be restarted at any time without corruption.
The Watch Mechanism: Event-Driven Efficiency
Controllers need to know when things change. Polling — periodically reading all objects — is wasteful and slow, introducing latency proportional to the polling interval.
Kubernetes uses a watch mechanism instead. A controller opens a long-lived HTTP connection to the API server and says, “notify me of any changes to Deployment objects.” The API server, in turn, watches etcd for changes and fans out notifications to all watching clients. This is event-driven: controllers react to changes immediately rather than discovering them on a polling interval.
The watch mechanism is implemented using HTTP chunked transfer encoding (or gRPC streaming). Each change event includes the type of change (ADDED, MODIFIED, DELETED), the object’s new state, and a resource version — a logical clock that enables clients to resume a watch from where they left off after a disconnection.
To handle the case where a watch connection breaks and events are missed, Kubernetes controllers use a pattern called list and watch: on startup, the controller lists all objects of interest (establishing a baseline), notes the resource version, and then watches for changes from that version forward. The client-go library provides an Informer abstraction that implements this pattern, including a local cache of objects and an event handler interface.
The Scheduler: Separation of Concerns
The Kubernetes scheduler is a separate component from the API server and the controllers. Its job is simple but computationally intensive: when a new Pod is created without a node assignment, the scheduler selects the best node for it.
Why is the scheduler separate? Because scheduling is a fundamentally different concern from state management. The API server manages state; controllers reconcile state; the scheduler makes placement decisions. Separating these concerns allows each to evolve independently. You can replace the default scheduler with a custom one, or run multiple schedulers for different workload types, without modifying any other component.
The scheduler operates in two phases:
- Filtering: Eliminate nodes that cannot run the pod (insufficient resources, incompatible taints, missing node selectors, etc.).
- Scoring: Rank the remaining nodes by desirability (resource balance, affinity/anti-affinity, topology spread, etc.).
The scheduler’s decision is recorded by binding the pod to a node — setting the spec.nodeName field in the Pod object via the API server. The kubelet on the target node watches for pods bound to it and starts the containers.
This design means the scheduler is advisory, not authoritative. It makes suggestions (by binding pods to nodes), but it does not directly start containers. If the kubelet cannot run the pod (perhaps a resource changed between scheduling and execution), the pod enters a failed state and the reconciliation loop handles it.
Kubelet: The Node Agent
The kubelet runs on every node in the cluster. It is the bridge between the Kubernetes control plane and the container runtime on the node. The kubelet:
- Watches the API server for pods assigned to its node
- Translates pod specifications into container runtime calls (via the Container Runtime Interface, CRI)
- Monitors container health (via liveness and readiness probes)
- Reports node status and pod status back to the API server
The kubelet is deliberately simple. It does not make scheduling decisions, manage networking, or handle service discovery. It is a single-responsibility agent that converts API state into running containers and reports back.
The kubelet’s design reflects a key Kubernetes principle: the control plane tells nodes what to do, not how to do it. The kubelet receives a PodSpec and is free to implement it however it wants, as long as the containers end up running with the specified resources and configuration. This abstraction is what allows Kubernetes to support multiple container runtimes (containerd, CRI-O, etc.) through the CRI interface.
Kube-Proxy: Transparent Service Networking
Kube-proxy runs on every node (or is replaced by equivalent CNI functionality) and implements the networking rules that make Services work. When a Service is created with a ClusterIP, kube-proxy programs iptables or IPVS rules on every node that intercept traffic destined for the ClusterIP and redirect it to one of the Service’s backing pods.
Why does kube-proxy run on every node instead of as a centralized load balancer? Because a centralized load balancer would be a bottleneck and a single point of failure. By distributing the load-balancing rules to every node, Kubernetes ensures that Service traffic takes the most direct path: a pod on Node A talking to a Service endpoint on Node B sends traffic directly from A to B, with no intermediary.
Kube-proxy watches the API server for Service and Endpoint changes and updates the local rules accordingly. It is another example of the controller pattern: observe desired state (Service definitions), observe actual state (current iptables rules), and reconcile.
The Controller Pattern is Kubernetes. If you understand only one thing about Kubernetes’ architecture, understand the controller pattern: observe, diff, act, repeat. Every component — from the scheduler to the kubelet to kube-proxy — is a controller that watches for state changes and reconciles actual state toward desired state. This single pattern, applied recursively across the entire system, is what makes Kubernetes self-healing, scalable, and extensible.
Common Mistakes and Misconceptions
-
“The master node runs my workloads.” Control plane nodes are dedicated to cluster management (API server, scheduler, controllers, etcd). Workloads run on worker nodes by default. Running application pods on control plane nodes is possible but strongly discouraged in production because it risks destabilizing the control plane.
-
“etcd is a general-purpose database.” etcd is optimized for small key-value metadata with a hard limit of approximately 1.5 MB per value. It is designed for storing cluster state, not application data. Treating it as an application database will lead to performance degradation and cluster instability.
-
“If the API server goes down, my pods stop.” Running pods continue to execute on their nodes even if the API server is unavailable. The kubelet keeps containers running based on its last known state. What stops is your ability to make changes, schedule new pods, or observe cluster state through the API.
-
“The scheduler continuously moves pods for better balance.” Pods are scheduled once and stay on their assigned node unless they are evicted, deleted, or the node fails. The scheduler does not rebalance running pods. If you need rebalancing, you must use tools like the Descheduler, which evicts pods so the scheduler can place them on better nodes.
Further Reading
- Kubernetes Official Documentation – Cluster Architecture – The authoritative reference for how control plane and node components fit together, including component-level diagrams.
- etcd Documentation – Documentation for the distributed key-value store at the heart of Kubernetes’ state management, covering Raft consensus, watch mechanics, and operational best practices.
- Kelsey Hightower – Kubernetes The Hard Way – A tutorial that walks through bootstrapping a Kubernetes cluster from scratch, component by component, providing deep understanding of how each piece interacts.
- Joe Beda – “The Road to More Usable Kubernetes” (KubeCon 2017) — Co-founder of Kubernetes discusses the design decisions and usability goals behind the system.
- Daniel Smith – “The Kubernetes Control Plane for Busy People Who Like Pictures” (KubeCon EU 2019) — Accessible visual walkthrough of how the control plane components interact.
- Kubernetes Documentation – Scheduler Performance Tuning – Details on the scheduler’s filtering and scoring phases, extension points, and how to tune scheduling behavior.
- Daniel Smith – “A Vision For API Machinery” (KubeCon 2018) — Google engineer on the architecture and future direction of the Kubernetes API server.
Next: The API Model
Chapter 4: The API Model — Declarative State and Reconciliation
Resources, Objects, Specs, and Status
The Kubernetes API is a resource-oriented API. Everything in Kubernetes — pods, services, deployments, config maps, custom resources — is a resource with a standard structure:
- apiVersion: The API group and version (e.g.,
apps/v1) - kind: The type of resource (e.g.,
Deployment) - metadata: Name, namespace, labels, annotations, resource version, creation timestamp, finalizers, owner references
- spec: The desired state — what the user wants
- status: The actual state — what the system has observed
This spec/status split is fundamental. The user writes spec; controllers write status. This separation of concerns means that:
- The user owns intent. Only the user (or their tooling) should modify spec. Controllers never modify spec.
- Controllers own reality. Controllers update status to reflect what they have observed and what actions they have taken.
- Reconciliation bridges the gap. The controller’s job is to make the real world match spec, and to report the real world in status.
The resource version field in metadata is a critical coordination mechanism. It is an opaque string (typically derived from etcd’s revision number) that changes every time the object is modified. When a client wants to update an object, it must include the current resource version. If another client has modified the object in the meantime, the resource version will have changed, and the update will fail with a 409 Conflict error. This is optimistic concurrency control: clients assume they can make updates without locking, and the system detects and rejects conflicting updates.
Declarative YAML: Configuration as Data
Kubernetes objects are typically expressed as YAML (or JSON) documents. This is not incidental — it is a deliberate design choice with deep implications.
By representing desired state as data (YAML files) rather than code (imperative scripts), Kubernetes enables:
- Version control: YAML files can be committed to Git, creating a complete history of every change to the cluster’s desired state. This is the foundation of GitOps.
- Code review: Changes to infrastructure can be reviewed with the same tools and processes used for application code.
- Diffing: You can diff two versions of a deployment spec to see exactly what changed.
- Templating: Tools like Helm can generate YAML from templates, enabling parameterized deployments.
- Validation: YAML can be validated against schemas before being applied, catching errors before they reach the cluster.
- Dry runs:
kubectl apply --dry-run=serversends the YAML to the API server for validation without actually creating resources.
The choice of YAML specifically (rather than JSON, TOML, or a custom DSL) was pragmatic: YAML is human-readable, supports comments (unlike JSON), and was already widely used in the DevOps community (Ansible, Docker Compose). Its verbosity has been criticized, but its universality is a significant advantage.
Reconciliation Loops: The Engine of Self-Healing
The reconciliation loop is the mechanism by which Kubernetes achieves its declarative guarantees.
Consider what happens when you apply a Deployment object:
flowchart TD
kubectl["kubectl apply"] --> API["API Server"]
API -->|store| etcd["etcd"]
API -->|watch| DC["Deployment Controller"]
DC -->|create| RS["ReplicaSet"]
API -->|watch| RSC["ReplicaSet Controller"]
RSC -->|create| Pods["Pod<br>Pod<br>Pod"]
API -->|watch| Scheduler["Scheduler"]
Scheduler -->|"bind pod to node"| Pods
API -->|watch| Kubelet["Kubelet (on node)<br>containerd<br>start container"]
Pods --> Kubelet
The following sequence diagram shows the temporal flow — notice that every component communicates only through the API server:
sequenceDiagram
participant User as User (kubectl)
participant API as API Server
participant etcd as etcd
participant DC as Deployment Controller
participant RSC as ReplicaSet Controller
participant Sched as Scheduler
participant KL as Kubelet (per node)
participant EC as Endpoint Controller
User->>API: POST /apis/apps/v1/deployments
API->>etcd: store Deployment
API-->>User: 201 Created
API-->>DC: watch: new Deployment
Note over DC: compare: 0 RS exist, need 1
DC->>API: create ReplicaSet
API->>etcd: store RS
API-->>RSC: watch: new RS
Note over RSC: compare: 0 pods, need 3
RSC->>API: create 3 Pods
API->>etcd: store Pods
API-->>Sched: watch: 3 unbound Pods
Note over Sched: assign nodeName per Pod
Sched->>API: update Pod .spec.nodeName
API->>etcd: store updated Pods
API-->>KL: watch: Pod bound to my node
Note over KL: start containers via CRI
KL->>API: update Pod .status (Running, IP)
API->>etcd: store Pod status
API-->>EC: watch: Pod Ready
Note over EC: add Pod IP to Service
EC->>API: update Endpoints
API->>etcd: store Endpoints
Here’s the same flow in words:
- The API server validates the Deployment and stores it in etcd.
- The Deployment controller observes the new Deployment. It compares the desired state (e.g., 3 replicas of nginx:1.21) to the actual state (no ReplicaSets exist yet). It creates a new ReplicaSet object.
- The ReplicaSet controller observes the new ReplicaSet. It compares the desired state (3 pods) to the actual state (0 pods). It creates 3 Pod objects.
- The Scheduler observes the 3 unscheduled Pods. For each, it selects a node and updates the Pod’s
spec.nodeName. - The Kubelet on each selected node observes the Pod assigned to it. It calls the container runtime to start the containers.
- The Kubelet reports the pod’s status (running, IP address, etc.) back to the API server.
- The Endpoint controller observes the running Pods and updates the Endpoints object for any matching Services.
Notice how many controllers are involved, each doing a small, independent job, communicating only through the API server. If any controller crashes, it simply restarts and resumes from the current state. If a pod crashes, the ReplicaSet controller detects that the actual count (2) differs from the desired count (3) and creates a replacement. This is self-healing through reconciliation.
Labels and Selectors: The Soft Linking Mechanism
Kubernetes objects are connected not by hard references (like foreign keys in a relational database) but by labels and selectors. A label is a key-value pair attached to an object’s metadata (e.g., app: nginx, env: production). A selector is a query that matches objects by their labels (e.g., app=nginx,env=production).
This soft linking is a deliberate design choice:
- Loose coupling: A Service does not reference specific Pods by name. It references a label selector, and any Pod matching that selector is included. This means Pods can be created, destroyed, and replaced without updating the Service.
- Flexibility: Labels can represent any dimension: application name, version, environment, team, cost center. Selectors can combine dimensions.
- Composition: Multiple resources can select the same Pods. A Service, a NetworkPolicy, and a PodDisruptionBudget can all independently select the same set of Pods using the same or different labels.
The label/selector model is inspired by the way tagging works in cloud infrastructure and the way CSS selectors work in web development: you define properties on objects and use queries to match them, rather than building explicit relationship graphs.
Custom Resource Definitions: Extending the API
One of Kubernetes’ most powerful features is the ability to extend the API with custom resources. A Custom Resource Definition (CRD) tells the API server about a new type of object — say, PostgresCluster — including its schema, its API group, and its versions. Once the CRD is installed, users can create, read, update, and delete PostgresCluster objects just like built-in resources.
But a CRD alone is just data storage. The magic happens when you pair a CRD with a custom controller that watches for PostgresCluster objects and reconciles them — creating the underlying StatefulSets, Services, ConfigMaps, PersistentVolumeClaims, and other resources needed to run an actual PostgreSQL cluster. This combination of CRD + controller is the Operator pattern.
CRDs are Kubernetes’ answer to the extensibility problem: how do you allow the platform to manage new types of resources without modifying Kubernetes itself? By making the API server a generic, extensible state store with a standard interface, Kubernetes enables an ecosystem of operators that teach the system how to manage everything from databases to message queues to machine learning pipelines.
This extensibility was a lesson from Borg, whose fixed API required modifying the system itself to support new workload types. Kubernetes’ CRD mechanism democratizes this: anyone can extend the API without forking the project.
Common Mistakes and Misconceptions
-
“kubectl apply and kubectl create are the same.”
kubectl createis imperative and fails if the resource already exists.kubectl applyis declarative and merges your manifest with the existing resource, making it safe to run repeatedly. In production, always useapplyfor reproducible, idempotent deployments. -
“I should use kubectl edit in production.” Imperative edits bypass GitOps workflows, code review, and audit trails. Changes made with
kubectl editare not tracked in version control and cannot be reproduced. Always use declarative YAML stored in Git and applied through a pipeline. -
“All Kubernetes resources are namespaced.” Many important resources are cluster-scoped: Nodes, PersistentVolumes, ClusterRoles, ClusterRoleBindings, and Namespaces themselves. Understanding which resources are namespaced and which are cluster-scoped is essential for RBAC and multi-tenancy.
-
“Deleting a resource is instant.” Finalizers can block deletion indefinitely until a controller completes cleanup logic. Pods have a graceful termination period (default 30 seconds) during which they receive SIGTERM before being killed. A resource in “Terminating” state may remain for an extended time.
Further Reading
- Kubernetes API Conventions – The definitive guide to how Kubernetes API resources are structured, including naming, versioning, spec/status separation, and metadata conventions.
- Kubernetes API Machinery (apimachinery) – The Go library underpinning the Kubernetes API: Group-Version-Resource (GVR), Group-Version-Kind (GVK), runtime.Object, scheme registration, and serialization.
- Writing Controllers – Official Kubernetes Sample Controller – A minimal but complete example of writing a custom controller using client-go, demonstrating informers, work queues, and the reconciliation loop.
- client-go Examples – Practical examples of interacting with the Kubernetes API from Go: creating resources, setting up watches, using dynamic clients, and leader election.
- Michael Hausenblas & Stefan Schimanski – “Programming Kubernetes” (O’Reilly) – The most comprehensive book on the Kubernetes API machinery, custom resources, and controller development patterns.
- Stefan Schimanski & Antoine Pelisse – “Deep Dive Into API Machinery” (KubeCon 2019) — Detailed walkthrough of API versioning, conversion webhooks, and the request lifecycle.
- Kubernetes Documentation – Custom Resources – Official docs on CRDs, structural schemas, validation, versioning, and conversion webhooks for extending the API.
Next: The Networking Model
Chapter 5: The Networking Model — Why Every Pod Gets an IP
The Fundamental Networking Problem
Kubernetes’ networking model is most easily understood by contrast with the Docker port-mapping model it rejected.
Docker Port-Mapping vs. Kubernetes Flat Network
DOCKER PORT-MAPPING MODEL KUBERNETES FLAT NETWORK MODEL
───────────────────────── ────────────────────────────
Host IP: 192.168.1.10 Node 1 Node 2
┌──────────────────────┐ ┌──────────────┐ ┌──────────────┐
│ Container A (:80) │ │ Pod A │ │ Pod C │
│ → mapped to :32768 │ │ 10.244.1.5 │ │ 10.244.2.8 │
│ │ │ :80 is :80 │ │ :80 is :80 │
│ Container B (:80) │ │ │ │ │
│ → mapped to :32769 │ │ Pod B │ │ Pod D │
│ │ │ 10.244.1.6 │ │ 10.244.2.9 │
│ Container C (:3000) │ │ :3000 is │ │ :3000 is │
│ → mapped to :32770 │ │ :3000 │ │ :3000 │
└──────────────────────┘ └──────┬───────┘ └──────┬───────┘
│ Flat network │
Client must know: └─────────────────────┘
192.168.1.10:32768 Any pod can reach any pod by IP.
192.168.1.10:32769 No port translation. No NAT.
192.168.1.10:32770 Apps bind to the port they expect.
Docker’s Port-Mapping Model (and Why Kubernetes Rejected It)
In Docker’s default bridge networking mode, each container gets its own network namespace connected to the docker0 bridge on the host. Since containers are isolated behind this bridge, reaching them from outside the host requires port mapping (-p), which maps a host port into the container’s namespace (e.g., host port 32768 to container port 80). This means:
- The container’s address from outside is
<host-ip>:<random-port>, not a predictable address. - Every service that needs to communicate with the container must know the host IP and the mapped port.
- Port allocation must be coordinated across all containers on a host to avoid conflicts.
- Applications must be aware of port mapping, or an intermediary must translate.
This model breaks a fundamental assumption of network programming: that you know your own address. A container that binds to port 80 thinks it is listening on port 80, but external clients reach it on port 32768. Google’s experience with Borg — which used a naming service (BNS) to map logical names to host:port pairs — confirmed that port-mapping models create cascading operational friction.
Kubernetes’ Flat Networking Model
Its networking model has three fundamental rules:
- Every Pod gets its own IP address. No port mapping. No NAT between pods. A pod that binds to port 80 is reachable on port 80 at its pod IP.
- All pods can communicate with all other pods without NAT. Any pod can reach any other pod using the other pod’s IP address, regardless of which node either pod is on.
- Agents on a node (kubelet, kube-proxy) can communicate with all pods on that node.
This is sometimes called the flat networking model because from the perspective of pods, the network is flat: every pod is directly reachable from every other pod. There are no layers of NAT or port mapping to navigate.
Why is this model superior? Because it preserves the assumptions of traditional network programming. Applications do not need to know about port mapping. They bind to the port they expect. They connect to other services at their expected ports. DNS, load balancers, and monitoring tools work as expected. The mental model is: “pods are like VMs on a flat network.”
How the Flat Network Is Implemented: CNI
Kubernetes does not implement networking itself. Instead, it defines the Container Network Interface (CNI) specification: a standard API that networking plugins must implement. The CNI plugin is responsible for:
- Allocating an IP address for each pod
- Configuring the pod’s network namespace (virtual ethernet pair, routes, etc.)
- Ensuring pod-to-pod connectivity across nodes
Different CNI plugins implement this in different ways:
- Flannel uses a simple overlay network (VXLAN or host-gateway) to encapsulate pod traffic in UDP or IP-in-IP packets.
- Calico uses BGP to distribute pod routes, avoiding encapsulation overhead and enabling network policies.
- Cilium uses eBPF (extended Berkeley Packet Filter) programs in the Linux kernel for high-performance, programmable networking.
- AWS VPC CNI assigns pod IPs from the AWS VPC address space, making pods first-class citizens in the VPC network.
The CNI abstraction is another example of Kubernetes’ design philosophy: define the interface, not the implementation. By specifying what networking must provide (unique pod IPs, flat connectivity) without specifying how, Kubernetes allows the networking layer to be optimized for different environments.
Services: Stable Endpoints for Ephemeral Pods
Pod IP addresses are ephemeral. When a pod is destroyed and recreated, it gets a new IP. This means you cannot rely on pod IPs for service discovery. This is where the Service abstraction comes in.
A Service provides a stable virtual IP address (the ClusterIP) and a stable DNS name that routes traffic to the set of pods matching the Service’s label selector. The mapping from Service to pods is maintained by the Endpoints (or EndpointSlice) controller, which watches for pod changes and updates the endpoint list.
flowchart TD
SVC["Service: web-svc<br>ClusterIP: 10.96.0.42<br>DNS: web-svc.default.svc<br>Selector: app=web"]
SVC -->|"kube-proxy / iptables<br>load-balances across"| Pod1["Pod app=web<br>10.244.1.5<br>Node 1"]
SVC -->|"kube-proxy / iptables<br>load-balances across"| Pod2["Pod app=web<br>10.244.2.8<br>Node 2"]
SVC -->|"kube-proxy / iptables<br>load-balances across"| Pod3["Pod app=web<br>10.244.1.9<br>Node 1"]
Client code: http://web-svc:80 is transparently routed to a pod.
Kube-proxy (or the CNI plugin) programs rules on every node that intercept traffic to the Service’s ClusterIP and redirect it to one of the backing pod IPs, using round-robin or other load-balancing algorithms. From the client’s perspective, the Service has a single, stable address; the fact that traffic is being distributed to ephemeral pods is transparent.
Services come in several types:
- ClusterIP (default): Accessible only within the cluster.
- NodePort: Exposes the Service on a static port on every node’s IP, making it accessible from outside the cluster.
- LoadBalancer: Provisions an external load balancer (on cloud providers) that routes external traffic to the Service.
- ExternalName: Maps the Service to a DNS CNAME record, providing a Kubernetes-native alias for an external service.
Ingress and Gateway API: L7 Routing
Services operate at Layer 4 (TCP/UDP). For HTTP-level routing — path-based routing, host-based virtual hosting, TLS termination — Kubernetes provides the Ingress resource (and its successor, the Gateway API).
An Ingress is a declaration of routing rules: “route traffic for host foo.example.com to Service foo, and traffic for host bar.example.com to Service bar.” An Ingress Controller (a separate component, typically nginx, HAProxy, Traefik, or a cloud load balancer) watches for Ingress resources and configures the actual routing.
The Gateway API, introduced as a successor to Ingress, provides a more expressive and extensible model for routing, with better support for multi-tenancy, traffic splitting, and protocol-specific routing.
Network Policies: The Missing Firewall
By default, Kubernetes’ flat networking model allows all pods to communicate with all other pods. This is convenient but not secure. Network Policies provide pod-level firewall rules: you can specify which pods can communicate with which other pods, based on labels, namespaces, and IP blocks.
Network Policies are implemented by the CNI plugin (not all plugins support them). They are another example of Kubernetes’ declarative model: you declare the desired network access rules, and the CNI plugin configures the underlying network to enforce them.
The Four Networking Problems
| Problem | Solution | Key Mechanism |
|---|---|---|
| Container-to-container on same pod | Shared network namespace (localhost) | Pods share a single IP; containers communicate via localhost |
| Pod-to-pod across nodes | Flat network via CNI plugin | Every pod gets a unique IP; CNI ensures cross-node connectivity |
| Pod-to-Service (service discovery) | Service abstraction with ClusterIP | kube-proxy/CNI programs iptables/IPVS rules for load balancing |
| External-to-Service | NodePort, LoadBalancer, Ingress | Expose services externally via port mapping, cloud LB, or L7 routing |
Common Mistakes and Misconceptions
-
“Pods need NAT to talk to each other.” Kubernetes requires a flat network where every pod can reach every other pod directly by IP without NAT. This is a fundamental requirement of the networking model, enforced by the CNI plugin. If you find yourself configuring NAT between pods, something is misconfigured.
-
“Services are load balancers.” A Service is a stable virtual IP (ClusterIP) with endpoint tracking and basic load distribution via kube-proxy rules. Only
type: LoadBalancerprovisions an actual external load balancer. ClusterIP and NodePort Services are internal routing constructs, not load balancer appliances. -
“Pod IPs are stable.” Pod IPs are ephemeral and change every time a pod is restarted or rescheduled. Never hard-code pod IPs in configuration. Use Services for stable endpoints and DNS-based service discovery.
-
“NodePort is fine for production.” NodePort exposes a high-numbered port on every node in the cluster, making it difficult to manage, secure, and integrate with external DNS or TLS. For production external traffic, use Ingress controllers or
type: LoadBalancerServices instead.
Further Reading
- Kubernetes Networking Model – Official documentation explaining the fundamental requirement that every pod gets a unique IP and can communicate with every other pod without NAT.
- CNI Specification – The Container Network Interface spec that defines how network plugins integrate with container runtimes; essential for understanding how Calico, Cilium, Flannel, and others plug in.
- Life of a Packet in Kubernetes (KubeCon, Michael Rubin) — Traces a network packet from pod to pod, covering CNI, kube-proxy, and iptables/IPVS rules.
- CoreDNS Documentation – Reference for the default DNS server in Kubernetes, covering service discovery, custom DNS entries, and plugin-based extensibility.
- iptables vs. IPVS for kube-proxy – Tigera blog post comparing the two kube-proxy modes, including performance benchmarks and guidance on when to switch to IPVS.
- Kubernetes Networking Intro and Deep-Dive (KubeCon, Bowei Du & Tim Hockin) — Comprehensive walkthrough of the Kubernetes networking model, Services, DNS, and ingress.
- Gateway API Documentation – The next-generation Kubernetes API for L7 routing, replacing Ingress with a more expressive and role-oriented model.
Next: The Ecosystem
Chapter 6: The Ecosystem — Why Operators, Helm, and Service Meshes Exist
block-beta
columns 3
APP["YOUR APPLICATION"]:3
GITOPS["GitOps\nArgoCD / Flux"]
PKG["Packaging\nHelm / Kustomize"]
OBS["Observability\nPrometheus / Grafana"]
MESH["SERVICE MESH (optional) — Istio / Linkerd / Cilium"]:3
OPH["OPERATORS — CRD + Controller = domain knowledge as code\nPostgreSQL / Kafka / Any Domain Operator"]:3
CORE["KUBERNETES CORE\nDeployments, Services, ConfigMaps, Secrets, RBAC, CRDs\nAPI Server, etcd, Scheduler, Controllers, Kubelet"]:3
CNI["NETWORK (CNI)\nFlannel / Calico / \nCilium"]
CRI["RUNTIME (CRI)\ncontainerd / CRI-O"]
CSI["STORAGE (CSI)\nEBS / NFS / Ceph"]
LINUX["LINUX + HARDWARE — cgroups, namespaces, iptables/eBPF, kernel"]:3
style CORE fill:#1a4eb8,color:#fff,stroke:#0d2d6e,stroke-width:3px
Kubernetes provides the middle layers. Everything above and below is pluggable. This is “platform for platforms” by design.
Operators: Teaching Kubernetes Domain Knowledge
The Operator pattern, introduced by CoreOS in 2016, is the most significant architectural pattern to emerge from the Kubernetes ecosystem. An Operator encodes the operational knowledge of a human domain expert into a custom controller.
Consider the problem of running a PostgreSQL database on Kubernetes. Kubernetes knows how to run containers, but it does not know how to:
- Initialize a PostgreSQL cluster with a primary and replicas
- Configure streaming replication between primary and replicas
- Perform a failover when the primary node fails
- Take point-in-time backups using WAL archiving
- Resize a cluster by adding or removing replicas
- Upgrade PostgreSQL versions with minimal downtime
A human DBA knows all of these things. An Operator encodes this knowledge into code. The Operator defines a CRD (e.g., PostgresCluster) and a controller that watches for PostgresCluster objects and reconciles them by creating and managing the underlying Kubernetes resources (StatefulSets, Services, ConfigMaps, PersistentVolumeClaims, CronJobs for backups, etc.).
The Operator pattern is powerful because it composes with Kubernetes’ existing primitives — the same API, controllers, and reconciliation loops — adding only the domain-specific logic on top.
Operators exist for virtually every stateful application: databases (PostgreSQL, MySQL, MongoDB, CockroachDB, Cassandra), message queues (Kafka, RabbitMQ), monitoring systems (Prometheus), and many more. The OperatorHub.io registry catalogs hundreds of them.
Helm: Package Management for Kubernetes
Helm addresses a different problem: parameterized deployment. A typical Kubernetes application consists of dozens of YAML files: Deployments, Services, ConfigMaps, Secrets, Ingresses, ServiceAccounts, RBAC rules. These files need to be customized for different environments (dev, staging, production) and different configurations (replicas, resource limits, feature flags).
Helm introduces the concept of a chart: a package of templated YAML files, a values.yaml file that provides default parameters, and metadata. Users install a chart with custom values, and Helm renders the templates and applies the resulting YAML to the cluster.
Helm also provides release management: it tracks which charts are installed, their versions, and their configuration, enabling upgrades and rollbacks. It fills the role of a package manager (like apt or npm) for Kubernetes.
Helm has been criticized for its complexity and for its templating approach (Go templates embedded in YAML is ergonomically challenging). Alternatives like Kustomize (which uses overlay-based patching rather than templating) have emerged, but Helm remains the most widely used packaging tool in the Kubernetes ecosystem, largely because of its enormous library of community-maintained charts.
Service Meshes: The Networking Layer That Kubernetes Lacks
Kubernetes provides basic service discovery and load balancing through Services, but it does not provide:
- Mutual TLS (mTLS) between services: encrypting and authenticating all inter-service communication
- Fine-grained traffic management: canary deployments, traffic splitting, fault injection, retries, timeouts, circuit breaking
- Observability: distributed tracing, per-service metrics, access logging
A service mesh adds these capabilities by inserting a sidecar proxy (typically Envoy) alongside every pod. All inbound and outbound traffic flows through the sidecar, which can encrypt it, route it, observe it, and enforce policies on it. A control plane (Istio, Linkerd, Consul Connect) configures the sidecars.
Service meshes exist because Kubernetes deliberately does not implement application-level networking. Kubernetes provides the infrastructure-level network (pod IPs, Service ClusterIPs) but leaves application-level concerns (encryption, traffic management, observability) to the application or to a mesh. This is consistent with Kubernetes’ design philosophy of providing building blocks rather than a complete platform.
However, service meshes add significant complexity: they increase resource consumption (each sidecar consumes CPU and memory), add latency (each hop through a sidecar adds processing time), and introduce a large new control plane to operate. Many organizations find that they can achieve sufficient security and observability with simpler approaches (network policies + application-level TLS + centralized logging) and do not need a full mesh.
Why the Ecosystem Exists: Kubernetes as a Platform for Platforms
The common thread across Operators, Helm, and service meshes is that Kubernetes is deliberately incomplete. It provides primitives (pods, services, deployments, CRDs) and extension mechanisms (controllers, admission webhooks, CNI, CRI, CSI) but does not attempt to solve every problem itself — a lesson from Borg, which tried to be everything and became too tightly coupled to evolve. Kubernetes instead adopted the Unix philosophy: do one thing well, and compose with other tools.
The result is that Kubernetes is not a platform; it is a platform for building platforms. Organizations build their own internal developer platforms on top of Kubernetes, combining:
- Operators for stateful services
- Helm or Kustomize for packaging
- A service mesh or CNI-level features for security
- ArgoCD or Flux for GitOps
- Prometheus and Grafana for monitoring
- Custom CRDs and controllers for domain-specific needs
This composability is both Kubernetes’ greatest strength and its greatest source of complexity. The bare Kubernetes API is relatively simple; the ecosystem built on top of it is vast and sometimes overwhelming. Understanding that this is by design — that Kubernetes provides the kernel, not the full operating system — is essential to understanding the Kubernetes landscape.
Common Mistakes and Misconceptions
-
“CNCF graduated means production-ready for my use case.” Graduated status indicates mature governance, broad adoption, and a proven security audit process. It does not guarantee the project is the right fit for your specific workload, scale, or operational constraints. Always evaluate projects against your own requirements.
-
“I need to install every CNCF tool.” Most production clusters need only 5-10 ecosystem tools (a CNI plugin, an ingress controller, monitoring, logging, and perhaps a GitOps tool). The CNCF landscape contains 1000+ projects; installing everything would create an unmanageable operational burden.
-
“The CNCF landscape is the complete ecosystem.” Many important Kubernetes tools live outside the CNCF, including commercial products, independent open-source projects, and cloud-provider-specific integrations. The CNCF landscape is a significant subset, not the totality of the ecosystem.
Further Reading
- CNCF Landscape – Interactive map of the entire cloud-native ecosystem, categorized by function (orchestration, observability, service mesh, etc.), with funding and maturity data.
- CNCF Project Maturity Levels – Explanation of the Sandbox, Incubating, and Graduated tiers, along with a full list of CNCF projects and their current status.
- Introducing Operators (CoreOS, 2016) – The original blog post by Brandon Philips that introduced the Operator pattern, explaining why encoding operational knowledge in code is a natural extension of Kubernetes controllers.
- CNCF Annual Survey Results – Yearly survey data on Kubernetes adoption rates, ecosystem tool usage, and deployment patterns across organizations worldwide.
- KubeCon + CloudNativeCon Talk Recordings – Full archives of KubeCon presentations covering operators, Helm, service meshes, and every other corner of the ecosystem.
- Helm Documentation – Official docs for the most widely used Kubernetes package manager, covering chart authoring, templating, release management, and repository hosting.
- Kubernetes Service Mesh: A Comparison of Istio, Linkerd and Consul (Platform9) – Detailed comparison of the major service mesh implementations across 16 factors, covering architectures, performance characteristics, and ideal use cases.
Next: Key Design Principles
Chapter 7: Key Design Principles
Declarative Over Imperative
Kubernetes favors declaring desired state over issuing commands. This principle pervades every level of the system, from the API (objects have spec and status, not a command queue) to the controllers (which reconcile rather than execute) to the tooling (kubectl apply rather than kubectl run).
Control Loops Over Orchestration
As the official documentation states: “Kubernetes is not a mere orchestration system. Orchestration means executing a defined workflow: first do A, then B, then C. Kubernetes comprises a set of independent, composable control processes that continuously drive the current state towards the desired state.”
This distinction is subtle but important. An orchestration system is fragile: if step B fails, the entire workflow may need to be restarted or manually intervened. A control-loop system is robust: each controller independently makes progress, and failures in one controller do not block others.
API-Centric Design
Everything in Kubernetes is an API object. Every component communicates through the API server. There are no hidden side channels, no direct component-to-component communication. This means:
- The API is the complete description of the system’s state.
- Any behavior can be observed by watching the API.
- Any component can be replaced by one that speaks the same API.
- The system can be extended by adding new API types (CRDs) and controllers.
Portability and Vendor Neutrality
Kubernetes was designed from the start to run anywhere: on any cloud provider, on bare metal, on a laptop. This is achieved through abstraction layers (CRI for container runtimes, CNI for networking, CSI for storage) that isolate Kubernetes from the underlying infrastructure. The goal is to prevent vendor lock-in and enable workload portability.
Extensibility as a First-Class Concern
Kubernetes does not try to solve every problem itself. Instead, it provides extension points at every level: CRDs for custom API types, admission webhooks for custom validation and mutation, custom schedulers, custom controllers, CNI/CRI/CSI plugins. This extensibility is what enables the vast Kubernetes ecosystem.
The Level-Triggered vs. Edge-Triggered Distinction
Kubernetes controllers are designed to be level-triggered, not edge-triggered. An edge-triggered system reacts to changes (events): “a pod was deleted.” A level-triggered system reacts to state: “the desired count is 3, but the actual count is 2.”
The level-triggered approach is more robust because it handles missed events gracefully. If a controller misses the “pod deleted” event (because it was restarting or the watch was disconnected), it will still notice that the actual count is wrong on its next reconciliation and take corrective action. Edge-triggered systems require reliable event delivery; level-triggered systems only require eventual state observation.
This is why Kubernetes controllers are built around Informers that maintain a cached copy of the current state, rather than simple event handlers. The Informer’s cache represents the current level, and the controller reconciles against it.
Level-Triggered Design: Kubernetes controllers react to the current state of the world (“there are 2 pods but 3 desired”), not to individual events (“a pod was deleted”). This makes them robust against missed events, disconnections, and restarts. If a controller misses an event, it will still observe the state discrepancy on its next reconciliation cycle and take corrective action.
Common Mistakes and Misconceptions
-
“Declarative means one-shot.” Declarative does not mean “apply once and walk away.” It means continuous reconciliation: Kubernetes constantly compares the actual state of the cluster to the desired state and drives toward convergence. The system is always working, not just at the moment you run
kubectl apply. -
“Controllers run once when you apply a change.” Controllers run in continuous loops, not as one-shot handlers. They watch for any drift from desired state, whether caused by your changes, hardware failures, resource pressure, or other controllers. A controller that only ran once would miss all subsequent drift.
-
Writing event-driven controllers instead of level-triggered ones. Controllers that react to individual events rather than reconciling against current state break when events are missed. A level-triggered controller simply observes the current state on the next reconciliation and converges regardless of what events it missed.
Further Reading
- Level Triggering and Reconciliation in Kubernetes (Hackernoon) – Essential article explaining why Kubernetes controllers are level-triggered rather than edge-triggered, and how this design choice makes the system resilient to missed events.
- Kubernetes Enhancement Proposals (KEPs) – The formal process for proposing, discussing, and tracking significant changes to Kubernetes; reading KEPs is the best way to understand the reasoning behind design decisions.
- Kubernetes Design Proposals Archive – Historical archive of early Kubernetes design documents that shaped the API, controllers, and extensibility model before the KEP process was established.
- James Urquhart, “Flow Architectures” (O’Reilly, 2021) – Explores event-driven and declarative flow-based systems, providing broader context for why Kubernetes’ reconciliation-based approach is part of a larger trend in distributed system design.
- Kubernetes API Conventions – The official guide to Kubernetes API design: spec vs. status, metadata conventions, and the principles that make the API consistent and extensible.
- Brian Grant – “What is Kubernetes?” (KubeCon 2017) — Principal Engineer at Google on Kubernetes’ design philosophy, resource model, and architectural principles.
Next: Why Kubernetes Won
Chapter 8: Why Kubernetes Won
The Competitive Landscape
Kubernetes was not the only container orchestration system:
- Docker Swarm (2015) offered a simpler, Docker-native orchestration experience.
- Apache Mesos (2009) with Marathon provided a battle-tested, two-level scheduling architecture used at Twitter, Airbnb, and Apple.
- Nomad (2015) from HashiCorp offered a simpler, more flexible orchestrator that could manage containers, VMs, and standalone binaries.
So why did Kubernetes win? Several factors:
1. The right abstraction level. Docker Swarm was too simple: it lacked the extensibility and abstraction depth needed for complex production workloads. Mesos was too low-level: it provided resource scheduling but left application management to frameworks like Marathon, creating a fragmented experience. Kubernetes hit a sweet spot: it provided a comprehensive API for managing applications (Deployments, Services, ConfigMaps, Secrets) while remaining extensible for new use cases (CRDs, Operators).
2. The declarative model. Kubernetes’ commitment to declarative, reconciliation-based state management was more robust than Swarm’s imperative commands or Mesos’ framework-specific APIs. The declarative model enabled GitOps, automated testing of infrastructure changes, and reliable self-healing.
3. The extensibility model. CRDs and custom controllers allowed the community to extend Kubernetes without forking it. This created a virtuous cycle that Docker Swarm and Mesos, lacking this extensibility, could not match.
4. Vendor neutrality. By donating Kubernetes to the CNCF and designing it to run on any infrastructure, Google ensured that no single vendor controlled the project. This convinced AWS, Azure, and every other cloud provider to offer managed Kubernetes services, creating a universal standard. Docker Swarm was controlled by Docker, Inc., and Mesos was associated with Mesosphere (later D2iQ).
5. Google’s credibility. Kubernetes was backed by Google’s decade of experience running Borg at unprecedented scale. This gave the project instant credibility in a way that a startup’s orchestrator could not match.
6. Community and ecosystem. Kubernetes built the largest open-source community in history (by contributor count). The CNCF ecosystem of complementary projects (Prometheus, Envoy, Helm, ArgoCD, Cilium, etc.) created a complete platform story that no competitor could match.
The Deeper Lesson
But the deeper reason Kubernetes won is architectural. Its design — declarative state, reconciliation loops, extensible API, composable controllers — is not just a set of implementation choices. It is a theory of how to manage distributed systems.
The theory says: define the desired state of the world as data. Build independent controllers that each reconcile one aspect of the world toward the desired state. Communicate only through a shared, versioned state store. Make everything observable and extensible.
This theory is general enough to manage not just containers but anything: virtual machines, databases, DNS records, cloud resources, machine learning models. And that generality is what makes Kubernetes not just an orchestrator but a universal control plane — a platform for managing any infrastructure through declarative, reconciliation-based APIs.
Whether this generality justifies Kubernetes’ complexity is a fair debate. For simple applications, Kubernetes is overkill. But for organizations managing diverse, dynamic, distributed infrastructure at scale, Kubernetes’ architectural principles provide a coherent framework that no other system has matched.
Kubernetes’ ultimate contribution is not the code (which will be replaced someday) but the ideas: declarative state, reconciliation loops, level-triggered controllers, extensible APIs, operator patterns. These ideas will outlive Kubernetes itself and will influence the design of distributed systems for decades to come.
Complexity Is Not Free. Kubernetes’ generality comes at a cost. The system has hundreds of moving parts, a vast ecosystem of add-ons, and a steep learning curve. For many applications — a single service with modest scale, a batch processing pipeline, a static website — Kubernetes is dramatically overengineered. The right question is not “should I use Kubernetes?” but “do I have the problems that Kubernetes was designed to solve?” If you do not have bin-packing, service discovery, rolling deployment, or self-healing problems at meaningful scale, simpler alternatives (Docker Compose, a cloud provider’s native container service, even a well-managed VM fleet) may be more appropriate.
Key Contributors to Kubernetes’ Design
| Name | Role |
|---|---|
| Joe Beda | Co-founder of Kubernetes at Google. Led early architecture decisions. |
| Brendan Burns | Co-founder of Kubernetes. Author of key design patterns papers. Corporate VP at Microsoft Azure. |
| Craig McLuckie | Co-founder of Kubernetes. Founded Heptio (later acquired by VMware). Key advocate for CNCF donation. |
| Brian Grant | Principal engineer at Google. Led Kubernetes API design and declarative configuration model. |
| Tim Hockin | Principal engineer at Google. Key architect of Kubernetes networking and node components. |
| John Wilkes | Google Fellow. Architect of Borg and Omega. His research directly informed Kubernetes’ design. |
| Eric Tune | Google engineer. Co-author of the Borg paper and early Kubernetes contributor. |
| Clayton Coleman | Red Hat architect. Major contributor to Kubernetes API machinery, CRDs, and extensibility. |
Common Mistakes and Misconceptions
-
“Kubernetes won because it’s the simplest.” Kubernetes won despite its complexity, not because of simplicity. The decisive factors were API extensibility (CRDs and custom controllers), vendor-neutral governance through the CNCF, and the ecosystem flywheel these created. Simpler alternatives like Docker Swarm lost because they lacked these properties.
-
“Docker Swarm failed because Docker was bad.” Swarm’s user experience was widely praised as simpler and more intuitive than Kubernetes. Swarm lost on ecosystem breadth, not on technical quality or user experience.
-
“There are no alternatives to Kubernetes.” HashiCorp Nomad, AWS ECS, and various platform-as-a-service offerings (Cloud Run, Fly.io, Railway) are valid alternatives for many workloads. Kubernetes is the right choice for complex, multi-service, multi-team environments at scale, but not every application needs what Kubernetes provides.
Further Reading
- Apache Mesos Retirement Announcement (Apache Foundation, 2021) – The official notice that Apache Mesos moved to the Attic, marking the end of active development for Kubernetes’ most technically sophisticated competitor.
- Docker Swarm to Mirantis Container Runtime Transition – Documents the transfer of Docker Swarm maintenance to Mirantis after Docker Inc. shifted focus, effectively ending Swarm as a competitive orchestrator.
- HashiCorp Nomad Documentation – Official docs for Nomad, the scheduler that remains a viable Kubernetes alternative for teams wanting simpler orchestration without the full Kubernetes ecosystem.
- CNCF Governance Documents – The charter and governance structure of the Cloud Native Computing Foundation, explaining how vendor-neutral governance gave Kubernetes an adoption advantage over vendor-controlled alternatives.
- The History of Containers (Red Hat) – Timeline from FreeBSD jails to Docker to Kubernetes, covering how container technologies evolved from early OS-level virtualization through modern orchestration.
- Kelsey Hightower – “Kubernetes and the Path to Serverless” (KubeCon keynote) — A keynote exploring Kubernetes’ role in the broader cloud-native ecosystem and its evolution toward serverless patterns.
- Why We Chose Kubernetes over ECS, Mesos, and Nomad (various engineering blogs) – Collection of case studies from organizations explaining their evaluation criteria and why they landed on Kubernetes.
Next: References and Further Reading
Chapter 9: References and Further Reading
Foundational Papers
Large-scale cluster management at Google with Borg (Verma et al., EuroSys 2015) — The landmark paper describing Google’s Borg system, which directly inspired Kubernetes. Covers the declarative job specification, bin packing scheduler, naming service, and lessons learned from a decade of production use.
Borg, Omega, and Kubernetes (Burns, Grant, Oppenheimer, Tune, Wilkes, ACM Queue 2016) — A retrospective by the architects of all three systems, explicitly discussing the lessons learned from Borg and Omega that were applied to Kubernetes.
Omega: flexible, scalable schedulers for large compute clusters (Schwarzkopf et al., EuroSys 2013) — Describes Google’s Omega scheduling system and its shared-state, optimistic-concurrency approach, which influenced Kubernetes’ multi-controller architecture.
Design Patterns for Container-Based Distributed Systems (Burns and Oppenheimer, USENIX HotCloud 2016) — By Brendan Burns, co-founder of Kubernetes. Identifies common patterns in containerized systems: sidecar, ambassador, adapter. These patterns became the foundation for service meshes and the Operator pattern.
Official Design Documents
Kubernetes Design Proposals Archive — The archive of Kubernetes Enhancement Proposals (KEPs) and design documents. Reading these documents reveals the reasoning behind specific design decisions.
Kubernetes Architecture Documentation — The official documentation of Kubernetes’ architecture, including descriptions of every control plane and node component.
Kubernetes API Concepts — Official documentation of the Kubernetes API model, versioning, and extension mechanisms.
Kubernetes Networking Model — Official documentation of the Kubernetes networking model and its requirements.
Key Talks
“Kubernetes: Changing the Way That We Think and Talk About Computing” (Brendan Burns, various conferences) — Burns’ talks consistently focus on the conceptual model rather than the mechanics, making them excellent introductions to the design philosophy.
“The History of Kubernetes and Where It’s Going” (Joe Beda, KubeCon keynotes) — Beda, co-founder of Kubernetes, discusses the project’s origins and design decisions.
“Borg, Omega, and Kubernetes: Lessons Learned” (John Wilkes, various) — Wilkes was a key architect of Borg and Omega, and his talks provide unparalleled insight into the design evolution.
Books
Kubernetes Up & Running (Burns, Beda, Hightower; O’Reilly) — Co-authored by two Kubernetes co-founders. Covers both how and why.
Kubernetes in Action (Luksa; Manning) — Deep technical coverage of Kubernetes internals, with excellent explanations of the control plane.
Programming Kubernetes (Hausenblas, Schimanski; O’Reilly) — Focuses on extending Kubernetes: writing controllers, operators, and CRDs.
Kubernetes Patterns (Ibryam, Huss; O’Reilly) — Catalogs recurring design patterns for Kubernetes applications.
This concludes Part 1: First Principles. You now have the conceptual foundation — the architecture, the API model, the networking model, and the design principles that explain why Kubernetes works the way it does. Part 2 shifts from “why was it designed this way?” to “how did the tooling around it evolve?” — starting with the container runtime wars that shaped the foundation Kubernetes runs on.
Next: The Container Runtime Wars
Chapter 10: The Container Runtime Wars
The Evolution of Container Runtimes in Kubernetes
Era 1 (2014-2016): kubelet ---> Docker Engine (which used libcontainer/runc internally)
"The only option. Docker was a monolith."
Era 2 (2016-2020): kubelet ---> dockershim ---> Docker Engine ---> containerd ---> runc
"CRI exists, but Docker doesn't speak it. Add another layer."
Era 3 (2018+): kubelet ---> CRI ---> containerd ---> runc
kubelet ---> CRI ---> CRI-O ------> runc
"Direct communication. Docker removed from the chain."
Docker’s Original Role: From Monolith to Layers
To understand the container runtime wars, you must first understand how Docker evolved. In the early days (2014-2016), Docker Engine was a monolithic daemon. It used an internal library called libcontainer (later extracted and renamed to runc) to interact with the Linux kernel, setting up namespaces, cgroups, and filesystem mounts. There was no separate “containerd” layer yet — Docker Engine handled everything from the user-facing API down to container creation in a single process.
When Kubernetes launched in 2014-2015, it talked to Docker Engine directly. The kubelet called the Docker API, and Docker Engine internally used libcontainer/runc to create containers. Kubernetes was only using a fraction of what Docker Engine provided. It did not need Docker Compose, Docker Swarm, or Docker’s build system. It needed exactly one capability: run containers.
In December 2016, Docker began decomposing its monolith. It extracted the core container lifecycle management into a separate daemon called containerd, and the low-level container creation into runc (the graduated form of libcontainer). This produced the layered architecture that later versions used:
- runc — a low-level tool that did exactly one thing: create and run a container according to the OCI (Open Container Initiative) runtime specification.
- containerd — a daemon that managed the lifecycle of containers: image pulling, storage, container execution (by calling runc), and networking setup.
- Docker Engine (dockerd) — the daemon that provided the Docker API, Docker CLI integration, Docker Compose support, Docker Swarm orchestration, build functionality, and all the user-facing features that made Docker popular. dockerd talked to containerd, which talked to runc.
With this decomposition, every container operation now went through three layers: kubelet called the Docker API, Docker Engine called containerd, containerd called runc. Each layer added latency, complexity, and potential failure modes.
This was like hiring a general contractor, a subcontractor, and a specialist every time you needed to hammer a single nail.
The CRI: Defining a Standard Interface
Before Kubernetes 1.5 (December 2016), the kubelet had direct knowledge of how to talk to Docker compiled into its source code. If you wanted to use a different container runtime — say, rkt from CoreOS — the code for that runtime also had to be compiled into the kubelet binary. This meant the kubelet was tightly coupled to every runtime it supported. Adding a new runtime required modifying kubelet source code, getting the changes reviewed and merged, and waiting for a Kubernetes release. This did not scale.
The Container Runtime Interface (CRI) was introduced in Kubernetes 1.5 to solve this problem. CRI defined a gRPC-based interface with two services:
- RuntimeService: operations on containers and pods (create, start, stop, remove, list, status, exec, attach, port-forward)
- ImageService: operations on container images (pull, list, remove, image status)
Any container runtime that implemented this gRPC interface could be plugged into the kubelet without modifying kubelet source code. The kubelet would communicate with the runtime over a Unix socket, and the runtime would handle everything from there.
This was a critical architectural decision — the same design philosophy that Kubernetes applied to networking (CNI), storage (CSI), and cloud providers. Define a clean interface, let implementations compete, and avoid coupling the core system to any particular vendor.
flowchart TD
kubelet["kubelet"] -->|"gRPC over<br>Unix socket"| CRI["CRI gRPC Interface"]
CRI --- RS["RuntimeService<br>RunPodSandbox, CreateContainer,<br>StartContainer, StopContainer,<br>ListContainers, ExecSync"]
CRI --- IS["ImageService<br>PullImage, ListImages,<br>RemoveImage"]
CRI -->|"Implemented by"| containerd["containerd<br>(via CRI plugin)"]
CRI -->|"Implemented by"| CRIO["CRI-O"]
CRI -->|"Formerly"| dockershim["dockershim<br>(removed in 1.24)"]
style dockershim fill:#666,stroke:#999,color:#ccc
The Dockershim: A Bridge to Nowhere
There was a problem. Docker Engine predated CRI by several years and did not implement it. Docker had its own API, its own assumptions, its own way of doing things. But Docker was the dominant runtime — virtually every Kubernetes cluster in production used Docker. Kubernetes could not simply drop Docker support overnight.
The solution was the dockershim — a CRI-compatible shim layer that translated CRI calls into Docker Engine API calls. The kubelet would speak CRI to the dockershim, and the dockershim would translate those calls into the Docker API. The dockershim was maintained inside the kubelet codebase itself, making the kubelet responsible for keeping up with every Docker API change.
The call chain became even longer:
kubelet ---> dockershim ---> Docker Engine (dockerd) ---> containerd ---> runc
Four layers of indirection to start a container. And the dockershim added a particular kind of burden: because it lived in the kubelet codebase, every Kubernetes release had to ensure compatibility with Docker. Docker bugs became kubelet bugs. Docker’s release cycle constrained Kubernetes’ release cycle. The dockershim was approximately 2,000 lines of complex translation code that had to be maintained by the Kubernetes community despite being, conceptually, Docker’s problem.
This situation was unsustainable. The Kubernetes community was maintaining a compatibility shim for one specific vendor’s product inside its core codebase.
containerd Goes Standalone
Docker itself recognized the architectural problem. In 2017, Docker donated containerd to the Cloud Native Computing Foundation (CNCF) as an independent project. This was a significant move: containerd was no longer “Docker’s internal component” but an independent, community-governed container runtime.
containerd 1.1, released in 2018, added native CRI support through a built-in CRI plugin. This meant containerd could speak CRI directly — the kubelet could talk to containerd without Docker Engine in the middle. The call chain collapsed:
Before: kubelet ---> dockershim ---> dockerd ---> containerd ---> runc
After: kubelet ---> CRI ---> containerd ---> runc
Two layers were eliminated. The result was faster container operations, fewer potential failure points, and lower resource overhead (no Docker daemon consuming memory and CPU for features Kubernetes did not use).
CRI-O: The Minimal Alternative
Red Hat took a different approach. Rather than adapting an existing runtime, they built CRI-O from scratch as a minimal CRI implementation — just enough runtime to support Kubernetes, nothing more. CRI-O’s motto was effectively “Kubernetes’ container runtime.”
CRI-O did not support Docker Compose. It did not support Docker Swarm. It did not have a CLI for building images. It implemented the CRI gRPC interface and managed containers using runc (or any OCI-compliant low-level runtime). It was purpose-built for Kubernetes and nothing else.
This minimalism had real advantages:
- Smaller attack surface: less code means fewer vulnerabilities
- Version alignment: CRI-O versions are aligned 1:1 with Kubernetes versions (CRI-O 1.24 works with Kubernetes 1.24)
- Predictable behavior: no features outside the CRI specification that could cause unexpected interactions
- Lighter weight: lower memory and CPU overhead than Docker Engine or even containerd (which supports non-CRI use cases)
Red Hat adopted CRI-O as the default runtime for OpenShift, their enterprise Kubernetes distribution. This gave CRI-O a significant production footprint and a well-funded development team.
The Deprecation That Shook the Community
In December 2020, Kubernetes 1.20 included a deprecation warning: dockershim would be removed in a future release. In May 2022, Kubernetes 1.24 completed the removal.
The announcement caused widespread panic. Blog posts declared “Kubernetes is dropping Docker support!” Twitter erupted with confusion. Many users believed their Docker images would stop working, that their Dockerfiles were obsolete, that they needed to rebuild everything.
None of this was true. The confusion stemmed from conflating “Docker” the image format with “Docker” the runtime engine.
Here is what actually happened:
- Docker images are OCI images. The Open Container Initiative defined a standard image format, and Docker images conform to it. containerd, CRI-O, and every other OCI-compliant runtime can pull and run Docker images. Nothing changed about images.
- Dockerfiles continued to work. They produce OCI-compliant images. It does not matter what builds the image; what matters is that the output conforms to the OCI specification.
- What was removed was the dockershim — the translation layer inside the kubelet that allowed Kubernetes to talk to Docker Engine. If you were running Docker Engine as your container runtime, you needed to switch to containerd or CRI-O. Since Docker Engine itself used containerd internally, this switch was straightforward: containerd was already on the machine, it just needed to be configured as the CRI endpoint.
The deprecation was, in a sense, a non-event for users. Their workflows did not change. Their images did not change. Their Dockerfiles did not change. What changed was which daemon the kubelet talked to on each node — an infrastructure detail that most application developers never interacted with directly.
But the deprecation was a significant event for the Kubernetes project. It removed approximately 2,000 lines of shim code from the kubelet, eliminated an entire class of compatibility bugs, and completed the transition to a clean, pluggable runtime architecture that CRI had promised since 2016.
The Current Landscape
Today, the container runtime landscape has settled into a clear pattern:
containerd is the dominant runtime. Amazon EKS, Google GKE, Microsoft AKS, and most managed Kubernetes services use containerd as their default runtime. containerd is mature, well-tested, and supports a broad range of use cases beyond just Kubernetes (it is also used by Docker Desktop, nerdctl, and other tools).
CRI-O is the standard runtime for Red Hat OpenShift and is used in environments where minimalism and strict Kubernetes alignment are priorities.
Docker Desktop remains the most popular tool for local container development. Developers build images with Docker, push them to registries, and those images run on containerd or CRI-O in production. Docker’s role shifted from “the runtime” to “the developer tool.”
Specialized runtimes exist for specific use cases: gVisor (Google) provides kernel-level sandboxing for stronger isolation, Kata Containers runs each container in a lightweight VM for hardware-level isolation, and Firecracker (AWS) powers Lambda and Fargate with microVMs. All of these can be plugged into Kubernetes through CRI, demonstrating the power of the pluggable interface.
flowchart TD
subgraph Dev["Developer Workstation"]
Docker["Docker Desktop<br>(build images)"] --> OCI["OCI Image"]
end
OCI -->|"push to registry"| kubelet
subgraph Prod["Production Cluster"]
kubelet["kubelet"]
kubelet --> containerd
kubelet --> CRIO["CRI-O"]
containerd --> runc["runc<br>(or gVisor / Kata)"]
CRIO --> runc
end
The lesson they teach is a design lesson: define clean interfaces early, and the ecosystem will sort itself out. CRI turned the container runtime from a hardwired dependency into a pluggable component, and the result was a healthier ecosystem where runtimes could compete on merit without requiring changes to Kubernetes core. The removal of dockershim, despite the community anxiety it caused, was the natural conclusion of a process that began six years earlier with the introduction of CRI.
Common Mistakes and Misconceptions
- “I need Docker installed to run containers on Kubernetes.” Since K8s 1.24, dockershim was removed. Kubernetes uses containerd or CRI-O directly. Docker is a development tool, not a K8s runtime dependency.
- “containerd and Docker are completely different.” containerd was extracted from Docker. Docker uses containerd internally. The difference is that K8s talks to containerd directly via CRI, skipping the Docker daemon.
- “OCI images built with Docker won’t work with containerd.” OCI images are runtime-agnostic. An image built with
docker buildruns identically on containerd, CRI-O, or any OCI-compliant runtime.
For a visual timeline of how container runtimes evolved alongside the broader ecosystem, see Appendix E: Architecture Evolution Timeline.
Further Reading
- Container Runtime Interface (CRI) specification – The formal API definition that decoupled Kubernetes from any single container runtime, enabling the pluggable ecosystem described in this chapter.
- containerd documentation – Official docs for the dominant container runtime, covering architecture, configuration, and the plugin system that makes containerd extensible beyond Kubernetes.
- CRI-O documentation – The lightweight, Kubernetes-dedicated runtime used by OpenShift. Useful for understanding the “do one thing well” design philosophy contrasted with containerd’s broader scope.
- KEP-2221: Dockershim Removal – The Kubernetes Enhancement Proposal that formalized the dockershim removal, including the rationale, migration plan, and community discussion.
- “Don’t Panic: Kubernetes and Docker” (Kubernetes blog) – The official blog post that clarified the Docker deprecation, explaining why Docker images still work and what actually changed. Essential reading for understanding the community communication around this transition.
- OCI Runtime Specification – The standard that defines how a container runtime starts and manages containers at the lowest level. Understanding this spec clarifies the relationship between high-level runtimes (containerd, CRI-O) and low-level runtimes (runc).
- runc GitHub repository – The reference implementation of the OCI runtime spec and the low-level runtime that actually creates containers for both containerd and CRI-O. Reading the README provides a clear picture of what happens at the bottom of the runtime stack.
Next: Chapter 11: Bootstrapping a Cluster — From kube-up.sh to kubeadm
Chapter 11: Bootstrapping a Cluster — From kube-up.sh to kubeadm
gantt
title Cluster Bootstrap Tools — Increasing Abstraction
dateFormat YYYY
axisFormat %Y
section Provision Everything
kube-up.sh (GCE only) :done, 2014, 2016
section Bootstrap on Machines
kubeadm (alpha → GA 2018) :done, 2016, 2024
kops (alpha → GA 2018) :done, 2016, 2024
minikube :done, 2016, 2024
kubespray :done, 2018, 2024
section Single Binary
k3s :done, 2018, 2024
kind :done, 2018, 2024
k0s :done, 2020, 2024
section Managed K8s Dominates
GKE / EKS / AKS :active, 2018, 2026
The Problem: What Does It Actually Take to Run Kubernetes?
Before examining the tools, consider what bootstrapping a Kubernetes cluster requires. This is not a trivial task. At minimum, you must:
-
Generate a Public Key Infrastructure (PKI). Kubernetes components communicate over mutually authenticated TLS. The API server needs a certificate. Each kubelet needs a certificate. etcd needs certificates. The front proxy needs certificates. A typical cluster has 10+ certificates, each with specific Subject Alternative Names, key usages, and expiration policies. Getting any of these wrong results in cryptic TLS handshake failures.
-
Deploy and configure etcd. etcd must form a quorum, which means each member must know about the others. In a multi-node etcd cluster, the initial bootstrap is a chicken-and-egg problem: members need to discover each other before they can form a cluster.
-
Configure the API server. The API server needs to know where etcd is, which certificates to use, which admission controllers to enable, which authentication methods to support, and how to reach the kubelet on each node.
-
Configure the controller manager and scheduler. Both need kubeconfig files with credentials to authenticate to the API server.
-
Join worker nodes. Each worker node needs a kubelet configured with the correct API server address and authentication credentials. The node needs a container runtime installed and configured. It needs kube-proxy or a replacement for service networking.
-
Install cluster networking. Kubernetes mandates that every pod can communicate with every other pod without NAT. This requires a CNI (Container Network Interface) plugin, which must be installed after the control plane is running but before workloads can function.
-
Install DNS. Kubernetes assumes that a cluster DNS service (CoreDNS) is available. Services are discovered by DNS name, and without DNS, almost nothing works correctly.
Each of these steps has dependencies on the others. Certificates must be generated before any component can start. etcd must be running before the API server can start. The API server must be running before the controller manager, scheduler, or any kubelets can connect. Networking must be installed before pods can communicate. The ordering is strict, and mistakes are difficult to diagnose.
The Early Days: kube-up.sh
In 2014-2015, the primary way to create a Kubernetes cluster was a shell script called kube-up.sh. This script attempted to do everything: provision cloud resources (VMs, networks, firewalls, load balancers), install Kubernetes binaries, generate certificates, configure all components, and join nodes into a cluster.
The script was massive — thousands of lines of bash — and was built primarily for Google Compute Engine (GCE). It had branches for AWS and other providers, but these were maintained with varying degrees of quality. The fundamental problem was that kube-up.sh conflated two very different concerns:
- Infrastructure provisioning: creating VMs, networks, and storage. This is cloud-provider-specific and depends on each provider’s API, authentication model, and resource semantics.
- Cluster bootstrapping: installing and configuring Kubernetes on machines that already exist. This is (or should be) cloud-agnostic.
By combining both concerns in a single script, kube-up.sh was fragile, difficult to debug, and nearly impossible to extend. If you wanted to customize the VM size, the network topology, or the operating system, you had to modify the script. If the script failed halfway through, there was no reliable way to resume. If you wanted to bootstrap Kubernetes on bare metal or on a cloud provider that kube-up.sh did not support, you were on your own.
The script was also undocumented in any meaningful way. Understanding what it did required reading thousands of lines of bash, following variable expansions across multiple files, and understanding the implicit assumptions about the environment. This was the era when “setting up a Kubernetes cluster” was a multi-day project that required deep expertise.
kubeadm: Separating Bootstrap from Provisioning
The Kubernetes community recognized that the solution was to separate concerns. Infrastructure provisioning should be handled by tools designed for that purpose — Terraform, CloudFormation, Ansible, or cloud-provider CLIs. Cluster bootstrapping should be handled by a dedicated tool that assumed machines already existed and focused exclusively on turning those machines into a Kubernetes cluster.
kubeadm emerged from SIG Cluster Lifecycle in 2016, reached beta in Kubernetes 1.11, and became GA in Kubernetes 1.13 (December 2018). Its design principles were explicit:
- Scope limitation: kubeadm bootstraps a cluster on existing machines. It does not provision infrastructure.
- Composability: kubeadm is designed to be a building block for higher-level tools. kops, kubespray, and managed Kubernetes services all use kubeadm internally.
- Phases: the bootstrap process is broken into discrete, independently executable phases. If something fails, you can re-run a specific phase without starting over.
What kubeadm Actually Does
When you run kubeadm init on a machine destined to be a control plane node, it executes the following phases:
Preflight checks. kubeadm verifies that the machine meets requirements: the container runtime is installed and running, required kernel modules are loaded, required ports are available, swap is disabled (Kubernetes historically required this because the scheduler’s resource accounting assumed no swap), and the machine has sufficient resources.
PKI generation. kubeadm generates the entire certificate authority hierarchy: a root CA, an API server certificate, kubelet client certificates, front proxy certificates, etcd CA and certificates, and service account signing keys. Each certificate has appropriate SANs and key usages. This single phase eliminates what was previously one of the most error-prone manual steps.
Static pod manifests. Rather than running control plane components as system services, kubeadm writes static pod manifests to /etc/kubernetes/manifests/. The kubelet watches this directory and automatically creates pods for any manifests it finds. This means the API server, controller manager, scheduler, and etcd all run as pods on the control plane node — Kubernetes managing itself. This approach is elegant: it means the same mechanisms that manage user workloads also manage the control plane.
flowchart TD
Start["Machine with kubelet +<br>container runtime installed"]
Start --> P1
P1["Phase 1: Preflight checks<br>Verify container runtime, ports,<br>kernel modules, resources"]
P1 --> P2
P2["Phase 2: Generate PKI<br>CA, API server cert, kubelet certs,<br>etcd certs, SA keys<br>Writes to /etc/kubernetes/pki/"]
P2 --> P3
P3["Phase 3: Generate kubeconfig files<br>admin.conf, kubelet.conf,<br>controller-manager.conf, scheduler.conf"]
P3 --> P4
P4["Phase 4: Write static pod manifests<br>kube-apiserver.yaml<br>kube-controller-manager.yaml<br>kube-scheduler.yaml<br>etcd.yaml"]
P4 --> P5
P5["Phase 5: Wait for control plane<br>kubelet reads manifests, starts pods,<br>API server becomes healthy"]
P5 --> P6
P6["Phase 6: Upload configuration<br>Store cluster config in ConfigMap<br>for future joins"]
P6 --> P7
P7["Phase 7: Generate bootstrap token<br>Short-lived token for<br>worker nodes to join"]
P7 --> P8
P8["Phase 8: Install addons<br>CoreDNS (cluster DNS)<br>kube-proxy (service networking)"]
Bootstrap tokens. kubeadm generates a short-lived token that worker nodes use to authenticate with the API server during the join process. This solves the chicken-and-egg problem of node authentication: the node needs credentials to talk to the API server, but the API server needs to verify the node’s identity. The bootstrap token provides initial trust, and the node uses it to request a proper kubelet certificate through the TLS bootstrap protocol.
Addon installation. kubeadm installs CoreDNS (for cluster DNS) and kube-proxy (for service networking) as cluster addons. These are deployed as regular Kubernetes workloads, managed by the same control plane they support.
The Alternatives: Different Problems, Different Tools
kubeadm solved the bootstrap problem but deliberately left the provisioning problem to others. This created space for tools that combined provisioning and bootstrapping, each optimized for different use cases. See Appendix C: Decision Trees for a flowchart to help choose the right bootstrap tool.
kops (Kubernetes Operations)
kops took the opposite approach from kubeadm: it handled both provisioning and bootstrapping. Originally built for AWS, kops could create VPCs, subnets, auto-scaling groups, security groups, IAM roles, Route53 DNS entries, and S3 state storage, then install and configure Kubernetes across the provisioned infrastructure.
kops was opinionated and comprehensive. It stored cluster state in a cloud storage bucket (S3 on AWS) and could perform rolling updates, upgrade Kubernetes versions, and resize clusters. For AWS users who wanted a production-grade, self-managed Kubernetes cluster without a managed service, kops was often the best choice.
The tradeoff: kops’ breadth makes it powerful on AWS but tightly coupled to cloud-provider APIs and harder to debug than kubeadm.
kubespray
kubespray used Ansible playbooks to install Kubernetes on existing machines. It supported a wide range of operating systems, container runtimes, and network plugins. kubespray was the tool of choice for organizations that already used Ansible for configuration management, had bare-metal infrastructure, or needed to customize every aspect of the installation.
kubespray occupied the middle ground between kubeadm’s minimalism and kops’ full-stack approach. It assumed you had provisioned machines (like kubeadm) but handled more of the pre-requisite setup than kubeadm did (installing container runtimes, configuring kernel parameters, setting up load balancers for HA control planes).
k3s
k3s, created by Rancher Labs (later acquired by SUSE), took a radically different approach. Instead of a collection of separate binaries with complex interdependencies, k3s packaged the entire Kubernetes distribution into a single binary under 100MB.
k3s achieved this by making several substitutions:
- SQLite instead of etcd for the default datastore (etcd and other datastores available as options)
- Flannel built-in for networking
- Traefik built-in as the ingress controller
- Local storage provider built-in
- Removed legacy and alpha features, cloud provider integrations, and storage drivers that were not needed in edge/IoT scenarios
The result was a Kubernetes distribution that could run on a Raspberry Pi, start in 30 seconds, and be installed with a single curl command. k3s was certified conformant — it passed the CNCF conformance tests — meaning it was “real Kubernetes,” just packaged differently.
k3s demonstrated that the complexity of Kubernetes installation was largely accidental, not essential. The core of Kubernetes is not that large; it was the matrix of configuration options, pluggable interfaces, and backward compatibility that made installation complex.
kind (Kubernetes IN Docker)
kind solved a different problem entirely: running Kubernetes in CI/CD pipelines and for local testing. kind created a multi-node Kubernetes cluster by running each “node” as a Docker container. Inside each container, it ran the kubelet and a container runtime (containerd), creating a nested container architecture.
kind was fast (cluster creation in under a minute), lightweight (no VMs required), and disposable (clusters could be created and destroyed as part of a test pipeline). It became the standard tool for testing Kubernetes itself — the Kubernetes CI infrastructure uses kind to run conformance tests.
minikube
minikube was the original local Kubernetes development tool, created alongside kubeadm in 2016. It ran a single-node Kubernetes cluster inside a VM (or later, a container). minikube was the tool most developers encountered first when learning Kubernetes. It prioritized ease of use and supported add-ons for common development needs: dashboards, metrics, registries, and ingress controllers.
k0s
k0s (zero friction Kubernetes) followed k3s’ single-binary approach but aimed to be closer to upstream Kubernetes with fewer opinionated substitutions. k0s packaged all control plane components into a single binary and supported running the control plane and worker components separately, making it suitable for both single-node and multi-node deployments.
The Managed Service Explosion
The most significant development in cluster bootstrapping was the emergence of managed Kubernetes services that made bootstrapping irrelevant for a large portion of users.
Google Kubernetes Engine (GKE), launched in 2015, was the first. Google managed the control plane as a service. Users only managed worker nodes (and later, with Autopilot mode, not even that). GKE’s early availability gave it a lasting advantage: it had years of operational experience that competitors could not quickly replicate.
Azure Kubernetes Service (AKS) entered preview in 2017 (GA June 2018), and Amazon Elastic Kubernetes Service (EKS) launched in 2018. AWS was notably late to the Kubernetes party, having bet heavily on its own orchestration system (ECS) before market demand forced its hand. EKS’s eventual success validated Kubernetes as the industry standard: when the largest cloud provider builds a managed service for your project, you have won.
By the mid-2020s, managed Kubernetes services account for the majority of production Kubernetes usage. The bootstrapping tools — kubeadm, kops, kubespray — remain essential for on-premises deployments, specialized environments, and educational purposes, but the center of gravity has shifted decisively toward managed services.
Who Uses What (2024+)
Use Case Tool
───────────────────────────── ──────────────────────
Production (cloud) Managed: GKE, EKS, AKS
Production (on-premises) kubeadm + kubespray, or k0s/k3s
Production (AWS, self-managed) kops
Edge / IoT / Raspberry Pi k3s
CI/CD testing kind
Local development minikube, kind, Docker Desktop
Learning minikube, kind, k3s
The evolution of bootstrapping tools mirrors a broader pattern in infrastructure software: complexity moves from the user to the platform. In 2014, bootstrapping a cluster required deep expertise in Linux administration, PKI, and distributed systems. By 2024, it requires a credit card and a cloud provider account. The knowledge is still valuable — someone has to build and operate those managed services — but the barrier to entry for Kubernetes users has dropped by orders of magnitude.
Common Mistakes and Misconceptions
- “kubeadm is only for learning.” kubeadm is used in production by many organizations. It handles TLS bootstrapping, certificate rotation, and upgrade orchestration. Managed services are easier, but kubeadm is production-grade.
- “k3s is not real Kubernetes.” k3s is a certified, conformant Kubernetes distribution. It passes the same conformance tests as full K8s. It just has a smaller binary and uses SQLite instead of etcd by default.
- “I should use minikube/kind for production.” These tools are for local development and CI. They run single-node clusters without HA, proper networking, or persistent storage guarantees.
Further Reading
- kubeadm documentation – Official reference for the standard cluster bootstrapping tool. Covers
kubeadm init,kubeadm join, certificate management, and upgrade procedures. - kops GitHub repository – The Kubernetes Operations project for deploying production clusters on AWS, GCE, and other clouds. The docs include architecture decisions and comparison with other tools.
- kubespray documentation – Ansible-based cluster provisioning that supports bare metal, AWS, GCE, Azure, and more. Useful for understanding the infrastructure-as-code approach to cluster bootstrapping.
- k3s documentation – Rancher’s lightweight Kubernetes distribution designed for edge, IoT, and resource-constrained environments. Explains the trade-offs made to shrink Kubernetes into a single binary.
- Rancher documentation – Multi-cluster management platform that abstracts over different bootstrap methods. Covers fleet management, RBAC, and the operational layer above individual clusters.
- kind (Kubernetes in Docker) – A tool for running local Kubernetes clusters using Docker containers as nodes. Designed for testing Kubernetes itself, and widely used in CI/CD pipelines.
- minikube documentation – The original local Kubernetes tool, supporting multiple drivers (Docker, VirtualBox, HyperKit, etc.). Remains the most approachable path for developers learning Kubernetes.
Next: Chapter 12: Package Management and GitOps
Chapter 12: Package Management and GitOps
The YAML Explosion
Every Kubernetes resource is defined by a YAML manifest. A simple web application requires, at minimum: a Deployment (to run the pods), a Service (to expose them), a ConfigMap (for configuration), a Secret (for credentials), an Ingress (for external access), a ServiceAccount (for identity), and resource quotas. That is seven YAML files for a single application. A real production application typically requires 15-30 manifests when you include HorizontalPodAutoscalers, PodDisruptionBudgets, NetworkPolicies, PersistentVolumeClaims, and RBAC rules.
The YAML Explosion: One Application's Manifests
Minimal App (7 files) Production App (15-30 files)
┌─────────────────────┐ ┌─────────────────────────────────┐
│ deployment.yaml │ Pods │ deployment.yaml │
│ service.yaml │ Network │ service.yaml │
│ configmap.yaml │ Config │ configmap.yaml │
│ secret.yaml │ Creds │ secret.yaml │
│ ingress.yaml │ External │ ingress.yaml │
│ serviceaccount.yaml │ Identity │ serviceaccount.yaml │
│ resourcequota.yaml │ Limits │ resourcequota.yaml │
└─────────────────────┘ │─────────────────────────────────│
7 files │ hpa.yaml Scaling │
│ │ pdb.yaml Uptime │
│ "Just add │ networkpolicy.yaml Security │
│ production │ pvc.yaml Storage │
│ concerns..." │ role.yaml RBAC │
│ │ rolebinding.yaml RBAC │
▼ │ limitrange.yaml Limits │
┌───────────────┐ │ poddisruptionbudget.yaml Uptime │
│ × 3 envs │ │ prometheus-rules.yaml Observe │
│ (dev/stg/prd) │ │ grafana-dashboard.json Observe │
└───────────────┘ └─────────────────────────────────┘
│ 15-20 files
▼ │
┌───────────────┐ ▼
│ 7 × 3 = 21 │ ┌───────────────┐
│ files minimum │ │ × 3 envs │
└───────────────┘ │ (dev/stg/prd) │
│ └───────────────┘
│ "But each env differs: │
│ replicas, limits, ▼
│ image tags, configs..." ┌───────────────────┐
▼ │ 20 × 3 = 60 │
┌────────────────────┐ │ files to maintain │
│ 21-90 YAML files │ └───────────────────┘
│ for ONE service │
└────────────────────┘
Now multiply by environments. Most organizations maintain at least three — development, staging, and production — with small differences between them: different replica counts, different resource limits, different image tags, different configuration values. If you manage this with raw YAML files, you either maintain three copies of every manifest (tripling the maintenance burden and guaranteeing drift) or you build some ad-hoc templating system with sed and environment variables (fragile and error-prone).
This is the YAML explosion problem, and it is the root cause behind every tool discussed in this chapter.
Helm v2: The Package Manager with a Fatal Flaw
Helm was introduced in 2016 as “the package manager for Kubernetes,” explicitly modeled on apt, yum, and Homebrew. The core abstraction was the Chart — a collection of templated YAML files, a values.yaml file containing default parameters, and metadata describing the package.
Helm Charts solved two problems simultaneously:
Distribution. A complex application like Prometheus (which requires 10+ Kubernetes resources) could be packaged as a single Chart and installed with one command. Charts could be stored in repositories and versioned. The ecosystem effect was powerful: instead of every user figuring out how to deploy Prometheus on Kubernetes, one person wrote a Chart and everyone benefited.
Parameterization. Charts used Go templates to inject values into YAML manifests. A Deployment’s replica count might be templated as {{ .Values.replicaCount }}, allowing users to override it without modifying the Chart. This addressed the multi-environment problem: you could install the same Chart with different values files for dev, staging, and production.
But Helm v2 had a critical architectural flaw: Tiller.
Tiller: The Security Nightmare
Tiller was a server-side component that ran inside the Kubernetes cluster. When you ran helm install, your local Helm client sent the rendered manifests to Tiller, which then applied them to the cluster. Tiller stored release state (which Charts were installed, at which versions, with which values) as ConfigMaps in the cluster.
flowchart TD
helm["helm CLI<br>Chart + values.yaml"] -->|"gRPC"| Tiller
subgraph Cluster["Kubernetes Cluster"]
Tiller["Tiller (Deployment)<br>cluster-admin access"]
Tiller --> API["API Server"]
Tiller --> CM["ConfigMaps<br>(release state)"]
end
Warning["Problem: Tiller has<br>GOD MODE access to<br>the entire cluster"]
style Tiller fill:#f44,color:#fff,stroke:#d00
style Warning fill:#fee,stroke:#f44,color:#d00
The problem was that Tiller required cluster-admin privileges by default. It needed broad access because it had to create any type of resource in any namespace on behalf of any user. This meant:
- Any user who could talk to Tiller had effective cluster-admin access. Tiller did not enforce per-user RBAC. If developer A had permission to deploy only to namespace “team-a,” they could use Tiller to deploy anything anywhere, because Tiller itself had cluster-admin access.
- Tiller was a single point of attack. Compromise Tiller, and you had full control of the cluster. Tiller’s gRPC port was often accessible from any pod in the cluster without authentication.
- Multi-tenant clusters were unsafe. Helm v2 was fundamentally incompatible with the principle of least privilege. You could not safely use Helm v2 in a cluster shared by multiple teams with different access levels.
The security community raised alarms repeatedly. Workarounds existed (running one Tiller per namespace, using TLS for the gRPC connection), but they were complex and undermined Helm’s ease-of-use promise.
Helm v3: The Tiller Excision
Helm v3, released in November 2019, removed Tiller entirely. The new architecture was client-only: the Helm CLI connected directly to the Kubernetes API server, using the user’s own kubeconfig credentials. The user’s RBAC permissions determined what Helm could do. If a user only had access to namespace “team-a,” Helm would only be able to deploy to namespace “team-a.”
Release state moved from ConfigMaps to Kubernetes Secrets, stored in the namespace of the release. This was both more secure (Secrets can be encrypted at rest) and more natural (the release metadata lived alongside the release resources).
Helm v3 also added:
- JSON Schema validation: Chart authors could define schemas for their values.yaml, catching configuration errors before rendering
- OCI registry support: Charts could be stored in container registries alongside images, unifying artifact management
- Library charts: reusable chart fragments that could be imported by other charts, reducing duplication
- Three-way merge for upgrades: comparing the old manifest, the live cluster state, and the new manifest, enabling safer upgrades when resources had been manually modified
The removal of Tiller was driven by a principle that applies broadly in systems design: do not bypass the access control layer. Tiller existed because it was architecturally convenient to have a server-side component that could apply resources. But convenience created a massive security hole. Helm v3 demonstrated that you could achieve the same functionality without a privileged intermediary, simply by having the client talk directly to the API server.
Kustomize: Template-Free Customization
Kustomize, developed by Google and released in 2018, took a fundamentally different approach to the YAML problem. Where Helm used Go templates to parameterize YAML, Kustomize used overlay-based patching. No templating language. No {{ }} syntax. No Tiller. No client-side rendering. Just plain YAML, composed and patched using a declarative overlay system.
The core idea was simple. You start with a base — a set of plain, valid Kubernetes YAML files that represent your application. Then you create overlays — directories that contain patches describing how to modify the base for a specific environment. An overlay might change the replica count for production, add resource limits for staging, or change the image tag for development.
Kustomize Directory Structure
base/
├── kustomization.yaml # Lists resources
├── deployment.yaml # Plain, valid Kubernetes YAML
├── service.yaml # No templates, no {{ }}
└── configmap.yaml
overlays/
├── dev/
│ ├── kustomization.yaml # References base + patches
│ └── replica-patch.yaml # "Change replicas to 1"
├── staging/
│ ├── kustomization.yaml
│ └── resource-patch.yaml # "Add resource limits"
└── prod/
├── kustomization.yaml
├── replica-patch.yaml # "Change replicas to 5"
└── hpa.yaml # "Add HorizontalPodAutoscaler"
Kustomize’s key advantage was diffability. Because the base files were plain YAML and the patches were plain YAML, you could diff any two environments and see exactly what was different. With Helm templates, understanding the difference between two rendered outputs required rendering both and diffing the result — a lossy process that made code review difficult.
Kustomize was integrated into kubectl itself (kubectl apply -k ./overlay/prod/), meaning it required no additional tooling. This made it attractive for organizations that wanted to minimize their dependency footprint.
Helm vs. Kustomize: Complementary, Not Competing
The community often framed Helm and Kustomize as competitors, but they solve different problems.
Helm excels at third-party package distribution. If you want to install Prometheus, PostgreSQL, or NGINX Ingress Controller on your cluster, Helm Charts are the standard distribution mechanism. The Chart author encapsulates the complexity of deploying the application, and you customize it through values. You would not want to maintain your own YAML files for every third-party application you use.
Kustomize excels at managing your own manifests across environments. If you are deploying your own application and need to manage small differences between dev, staging, and production, Kustomize’s overlay model is simpler and more transparent than Helm templates.
Many organizations use both: Helm for third-party software, Kustomize for their own applications. Some even use Kustomize to patch Helm-rendered output, combining both tools in a pipeline.
The GitOps Revolution
Helm and Kustomize solved the problem of parameterizing and organizing YAML. But they left a deeper problem unaddressed: how does the YAML get applied to the cluster?
The traditional workflow was: a developer modifies manifests, runs kubectl apply, and the cluster state changes. This approach has several serious deficiencies:
- No audit trail. Who applied what, when? kubectl does not maintain a log. You can check the Kubernetes audit log if it is enabled, but correlating API server events to human actions is difficult.
- No rollback mechanism. If a
kubectl applycauses a problem, reverting requires knowing what the previous state was and manually applying it. - No access control beyond RBAC. Anyone with kubectl access and appropriate RBAC permissions can modify the cluster. There is no approval workflow, no review process, no gate.
- Drift. If someone manually modifies a resource in the cluster (a “hot fix”), the cluster state diverges from the YAML files in the repository. Over time, the repository becomes a lie — it no longer represents what is actually running.
GitOps addresses all of these problems by applying a single principle: Git is the single source of truth for the desired state of the cluster.
flowchart TD
Dev["Developer"] -->|"git push"| Git["Git Repo<br>(source of truth)"]
Git -->|"watch"| Controller["GitOps Controller<br>(ArgoCD / Flux)<br><br>1. Watch Git for changes<br>2. Compare Git state to cluster state<br>3. Reconcile: apply diff to<br>make cluster match Git"]
Controller -->|"apply"| K8s["Kubernetes Cluster"]
K8s -->|"drift detection"| Controller
Rollback["Rollback = git revert<br>Audit = git log<br>Review = pull request<br>Access = Git permissions"]
style Rollback fill:#eff,stroke:#099,color:#066
The idea is that a controller running inside the cluster watches a Git repository. When the repository changes (new commit, merged pull request), the controller compares the desired state in Git to the actual state in the cluster and reconciles any differences. This is the Kubernetes reconciliation pattern applied to deployment itself — the same pattern that the Deployment controller uses to reconcile desired and actual pod counts, now applied at the level of the entire cluster configuration.
ArgoCD
ArgoCD, created by Intuit in 2018 and donated to the CNCF, is the most widely adopted GitOps tool. ArgoCD runs as a set of controllers in the cluster and provides a web UI, CLI, and API for managing applications. An ArgoCD “Application” resource defines the mapping: this Git repository, this path, this branch should be deployed to this cluster, this namespace.
ArgoCD supports Helm Charts, Kustomize overlays, plain YAML directories, and Jsonnet as input formats. It provides real-time sync status visualization, showing which resources are in sync with Git and which have drifted. It supports multi-cluster management, RBAC, SSO integration, and automated sync policies.
Flux
Flux, created by Weaveworks in 2017 (and rebuilt as Flux v2 using the GitOps Toolkit), takes a more Kubernetes-native approach. Flux is a set of Custom Resource Definitions (CRDs) and controllers: a GitRepository resource tells Flux where to watch, a Kustomization resource tells Flux how to render and apply the manifests, and a HelmRelease resource tells Flux how to manage Helm releases.
Flux v2 was designed to be composable: each controller does one thing, and they communicate through Kubernetes resources. This makes Flux extensible (you can add image automation controllers, notification controllers, etc.) but also means there are more pieces to understand and configure.
If you manage multiple clusters (dev, staging, production, or multiple production regions), GitOps ensures they are configured from the same source. Promoting a change from staging to production is a Git merge, not a series of manual kubectl commands against different clusters.
Common Mistakes and Misconceptions
- “Helm charts are always safe to install.” Helm charts can contain arbitrary Kubernetes resources including ClusterRoles and webhooks. Always review chart templates before installing, especially from unknown sources.
- “Kustomize replaces Helm.” They solve different problems. Helm templates generate YAML; Kustomize patches existing YAML. Many teams use both: Helm for third-party charts, Kustomize for environment overlays.
- “Putting all configuration in values.yaml is good practice.” Over-parameterizing Helm charts makes them harder to maintain than raw YAML. Only expose values that actually change between environments.
Further Reading
- Helm documentation – Official reference for Helm, covering chart structure, templating, release management, and the Helm SDK. Start with the “Chart Developer Guide” for understanding how charts are built.
- Kustomize documentation – The template-free customization tool built into kubectl. The “Examples” section demonstrates the overlay pattern for managing environment-specific configurations.
- Helm Chart Best Practices Guide – Official guidelines for writing production-quality Helm charts, covering values design, template conventions, labels, and dependency management.
- Artifact Hub – The CNCF’s central repository for discovering Helm charts, OPA policies, and other Kubernetes packages. Browse to understand the breadth of the ecosystem and how charts are published and versioned.
- “Helm vs Kustomize” (Harness) – A practical comparison of the two dominant approaches, covering strengths, weaknesses, when to use each, and how to combine them.
- cdk8s documentation – AWS’s framework for defining Kubernetes manifests using general-purpose programming languages (TypeScript, Python, Go, Java). Represents the “code over YAML” approach to configuration.
- “Stop Using Helm” and the counterarguments (archived) – A provocative critique of Helm’s templating approach, useful for understanding the trade-offs that led to alternatives like Kustomize and cdk8s.
Next: Chapter 13: The Networking Stack Evolution
Chapter 13: The Networking Stack Evolution
The Fundamental Requirement
Kubernetes imposes a single, non-negotiable networking requirement: every pod gets its own IP address, and every pod can communicate with every other pod without NAT. This is called the flat network model. A pod on Node A can reach a pod on Node B by sending a packet to that pod’s IP address directly. No port mapping. No address translation. No special routing configuration by the application developer.
This requirement is deceptively simple to state and remarkably difficult to implement. On a single machine, giving each container its own IP is straightforward using Linux network namespaces and virtual ethernet pairs. But across machines, you must somehow route packets from one node’s pod network to another node’s pod network, typically over a physical network that knows nothing about Kubernetes pods. This is the problem that CNI (Container Network Interface) plugins solve, and the evolution of these plugins reflects the broader maturation of Kubernetes networking from “good enough” to “production-grade at massive scale.”
Flannel: The First Answer (2014)
Flannel, created by CoreOS in 2014, was the first widely-adopted CNI plugin for Kubernetes. Flannel’s approach was simple: create a VXLAN overlay network. Each node was assigned a subnet (e.g., node 1 gets 10.244.1.0/24, node 2 gets 10.244.2.0/24), and VXLAN encapsulation handled cross-node communication. When a pod on node 1 sent a packet to a pod on node 2, Flannel encapsulated the pod-to-pod packet inside a UDP packet with the node-to-node addresses, sent it across the physical network, and de-encapsulated it on the other side.
flowchart LR
subgraph Node1["Node 1 (10.0.0.1)"]
PodA["Pod A<br>10.244.1.5"] --> flannel1["flannel.1<br>(VXLAN dev)"]
flannel1 --> encap["Encapsulate:<br>src=10.0.0.1, dst=10.0.0.2"]
end
encap -->|"Physical network<br>(VXLAN encapsulated)"| decap
subgraph Node2["Node 2 (10.0.0.2)"]
decap["Decapsulate:<br>unwrap original packet"] --> flannel2["flannel.1<br>(VXLAN dev)"]
flannel2 --> PodB["Pod B<br>10.244.2.8"]
end
Flannel worked. It was simple to deploy, easy to understand, and provided basic pod-to-pod connectivity. But it had significant limitations:
- No network policy support. Flannel could not restrict which pods could talk to which other pods. In a multi-tenant cluster, any pod could reach any other pod. This was a non-starter for security-conscious organizations.
- VXLAN overhead. Encapsulation added 50 bytes of header overhead to every packet, reducing the effective MTU. It also added CPU overhead for encapsulation and decapsulation.
- Limited performance. The overlay approach was inherently slower than native routing because of the encapsulation overhead and the need to traverse the kernel’s network stack twice (once for the inner packet, once for the outer packet).
Flannel was “good enough” for getting started, and it remains useful in simple environments and edge deployments (k3s includes Flannel by default). But production clusters needed more.
Calico: Production-Grade Networking (2016)
Calico, created by Project Calico (later commercialized by Tigera), took a fundamentally different approach. Instead of overlay networking, Calico used BGP (Border Gateway Protocol) to distribute pod routes across the physical network. Each node announced its pod subnet to its neighbors using BGP, and the physical network infrastructure routed packets natively. No encapsulation. No overlay. Packets traveled from pod to pod using the same routing mechanisms that power the internet.
This approach had significant advantages:
- Native performance. Without encapsulation overhead, Calico achieved near-line-rate performance. Packets were not wrapped and unwrapped; they were simply routed.
- Rich network policies. Calico implemented Kubernetes NetworkPolicy and extended it with its own CRD-based policies that supported L3/L4 rules, namespace selectors, global policies, and CIDR-based rules for external traffic.
- Visibility. Because packets were not encapsulated, standard network debugging tools (tcpdump, traceroute) worked as expected. With overlays, debugging required understanding the encapsulation layer.
The tradeoff was that BGP-based routing required cooperation from the physical network infrastructure. In cloud environments where you could not run BGP (because the cloud provider controlled the network), Calico could fall back to VXLAN or IP-in-IP encapsulation. This hybrid approach made Calico viable everywhere while providing optimal performance on bare metal.
Calico became the de facto standard for production Kubernetes networking. Its combination of performance, network policy support, and operational maturity made it the default choice for most serious deployments.
The eBPF Revolution
To understand why eBPF changed Kubernetes networking, you must first understand how kube-proxy and iptables work — and why they fail at scale.
The iptables Problem
kube-proxy is the Kubernetes component responsible for implementing Services. When you create a Service with three backend pods, kube-proxy programs the node’s packet filtering rules so that traffic to the Service’s ClusterIP is load-balanced across the three pods. Historically, kube-proxy used iptables to implement this.
iptables is a Linux kernel packet filtering framework that evaluates rules sequentially. For each Service, kube-proxy creates a chain of iptables rules that use probability-based matching to distribute traffic across endpoints. If a Service has three endpoints, the first rule matches with probability 1/3, the second matches with probability 1/2 (of the remaining traffic), and the third catches everything else.
The problem is scale. iptables rules are evaluated linearly for each packet. In a cluster with 10,000 Services, each with multiple endpoints, the iptables rule set can grow to hundreds of thousands of rules. Every packet entering the node traverses this list. The result is measurable latency increases at scale, slow rule updates (programming 100,000 iptables rules takes seconds to minutes), and high CPU overhead.
iptables vs eBPF: The Performance Cliff
Packet processing time vs. number of Services
Latency
(us)
|
| * iptables
| *
| *
| *
| *
| *
| *
| *
| *
| *
| *
| ──────────────────────────────────────── eBPF (constant)
|
└─────────────────────────────────────────── Number of Services
1K 5K 10K 20K
iptables: O(n) linear scan for each packet
eBPF: O(1) hash table lookup for each packet
IPVS (IP Virtual Server) was introduced as an alternative kube-proxy mode to address some of these issues. IPVS uses hash tables rather than linear rule chains, providing better performance at scale. But IPVS still runs in the kernel’s Netfilter framework and has limitations around custom packet manipulation and observability.
eBPF: Programs in the Kernel
eBPF (extended Berkeley Packet Filter) is a technology that allows running sandboxed programs directly in the Linux kernel without modifying kernel source code or loading kernel modules. Originally designed for packet filtering (hence the name), eBPF has evolved into a general-purpose in-kernel execution environment.
An eBPF program is compiled to a bytecode that the kernel verifies for safety (no infinite loops, no invalid memory access, bounded execution time) and then JIT-compiles to native machine code. eBPF programs can be attached to various kernel hooks: network device ingress/egress, socket operations, system calls, tracepoints, and more.
For Kubernetes networking, eBPF is transformative because it allows implementing Service load-balancing, network policies, and observability at the earliest possible point in the kernel’s network stack, using hash table lookups instead of linear rule chains.
When a packet arrives at a node destined for a Service ClusterIP, an eBPF program attached to the network device performs a single hash lookup in a BPF map to find the backend pod, rewrites the packet’s destination address, and forwards it. O(1) regardless of how many Services exist. No iptables traversal. No Netfilter overhead.
Cilium: eBPF-Native Networking (2017)
Cilium, created by Isovalent in 2017 (Isovalent was acquired by Cisco in 2024), was built from the ground up on eBPF. Where Calico added eBPF support as an alternative to its iptables-based datapath, Cilium was eBPF-native from day one.
Cilium’s capabilities extend well beyond basic networking:
kube-proxy replacement. Cilium can fully replace kube-proxy, implementing Service load-balancing with eBPF programs. This eliminates the iptables bottleneck entirely and provides features like Maglev consistent hashing (for better load distribution), DSR (Direct Server Return) for reduced latency on reply packets, and graceful connection handling during backend changes.
L7-aware network policies. Traditional network policies operate at L3/L4 — IP addresses and TCP/UDP ports. Cilium’s eBPF programs can inspect L7 protocol headers, enabling policies like “allow HTTP GET to /api/v1/users but deny HTTP DELETE” or “allow gRPC calls to the ProductService.GetProduct method but deny ProductService.DeleteProduct.” This level of granularity was previously only available through service meshes.
Hubble observability. Cilium includes Hubble, an observability platform that provides real-time visibility into network flows, DNS queries, HTTP requests, and connection state — all captured by eBPF programs with minimal overhead. This is networking observability without sampling, without agents, without instrumentation.
Transparent encryption. Cilium can encrypt all pod-to-pod traffic using WireGuard or IPsec, transparently and without application changes. The encryption and decryption happen in eBPF programs attached to the network interfaces, so applications are unaware of the encryption layer.
Bandwidth management. eBPF programs can implement EDT (Earliest Departure Time) based rate limiting, providing better bandwidth management than traditional tc (traffic control) approaches.
Cilium became the default CNI on Google Kubernetes Engine in 2023, a significant endorsement. Its adoption reflects a broader trend: the kernel’s programmability through eBPF is displacing decades of networking infrastructure built on iptables, ipvs, and userspace proxies.
The kube-proxy Replacement Story
The move to replace kube-proxy deserves special attention because it illustrates how architectural assumptions age.
When Kubernetes was designed, iptables was the standard way to implement packet manipulation in the Linux kernel. It was well-understood, widely deployed, and sufficient for the cluster sizes of the time (dozens to hundreds of nodes, hundreds to thousands of Services). kube-proxy’s iptables mode was a reasonable engineering choice.
But Kubernetes clusters grew. Cloud providers ran clusters with tens of thousands of nodes and tens of thousands of Services. The linear scaling characteristics of iptables became untenable. Rule update latency meant Service changes took minutes to propagate. Connection tracking table overflow caused packet drops.
The progression was:
- iptables mode (original): simple, O(n) per packet, slow updates at scale
- IPVS mode (Kubernetes 1.11 GA): hash-based, better at scale, but still Netfilter-based
- eBPF mode (Cilium, Calico): O(1) per packet, fast updates, additional features
- nftables mode (Kubernetes 1.31): successor to iptables within the Netfilter framework, better performance and maintainability than iptables but still not eBPF-level
Today, organizations running at scale increasingly use Cilium or Calico’s eBPF datapath in place of kube-proxy. The kube-proxy component remains the default for backward compatibility and for environments where eBPF is not available (older kernels, certain cloud VMs), but the trajectory is clear.
Service Mesh Evolution
Istio, jointly developed by Google and IBM, using Lyft’s Envoy proxy as its data plane, and announced in 2017, was the first major service mesh for Kubernetes. Istio’s architecture injected an Envoy sidecar proxy into every pod. All traffic to and from the pod passed through this proxy, which could enforce mTLS (mutual TLS), collect metrics, perform traffic routing, implement circuit breakers, and enforce access policies.
sequenceDiagram
box Traditional (Istio with sidecars)
participant A1 as Pod A: App
participant E1 as Pod A: Envoy
participant E2 as Pod B: Envoy
participant B1 as Pod B: App
end
A1->>E1: request
E1->>E2: mTLS
E2->>B1: request
Note over A1,B1: Per-pod proxy: 50-100 MB memory each
box Sidecar-less (Cilium / Istio Ambient)
participant A2 as Pod A: App
participant NP as Per-Node eBPF Proxy
participant B2 as Pod B: App
end
A2->>NP: request
NP->>B2: mTLS + L7 policy
Note over A2,B2: One shared proxy per node: 10x less memory
The sidecar approach was powerful but expensive. Each sidecar consumed memory (50-100 MB per pod was common for Envoy), added latency (traffic traversed two proxies for each hop), and increased the complexity of the pod lifecycle (the sidecar had to start before the application, and shutting down required careful ordering). In a cluster with 10,000 pods, the sidecar overhead was 500 GB to 1 TB of memory just for the mesh infrastructure.
Linkerd, created by Buoyant in 2017, was the lighter-weight alternative. Linkerd’s Rust-based proxy (linkerd2-proxy) consumed significantly less memory than Envoy and focused on a smaller, well-defined feature set: mTLS, observability, and reliability features. The most significant recent trend is the sidecar-less mesh. Cilium Service Mesh uses eBPF programs in the kernel to provide mTLS, L7 policy, and observability without any sidecar proxies. Istio’s Ambient Mesh mode uses per-node ztunnel proxies for L4 features (mTLS, L4 policy) and optional waypoint proxies for L7 features, eliminating the per-pod sidecar overhead.
The sidecar-less approach reflects a broader realization: much of what sidecars do can be done more efficiently at the node level or in the kernel. The sidecar was an architectural choice driven by the constraints of 2017 (limited eBPF support, no per-node proxy infrastructure). As the infrastructure has evolved, the architecture is evolving with it.
The Current Landscape
The Kubernetes networking stack in 2024+ looks nothing like it did in 2015:
- CNI plugin: Cilium (dominant, especially on cloud), Calico (strong on-premises), Flannel (edge/simple deployments)
- Service implementation: eBPF (Cilium, Calico) replacing iptables/IPVS (kube-proxy)
- Network policy: Cilium or Calico, both supporting L3/L4 and increasingly L7
- Service mesh: consolidating around sidecar-less approaches; Istio Ambient and Cilium Service Mesh
- Encryption: WireGuard-based transparent encryption (Cilium, Calico)
The evolution from Flannel’s simple VXLAN overlay to Cilium’s eBPF-native stack represents one of the most dramatic technical shifts in the Kubernetes ecosystem. It was driven by scale: the solutions that worked for hundreds of nodes failed at thousands. And it was enabled by a foundational technology shift (eBPF) that changed what was possible inside the Linux kernel. For a quick flowchart on choosing a CNI, see Appendix C: Decision Trees.
Common Mistakes and Misconceptions
- “eBPF replaces all of iptables immediately.” Cilium’s eBPF datapath replaces kube-proxy’s iptables rules for service routing, but iptables still exists on the host for other purposes. Migration is incremental.
- “I need a service mesh from day one.” Service meshes add complexity (sidecars, mTLS certificate management, control plane). Start without one; add it when you have a concrete need for mTLS, traffic splitting, or observability between services.
- “Flannel is obsolete.” Flannel is simpler and lighter than Calico or Cilium. For small clusters that don’t need NetworkPolicy, Flannel is a perfectly valid choice.
Further Reading
- Cilium documentation – Comprehensive reference for the eBPF-based CNI plugin that has become the dominant networking solution. The “Concepts” section explains how eBPF replaces iptables for service routing, network policy, and observability.
- Calico documentation – Covers Calico’s BGP-based networking, network policy engine, and eBPF dataplane. Particularly strong on network policy design patterns for enterprise environments.
- eBPF.io – The definitive resource for understanding eBPF, the kernel technology underpinning modern Kubernetes networking. Includes tutorials, reference material, and a curated list of eBPF-based projects.
- Isovalent blog: “eBPF-based Networking, Observability, Security” – Technical deep-dives from the creators of Cilium on how eBPF is applied to networking, including kube-proxy replacement, transparent encryption, and service mesh without sidecars.
- Thomas Graf – “Accelerating Envoy with the Linux Kernel” (KubeCon EU 2018) — Cilium creator on how eBPF fundamentally changes Kubernetes networking performance and architecture.
- Flannel GitHub repository – The simple overlay network that was the default CNI for early Kubernetes. Reading the design docs helps understand the baseline that more advanced CNI plugins improved upon.
- Cilium Service Mesh – Documentation on Cilium’s sidecar-less service mesh implementation, showing how eBPF enables mTLS, L7 policy, and traffic management without per-pod proxy overhead.
- Kubernetes Network Policy documentation – The official reference for the NetworkPolicy API, essential for understanding the baseline that CNI plugins like Cilium and Calico extend with their own CRDs.
Next: Chapter 14: Kubernetes Version History — A Guided Tour
Chapter 14: Kubernetes Version History — A Guided Tour
For a visual timeline showing how the entire ecosystem evolved in parallel, see Appendix E: Architecture Evolution Timeline.
timeline
title Kubernetes Release Timeline — The Inflection Points
section Can it run?
v1.0 (2015) : CNCF launch
: Pods, Services, Secrets
section Can it handle state?
v1.5 (2016) : CRI introduced
: StatefulSets (beta)
: PodDisruptionBudgets
section Can we extend it?
v1.7 (2017) : CRDs replace TPR
: RBAC GA
section Is it production ready?
v1.9 (2018) : Apps/v1 GA
: Workloads API stable
section Can anyone set it up?
v1.13 (2019) : kubeadm GA
: CSI 1.0
section Cleaning house
v1.20 (2020) : Docker deprecation announced
section Removing the debt
v1.22 (2021) : API removals forced migration
: beta APIs off by default
section Runtime independence
v1.24 (2022) : dockershim removed
: Kubernetes runs on containerd/CRI-O only
section Mature platform
v1.29 (2023) : Sidecar containers (beta)
: KMS v2 GA
v1.31 (2024) : nftables kube-proxy (beta)
: AppArmor GA
v1.0 (July 2015): The Starting Line
Kubernetes 1.0 was released at OSCON 2015 alongside the announcement that Google was donating the project to the newly formed Cloud Native Computing Foundation (CNCF). The CNCF donation (covered in Chapter 8) gave competitors reason to contribute rather than fork.
The 1.0 release was sparse by modern standards. It had Pods, ReplicationControllers (the precursor to ReplicaSets and Deployments), Services, and Secrets. There were no Deployments, no StatefulSets, no RBAC, no CRDs. The scheduler was basic. Networking was primitive. But the core architectural decisions were already in place: the declarative API model, the reconciliation loop pattern, etcd as the state store, and the API server as the single point of access.
The significance of 1.0 was not its feature set but its commitment to stability. By calling it 1.0, the project promised backward compatibility. API resources marked as stable would not be removed or changed in breaking ways. This promise — which Kubernetes has largely kept — gave enterprises the confidence to invest in the platform.
v1.2 (March 2016): First Usable for Production
Kubernetes 1.2 introduced three features that transformed it from a promising experiment into something you could actually run in production.
ConfigMaps provided a way to inject configuration data into pods without baking it into the container image. Before ConfigMaps, you had two options: environment variables (limited and inflexible) or mounting Secrets (semantically wrong for non-secret configuration).
DaemonSets ensured that a specific pod ran on every node (or a selected subset of nodes). This was essential for infrastructure agents: log collectors, monitoring agents, network plugins, storage drivers. Without DaemonSets, operators had to manually ensure these agents were running on every node and handle new nodes joining the cluster.
Deployments (in beta) introduced declarative rolling updates. Before Deployments, updating an application required manually managing ReplicationControllers — creating a new one, scaling it up, scaling the old one down, and handling failures during the transition. Deployments automated this entire process and added rollback capability. The Deployment controller became the workhorse of Kubernetes, managing the vast majority of stateless workloads.
v1.3 (July 2016): The State Problem
PetSets (later StatefulSets) provided stable network identities, per-pod persistent storage, and ordered scaling — the features stateful workloads need that Deployments do not provide.
The name “PetSets” reflected the “pets vs. cattle” metaphor that dominated DevOps thinking: stateless containers were “cattle” (identical, replaceable) while stateful services were “pets” (unique, requiring individual care). The rename to StatefulSets in 1.5 was driven by the community’s desire for a more descriptive, less metaphorical name.
Cluster federation also appeared in alpha, reflecting an early attempt to manage multiple clusters as a single entity. Federation proved premature — the problem was real but the approach was wrong — and it was eventually replaced by tools like Loft’s vCluster, Admiralty, and the multi-cluster capabilities of service meshes and GitOps tools.
v1.5 (December 2016): The Plugin Architecture Emerges
Kubernetes 1.5 was architecturally pivotal. The Container Runtime Interface (CRI) was introduced, beginning the process that would eventually lead to Docker’s removal (covered in Chapter 10). StatefulSets reached beta, making stateful workloads viable for early adopters. PodDisruptionBudgets appeared, giving operators a way to express how much disruption a workload could tolerate during maintenance operations.
PodDisruptionBudgets solved a subtle but critical problem. When a node needed to be drained for maintenance (kernel upgrade, hardware repair), the system needed to know whether it was safe to evict a pod. For a Deployment with 10 replicas, losing one pod during a drain is fine. For a three-node etcd cluster, losing one node when another is already down would break quorum. PodDisruptionBudgets let operators express constraints like “at least 2 of 3 replicas must always be available,” giving the drain process the information it needed to proceed safely.
v1.6 (March 2017): Security Gets Serious
RBAC (Role-Based Access Control) reached beta and was enabled by default. Before RBAC, Kubernetes had ABAC (Attribute-Based Access Control), which required restarting the API server to change policies. However, the bigger problem was that many clusters simply ran with the AlwaysAllow authorizer — the permissive default — which allowed any authenticated user to do anything. ABAC was available as a more restrictive alternative, but its static file-based configuration made it cumbersome to adopt. RBAC changed this fundamentally.
RBAC introduced Roles (permissions scoped to a namespace), ClusterRoles (permissions scoped to the cluster), RoleBindings, and ClusterRoleBindings. It allowed fine-grained access control: developer A can create Deployments in namespace “team-a” but not in namespace “team-b.” Service accounts can read ConfigMaps but not Secrets. CI/CD pipelines can deploy but not modify RBAC rules.
This release also migrated the default storage backend from etcd v2 to etcd v3, a significant change. etcd v3 introduced a flat key-value model (replacing v2’s directory tree), a more efficient storage format, and support for watchers at scale. The migration was transparent to most users but was essential for supporting larger clusters.
Dynamic storage provisioning reached GA, allowing PersistentVolumeClaims to automatically trigger the creation of underlying storage (EBS volumes, GCE persistent disks, NFS shares) without manual administrator intervention. This completed the self-service model: developers could request storage in their manifests and the cluster would provision it automatically.
v1.7 (June 2017): Extensibility Unlocked
Custom Resource Definitions (CRDs) replaced the earlier ThirdPartyResources, fundamentally changing what Kubernetes could do. CRDs allowed anyone to define new resource types in the Kubernetes API. Combined with custom controllers, CRDs turned Kubernetes from a container orchestration platform into a general-purpose platform for managing any kind of resource.
The significance of CRDs cannot be overstated. They enabled the “operator pattern” — custom controllers that encode domain-specific operational knowledge. A PostgreSQL operator could define a PostgresCluster CRD, and a controller could watch for these resources and automatically provision databases, configure replication, manage backups, and handle failover. The operator pattern turned Kubernetes into a platform for automating the operation of complex software systems, not just running containers.
Network Policies reached GA, providing a mechanism to restrict pod-to-pod communication. Before Network Policies, the flat network model meant any pod could talk to any other pod — a security model that was unacceptable for multi-tenant clusters or environments handling sensitive data.
v1.8 (September 2017): RBAC Stabilizes
RBAC reached GA, completing its journey from alpha to stable. This was the release where Kubernetes’ security model was considered production-ready. From this point forward, the expectation was that all clusters would use RBAC, and tools and documentation assumed its presence.
CronJobs reached beta, providing scheduled job execution (the Kubernetes equivalent of cron). While conceptually simple, CronJobs were important because they addressed a common pattern — batch processing, report generation, database maintenance — that previously required external scheduling systems.
v1.9 (December 2017): The Apps API Stabilizes
This release marked the moment Kubernetes’ core workload APIs became stable. Deployments, ReplicaSets, StatefulSets, and DaemonSets all reached GA under the apps/v1 API group.
The Container Storage Interface (CSI) appeared in alpha. CSI would do for storage what CRI did for container runtimes and CNI did for networking: define a standard interface so storage providers could be plugged in without modifying Kubernetes core code. Before CSI, storage drivers were compiled into Kubernetes, meaning a new storage provider required a change to the Kubernetes codebase. CSI decoupled storage from the Kubernetes release cycle.
v1.11 (June 2018): Infrastructure Refresh
CoreDNS replaced kube-dns as the default cluster DNS provider. kube-dns was a composite of three containers (kube-dns, dnsmasq, sidecar) that was complex to debug and had known performance issues. CoreDNS was a single binary, written in Go, with a plugin-based architecture that made it flexible and easy to extend. The switch reflected a maturation of the ecosystem: better tools replaced adequate ones.
IPVS-based kube-proxy reached GA, providing an alternative to iptables mode for Service load-balancing. IPVS used hash tables instead of linear iptables chains, offering better performance at scale (thousands of Services). This was a stopgap improvement; the eventual answer would be eBPF, but IPVS provided meaningful improvements for clusters that could not yet adopt eBPF-based solutions.
v1.13 (December 2018): The Bootstrap Milestone
kubeadm reached GA, meaning the Kubernetes project now had a stable, supported way to bootstrap clusters. This was the culmination of two years of development by SIG Cluster Lifecycle and was essential for the ecosystem of higher-level tools (kops, kubespray) that built on kubeadm.
CSI 1.0 was released, completing the storage plugin interface specification. Storage vendors could now build drivers that worked with any Kubernetes version without compiling code into Kubernetes. This accelerated the storage ecosystem enormously: vendors shipped CSI drivers for their proprietary storage systems, and the community built CSI drivers for NFS, Ceph, and other open-source storage systems.
v1.16 (September 2019): CRDs Grow Up
CRDs reached GA with structural schemas, meaning CRD authors could define OpenAPI v3 schemas for their custom resources. The API server would validate custom resources against these schemas, rejecting invalid objects. Before structural schemas, CRDs accepted any JSON object, which meant validation errors were only caught at the controller level. Structural schemas moved validation to the API server, matching the behavior of built-in resources.
This release also deprecated extensions/v1beta1 for Deployments, DaemonSets, and ReplicaSets, forcing users to migrate to apps/v1. This was the beginning of a pattern: Kubernetes would aggressively deprecate beta APIs to prevent permanent dependence on unstable interfaces.
v1.20 (December 2020): The Docker Announcement
The dockershim deprecation announcement dominated this release’s narrative (covered in detail in Chapter 10). Beyond the Docker story, v1.20 introduced graceful node shutdown, allowing the kubelet to detect that the node’s operating system was shutting down and gracefully terminate pods in priority order. Before this, a node shutdown simply killed all pods, potentially interrupting critical workloads mid-operation.
v1.22 (August 2021): The Great API Migration
This release removed many deprecated beta APIs that had been deprecated since v1.16. Ingress moved from extensions/v1beta1 to networking.k8s.io/v1. CRD moved from apiextensions.k8s.io/v1beta1 to v1. ValidatingWebhookConfiguration and MutatingWebhookConfiguration moved to admissionregistration.k8s.io/v1.
The removals caused significant disruption. Many Helm charts, operators, and deployment scripts still referenced the old API versions. Tools that generated Kubernetes manifests had to be updated. The community learned a painful lesson about the cost of depending on beta APIs and the importance of migration planning.
Server-side apply reached GA, moving manifest merging logic from kubectl to the API server. This enabled conflict detection (two controllers modifying the same field), field ownership tracking, and consistent behavior across all API clients. Server-side apply was foundational for the emerging GitOps ecosystem, where multiple tools might manage different fields of the same resource.
v1.24 (May 2022): Runtime Independence
The dockershim was removed, completing the deprecation announced in v1.20. Clusters using Docker as their container runtime needed to switch to containerd or CRI-O. In practice, most managed Kubernetes services had already made this switch, and the impact on self-managed clusters was modest because containerd — the actual runtime inside Docker — was already present on most nodes.
v1.25 (August 2022): Security Model Modernization
PodSecurityPolicy (PSP) was removed, ending a contentious chapter in Kubernetes security. PSP had been the mechanism for restricting what pods could do (run as root, use host networking, mount host paths), but it was widely regarded as confusing, difficult to use correctly, and prone to misconfiguration. Its replacement was Pod Security Standards enforced through the Pod Security Admission controller, which defined three profiles — Privileged, Baseline, and Restricted — that were simpler to understand and apply.
Ephemeral containers reached GA, allowing operators to inject temporary debugging containers into running pods. Before ephemeral containers, debugging a distroless or minimal container (which lacked shells, debugging tools, or even a writable filesystem) required rebuilding the image with debugging tools, redeploying, and reproducing the problem. Ephemeral containers solved this by allowing you to attach a container with debugging tools to a running pod without restarting it.
v1.27 (April 2023): Resource Flexibility
In-place pod resource resize appeared in alpha, addressing a long-standing limitation. Before this feature, changing a pod’s CPU or memory limits required deleting and recreating the pod. For stateful workloads, this meant downtime. In-place resize allowed changing resource limits on a running pod, with the kubelet adjusting cgroup limits without restarting the container.
SeccompDefault reached GA, enabling Seccomp security profiles by default for all pods. Seccomp restricts which system calls a container can make, reducing the kernel attack surface. Making it default-on was a security hardening step that moved the ecosystem toward defense-in-depth.
v1.29 (December 2023): Sidecar Containers and Secrets at Scale
Sidecar containers reached beta (enabled by default) (formally: native sidecar support via init containers with restartPolicy: Always, with GA expected in v1.33). This addressed a long-standing problem with the sidecar pattern: Kubernetes had no native concept of a container that started before and stopped after the main container. Log collectors, service mesh proxies, and monitoring agents were deployed as sidecars, but Kubernetes treated them as ordinary containers. This led to startup ordering issues (the sidecar proxy might not be ready when the application started) and shutdown ordering issues (the sidecar might be killed before the application finished draining connections).
KMS v2 reached GA for secrets encryption at rest. Kubernetes Secrets are stored in etcd, and without encryption at rest, anyone with access to etcd’s data directory can read all Secrets in plaintext. KMS v2 provided a standard interface for integrating with external key management services (AWS KMS, Google Cloud KMS, Azure Key Vault, HashiCorp Vault), ensuring Secrets were encrypted in etcd using keys managed by a dedicated, auditable, access-controlled key management system.
v1.30 (April 2024): Authentication and Resource Management
Structured authentication configuration allowed administrators to configure authentication using a file-based configuration rather than a proliferation of API server flags. This made authentication setup more manageable, auditable, and version-controllable.
Dynamic Resource Allocation (DRA) continued its progression, providing a framework for managing non-traditional resources (GPUs, FPGAs, network devices) through a structured API rather than the opaque extended resources mechanism. DRA was driven by the explosive growth of AI/ML workloads that required fine-grained GPU allocation and sharing.
v1.31 (August 2024): Kernel-Level Security and Modern Networking
AppArmor support reached GA, providing mandatory access control profiles that restrict container capabilities at the kernel level. AppArmor profiles could limit filesystem access, network operations, and capability usage, providing a defense-in-depth layer beyond Seccomp and Linux capabilities.
The nftables kube-proxy backend was promoted to beta (it first appeared as alpha in v1.29), replacing iptables with its successor in the Linux kernel. nftables provides better performance, a cleaner rule syntax, and improved maintainability. While eBPF-based solutions (Cilium, Calico) offer superior performance, nftables modernized the default kube-proxy for environments that prefer to use the standard kernel networking stack.
v1.32+ (2025-2026): Continued Maturation
Recent and upcoming releases continue the trend of maturation rather than revolution. Dynamic Resource Allocation improvements address the growing demand for GPU and accelerator scheduling in AI/ML workloads. In-place pod resource resize progresses toward GA. The overall pattern is one of stabilization: making alpha features beta, making beta features GA, and improving the operational experience of features that are already stable.
The Pattern Behind the Versions
Reading the version history as a narrative rather than a changelog reveals a clear pattern of maturation:
2015-2016: Can it run at all? The early releases focused on basic functionality — scheduling pods, running stateless workloads, providing services. Kubernetes was proving that the architecture worked.
2016-2017: Can it handle real workloads? StatefulSets, RBAC, CRDs, Network Policies. These features addressed the requirements of production systems: state, security, extensibility, and network isolation.
2018-2019: Can anyone set it up? kubeadm, CSI, Helm v3. The focus shifted from what Kubernetes could do to how people could deploy and manage it. The tooling ecosystem matured alongside the platform.
2020-2022: Cleaning up the debt. Docker deprecation, API removals, PSP removal. Kubernetes spent these years removing technical debt and forcing the ecosystem to migrate away from deprecated interfaces. This was painful but necessary for long-term health.
2023-2026: Mature platform. Sidecar containers, in-place resize, DRA, security hardening. The features being added are refinements, not revolutions. Kubernetes is no longer proving itself; it is optimizing for the workloads and operational patterns that have emerged over a decade of production use.
The version history also reveals the disciplined API lifecycle that makes Kubernetes trustworthy as a platform. Features progress through alpha (disabled by default, may change or be removed), beta (enabled by default, API may change), and GA (stable, backward compatible, will not be removed). This lifecycle gives users clear signals about what is safe to depend on and gives the community space to iterate on APIs before committing to them permanently.
Common Mistakes and Misconceptions
- “I should always run the latest Kubernetes version.” New versions may have bugs, and your tools/operators may not support them yet. Use release channels (Stable or Regular) and wait 1-2 months after a minor release before upgrading production.
- “Skipping minor versions during upgrades is fine.” Kubernetes supports upgrading one minor version at a time (e.g., 1.28 → 1.29 → 1.30). Skipping versions can break API compatibility and is unsupported.
- “Deprecated APIs will keep working forever.” Deprecated APIs are removed after a defined period (typically 2-3 releases). Plan migrations early using
kubectl convertor tools like Pluto to detect deprecated APIs.
Further Reading
- Kubernetes Release Notes (official) – The canonical list of all Kubernetes releases with links to changelogs, release notes, and upgrade guides. Start here to understand what changed in any specific version.
- Kubernetes Deprecation Policy – The formal rules governing how APIs and features are deprecated and removed, including the minimum version guarantees for GA, beta, and alpha APIs.
- Kubernetes Enhancement Proposals (KEP) process – How new features go from idea to implementation. Understanding KEPs explains why features take multiple releases to mature and how the community coordinates large changes.
- SIG Release – The Special Interest Group responsible for the release process, cadence, and tooling. The README and meeting notes provide insight into how the three-releases-per-year cadence is managed.
- Kubernetes CHANGELOG on GitHub – The raw changelogs for every release, useful for detailed investigation of specific changes, bug fixes, and API modifications.
- “Kubernetes Release Cadence Change: Here’s What You Need To Know” (Kubernetes blog) – Explains the move from four to three releases per year and how the new cadence balances stability with velocity.
- API Version Lifecycle documentation – Official reference for understanding alpha, beta, and GA API stages, which directly maps to the feature maturation pattern described in this chapter.
This concludes Part 2: The Tooling Ecosystem. You now understand how the tools around Kubernetes evolved and why they look the way they do today. Part 3 takes all of this context and puts it into practice — setting up real clusters, deploying real workloads, and learning to debug when things go wrong.
Next: Setting Up a Cluster from Scratch
Chapter 15: Setting Up a Cluster from Scratch
Every Kubernetes cluster begins as a collection of Linux machines that know nothing about each other. Something must generate the certificates, write the configuration files, start the control plane processes, and establish the trust relationships that let workers join.
What kubeadm Actually Does
kubeadm is the official bootstrapping tool. When you run kubeadm init, it executes 12 phases in sequence. Each phase solves a specific problem in the bootstrap chain.
kubeadm init
│
├── 1. preflight Validate the node can become a control plane
├── 2. certs Generate the entire PKI hierarchy
├── 3. kubeconfig Generate kubeconfig files for each component
├── 4. kubelet-start Configure and start the kubelet
├── 5. control-plane Write static pod manifests for control plane
├── 6. etcd Write static pod manifest for local etcd
├── 7. upload-config Store kubeadm and kubelet config in ConfigMaps
├── 8. upload-certs (optional) Upload certs for HA join
├── 9. mark-control-plane Taint and label the node
├── 10. bootstrap-token Create token for worker node joining
├── 11. kubelet-finalize Update kubelet config for TLS bootstrap
├── 12. addon Install CoreDNS and kube-proxy
│
▼
Control plane is running. Workers can join.
Let us walk through each phase in detail.
Phase 1: Preflight Checks
Before touching anything, kubeadm validates that the system meets the minimum requirements. This includes:
- Swap is disabled. The kubelet refuses to start if swap is enabled (by default) because the scheduler’s resource accounting assumes no swap; swap breaks memory limit enforcement.
- Required ports are available. The API server needs port 6443, etcd needs 2379-2380, the scheduler needs 10259, the controller manager needs 10257. If another process occupies these ports, the control plane cannot start.
- Container runtime is reachable. kubeadm checks for a CRI-compatible runtime (containerd or CRI-O) at the expected socket path.
- cgroup driver matches. The kubelet and the container runtime must agree on whether to use
cgroupfsorsystemdas the cgroup driver. A mismatch causes containers to start in the wrong cgroup hierarchy, breaking resource accounting. Since Kubernetes 1.22,systemdis the recommended default. - Required kernel modules and sysctl settings are present (br_netfilter, ip_forward).
Phase 2: Certificate Generation
This is the most important phase. Kubernetes is a distributed system where every component authenticates to every other component using mutual TLS. kubeadm generates the entire PKI hierarchy and writes it to /etc/kubernetes/pki/.
PKI HIERARCHY
─────────────
/etc/kubernetes/pki/
│
├── ca.crt / ca.key ◄── Cluster Root CA
│ │
│ ├── apiserver.crt / apiserver.key API server serving cert
│ ├── apiserver-kubelet-client.crt/key API server → kubelet client cert
│ ├── front-proxy-ca.crt / .key ◄── Front Proxy CA (aggregation layer)
│ │ └── front-proxy-client.crt/key Aggregation layer client cert
│ │
│ └── (kubeconfig embedded certs)
│ ├── admin client cert kubectl access
│ ├── controller-manager client cert controller-manager → API server
│ └── scheduler client cert scheduler → API server
│
├── etcd/
│ ├── ca.crt / ca.key ◄── etcd Root CA (separate trust domain)
│ ├── server.crt / server.key etcd server serving cert
│ ├── peer.crt / peer.key etcd peer-to-peer communication
│ └── healthcheck-client.crt/key Health check client cert
│
├── apiserver-etcd-client.crt/key API server → etcd client cert
│
└── sa.key / sa.pub Service account signing keypair
Two separate CAs exist by design. The cluster CA signs all Kubernetes component certificates. The etcd CA signs all etcd certificates. This separation means a compromise of the cluster CA does not automatically grant access to etcd, and vice versa. The API server holds a client certificate signed by the etcd CA, which is how it authenticates to etcd.
The service account keypair (sa.key / sa.pub) is used by the controller manager to sign service account tokens and by the API server to verify them. This is not a CA — it is a signing key for JWTs.
Phase 3: kubeconfig Generation
kubeadm generates four kubeconfig files in /etc/kubernetes/:
| File | Used By | Purpose |
|---|---|---|
admin.conf | kubectl (cluster admin) | Full cluster access |
controller-manager.conf | kube-controller-manager | Authenticate to API server |
scheduler.conf | kube-scheduler | Authenticate to API server |
kubelet.conf | kubelet on the control plane node | Authenticate to API server |
Each kubeconfig file embeds a client certificate (signed by the cluster CA) and the CA certificate for verifying the API server. This is mutual TLS: the component authenticates to the API server, and the API server authenticates back to the component.
Phase 4-6: Static Pod Manifests and the Bootstrap Problem
Here is the fundamental bootstrap problem: the API server, controller manager, scheduler, and etcd must run as containers, but the kubelet cannot pull their pod specs from an API server that does not yet exist. This is a circular dependency.
Kubernetes solves this with static pods. The kubelet can read pod manifests directly from a local directory (/etc/kubernetes/manifests/) and run them without any API server involvement. kubeadm writes four manifest files:
/etc/kubernetes/manifests/
├── kube-apiserver.yaml
├── kube-controller-manager.yaml
├── kube-scheduler.yaml
└── etcd.yaml
The kubelet detects these files, creates the pods, and monitors them. If a static pod crashes, the kubelet restarts it. Once the API server is running, the kubelet creates mirror pods in the API — read-only representations that make static pods visible through kubectl get pods -n kube-system, even though the API server does not manage them.
This is one of the most elegant solutions in Kubernetes’ design. The kubelet operates in two modes simultaneously: it manages static pods from local files (for bootstrapping) and regular pods from the API server (for everything else).
Phase 7-9: Configuration and Node Marking
kubeadm stores its own configuration and the kubelet’s configuration as ConfigMaps in the kube-system namespace. This serves two purposes: it documents how the cluster was initialized, and it provides configuration for worker nodes joining later.
The control plane node is tainted with node-role.kubernetes.io/control-plane:NoSchedule so that regular workloads are not scheduled onto it. This is a convention, not a hard rule — you can remove this taint on single-node clusters.
Phase 10: Bootstrap Tokens and the TLS Bootstrap Handshake
When a worker node joins the cluster, it needs to authenticate to the API server. But it has no certificate yet — that is what it is trying to obtain. This is solved by the TLS bootstrap protocol.
sequenceDiagram
participant K as Kubelet (Worker)
participant A as API Server
participant C as CSR Approving Controller
participant CA as CA (Signer)
Note over K: kubeadm join --token abc123
K->>A: TLS connect (verify CA cert hash)
K->>A: Authenticate with bootstrap token
Note right of A: Token valid — grants CSR create permission
K->>A: POST CertificateSigningRequest
Note left of K: "I am node X, give me a cert"
A->>C: CSR created
Note right of C: Auto-approve (first cert from bootstrap token)
C->>CA: Sign request
CA-->>C: Signed certificate
C-->>A: CSR approved + signed cert
A-->>K: Signed certificate returned
rect rgba(50, 108, 229, 0.1)
Note over K,CA: Kubelet now uses real certificate for all API calls.<br/>Bootstrap token can be safely revoked.
K->>A: Authenticated API calls (using real cert)
end
The bootstrap token is a short-lived, low-privilege credential. It grants exactly one permission: the ability to create a CertificateSigningRequest. The csrapproving controller in the controller manager automatically approves CSRs from bootstrap tokens (for the first certificate). The worker receives a signed certificate and uses it for all subsequent communication. The bootstrap token can now be revoked.
The --discovery-token-ca-cert-hash flag prevents man-in-the-middle attacks during the initial connection. The worker verifies the API server’s certificate against this hash before sending the bootstrap token.
Phase 11-12: Addons
kubeadm installs two mandatory addons as Deployments:
- CoreDNS: Provides cluster DNS. Pods resolve service names (e.g.,
my-svc.my-namespace.svc.cluster.local) through CoreDNS. - kube-proxy: Runs as a DaemonSet on every node. Manages iptables or IPVS rules that implement Service routing.
Note that kubeadm does not install a CNI plugin. This is deliberate: the choice of CNI plugin is a critical networking decision that kubeadm leaves to the operator. Until a CNI plugin is installed, pods on the control plane node will be stuck in Pending and nodes will show as NotReady.
Using a kubeadm Configuration File
While kubeadm init accepts dozens of CLI flags, production usage should always use a YAML configuration file. This makes the cluster setup reproducible and auditable.
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: v1.32.0
controlPlaneEndpoint: "k8s-api.example.com:6443"
networking:
podSubnet: "10.244.0.0/16"
serviceSubnet: "10.96.0.0/12"
dnsDomain: "cluster.local"
apiServer:
certSANs:
- "k8s-api.example.com"
- "10.0.0.100"
extraArgs:
- name: "audit-log-path"
value: "/var/log/kubernetes/audit.log"
etcd:
local:
dataDir: "/var/lib/etcd"
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
nodeRegistration:
criSocket: "unix:///var/run/containerd/containerd.sock"
kubeletExtraArgs:
- name: "cgroup-driver"
value: "systemd"
Run with: kubeadm init --config=kubeadm-config.yaml
The controlPlaneEndpoint is critical for HA clusters. It should point to a load balancer in front of multiple API server instances. Setting it during initial setup avoids painful reconfiguration later.
Kubernetes the Hard Way
Kelsey Hightower’s Kubernetes the Hard Way is a 13-lab exercise that provisions a cluster by hand, without kubeadm. The labs (updated for v1.32.x) walk you through:
- Generating every certificate by hand (you will appreciate kubeadm’s Phase 2 after this)
- Writing every kubeconfig file manually
- Configuring etcd from scratch
- Writing systemd unit files for every component
- Configuring kubelet and kube-proxy on each worker
- Setting up pod networking manually
What The Hard Way teaches that kubeadm hides:
- The CA is just files. There is no magic PKI server. You generate a CA key, use it to sign certificates, and distribute them. Understanding this demystifies all of Kubernetes’ authentication.
- The API server is just a binary with flags. Every feature — authentication methods, authorization modes, admission controllers — is controlled by command-line flags to
kube-apiserver. - Networking is not built-in. You must configure routing tables or install a CNI plugin yourself. This makes you understand why CNI exists.
- etcd is independent. It runs as its own cluster and can be inspected with
etcdctlindependently of Kubernetes.
Do The Hard Way once, then use kubeadm for everything after. The exercise takes 4-8 hours and permanently changes how you think about clusters.
Common Pitfalls
| Problem | Symptom | Fix |
|---|---|---|
| Swap enabled | kubelet refuses to start | swapoff -a and remove swap from /etc/fstab |
| cgroup driver mismatch | Pods fail with cgroup errors | Ensure kubelet and containerd both use systemd |
| Port 6443 in use | API server fails to bind | Check for existing processes: ss -tlnp | grep 6443 |
| Firewall blocking | Workers cannot join | Open 6443, 2379-2380, 10250, 10259, 10257 |
| CNI not installed | All pods stuck in Pending, nodes NotReady | Install a CNI plugin (Calico, Cilium, Flannel) |
| Wrong podSubnet | CNI and kubeadm disagree on pod CIDR | Match podSubnet in kubeadm config with CNI config |
| Expired bootstrap token | Workers cannot join after 24h | Generate new token: kubeadm token create --print-join-command |
For a comprehensive error-to-fix mapping, see Appendix D: Troubleshooting Quick Reference.
Common Mistakes and Misconceptions
- “One control plane node is enough for production.” A single control plane is a single point of failure. Production clusters need 3 or 5 control plane nodes for etcd quorum and API server HA.
- “Worker nodes should be as large as possible.” Fewer large nodes means each node failure impacts more pods. Balance node size against blast radius — many medium nodes are often better than few huge ones.
- “I can skip configuring kubelet flags.” Defaults work for learning, but production kubelets need tuning: eviction thresholds, max-pods, image garbage collection, and system reserved resources.
Further Reading
- kubeadm init documentation — Official reference for all phases and flags
- Kubernetes the Hard Way — Kelsey Hightower’s 13-lab manual cluster setup (v1.32.x)
- PKI certificates and requirements — Full list of certificates and their purposes
- TLS bootstrapping — Deep dive into the bootstrap token protocol
- KillerCoda kubeadm scenarios — Interactive browser-based kubeadm exercises
- KodeKloud CKA course — Hands-on labs covering cluster setup
Next: Managed Kubernetes: EKS, GKE, and AKS
Chapter 16: Managed Kubernetes: EKS, GKE, and AKS
Running your own control plane is an excellent way to learn Kubernetes. For most teams, managed services reduce operational overhead significantly. The control plane — etcd, the API server, the controller manager, the scheduler — requires careful backup, monitoring, upgrade orchestration, and high-availability configuration. Managed Kubernetes services take this burden off your team so you can focus on what runs on the cluster rather than what runs the cluster.
But “managed” does not mean “fully operated.” Every cloud provider draws the line differently between what they manage and what remains your responsibility. Understanding exactly where that line falls is essential for making an informed choice.
The Shared Responsibility Model
MANAGED KUBERNETES: WHO MANAGES WHAT?
──────────────────────────────────────
Cloud Provider Manages │ You Manage
───────────────────── │ ──────────
│
┌────────────────────────────────┐ │ ┌────────────────────────────────┐
│ Control Plane │ │ │ Worker Nodes │
│ ┌───────────┐ ┌────────────┐ │ │ │ ┌───────────┐ ┌───────────┐ │
│ │ API Server│ │ Controller │ │ │ │ │ kubelet │ │ Your Pods │ │
│ │ (HA, TLS) │ │ Manager │ │ │ │ │ │ │ │ │
│ └───────────┘ └────────────┘ │ │ │ └───────────┘ └───────────┘ │
│ ┌───────────┐ ┌────────────┐ │ │ │ ┌───────────┐ ┌───────────┐ │
│ │ Scheduler │ │ etcd │ │ │ │ │ kube-proxy│ │ CNI agent │ │
│ │ │ │ (backups) │ │ │ │ │ │ │ │ │
│ └───────────┘ └────────────┘ │ │ │ └───────────┘ └───────────┘ │
│ │ │ │ │
│ Upgrades, patches, HA, │ │ │ OS patching, scaling, │
│ etcd backups, API cert │ │ │ node upgrades, app deploys, │
│ rotation │ │ │ networking config, storage, │
└────────────────────────────────┘ │ │ security policies, RBAC │
│ └────────────────────────────────┘
│
* GKE Autopilot: Google also manages │ * With node auto-upgrade enabled,
the worker nodes and their sizing │ the provider patches node OS
What remains your responsibility in all cases: your application workloads, your RBAC policies, your network policies, your storage configuration, your monitoring, your cost management.
GKE: Google Kubernetes Engine
GKE is the most mature managed Kubernetes service. GKE is typically the first to adopt new Kubernetes features and the most opinionated about best practices.
Networking. GKE uses a VPC-native networking model with Alias IPs. Each node is allocated a secondary IP range from the VPC. Pods receive IPs from this secondary range. These are real VPC IPs — they are routable within the VPC without overlay networks or encapsulation. This means VPC firewall rules, routes, and VPC peering work natively with pod IPs.
Autopilot mode. GKE offers two modes: Standard (you manage node pools) and Autopilot (Google manages everything, including node provisioning and sizing). In Autopilot mode, you submit workloads and Google provisions the right amount of compute. You pay per pod resource request, not per node. Autopilot enforces security best practices by default: workloads run as non-root, privilege escalation is blocked, and host path mounts are disallowed.
Upgrades. GKE is typically the fastest to support new Kubernetes versions. It offers release channels (Rapid, Regular, Stable) that automatically upgrade the control plane and node pools on a schedule. Surge upgrades create extra nodes to maintain capacity during rolling node upgrades.
Pricing. $0.10/hr for the cluster management fee (Standard mode). Autopilot charges per pod resource request instead.
GKE Strengths
- Fastest Kubernetes version adoption
- Autopilot removes node management entirely
- VPC-native networking eliminates overlay complexity
- Tight integration with Google Cloud networking (Cloud NAT, Cloud Armor, Internal Load Balancers)
- Binary Authorization for supply chain security
GKE Weaknesses
- Smaller ecosystem of third-party integrations compared to AWS
- Autopilot restrictions may be too opinionated for some workloads
- Vendor lock-in to GCP networking model
EKS: Amazon Elastic Kubernetes Service
EKS is the most widely used managed Kubernetes service, reflecting AWS’s dominant market position. It is also the most “assembly required” of the three — AWS provides the control plane and expects you to configure everything else.
Networking. EKS uses the AWS VPC CNI plugin, which assigns pods real VPC IP addresses from Elastic Network Interfaces (ENIs). Each EC2 instance has a limit on the number of ENIs it can attach and the number of secondary IPs per ENI. This means pod density is limited by instance type:
| Instance Type | Max ENIs | IPs per ENI | Max Pods |
|---|---|---|---|
| t3.nano | 2 | 2 | ~4 |
| t3.medium | 3 | 6 | ~17 |
| m5.large | 3 | 10 | ~29 |
| m5.xlarge | 4 | 15 | ~58 |
| m5.24xlarge | 15 | 50 | ~737 |
This is a critical capacity planning consideration. If you run many small pods, you may hit the pod limit before you exhaust CPU or memory. AWS offers prefix delegation to increase pod density by assigning /28 prefixes instead of individual IPs.
Node management. EKS offers three options: self-managed nodes (EC2 instances you configure), managed node groups (AWS manages the EC2 lifecycle), and Fargate (serverless pods, similar to GKE Autopilot but per-pod). Karpenter is AWS’s open-source node autoscaler, which provisions right-sized nodes based on pending pod requirements — it is faster and more flexible than the Cluster Autoscaler.
Upgrades. EKS upgrades are the most manual of the three providers. You upgrade the control plane first (one API call or console click), then upgrade each node group separately. There is no automatic release channel for control plane upgrades in the standard configuration — you must actively track Kubernetes versions and initiate upgrades. However, EKS Auto Mode (launched December 2024) manages node and upgrade operations automatically for clusters that opt in.
Pricing. $0.10/hr for the cluster ($72/month). EKS on Fargate adds a per-pod charge.
EKS Strengths
- Largest ecosystem — most third-party tools are tested on EKS first
- Deep AWS integration (IAM roles for service accounts, ALB Ingress Controller, EBS CSI driver)
- Karpenter for intelligent, fast node autoscaling
- Most flexibility in configuration
- AWS marketplace of EKS add-ons
EKS Weaknesses
- Most manual upgrade process
- VPC CNI pod density limits require careful instance type selection
- More “assembly required” than GKE or AKS
AKS: Azure Kubernetes Service
AKS differentiates primarily on pricing: the control plane is free in the Free tier. You pay only for the worker node VMs.
Networking. AKS offers two networking models. kubenet is a basic overlay network where pods get IPs from a virtual network that is not routable in the VPC (Azure calls it VNet). Azure CNI assigns pods real VNet IPs, similar to AWS VPC CNI and GKE Alias IPs. Azure CNI Overlay is a newer option that provides Azure CNI features without consuming VNet IPs for every pod.
Upgrades. AKS has a rapid security patching cadence. It supports automatic upgrades through channels (none, patch, stable, rapid, node-image). Node image upgrades can be applied independently from Kubernetes version upgrades.
Pricing. Free tier: $0 for the control plane. Standard tier: $0.10/hr (adds SLA and more features). Premium tier: $0.60/hr (adds long-term support versions).
AKS Strengths
- Free control plane in Free tier
- Rapid security patching
- Strong integration with Azure Active Directory for RBAC
- Azure Arc extends AKS management to on-premises and other clouds
- AKS Automatic mode (similar to GKE Autopilot)
AKS Weaknesses
- Azure networking can be complex (VNet peering, NSG interactions)
- Historically slower Kubernetes version adoption than GKE
- Smaller Kubernetes-specific community than AWS
Comparison Table
| Feature | GKE | EKS | AKS |
|---|---|---|---|
| Control plane cost | $0.10/hr | $0.10/hr | Free (Free tier) |
| Serverless pods | Autopilot | Fargate | Virtual Nodes |
| Pod networking | Alias IPs (VPC-native) | VPC CNI (ENI-based) | Azure CNI or kubenet |
| Pod IP routable in VPC? | Yes | Yes | Yes (Azure CNI) |
| Default node autoscaler | Cluster Autoscaler | Karpenter / CA | Cluster Autoscaler / KEDA |
| Upgrade automation | Release channels | Manual initiation; EKS Auto Mode (Dec 2024) manages upgrades automatically | Upgrade channels |
| Version adoption speed | Fastest | Moderate | Moderate |
| Identity integration | Google IAM + Workload Identity | IAM Roles for Service Accounts | Azure AD + Workload Identity |
| Service mesh | Anthos Service Mesh | Istio / Linkerd (App Mesh deprecated Sept 2024) | Istio add-on (Open Service Mesh archived Sept 2023) |
| GPU support | Yes (multi-GPU, TPU) | Yes (GPU, Inferentia, Trainium) | Yes (GPU) |
| Max nodes per cluster | 15,000 | 5,000 (soft limit) | 5,000 |
When to Choose Each
See also Appendix C: Decision Trees for a quick decision flowchart.
Choose GKE when:
- You want the most automated, opinionated experience
- You are already on Google Cloud or are starting fresh
- You want Autopilot to eliminate node management
- You need fast access to the latest Kubernetes features
- You are running ML/AI workloads with TPU requirements
Choose EKS when:
- You are already on AWS (most organizations are)
- You need maximum flexibility and control
- Your team has AWS expertise
- You need deep integration with the AWS ecosystem (Lambda, SQS, DynamoDB)
- You want Karpenter for intelligent autoscaling
Choose AKS when:
- You are already on Azure or have an Enterprise Agreement
- You want a free control plane for dev/test
- You use Azure Active Directory for identity management
- You need hybrid cloud with Azure Arc
- You want the cheapest entry point for learning
Choose self-managed (kubeadm) when:
- You are on-premises with no cloud option
- You have strict regulatory requirements about where the control plane runs
- You are learning Kubernetes internals
- You need control over every component’s configuration
The Hidden Costs
The control plane fee is the smallest part of the bill. The real costs are:
- Worker node compute: The VMs or instances running your pods (typically 80-90% of the bill)
- Load balancers: Each Service of type LoadBalancer creates a cloud load balancer ($15-25/month each)
- NAT gateways: Required for private clusters to reach the internet ($30-45/month + data processing fees)
- Persistent storage: EBS volumes, Persistent Disks, Managed Disks ($0.08-0.10/GB/month for SSD)
- Data transfer: Cross-AZ traffic is charged on all three clouds ($0.01-0.02/GB)
- Monitoring and logging: CloudWatch, Cloud Monitoring, Azure Monitor charges for ingestion and storage
A “free” AKS control plane cluster running three m5.large worker nodes with a load balancer, NAT gateway, and 100 GB of persistent storage will cost approximately $300-400/month before data transfer.
Common Mistakes and Misconceptions
- “Managed Kubernetes means fully managed.” You still manage worker nodes (unless using Autopilot/Fargate), networking, storage, RBAC, monitoring, and your applications. “Managed” refers primarily to the control plane.
- “EKS/GKE/AKS clusters are identical to vanilla Kubernetes.” Each provider adds proprietary networking (VPC CNI, Alias IPs), identity (IRSA, Workload Identity), and storage integrations that don’t exist in upstream K8s.
- “The control plane fee is my main Kubernetes cost.” The $72-74/month control plane fee is typically under 5% of the total bill. Worker node compute, load balancers, NAT gateways, and data transfer dominate costs.
Further Reading
- GKE documentation — Comprehensive guides for Standard and Autopilot modes
- EKS documentation — Setup guides, best practices, and blueprints
- AKS documentation — Getting started, networking, and security guides
- EKS Best Practices Guide — AWS’s official best practices for EKS
- Karpenter documentation — Intelligent node autoscaling for Kubernetes
- GKE Autopilot overview — Understanding the fully managed mode
- KubeCon talks on YouTube — CNCF conference presentations on real-world managed K8s usage
- CNCF Slack #eks, #gke, #aks channels — Community support for each provider
Next: Cloud Networking and Storage
Chapter 17: Cloud Networking and Storage
Kubernetes defines abstractions — Services, PersistentVolumes, Ingress — but it does not implement them. The implementation is provided by cloud-specific controllers and plugins. Understanding how these abstractions map to real cloud infrastructure is the difference between writing YAML that works and writing YAML that works well.
How Pod Networking Maps to Cloud Networking
Recall the flat network model from Chapter 5: every pod gets a unique IP, reachable without NAT.
AWS VPC CNI: Pods as First-Class VPC Citizens
The AWS VPC CNI plugin gives each pod a real VPC IP address. It does this by leveraging Elastic Network Interfaces (ENIs), which are virtual network cards that can be attached to EC2 instances.
AWS VPC CNI: HOW PODS GET IPS
──────────────────────────────
EC2 Instance (m5.large)
┌──────────────────────────────────────────────────┐
│ │
│ Primary ENI (eth0) │
│ ┌────────────────────────────────────┐ │
│ │ Primary IP: 10.0.1.100 (node IP) │ │
│ │ Secondary IP: 10.0.1.101 → Pod A │ │
│ │ Secondary IP: 10.0.1.102 → Pod B │ │
│ │ Secondary IP: 10.0.1.103 → Pod C │ │
│ │ ...up to 10 IPs per ENI │ │
│ └────────────────────────────────────┘ │
│ │
│ Secondary ENI (eth1) │
│ ┌────────────────────────────────────┐ │
│ │ Primary IP: 10.0.1.200 │ │
│ │ Secondary IP: 10.0.1.201 → Pod D │ │
│ │ Secondary IP: 10.0.1.202 → Pod E │ │
│ │ ...up to 10 IPs per ENI │ │
│ └────────────────────────────────────┘ │
│ │
│ Secondary ENI (eth2) │
│ ┌────────────────────────────────────┐ │
│ │ Primary IP: 10.0.1.210 │ │
│ │ Secondary IP: 10.0.1.211 → Pod F │ │
│ │ ... │ │
│ └────────────────────────────────────┘ │
│ │
│ m5.large: 3 ENIs x 10 IPs = ~29 max pods │
│ │
└──────────────────────────────────────────────────┘
Pod A (10.0.1.101) can reach Pod X (10.0.2.55) on another
node directly through VPC routing. No encapsulation.
No overlay. Just VPC route tables.
The IPAMD (IP Address Management Daemon) runs on each node as part of the VPC CNI. It pre-allocates ENIs and warms secondary IPs so that new pods get IPs quickly. When a pod is scheduled, the CNI assigns a pre-warmed IP from the pool.
Advantages: No overlay network. No encapsulation overhead. Pod IPs are routable in the VPC, so VPC security groups, NACLs, VPC Flow Logs, and VPC peering work natively with pod traffic.
Trade-off: Pod density is constrained by the instance type’s ENI and IP limits. A t3.nano can run approximately 4 pods. An m5.large can run approximately 29. This matters: if you run many small pods (sidecars, agents), you may exhaust the IP limit before CPU or memory. Enable prefix delegation to assign /28 prefixes (16 IPs each) instead of individual IPs, dramatically increasing pod density.
GKE Alias IPs: VPC-Native Pods
GKE’s Alias IP model (described in Chapter 16) gives each node a secondary VPC range; pods draw IPs from it without overlay.
On-Premises: Why Overlay Networks Are Necessary
On-premises clusters lack the cloud’s SDN (Software-Defined Networking) layer. The physical network routers do not know about pod CIDRs. An overlay network — VXLAN (used by Flannel, Calico), Geneve (used by Cilium), or IP-in-IP (used by Calico) — encapsulates pod traffic inside packets addressed to node IPs, which the physical network can route.
OVERLAY vs. CLOUD-NATIVE NETWORKING
────────────────────────────────────
Cloud-Native (AWS VPC CNI, GKE Alias IPs):
Pod A ──► [Packet: src=10.0.1.101, dst=10.0.2.55] ──► VPC Router ──► Pod X
No encapsulation. Direct routing.
On-Prem Overlay (VXLAN):
Pod A ──► [Outer: src=192.168.1.10, dst=192.168.1.20]
[VXLAN header]
[Inner: src=10.244.1.5, dst=10.244.2.8] ──► Physical Switch ──► Pod X
Pod packet wrapped inside a node-to-node packet.
~50 bytes overhead per packet. Physical network only sees node IPs.
The overlay approach works everywhere but adds latency (encapsulation/decapsulation), reduces MTU (the inner packet must be smaller than the outer packet), and makes network debugging harder (tcpdump on the physical network shows encapsulated traffic). Cloud-native CNI plugins avoid all of this by integrating with the cloud’s routing layer.
How Storage Maps to Cloud Infrastructure
Kubernetes storage abstractions — PersistentVolumes (PV), PersistentVolumeClaims (PVC), and StorageClasses — map to specific cloud storage services through the Container Storage Interface (CSI).
Storage Access Modes
| Access Mode | Abbreviation | Meaning | Cloud Examples |
|---|---|---|---|
| ReadWriteOnce | RWO | One node can mount read-write | EBS, GCE PD, Azure Managed Disk |
| ReadOnlyMany | ROX | Many nodes can mount read-only | EBS (snapshot-based), GCE PD |
| ReadWriteMany | RWX | Many nodes can mount read-write | EFS, Filestore, Azure Files |
| ReadWriteOncePod | RWOP | One pod can mount read-write | EBS (since CSI spec 1.5) |
The most common mistake is requesting RWX for a workload that only needs RWO. Block storage (EBS, GCE PD, Azure Managed Disks) is RWO — a single volume can only be attached to one node at a time. If you need shared storage across multiple pods on different nodes, you must use a file storage service (EFS, Filestore, Azure Files) or a distributed storage system (Ceph, GlusterFS).
Cloud Storage Mapping
| Kubernetes Concept | AWS | GCP | Azure |
|---|---|---|---|
| RWO PersistentVolume | EBS (gp3, io2) | GCE Persistent Disk (pd-balanced, pd-ssd) | Azure Managed Disk (Premium SSD, Standard SSD) |
| RWX PersistentVolume | EFS | Cloud Filestore | Azure Files |
| StorageClass provisioner | ebs.csi.aws.com | pd.csi.storage.gke.io | disk.csi.azure.com |
| Volume snapshots | EBS snapshots | PD snapshots | Azure Disk snapshots |
The CSI Architecture
CSI (Container Storage Interface) is the standard that allows storage vendors to write plugins for Kubernetes without modifying Kubernetes itself. The architecture has two components deployed differently:
CSI ARCHITECTURE
────────────────
┌─────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
│ │
│ CSI Controller Plugin (Deployment, 1-3 replicas) │
│ ┌───────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────────┐ │ │
│ │ │ external- │ │ external- │ │ │
│ │ │ provisioner │ │ attacher │ │ │
│ │ │ │ │ │ │ │
│ │ │ Watches PVCs, │ │ Watches VolumeAttach │ │ │
│ │ │ calls CSI │ │ objects, calls CSI │ │ │
│ │ │ CreateVolume() │ │ ControllerPublish() │ │ │
│ │ └────────┬─────────┘ └────────┬─────────────┘ │ │
│ │ │ │ │ │
│ │ ┌────────▼─────────────────────▼──────────────┐ │ │
│ │ │ CSI Driver (controller mode) │ │ │
│ │ │ │ │ │
│ │ │ Translates CSI calls to cloud API calls: │ │ │
│ │ │ CreateVolume() → aws ec2 create-volume │ │ │
│ │ │ ControllerPublish() → aws ec2 attach-vol │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ EVERY NODE │
│ │
│ CSI Node Plugin (DaemonSet, one per node) │
│ ┌───────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ node-driver- │ │ │
│ │ │ registrar │ Registers the CSI driver │ │
│ │ │ │ with the kubelet │ │
│ │ └────────┬─────────┘ │ │
│ │ │ │ │
│ │ ┌────────▼───────────────────────────────────┐ │ │
│ │ │ CSI Driver (node mode) │ │ │
│ │ │ │ │ │
│ │ │ NodeStageVolume() → format + mount to │ │ │
│ │ │ staging path │ │ │
│ │ │ NodePublishVolume() → bind mount into │ │ │
│ │ │ pod's filesystem │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
The Controller Plugin runs as a Deployment (typically 1-3 replicas). It handles volume lifecycle operations that do not require node-level access: creating volumes, deleting volumes, creating snapshots, and attaching volumes to nodes (at the cloud API level).
The Node Plugin runs as a DaemonSet (one per node). It handles operations that require access to the node’s filesystem: formatting the volume, mounting it, and bind-mounting it into the pod’s filesystem.
Sidecar containers bridge between Kubernetes and CSI. They watch Kubernetes API objects and translate them into CSI calls:
external-provisioner: Watches PVCs, callsCreateVolume()external-attacher: Watches VolumeAttachment objects, callsControllerPublishVolume()external-snapshotter: Watches VolumeSnapshot objects, callsCreateSnapshot()external-resizer: Watches PVC size changes, callsControllerExpandVolume()node-driver-registrar: Registers the CSI driver with kubelet
Dynamic Provisioning Flow
When you create a PVC with a StorageClass, the following sequence occurs. The following sequence diagram shows the full lifecycle — six different components coordinate to turn a PVC into a mounted filesystem inside a running pod:
CSI DYNAMIC PROVISIONING: PVC TO MOUNTED VOLUME
─────────────────────────────────────────────────
User API Server external- CSI Controller Cloud API external- kubelet / Pod
(kubectl) provisioner Plugin (EBS/GCE) attacher CSI Node Plugin
│ │ │ │ │ │ │ │
│ create PVC │ │ │ │ │ │ │
├─────────────▶│ │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ watch: │ │ │ │ │ │
│ │ new PVC │ │ │ │ │ │
│ ├────────────▶│ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ CreateVolume │ │ │ │ │
│ │ ├──────────────▶│ │ │ │ │
│ │ │ │ create disk │ │ │ │
│ │ │ ├──────────────▶│ │ │ │
│ │ │ │ disk ready │ │ │ │
│ │ │ │◀──────────────┤ │ │ │
│ │ │ volume ID │ │ │ │ │
│ │ │◀──────────────┤ │ │ │ │
│ │ │ │ │ │ │ │
│ │ create PV, │ │ │ │ │ │
│ │ bind PVC │ │ │ │ │ │
│ │◀────────────┤ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ (Pod scheduled to node) │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ watch: │ │ │ │ │ │
│ │ VolumeAttachment │ │ │ │ │
│ ├────────────────────────────────────────────────────────────▶│ │ │
│ │ │ │ │ │ │ │
│ │ │ │ ControllerPublish │ │ │
│ │ │ │ (attach disk │ │ │ │
│ │ │ │ to node) │ │ │ │
│ │ │ ┌────────────┤◀─────────────────────────────┤ │ │
│ │ │ │ │ attach to │ │ │ │
│ │ │ │ ├──────────────▶│ │ │ │
│ │ │ │ │ attached │ │ │ │
│ │ │ │ │◀──────────────┤ │ │ │
│ │ │ └───────────▶│ │ │ │ │
│ │ │ │ │ │ │ │
│ │ kubelet: mount volume │ │ │ │ │
│ ├──────────────────────────────────────────────────────────────────────────▶│ │
│ │ │ │ │ │ │ │
│ │ │ │ NodeStage + │ │ │ │
│ │ │ │ NodePublish │ │ │ │
│ │ │ │ (format, mount) │ │ │
│ │ │ │◀─────────────────────────────────────────────┤ │
│ │ │ │ mounted │ │ │ │
│ │ │ ├──────────────────────────────────────────────▶│ │
│ │ │ │ │ │ │ │
│ │ │ │ │ │ start │ │
│ │ │ │ │ │ container │ │
│ │ │ │ │ │ with mount │ │
│ │ │ │ │ ├─────────────▶│ │
│ │ │ │ │ │ │ Running │
The same steps in a compact numbered form:
DYNAMIC PROVISIONING FLOW
──────────────────────────
1. User creates PVC
┌──────────────────────────────┐
│ kind: PersistentVolumeClaim │
│ spec: │
│ storageClassName: gp3 │
│ resources: │
│ requests: │
│ storage: 50Gi │
└──────────┬───────────────────┘
│
▼
2. external-provisioner sees unbound PVC
with storageClassName matching its driver
│
▼
3. Calls CSI CreateVolume() → cloud creates EBS volume
│
▼
4. Creates PV object bound to the PVC
│
▼
5. Pod is scheduled to a node
│
▼
6. external-attacher sees VolumeAttachment →
calls CSI ControllerPublishVolume() →
cloud attaches EBS to EC2 instance
│
▼
7. Node plugin: NodeStageVolume() formats + mounts
│
▼
8. Node plugin: NodePublishVolume() bind-mounts into pod
│
▼
9. Pod sees /data with 50Gi filesystem
WaitForFirstConsumer: Why It Matters
StorageClasses have a volumeBindingMode field with two options:
Immediate: The volume is created as soon as the PVC is created.WaitForFirstConsumer: The volume is not created until a pod using the PVC is scheduled.
WaitForFirstConsumer is critical for availability-zone-aware storage. EBS volumes, GCE PDs, and Azure Managed Disks are zonal — they exist in a specific availability zone. If a PVC creates an EBS volume in us-east-1a immediately, but the scheduler places the pod in us-east-1b, the volume cannot be attached. WaitForFirstConsumer delays volume creation until the pod is scheduled, so the volume is created in the same AZ as the node.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-waitforfirstconsumer
provisioner: ebs.csi.aws.com
parameters:
type: gp3
fsType: ext4
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
allowVolumeExpansion: true
Always use WaitForFirstConsumer for zonal block storage. The only exception is if you are running a single-AZ cluster.
Volume Snapshots
CSI volume snapshots allow point-in-time copies of PersistentVolumes. The workflow uses three objects:
# 1. VolumeSnapshotClass (like StorageClass but for snapshots)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Delete
---
# 2. VolumeSnapshot (request a snapshot of an existing PVC)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: my-app-snapshot
spec:
volumeSnapshotClassName: ebs-snapshot-class
source:
persistentVolumeClaimName: my-app-data
---
# 3. Restore from snapshot (create a PVC from the snapshot)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-app-data-restored
spec:
storageClassName: gp3
dataSource:
name: my-app-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
This is the foundation for backup workflows. Tools like Velero use CSI snapshots internally.
Common Mistakes and Misconceptions
- “All storage classes perform the same.” gp3 vs io2 vs local NVMe have vastly different IOPS, throughput, and cost profiles. Match storage class to workload requirements, especially for databases.
- “Cross-AZ traffic is free.” All three major clouds charge $0.01-0.02/GB for cross-AZ data transfer. High-traffic services with pods spread across AZs can accumulate significant costs.
- “I should use one big VPC for everything.” Separate VPCs (or at least subnets) for dev/staging/production provide network-level isolation. VPC peering connects them when needed.
Further Reading
- AWS VPC CNI documentation — Detailed explanation of ENI-based pod networking
- GKE VPC-native clusters — How Alias IPs work for pod networking
- Azure CNI overview — Azure CNI vs kubenet comparison
- CSI specification — The official CSI spec
- EBS CSI driver — AWS EBS CSI implementation
- Kubernetes storage documentation — Official PV, PVC, and StorageClass docs
- Volume snapshot documentation — CSI snapshot workflow
Next: Your First Workloads
Chapter 18: Your First Workloads
This chapter is hands-on. Every YAML example is complete — you can apply it to a running cluster and observe the result.
Exercise 1: Deployment + Service
A Deployment manages a set of identical pods. A Service provides a stable network endpoint to reach them. Together, they are the fundamental building block of every Kubernetes application.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: default
labels:
app: web-app
spec:
replicas: 3 # Run 3 identical pods
selector:
matchLabels:
app: web-app # The Deployment manages pods with this label
template: # Pod template --- every pod created from this
metadata:
labels:
app: web-app # Must match selector.matchLabels
spec:
containers:
- name: web
image: nginx:1.27.3 # Always pin a specific version. Never use :latest.
ports:
- containerPort: 80
protocol: TCP
resources:
requests: # Scheduler uses these for placement decisions
cpu: 100m # 100 millicores = 0.1 CPU core
memory: 128Mi # 128 mebibytes
limits: # Hard ceiling the container cannot exceed
cpu: 250m
memory: 256Mi
readinessProbe: # Pod is added to Service endpoints only when ready
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe: # Pod is restarted if this fails
httpGet:
path: /
port: 80
initialDelaySeconds: 15
periodSeconds: 20
Key fields explained:
spec.replicas: The desired number of pod instances. The Deployment controller continuously reconciles the actual count to match this number.spec.selector.matchLabels: How the Deployment identifies which pods it owns. This must match the pod template labels. If it does not, the API server rejects the Deployment.spec.template: The blueprint for each pod. Every pod created by this Deployment is identical (same image, same resources, same probes).resources.requests: The minimum resources the scheduler guarantees. A pod with 100m CPU requests is guaranteed 0.1 cores. The scheduler will not place the pod on a node that cannot satisfy this request.resources.limits: The maximum resources the container can use. Exceeding CPU limits causes throttling (the container is slowed down). Exceeding memory limits causes OOMKill (the container is terminated).
Now the Service:
apiVersion: v1
kind: Service
metadata:
name: web-app
namespace: default
spec:
type: ClusterIP # Internal-only. Reachable within the cluster.
selector:
app: web-app # Route traffic to pods with this label
ports:
- port: 80 # The port the Service listens on
targetPort: 80 # The port on the pod to forward to
protocol: TCP
The Service creates a stable virtual IP (ClusterIP) that load-balances across all pods matching the selector. When pods are created, destroyed, or become unready, the Service automatically updates its endpoints. This decouples clients from the pod lifecycle.
flowchart TD
Client["Client Pod"] -- "GET http://web-app/" --> Service["Service<br>ClusterIP: 10.96.45.12"]
Service -- "load balance" --> Pod1["Pod 1<br>10.244.1.5:80"]
Service -- "load balance" --> Pod2["Pod 2<br>10.244.2.8:80"]
Service -- "load balance" --> Pod3["Pod 3<br>10.244.1.6:80"]
kube-proxy maintains iptables/IPVS rules that distribute traffic across healthy pods.
Apply both:
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl get pods -l app=web-app
kubectl get endpoints web-app
The endpoints command shows which pod IPs are currently backing the Service. Pods that fail their readiness probe are removed from endpoints.
Exercise 2: Scaling
Scaling a Deployment is a single field change:
kubectl scale deployment web-app --replicas=5
Or declaratively, change spec.replicas: 5 and kubectl apply. The Deployment controller creates 2 new pods. The scheduler places them on nodes with available resources. The Service automatically includes them in its endpoint list once they pass their readiness probe.
Scale down to 2:
kubectl scale deployment web-app --replicas=2
The Deployment controller selects 3 pods for termination. Kubernetes sends SIGTERM, waits for terminationGracePeriodSeconds (default 30 seconds), then sends SIGKILL. During this window, the pod is removed from Service endpoints so it stops receiving new traffic.
Exercise 3: Rolling Updates
Change the image version to trigger a rolling update:
kubectl set image deployment/web-app web=nginx:1.27.4
Or change the image in the YAML and kubectl apply. The Deployment controller performs a rolling update controlled by two parameters:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # At most 1 extra pod above desired count
maxUnavailable: 0 # Zero pods can be unavailable during update
ROLLING UPDATE (replicas=3, maxSurge=1, maxUnavailable=0)
──────────────────────────────────────────────────────────
Time Old Pods (v1) New Pods (v2) Total Running
──── ───────────── ───────────── ─────────────
t0 [A] [B] [C] 3 (all v1)
t1 [A] [B] [C] [D]creating 3 + 1 surge
t2 [A] [B] [C] [D]ready 4 (surge = 1)
t3 [A] [B] X [D] 3 (C terminated)
t4 [A] [B] [D] [E]creating 3 + 1 surge
t5 [A] [B] [D] [E]ready 4
t6 [A] X [D] [E] 3 (B terminated)
t7 [A] [D] [E] [F]creat 3 + 1 surge
t8 [A] [D] [E] [F]ready 4
t9 X [D] [E] [F] 3 (all v2)
maxSurge: 1 means at most 1 extra pod can exist above the desired replica count. This provides capacity during the transition.
maxUnavailable: 0 means every old pod must be replaced by a ready new pod before it is terminated. This ensures zero downtime. The trade-off is that the update requires temporarily running 4 pods (3 desired + 1 surge), which needs extra cluster capacity.
Alternative strategies:
maxSurge: 0, maxUnavailable: 1: No extra pods, but one pod is unavailable during each step. Saves resources, risks reduced capacity.maxSurge: 25%, maxUnavailable: 25%: The default. Balances speed and availability.
Roll back if something goes wrong:
kubectl rollout undo deployment/web-app
kubectl rollout status deployment/web-app
kubectl rollout history deployment/web-app
Exercise 4: ConfigMaps and Secrets
Configuration should be separated from container images. ConfigMaps hold non-sensitive configuration. Secrets hold sensitive data (passwords, tokens, certificates).
apiVersion: v1
kind: ConfigMap
metadata:
name: web-config
data:
APP_ENV: "production"
LOG_LEVEL: "info"
config.json: |
{
"database_pool_size": 10,
"cache_ttl_seconds": 300,
"feature_flags": {
"new_dashboard": true
}
}
---
apiVersion: v1
kind: Secret
metadata:
name: web-secrets
type: Opaque
stringData: # stringData accepts plain text (base64 encoded on save)
DATABASE_URL: "postgres://user:pass@db:5432/myapp"
API_KEY: "sk-abc123secret"
Mount as volumes, not environment variables. This is a best practice for two reasons:
- Volume mounts can be updated without restarting the pod (if
subPathis not used). - Environment variables are exposed in
kubectl describe pod, process listings, and crash dumps. Volume-mounted files are more contained.
# In the Deployment spec.template.spec:
containers:
- name: web
image: my-app:v1.2.0
volumeMounts:
- name: config-volume
mountPath: /etc/app/config
readOnly: true
- name: secret-volume
mountPath: /etc/app/secrets
readOnly: true
volumes:
- name: config-volume
configMap:
name: web-config
- name: secret-volume
secret:
secretName: web-secrets
defaultMode: 0400 # Read-only by owner
The application reads /etc/app/config/config.json and /etc/app/secrets/DATABASE_URL as files. When the ConfigMap is updated, the kubelet updates the mounted files within 1-2 minutes (the sync period). The application must watch for file changes or be signaled to reload.
Note: Kubernetes Secrets are base64-encoded, not encrypted at rest by default. For actual security, enable encryption at rest (EncryptionConfiguration) or use an external secret store (AWS Secrets Manager, HashiCorp Vault) with the External Secrets Operator.
Exercise 5: Resource Requests, Limits, and QoS
Understanding the difference between CPU and memory limits is fundamental to running stable workloads.
CPU is compressible. When a container exceeds its CPU limit, it is throttled — the kernel’s CFS (Completely Fair Scheduler) restricts the container’s CPU time. The container runs slower but continues to run. It is never killed for using too much CPU.
Memory is non-compressible. When a container exceeds its memory limit, it is OOMKilled — the kernel’s OOM killer terminates the process. There is no way to “slow down” memory usage. The container either fits in its limit or it dies.
CPU vs MEMORY: WHAT HAPPENS WHEN YOU EXCEED LIMITS
───────────────────────────────────────────────────
CPU (compressible):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Request │ │ Using │ │ Limit │
│ 100m │ ... │ 300m │ ... │ 250m │
└─────────┘ └────┬────┘ └─────────┘
│
Container is THROTTLED.
Runs slower. Not killed.
CFS quota enforced.
Memory (non-compressible):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Request │ │ Using │ │ Limit │
│ 128Mi │ ... │ 300Mi │ ... │ 256Mi │
└─────────┘ └────┬────┘ └─────────┘
│
Container is OOMKilled.
Exit code 137 (128 + SIGKILL=9).
Pod restarts (CrashLoopBackOff if repeated).
QoS classes are assigned automatically based on resource configuration:
| QoS Class | Condition | Eviction Priority |
|---|---|---|
| Guaranteed | Every container has requests == limits for both CPU and memory | Last to be evicted |
| Burstable | At least one container has a request or limit set, but they are not all equal | Middle priority |
| BestEffort | No requests or limits set on any container | First to be evicted |
When a node runs out of memory, the kubelet evicts pods in order: BestEffort first, then Burstable (sorted by how much they exceed their requests), then Guaranteed (only under extreme pressure). Always set both requests and limits. Setting them equal gives you Guaranteed QoS — the strongest protection against eviction.
Exercise 6: Ingress
A Service of type ClusterIP is only reachable inside the cluster. Ingress exposes HTTP/HTTPS routes from outside the cluster to Services inside the cluster.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx # Which Ingress controller handles this
tls:
- hosts:
- app.example.com
secretName: app-tls-cert # Secret containing TLS cert and key
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-app
port:
number: 80
- path: /api
pathType: Prefix
backend:
service:
name: api-service
port:
number: 8080
Ingress requires an Ingress controller — a pod that watches Ingress resources and configures a reverse proxy (typically NGINX, Traefik, or HAProxy). The Ingress resource itself is just configuration; the controller is the data plane that routes traffic.
INGRESS TRAFFIC FLOW
────────────────────
Internet
│
▼
Load Balancer (cloud LB or NodePort)
│
▼
Ingress Controller Pod (NGINX)
│
├── Host: app.example.com, Path: / → Service: web-app:80
│ → Pod 10.244.1.5:80
│ → Pod 10.244.2.8:80
│
└── Host: app.example.com, Path: /api → Service: api-service:8080
→ Pod 10.244.1.9:8080
Install an Ingress controller (it is not included by default):
# NGINX Ingress Controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.12.0/deploy/static/provider/cloud/deploy.yaml
Putting It All Together
A complete application typically combines all of the above:
COMPLETE APPLICATION STACK
──────────────────────────
Ingress (app.example.com)
│
▼
Service (ClusterIP)
│
├──► Pod 1 ──► ConfigMap (config files)
│ Secret (credentials)
│ PVC (persistent data)
│
├──► Pod 2 ──► (same mounts)
│
└──► Pod 3 ──► (same mounts)
Apply resources in dependency order: Namespace, ConfigMap, Secret, PVC, Deployment, Service, Ingress. Or put them all in one file separated by --- and let kubectl apply handle the ordering.
Common Mistakes and Misconceptions
- “Using the
latesttag is convenient and fine.”latestis mutable — it can point to different images over time. This breaks reproducibility and rollbacks. Always use specific version tags or digests. - “Pods don’t need resource requests and limits.” Without requests, the scheduler can’t make good placement decisions. Without limits, a single pod can consume all node resources and crash neighbors.
- “Restarting a Deployment means deleting and recreating it.” Use
kubectl rollout restart deployment/<name>to trigger a rolling restart without downtime or losing the Deployment’s history. - “I should use
kubectl runfor everything.”kubectl runcreates bare pods without a controller. Use Deployments for services (self-healing, rolling updates) and Jobs for batch work.
When things go wrong, see Appendix D: Troubleshooting Quick Reference for a mapping of error messages to root causes.
Further Reading
- Kubernetes Deployments — Official Deployment documentation
- Services documentation — Service types, selectors, and endpoints
- Ingress documentation — Ingress resource specification
- Resource Management — Requests, limits, and QoS classes
- ConfigMap and Secrets — Configuration management best practices
- KillerCoda interactive labs — Browser-based exercises for Deployments, Services, and Ingress
- KodeKloud CKAD course — Hands-on application deployment labs
- Kubernetes Basics Tutorial — Official interactive tutorial for deploying your first app
Next: Debugging Kubernetes
Chapter 19: Debugging Kubernetes
Kubernetes failures are often opaque. A pod does not start, a service does not route traffic, a node disappears — and the system gives you a status word and expects you to figure out the rest. This chapter builds a systematic debugging methodology and a reference for the most common failure modes. For a quick-reference cheat sheet of errors and fixes, see Appendix D: Troubleshooting Quick Reference.
The Debugging Workflow
Every Kubernetes debugging session follows the same escalation path:
THE DEBUGGING ESCALATION PATH
──────────────────────────────
kubectl get What exists? What state is it in?
│
▼
kubectl describe Why is it in that state? What events occurred?
│
▼
kubectl logs What did the application say?
│
▼
kubectl exec Get inside the container and investigate
│
▼
kubectl debug Cannot exec? Use an ephemeral debug container
│
▼
Node-level debug SSH to the node, check kubelet logs, check runtime
Step 1: kubectl get — What Exists?
Start broad and narrow down.
# Overview of all resources in a namespace
kubectl get all -n my-namespace
# Pods with extra detail
kubectl get pods -n my-namespace -o wide
# Watch pods in real time
kubectl get pods -n my-namespace -w
# Filter by label
kubectl get pods -l app=web-app -o wide
The -o wide flag shows node placement and pod IPs. The -w flag watches for changes in real time — invaluable for observing rolling updates, scaling events, or crash loops.
Step 2: kubectl describe — Why?
kubectl describe shows the full history of an object: its current spec, its status, and the events that affected it. Events are the single most important debugging data source in Kubernetes.
kubectl describe pod web-app-7d4f8b6c9-x2z4p
The output includes:
- Status: The pod’s current phase (Pending, Running, Succeeded, Failed, Unknown)
- Conditions: Ready, Initialized, ContainersReady, PodScheduled — each with a reason if false
- Container state: Waiting (with reason), Running, or Terminated (with exit code)
- Events: Time-ordered log of what happened to this pod
Events decay after 1 hour by default. If you are debugging something that happened hours ago, events may be gone. Use a monitoring system to persist events (more on this in Chapter 20).
Step 3: kubectl logs — What Did the Application Say?
# Current logs
kubectl logs web-app-7d4f8b6c9-x2z4p
# Previous container's logs (after a restart)
kubectl logs web-app-7d4f8b6c9-x2z4p --previous
# Follow logs in real time
kubectl logs web-app-7d4f8b6c9-x2z4p -f
# Logs from a specific container in a multi-container pod
kubectl logs web-app-7d4f8b6c9-x2z4p -c sidecar
# Logs from all pods matching a label
kubectl logs -l app=web-app --all-containers
The --previous flag is critical for CrashLoopBackOff debugging. The current container has just started (and may have nothing useful in its logs yet), but the previous container’s logs show why it crashed.
Step 4: kubectl exec — Get Inside
# Interactive shell
kubectl exec -it web-app-7d4f8b6c9-x2z4p -- /bin/sh
# Run a single command
kubectl exec web-app-7d4f8b6c9-x2z4p -- cat /etc/app/config/config.json
# Check DNS resolution from inside the pod
kubectl exec web-app-7d4f8b6c9-x2z4p -- nslookup my-service
# Check network connectivity
kubectl exec web-app-7d4f8b6c9-x2z4p -- wget -qO- http://my-service:8080/health
Step 5: kubectl debug — When exec Is Not Enough
Many production images are distroless — they contain only the application binary, with no shell, no curl, no debugging tools. You cannot exec into something that has no shell.
Ephemeral debug containers solve this. They inject a temporary container into a running pod that shares the pod’s network namespace (and optionally its process namespace).
# Attach a debug container with networking tools
kubectl debug -it web-app-7d4f8b6c9-x2z4p \
--image=nicolaka/netshoot \
--target=web
# The --target flag shares the process namespace with the specified container
# You can now see the target container's processes with ps aux
The debug container runs alongside the existing containers in the same pod. It shares the network namespace (same IP, same ports) but has its own filesystem with the debugging tools you need. When you exit, the ephemeral container is cleaned up.
You can also debug nodes:
# Create a debugging pod on a specific node
kubectl debug node/worker-1 -it --image=ubuntu
# This creates a pod with hostPID, hostNetwork, and the node's
# filesystem mounted at /host. You can inspect the node as if
# you had SSH access.
Understanding Pod Status
Pod status words are the first signal in any debugging session. Here is what each one means and how to investigate it.
POD LIFECYCLE
─────────────
Pending ──► Running ──► Succeeded
│ │
│ └──► Failed
│
└──► (stuck here: scheduling or volume issues)
Container States:
Waiting ──► Running ──► Terminated
│ │
│ └──► (exit code 0 = success)
│ └──► (exit code non-zero = error)
│
└──► CrashLoopBackOff (repeated Terminated → Waiting cycle)
Status Reference Table
| Status | Likely Cause | Diagnostic Command |
|---|---|---|
| Pending (no events) | No node has enough resources | kubectl describe pod — look for “Insufficient cpu/memory” in events |
| Pending (FailedScheduling) | Node selector, affinity, or taint preventing scheduling | kubectl describe pod — check node affinity/selector rules and taints |
| Pending (volume) | PVC unbound, StorageClass missing, or AZ mismatch | kubectl get pvc and kubectl describe pvc |
| ContainerCreating (stuck) | Image pull in progress, or volume mount failing | kubectl describe pod — check events for pull progress or mount errors |
| ImagePullBackOff | Wrong image name, tag does not exist, or registry auth failure | kubectl describe pod — read the exact error. Check image name and imagePullSecrets |
| CrashLoopBackOff | Container starts and immediately exits | kubectl logs --previous — read the application’s error output |
| CrashLoopBackOff (exit 1) | Application error (bad config, missing dependency) | kubectl logs --previous and check ConfigMap/Secret mounts |
| CrashLoopBackOff (exit 137) | OOMKilled — container exceeded memory limit | kubectl describe pod — look for “OOMKilled”. Increase memory limit or fix memory leak |
| CrashLoopBackOff (exit 139) | Segfault in the application | kubectl logs --previous — check for native crash logs |
| Running but not ready | Readiness probe failing | kubectl describe pod — check readiness probe events |
| OOMKilled | Memory limit exceeded | kubectl describe pod — confirm OOMKilled reason. Check resources.limits.memory |
| Evicted | Node under memory or disk pressure | kubectl describe pod — check eviction reason. kubectl describe node — check conditions |
| Terminating (stuck) | Finalizers blocking deletion, or process ignoring SIGTERM | kubectl get pod -o json | jq '.metadata.finalizers' |
| Unknown | Kubelet on the node is not reporting | kubectl get nodes — check if the node is NotReady. Investigate the node. |
| Error (on Job/CronJob) | Container exited with non-zero exit code | kubectl logs <pod> |
Common Failure Patterns
Pattern 1: DNS Resolution Failure
Symptom: Application logs show connection timeouts or “name not found” errors for service names.
Diagnosis:
# Check if CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Test DNS from inside a pod
kubectl exec -it debug-pod -- nslookup kubernetes.default
kubectl exec -it debug-pod -- nslookup my-service.my-namespace.svc.cluster.local
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
Common causes:
- CoreDNS pods are not running
- The pod’s DNS policy is misconfigured
- A NetworkPolicy is blocking DNS traffic (port 53 UDP/TCP to the kube-dns Service)
Pattern 2: Service Not Routing Traffic
Symptom: Requests to a Service ClusterIP time out or return connection refused.
Diagnosis:
# Check if the Service has endpoints
kubectl get endpoints my-service
# If endpoints list is empty:
# 1. Check that pods exist with the right labels
kubectl get pods -l app=web-app
# 2. Check that pods are Ready
kubectl get pods -l app=web-app -o jsonpath='{.items[*].status.conditions}'
# 3. Check that the Service selector matches the pod labels
kubectl get svc my-service -o yaml | grep -A5 selector
The most common cause is a selector mismatch — the Service’s spec.selector labels do not match the pod’s metadata.labels. This is a silent failure: no error, just no traffic.
Pattern 3: Node NotReady
Symptom: kubectl get nodes shows a node in NotReady status.
Diagnosis:
# Check node conditions
kubectl describe node worker-1
# Look for conditions:
# MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable
# If you can SSH to the node:
# Check kubelet status
systemctl status kubelet
journalctl -u kubelet -n 100
# Check container runtime
systemctl status containerd
crictl ps
Common causes:
- kubelet crashed
- Container runtime is down
- The node ran out of disk space
- Network connectivity to the API server was lost
Pattern 4: Persistent Volume Claim Stuck in Pending
Symptom: PVC stays in Pending state indefinitely.
Diagnosis:
kubectl describe pvc my-claim
# Look for events like:
# - "no persistent volumes available for this claim"
# - "storageclass not found"
# - "waiting for first consumer to be created before binding"
Common causes:
- StorageClass does not exist
- The CSI driver is not installed
WaitForFirstConsumervolume binding mode is waiting for a pod to be scheduled- The requested storage exceeds available capacity
Pattern 5: Intermittent OOMKills
Symptom: Pods restart periodically with exit code 137.
Diagnosis:
# Confirm OOMKill
kubectl describe pod my-pod | grep -A5 "Last State"
# Check current memory usage
kubectl top pod my-pod
# Check the memory limit
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.limits.memory}'
The fix is either to increase the memory limit or to fix the memory leak in the application. If kubectl top shows memory growing over time without plateauing, suspect a leak. If it grows to a stable level that exceeds the limit, the limit is too low.
Advanced: Reading Events Cluster-Wide
Events are namespaced objects. To see all events across the cluster:
# All events in a namespace, sorted by time
kubectl get events -n my-namespace --sort-by='.lastTimestamp'
# All events cluster-wide
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
# Watch for new events in real time
kubectl get events -n my-namespace -w
# Filter events by type (Warning events are usually the interesting ones)
kubectl get events -n my-namespace --field-selector type=Warning
Events are the system’s audit trail. When something goes wrong, the event stream usually tells you what happened, when, and why — if you look quickly enough before the events expire.
Common Mistakes and Misconceptions
- “kubectl logs shows everything.” Logs only show stdout/stderr from the current (or previous with
-p) container. For multi-container pods, specify-c container-name. For crashed init containers, use-c init-container-name. - “If the pod is Running, it’s healthy.” Running means the container process is alive, not that it’s serving traffic correctly. Readiness probes determine if a pod receives traffic; a Running pod can be unready.
- “kubectl exec is safe in production.” exec gives shell access to running containers, bypassing RBAC audit trails for in-container actions. Use it for debugging but not as a routine operational tool. Audit exec usage.
- “Events tell the full story.” Events are garbage-collected after 1 hour by default. For historical debugging, you need persistent logging (Loki, CloudWatch, etc.).
Further Reading
- Kubernetes troubleshooting documentation — Official debugging guides for applications, clusters, and services
- kubectl debug documentation — Ephemeral debug container reference
- Pod lifecycle — Pod phases, conditions, and container states
- nicolaka/netshoot — Docker image with network debugging tools for ephemeral containers
- KillerCoda debugging scenarios — Interactive browser-based troubleshooting labs
- Learnk8s troubleshooting flowchart — Visual flowchart for debugging Deployments
- CNCF Slack #kubernetes-users — Community support for Kubernetes debugging
- Kubernetes Debugging Exercises — Practice debugging common Kubernetes issues
Next: Production Readiness
Chapter 20: Production Readiness
A cluster that runs workloads is not the same as a cluster that is ready for production. Production readiness is a checklist of capabilities that, taken together, ensure your cluster is observable, secure, recoverable, and cost-efficient.
The Production Readiness Checklist
PRODUCTION READINESS
────────────────────
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Monitoring │ │ Logging │ │ Security │
│ Prometheus │ │ Loki / EFK │ │ RBAC, PSS, │
│ Grafana │ │ │ │ NetworkPol │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐
│ Backup │ │ Health │ │ Resource │
│ Velero │ │ Probes │ │ Management │
│ etcd snap │ │ PDBs │ │ QoS, Quotas │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌──────▼───────┐
│ Cost │
│ Management │
│ Labels, │
│ Kubecost │
└──────────────┘
Monitoring: Prometheus + Grafana
Monitoring answers one question: “Is my system healthy right now, and if not, where is the problem?”
kube-prometheus-stack
The kube-prometheus-stack Helm chart deploys a complete monitoring pipeline:
- Prometheus: Scrapes metrics from all Kubernetes components, node exporters, and application pods
- Grafana: Dashboards for visualization and alerting
- Alertmanager: Routes alerts to Slack, PagerDuty, email, or other channels
- node-exporter: DaemonSet that exports node-level metrics (CPU, memory, disk, network)
- kube-state-metrics: Exports Kubernetes object state as metrics (pod status, deployment replicas, PVC capacity)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=changeme
This single command deploys 5+ components with pre-configured dashboards and alert rules. The default dashboards cover node health, pod resource usage, API server latency, etcd performance, and CoreDNS metrics.
What to Monitor
| Layer | Key Metrics | Why |
|---|---|---|
| Nodes | CPU utilization, memory utilization, disk I/O, network I/O | Detect resource exhaustion before it causes evictions |
| Pods | CPU usage vs request, memory usage vs limit, restart count | Detect misconfigured resource limits and crash loops |
| API Server | Request latency (p99), error rate, request count | The API server is the heart of the cluster |
| etcd | Disk fsync duration, leader elections, DB size | etcd performance directly affects cluster responsiveness |
| Application | Request latency, error rate, throughput (RED metrics) | Your users care about application health, not node health |
Golden Signals
Monitor the four golden signals for every service:
- Latency: How long requests take (distinguish successful vs failed requests)
- Traffic: How many requests per second
- Errors: How many requests fail
- Saturation: How full is the system (CPU, memory, queue depth)
Logging: Loki or EFK
Metrics tell you that something is wrong. Logs tell you why.
Option 1: Grafana Loki (Recommended)
Loki is a log aggregation system designed for Kubernetes. Unlike Elasticsearch, Loki indexes only labels, not full text. This makes it an order of magnitude cheaper to operate while remaining fast for label-based queries (which is how you search logs in Kubernetes: by pod, namespace, container, node).
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true \
--set grafana.enabled=false # Use the Grafana from kube-prometheus-stack
Promtail runs as a DaemonSet, reads container logs from /var/log/pods/, and ships them to Loki with Kubernetes labels attached.
Option 2: EFK Stack (Elasticsearch + Fluentd + Kibana)
The traditional choice. Elasticsearch provides full-text search, which is more powerful than Loki’s label-based queries. The trade-off is operational complexity: Elasticsearch clusters require significant memory, careful index management, and regular maintenance.
Choose Loki if you want simplicity and cost efficiency. Choose EFK if you need full-text search across log content.
Security
Kubernetes security is defense in depth: multiple layers, each reducing the attack surface.
RBAC: Principle of Least Privilege
Every human user, service account, and CI/CD pipeline should have the minimum permissions required for their function. Never use cluster-admin for applications.
# A Role that allows reading pods and logs in a specific namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: my-app
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
---
# Bind the role to a service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-reader-binding
namespace: my-app
subjects:
- kind: ServiceAccount
name: my-app-sa
namespace: my-app
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Key RBAC principles:
- Use Roles (namespaced) over ClusterRoles (cluster-wide) whenever possible
- Never grant
*(all) verbs unless absolutely necessary - Audit RBAC regularly:
kubectl auth can-i --list --as=system:serviceaccount:my-app:my-app-sa - Use
kubectl auth can-i create deployments --as=janeto test permissions
NetworkPolicies: Default Deny
By default, every pod can communicate with every other pod — a compromised pod can reach the entire cluster network.
Start with a default-deny policy in every namespace, then explicitly allow the traffic you need:
# Default deny all ingress and egress in a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: my-app
spec:
podSelector: {} # Applies to ALL pods in the namespace
policyTypes:
- Ingress
- Egress
---
# Allow the web pods to receive traffic on port 80
# and make DNS queries (port 53) and reach the database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-web-traffic
namespace: my-app
spec:
podSelector:
matchLabels:
app: web-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- port: 80
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- port: 5432
- to: # Allow DNS
- namespaceSelector: {}
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
Note: NetworkPolicies require a CNI plugin that supports them (Calico, Cilium, Weave). Flannel does not enforce NetworkPolicies.
Pod Security Standards
Pod Security Standards (PSS) replace the deprecated PodSecurityPolicy. They define three levels:
| Level | Description | Key Restrictions |
|---|---|---|
| Privileged | Unrestricted | None |
| Baseline | Minimally restrictive | No hostNetwork, no hostPID, no privileged containers |
| Restricted | Heavily restricted | Must run as non-root, drop ALL capabilities (only NET_BIND_SERVICE may be added back), allowPrivilegeEscalation: false, seccomp RuntimeDefault or Localhost |
Apply them at the namespace level:
apiVersion: v1
kind: Namespace
metadata:
name: my-app
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/audit: restricted
Every namespace running application workloads should enforce at least baseline. Use restricted for workloads that do not need elevated privileges.
Image Scanning
Scan container images for known vulnerabilities before deploying them. Trivy is the most widely used open-source scanner:
# Scan an image locally
trivy image nginx:1.27.3
# Integrate into CI/CD to fail builds with critical vulnerabilities
trivy image --exit-code 1 --severity CRITICAL my-app:v1.2.0
For continuous in-cluster scanning, deploy Trivy Operator, which scans running workloads and reports vulnerabilities as Kubernetes custom resources.
Backup: Velero + etcd Snapshots
etcd Snapshots
etcd contains the entire cluster state. Regular snapshots are non-negotiable:
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Managed Kubernetes services handle etcd backups automatically. For self-managed clusters, automate this with a CronJob or systemd timer.
Velero
Velero backs up Kubernetes resources (YAML manifests) and persistent volume data (via CSI snapshots). It can restore entire namespaces or specific resources to the same or a different cluster.
# Install Velero
velero install --provider aws --bucket my-backup-bucket \
--secret-file ./credentials-velero \
--use-volume-snapshots=true \
--plugins velero/velero-plugin-for-aws:v1.10.0
# Create a backup of a namespace
velero backup create my-app-backup --include-namespaces my-app
# Schedule daily backups with 30-day retention
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--include-namespaces my-app \
--ttl 720h
Test your restores regularly. A backup that has never been tested is not a backup — it is a hope.
Health Probes: Readiness vs. Liveness vs. Startup
These three probes serve different purposes. Conflating them is one of the most common production mistakes.
| Probe | Purpose | What Happens on Failure | When to Use |
|---|---|---|---|
| Readiness | Is the pod ready to serve traffic? | Removed from Service endpoints (stops receiving traffic) | Always. Check that the app can serve requests. |
| Liveness | Is the pod stuck in an unrecoverable state? | Pod is restarted | Only when the app can deadlock or hang. Check a lightweight endpoint. |
| Startup | Has the pod finished starting up? | Liveness/readiness probes are not run until startup succeeds | Slow-starting apps (JVM, large model loading). |
Critical rule: keep readiness and liveness probes different. The readiness probe should check that the application can serve requests (e.g., can it reach its database?). The liveness probe should check that the application process is not deadlocked (e.g., can it respond to a simple /healthz ping?). If you make them the same, a downstream dependency failure (database down) will cause liveness failures, which restarts the pod, which cannot fix a database outage, which creates a restart storm.
startupProbe: # Allow up to 5 minutes for slow startup
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
readinessProbe: # Check full readiness (dependencies included)
httpGet:
path: /ready
port: 8080
periodSeconds: 10
failureThreshold: 3
livenessProbe: # Check basic aliveness (no dependency checks)
httpGet:
path: /healthz
port: 8080
periodSeconds: 20
failureThreshold: 3
PodDisruptionBudgets
When a node is drained (for upgrades, scaling down, or maintenance), Kubernetes evicts all pods on that node. Without a PodDisruptionBudget (PDB), all replicas of a Deployment on that node could be evicted simultaneously, causing downtime.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 2 # At least 2 pods must remain running
selector:
matchLabels:
app: web-app
Alternatively, use maxUnavailable: 1 to allow at most 1 pod to be disrupted at a time. PDBs are respected by kubectl drain, cluster autoscaler, and node upgrade processes.
Resource Management
LimitRanges
Set default requests and limits for a namespace, so that developers who forget to set them still get reasonable defaults:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: my-app
spec:
limits:
- default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
type: Container
ResourceQuotas
Prevent a single namespace from consuming the entire cluster:
apiVersion: v1
kind: ResourceQuota
metadata:
name: namespace-quota
namespace: my-app
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
persistentvolumeclaims: "10"
QoS Classes Revisited
Guaranteed QoS (requests = limits) ensures critical pods are evicted last; Burstable QoS (requests < limits) allows efficient sharing for batch workloads. Avoid BestEffort — see Chapter 18 for detail.
Cost Management
Kubernetes makes it easy to provision resources and hard to track who is paying for them.
Labels for Cost Attribution
Apply consistent labels to every resource:
metadata:
labels:
team: platform
environment: production
cost-center: engineering
app: web-app
Cloud providers can filter billing data by Kubernetes labels (if label propagation is enabled in the cloud integration).
Tools
- Kubecost: Open-source cost monitoring. Shows cost per namespace, deployment, pod, and label. Identifies idle resources and right-sizing recommendations.
- OpenCost: CNCF project for Kubernetes cost monitoring. Vendor-neutral alternative to Kubecost.
Spot Instances
Run non-critical, fault-tolerant workloads on spot/preemptible instances to reduce compute costs by 60-90%. Use node affinity and tolerations to separate spot-friendly workloads from those that need stable compute:
# Toleration for spot instance taint
tolerations:
- key: "kubernetes.io/spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Combine with PDBs to ensure that spot instance reclamation does not take down all replicas simultaneously.
Chaos Engineering
Once your cluster is observable, secured, and backed up, test that it actually survives failure.
- Chaos Mesh: CNCF project. Injects pod failures, network latency, disk I/O stress, and time skew.
- Litmus: Another CNCF chaos engineering project with a library of pre-built experiments.
- Manual chaos:
kubectl delete pod <random-pod>,kubectl drain node <random-node>, kill a container runtime on a node. Start simple before adopting frameworks.
The goal is not to break things for fun. The goal is to verify that your monitoring catches the failure, your alerts fire, your PDBs prevent cascading outages, and your team knows how to respond.
The Complete Checklist
Before declaring a cluster production-ready, verify:
- Monitoring: kube-prometheus-stack or equivalent deployed and dashboards reviewed
- Alerting: Critical alerts configured (node down, pod CrashLoopBackOff, disk pressure, API server errors)
- Logging: Loki or EFK collecting logs from all namespaces
- RBAC: No unnecessary cluster-admin bindings; service accounts have minimal permissions
- NetworkPolicies: Default-deny in application namespaces with explicit allow rules
- Pod Security Standards: At least
baselineenforced on all application namespaces - Image scanning: Trivy or equivalent in CI/CD pipeline
- Backup: Velero or equivalent with scheduled backups and tested restores
- Health probes: Readiness, liveness, and startup probes on all Deployments
- PDBs: PodDisruptionBudgets on all production Deployments
- Resource limits: Requests and limits set on all containers
- LimitRanges: Default limits in every namespace
- ResourceQuotas: Quotas on every namespace
- Labels: Consistent labeling for cost attribution and filtering
- etcd backups: Automated (managed K8s) or scripted (self-managed)
- Upgrade plan: Documented process for upgrading Kubernetes and node OS
Common Mistakes and Misconceptions
- “My app works in dev, so it’s production-ready.” Production requires health probes, resource requests/limits, PodDisruptionBudgets, anti-affinity rules, graceful shutdown handling, and monitoring. Dev-working is the starting line, not the finish.
- “Setting replicas to 1 with a PDB is fine.” A PDB with
minAvailable: 1on a single-replica Deployment blocks all voluntary disruptions (node drains, upgrades). Use at least 2 replicas for anything that needs PDB protection. - “Liveness probes should check dependencies.” If your liveness probe checks the database and the database goes down, Kubernetes kills all your pods — making recovery impossible. Liveness checks should only verify the process itself is alive.
Further Reading
- kube-prometheus-stack Helm chart — The standard monitoring deployment
- Grafana Loki documentation — Log aggregation setup and query language
- Velero documentation — Backup and restore for Kubernetes
- Pod Security Standards — Official PSS reference
- Trivy — Container image vulnerability scanner
- OpenCost — CNCF project for Kubernetes cost monitoring and optimization
- Chaos Mesh — CNCF chaos engineering for Kubernetes
- CNCF Slack — Community channels for Kubernetes operations
- KubeCon talks playlist — Real-world production Kubernetes talks
- EKS Best Practices Guide — Production checklist specific to EKS
- GKE hardening guide — Security best practices for GKE
This concludes Part 3: From Theory to Practice. You have a running cluster, deployed workloads, and the debugging skills to keep them healthy. Part 4 tackles the next challenge: running applications that cannot simply be restarted and replaced — databases, queues, and other stateful systems that need stable identity and persistent storage.
Next: StatefulSets Deep Dive
Chapter 21: StatefulSets Deep Dive
Deployments treat pods as interchangeable. If pod web-abc123 dies, the replacement web-def456 is identical in every way that matters — same image, same configuration, same role. This works beautifully for stateless applications where any instance can handle any request. But some workloads are not interchangeable. - A database replica cannot simply replace the primary without coordination.
- A distributed system that uses consistent hashing needs members with stable identities.
- A clustered cache needs each node to own a predictable shard of data. These workloads need something Deployments cannot provide: stable identity.
StatefulSets exist because some pods are not fungible. (For a visual map of how stateful workload concepts relate, see Appendix B: Mental Models.) Like every Kubernetes workload controller, a StatefulSet follows the controller pattern we covered in Chapter 3 — observe, diff, act — but with additional ordering and identity guarantees that the Deployment controller does not provide.
The Identity Problem
Consider what happens when a Deployment manages three pods:
DEPLOYMENT IDENTITY MODEL
──────────────────────────
Deployment: web (replicas=3)
│
├── web-7b9f5d4c8-abc12 ← random suffix
├── web-7b9f5d4c8-def34 ← random suffix
└── web-7b9f5d4c8-ghi56 ← random suffix
Pod dies → Replacement: web-7b9f5d4c8-xyz99 ← new random name
new IP address
new node (maybe)
no memory of its past life
Now compare with a StatefulSet:
STATEFULSET IDENTITY MODEL
───────────────────────────
StatefulSet: db (replicas=3)
│
├── db-0 ← ordinal index 0 (always the first)
├── db-1 ← ordinal index 1 (always the second)
└── db-2 ← ordinal index 2 (always the third)
Pod db-1 dies → Replacement: db-1 ← same name
same PVC (data-db-1)
same DNS record
same identity, different incarnation
The difference is fundamental. A Deployment pod is a disposable worker. A StatefulSet pod is a named member of a group. When db-1 is replaced, the new pod inherits the identity of the old one — its name, its storage, its network address. This is what makes stateful workloads possible on Kubernetes.
StatefulSets vs Deployments
| Property | Deployment | StatefulSet |
|---|---|---|
| Pod names | Random hash suffix (web-7b9f5-abc12) | Ordinal index (web-0, web-1, web-2) |
| Pod creation order | All at once (parallel) | Sequential by default (web-0 → web-1 → web-2) |
| Pod deletion order | Any order | Reverse ordinal by default (web-2 → web-1 → web-0) |
| Storage | Shared or none | Per-pod PVC via volumeClaimTemplates |
| Network identity | ClusterIP Service (virtual IP) | Headless Service (individual DNS per pod) |
| Scaling | Instant (add/remove any pod) | Sequential (add highest, remove highest ordinal) |
| Use case | Stateless apps, web servers, APIs | Databases, message queues, distributed systems |
The cost of these guarantees is operational complexity. StatefulSets are harder to scale, harder to update, and require more careful planning. Use them only when your workload genuinely needs stable identity or per-pod storage.
Headless Services and Stable DNS
A normal ClusterIP Service creates a virtual IP that load-balances requests across all matching pods. A headless Service (one with clusterIP: None) does not create a virtual IP. Instead, it creates individual DNS records for each pod in the StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: db
namespace: default
spec:
clusterIP: None # This makes it headless
selector:
app: db
ports:
- port: 5432
targetPort: 5432
This headless Service produces the following DNS records:
flowchart TB
subgraph sts["StatefulSet: db (replicas: 3)"]
db0["db-0<br>10.244.1.5"]
db1["db-1<br>10.244.2.8"]
db2["db-2<br>10.244.1.9"]
end
subgraph dns["DNS Records (Headless Service: db, clusterIP: None)"]
dns0["db-0.db.default.svc.cluster.local"]
dns1["db-1.db.default.svc.cluster.local"]
dns2["db-2.db.default.svc.cluster.local"]
dnsAll["db.default.svc.cluster.local<br>(A record → all three IPs)"]
end
dns0 --> db0
dns1 --> db1
dns2 --> db2
dnsAll -.-> db0
dnsAll -.-> db1
dnsAll -.-> db2
style dns fill:#f0f0ff,stroke:#333
style sts fill:#e0ffe0,stroke:#333
Application connects to:
db-0.db.default.svc.cluster.local— always reaches db-0db-1.db.default.svc.cluster.local— always reaches db-1db.default.svc.cluster.local— reaches any (round-robin DNS)
The DNS naming convention is: <pod-name>.<service-name>.<namespace>.svc.cluster.local
The combination of a stable pod name (db-0) and a stable DNS entry (db-0.db.default.svc.cluster.local) gives each pod a persistent network identity that survives restarts, rescheduling, and node failures.
A PostgreSQL replica can be configured to always connect to db-0.db.default.svc.cluster.local as its primary, regardless of which node db-0 happens to be running on or what IP address it currently has.
The StatefulSet Spec
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: db
spec:
serviceName: db # Must match the headless Service name
replicas: 3
selector:
matchLabels:
app: db
template:
metadata:
labels:
app: db
spec:
containers:
- name: postgres
image: postgres:16.2
ports:
- containerPort: 5432
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates: # Per-pod persistent storage
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 50Gi
The volumeClaimTemplates field is unique to StatefulSets. For each pod, Kubernetes creates a PersistentVolumeClaim named <template-name>-<statefulset-name>-<ordinal>. In this example: data-db-0, data-db-1, data-db-2. Each PVC is bound to its own PersistentVolume, giving each pod dedicated storage.
Ordered Operations
By default, StatefulSets use the OrderedReady pod management policy. This means:
- Creation: Pods are created in order:
db-0first, thendb-1only afterdb-0is Running and Ready, thendb-2only afterdb-1is Running and Ready. - Scaling up: Same as creation — new pods are added one at a time in ordinal order.
- Scaling down: Pods are removed in reverse order:
db-2first, thendb-1, thendb-0. - Deletion: If you delete a StatefulSet, pods are terminated in reverse ordinal order.
This ordering exists because stateful systems often need it. A database primary (db-0) must be running before replicas (db-1, db-2) can initialize and connect. Replicas should be drained before the primary is stopped.
For workloads that do not need ordered operations (for example, a distributed cache where all nodes are peers), you can set podManagementPolicy: Parallel:
spec:
podManagementPolicy: Parallel # All pods start/stop simultaneously
This removes the ordering constraint but retains stable names and per-pod storage.
Update Strategies
RollingUpdate (Default)
Pods are updated in reverse ordinal order: db-2 first, then db-1, then db-0. Each pod must become Ready before the next one is updated.
The partition parameter enables canary deployments:
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 2 # Only pods with ordinal >= 2 are updated
With partition: 2 and 3 replicas, only db-2 receives the new pod template. db-0 and db-1 remain on the old version. After verifying db-2 is healthy, you lower the partition to 1, then 0, to roll out the update progressively. This is the safest way to update a stateful workload.
OnDelete
Pods are updated only when you manually delete them. This gives you complete control over the update order:
spec:
updateStrategy:
type: OnDelete
This is useful when the update order matters and the default reverse-ordinal approach is not appropriate — for example, when you need to update replicas before the primary.
PVC Retention Policies
By default, PVCs created by volumeClaimTemplates are never deleted by Kubernetes. This is the safest behavior — you never accidentally lose data — but it means orphaned PVCs accumulate when you scale down or delete a StatefulSet.
Starting in v1.27, you can configure PVC retention:
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain # When the StatefulSet is deleted
whenScaledDown: Retain # When replicas are scaled down
| Policy | whenDeleted | whenScaledDown | Behavior |
|---|---|---|---|
| Safest | Retain | Retain | PVCs always preserved (default) |
| Balanced | Delete | Retain | Cleanup on StatefulSet deletion, preserve on scale-down |
| Aggressive | Delete | Delete | PVCs deleted on both operations |
For production databases, always use Retain for both. Data recovery from a lost PVC is far more expensive than cleaning up unused PVCs.
Scale-Down and PVC Persistence
This behavior surprises many operators and deserves special emphasis:
PVC PERSISTENCE ON SCALE-DOWN
───────────────────────────────
BEFORE: replicas=5
db-0 db-1 db-2 db-3 db-4 PVCs: data-db-0 through data-db-4
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
PV0 PV1 PV2 PV3 PV4
AFTER: replicas=3 (scaled down)
db-0 db-1 db-2 Pods db-3, db-4 terminated
│ │ │
▼ ▼ ▼
PV0 PV1 PV2 PV3 PV4 PVCs data-db-3, data-db-4 STILL EXIST
▲ ▲
│ │
Orphaned PVCs ← data preserved but no pod using it
LATER: replicas=5 (scaled back up)
db-0 db-1 db-2 db-3 db-4 db-3, db-4 reattach to existing PVCs
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
PV0 PV1 PV2 PV3 PV4 Data from previous incarnation intact!
This is deliberate. When you scale back up, db-3 and db-4 get their old data back. But it also means you are paying for unused storage until you manually delete the orphaned PVCs (or configure whenScaledDown: Delete).
When to Use StatefulSets vs Deployments
Use a StatefulSet when:
- Each pod needs a stable, unique network identity (databases, consensus systems)
- Each pod needs its own persistent storage volume (data nodes)
- Pod initialization or termination must happen in a defined order
- Peers need to address each other by name (cluster membership protocols)
Use a Deployment when:
- All pods are identical and interchangeable
- Shared storage (or no storage) is sufficient
- Order of creation and deletion does not matter
- You need fast scaling (no sequential constraints)
A common anti-pattern is using StatefulSets for applications that just need persistent storage but do not need stable identity. If your application uses a single shared PVC (ReadWriteMany), a Deployment with a PVC is simpler and more appropriate.
Common Mistakes and Misconceptions
- “StatefulSets are just Deployments with persistent storage.” StatefulSets provide ordered startup/shutdown, stable network identities (pod-0, pod-1), and per-replica PVCs. These guarantees come with trade-offs: slower scaling and more complex operations.
- “Deleting a StatefulSet deletes its PVCs.” PVCs are deliberately retained to prevent data loss. You must delete PVCs manually. This is a safety feature, not a bug.
- “I need a StatefulSet for any app that uses a database.” If your app is stateless but connects to an external database, use a Deployment. StatefulSets are for when the pod itself holds state (e.g., the pod IS the database).
Further Reading
- StatefulSet documentation — Official reference
- StatefulSet Basics tutorial — Hands-on walkthrough
- Headless Services — DNS behavior for headless Services
- PVC Retention Policy KEP — Design rationale for PVC retention
Next: Databases on Kubernetes — When to run databases on K8s, operators, and the trade-offs.
Chapter 22: Databases on Kubernetes
“Should we run our database on Kubernetes?” is one of the most debated questions in the Kubernetes community, and the debate persists because the answer is genuinely nuanced. It depends on what database, what workload, what team, and what alternatives exist.
The Great Debate
The argument against databases on Kubernetes is simple: databases are the most important component in most architectures, and Kubernetes was designed for stateless, ephemeral workloads. Pods get rescheduled. Nodes fail. Network partitions happen. Storage has latency. Every one of these events is routine for a web server and potentially catastrophic for a database.
The argument for databases on Kubernetes is equally simple: managed database services are expensive, lock you into a cloud provider, and do not exist on-premises. Kubernetes operators can automate the same operational tasks that managed services handle — failover, backup, replication — and they work everywhere Kubernetes runs.
Both arguments are correct. The question is which trade-offs matter more for your specific situation.
The Honest Assessment
For revenue-critical systems, managed database services remain superior. AWS RDS, Google Cloud SQL, and Azure Database for PostgreSQL have teams of database engineers whose full-time job is handling failover, patching, backup, and recovery. They have years of operational experience encoded into their automation. The cost premium you pay for a managed service is insurance against the operational complexity you would otherwise absorb.
For development, testing, and non-critical workloads, Kubernetes databases are excellent. They provide consistent environments across dev/staging/production, they are easy to spin up and tear down, and they integrate naturally with the rest of your Kubernetes tooling.
For on-premises deployments, Kubernetes operators are often the best option available. When managed services do not exist, the choice is between hand-managing databases on VMs and using an operator that automates the hardest parts. The operator wins in most cases.
Decision Framework
| Tier | Description | Recommendation |
|---|---|---|
| Development / Test | Non-production, disposable data | Kubernetes — fast to create, fast to destroy |
| Tier 2-3 Services | Internal tools, analytics, non-revenue workloads | Kubernetes — acceptable with good operators and monitoring |
| Revenue-Critical | Customer-facing, SLA-bound, data-loss-intolerant | Managed service — unless you have strong database operations expertise |
| On-Premises | No managed services available | Kubernetes operators — best available option |
| Regulatory / Compliance | Data residency, air-gapped environments | Kubernetes operators — often the only option that satisfies constraints |
This is not a permanent ranking. The Kubernetes database ecosystem matures every year. Five years from now, running production databases on Kubernetes may be as routine as running web servers. But today, the operational gap between managed services and operators is real.
The Pets vs Cattle Nuance
The “pets vs cattle” metaphor says that modern infrastructure should treat servers like cattle (interchangeable, disposable) rather than pets (unique, irreplaceable). Kubernetes embodies this philosophy for stateless workloads. But databases are pets. A PostgreSQL primary node has unique state that cannot be recreated from a container image. Its data represents months or years of accumulated state.
StatefulSets are Kubernetes’s acknowledgment that some workloads are pets. The stable identity, ordered operations, and persistent storage guarantees exist specifically because not everything can be cattle. The question is not whether to treat databases as pets — they are pets — but whether Kubernetes provides the right tools for pet care.
Operators are the answer. An operator is a custom controller that encodes domain-specific operational knowledge into software. A PostgreSQL operator knows how to initialize a replica from a base backup, how to promote a replica to primary during failover, how to manage connection pooling, and how to schedule backups. It turns the pet care into automated, repeatable processes.
The Operator Landscape
PostgreSQL
PostgreSQL has the most mature operator ecosystem on Kubernetes.
CloudNativePG — The strongest momentum in 2025. A CNCF Sandbox project with a clean architecture: each PostgreSQL pod runs a lightweight instance manager alongside the database process in the same container (no sidecar), while the operator itself runs as a separate Deployment. Supports automated failover, continuous backup to object storage (S3, GCS, Azure Blob), point-in-time recovery, connection pooling via PgBouncer, and declarative configuration. The project’s velocity and community engagement make it the default choice for new deployments.
Crunchy Data PGO (postgres-operator) — The most battle-tested option. Crunchy Data has been running PostgreSQL on Kubernetes since before it was fashionable. PGO supports pgBackRest for backup (the gold standard for PostgreSQL backup), high availability via Patroni, connection pooling, monitoring integration, and multi-cluster replication. Choose this if you want the operator with the longest production track record.
Zalando postgres-operator — A simpler operator that grew out of Zalando’s internal Kubernetes usage. Good for straightforward PostgreSQL deployments but development velocity has slowed compared to CloudNativePG and PGO. Still a reasonable choice for teams that value simplicity.
MySQL
Percona Operator for MySQL — Supports both Percona XtraDB Cluster (Galera-based synchronous replication) and MySQL group replication. Backup to S3, automated failover, proxy via HAProxy or ProxySQL.
Vitess — Not strictly a MySQL operator but a database clustering system that runs on Kubernetes. Used by Slack, GitHub, and originally developed at YouTube. Vitess is the right choice when you need horizontal sharding of MySQL at massive scale. It is not the right choice for a single PostgreSQL-equivalent deployment.
Other Databases
MongoDB Community Operator — Manages MongoDB replica sets on Kubernetes. The enterprise version from MongoDB Inc. adds Ops Manager integration.
Redis (via Spotahome operator or Redis Enterprise) — Redis Sentinel and Redis Cluster topologies. Redis is simpler to operate than relational databases because it is primarily in-memory, but persistence and replication still require operational care.
Apache Kafka (Strimzi) — The dominant Kafka operator. Strimzi manages Kafka brokers, ZooKeeper (or KRaft), topics, users, and MirrorMaker. Kafka on Kubernetes is now mainstream, partly because Kafka’s distributed architecture maps well to StatefulSet semantics.
What Makes Database Operators Hard
A production database operator must handle:
OPERATOR RECONCILIATION LOOP
──────────────────────────────
┌─────────────────────────────────────────────────────┐
│ Desired State (CR) │
│ PostgresCluster: replicas=3, backup=daily, │
│ version=16, storage=100Gi, pooler=pgbouncer │
└──────────────────────┬──────────────────────────────┘
│
▼
┌──────────────────────────────────────── ─────────────┐
│ Operator Controller │
│ │
│ for each reconciliation loop: │
│ │
│ 1. Check cluster health │
│ ├── Is primary alive? │
│ ├── Are replicas streaming? │
│ └── Is replication lag acceptable? │
│ │
│ 2. Handle topology changes │
│ ├── Scale up: init new replica from backup │
│ ├── Scale down: drain connections, remove │
│ └── Node failure: promote replica to primary │
│ │
│ 3. Manage supporting services │
│ ├── Connection pooler (PgBouncer/ProxySQL) │
│ ├── Backup schedule (base backup + WAL) │
│ ├── Monitoring endpoints (Prometheus) │
│ └── TLS certificates │
│ │
│ 4. Handle version upgrades │
│ ├── Minor: rolling restart │
│ └── Major: pg_upgrade or logical replication │
│ │
└──────────────────────┬────────────────────── ────────┘
│
▼
┌────────────────────────────────────────────── ───────┐
│ Managed Resources │
│ │
│ StatefulSet(primary) StatefulSet(replicas) │
│ Service(read-write) Service(read-only) │
│ PVCs(data) ConfigMaps(postgresql.conf) │
│ Secrets(passwords) CronJob(backup) │
│ Deployment(pooler) ServiceMonitor(metrics) │
└─────────────────────────────────────────────── ──────┘
Each of these responsibilities is a failure mode:
Leader Election and Failover — When the primary fails, the operator must detect the failure, select the most up-to-date replica, promote it, reconfigure all other replicas to follow the new primary, and update the read-write Service endpoint. This must happen in seconds, without data loss, and without split-brain (two nodes both believing they are primary). Getting this wrong is the single most dangerous failure mode for a database.
Replication — The operator must configure streaming replication (for PostgreSQL) or group replication (for MySQL), monitor replication lag, and handle replicas that fall behind. A replica that loses its replication slot must be rebuilt from a base backup, which can take hours for large databases.
Backup and Recovery — Continuous backup involves both periodic base backups (full snapshots of the data directory) and continuous WAL (write-ahead log) archival. The operator must verify backup integrity, manage backup retention, and support point-in-time recovery to any moment in the past.
Version Upgrades — Minor version upgrades (16.1 to 16.2) are typically rolling restarts. Major version upgrades (15 to 16) require data migration via pg_upgrade or logical replication. Both must be done without extended downtime.
Connection Pooling — Database connections are expensive (each consumes memory and a process/thread). A connection pooler like PgBouncer sits between the application and the database, multiplexing many application connections onto a smaller number of database connections. The operator manages the pooler’s lifecycle and configuration.
Dual Monitoring — You need both Kubernetes-level monitoring (pod health, resource usage, PVC capacity) and database-level monitoring (query latency, lock contention, replication lag, cache hit ratio). These are complementary and both are essential.
The Real Cost of Self-Managing
The real comparison is:
| Cost Factor | Managed Service | Kubernetes Operator |
|---|---|---|
| Compute | Higher (managed premium) | Lower (your nodes) |
| Engineering time | Low (vendor handles operations) | Significant (you handle operations the operator cannot) |
| Failure recovery | Vendor SLA | Your team’s expertise |
| Backup verification | Vendor responsibility | Your responsibility to test restores |
| Major version upgrades | Often push-button | Often manual coordination |
| Compliance auditing | Vendor provides documentation | You provide documentation |
If your team has strong database operations expertise and the time to invest in it, Kubernetes operators are a powerful tool. If your team’s expertise is in application development and they view the database as infrastructure that should just work, a managed service is the better choice.
A Pragmatic Path
Many organizations adopt a layered approach:
- Start with managed services for production databases. Do not optimize costs before you have a working system.
- Use Kubernetes databases for dev/test. This gives your team experience with the operator and ensures dev/test environments match production topology.
- Evaluate migration to Kubernetes for Tier 2-3 workloads after your team has built confidence with the operator in non-production environments.
- Keep revenue-critical databases on managed services unless you have a compelling reason to move them (cost, compliance, on-premises requirement).
This path minimizes risk while building the operational muscle needed to run databases on Kubernetes if and when it makes sense.
Common Mistakes and Misconceptions
- “Never run databases on Kubernetes.” This was good advice in 2018. Modern operators (CloudNativePG, Percona, Vitess) handle replication, failover, backup, and restore. For many teams, K8s-native databases are simpler than managing separate DB infrastructure.
- “Kubernetes storage is too slow for databases.” Cloud SSDs (gp3, pd-ssd) provide consistent IOPS. Local NVMe on dedicated node pools rivals bare-metal performance. The storage layer is rarely the bottleneck.
- “A database operator means zero operational effort.” Operators automate routine tasks but still require monitoring, capacity planning, backup verification, and version upgrade planning. They reduce effort, not eliminate it.
Further Reading
- CloudNativePG documentation — The leading PostgreSQL operator
- Crunchy Data PGO — Battle-tested PostgreSQL operator
- Strimzi documentation — Kafka on Kubernetes
- Data on Kubernetes community — Community focused on stateful workloads
- KubeCon talk: “Is Running Databases on Kubernetes Practical?” — Real-world experience reports
Next: Persistent Storage Patterns — volumeClaimTemplates, reclaim policies, backup, and resize.
Chapter 23: Persistent Storage Patterns
Storage on Kubernetes is where the abstraction meets physical reality. A pod can be rescheduled to any node in seconds, but a 500GB disk cannot teleport. Persistent storage forces you to think about topology, data lifecycle, and failure modes that stateless workloads let you ignore. For a quick storage decision flowchart, see Appendix C: Decision Trees.
volumeClaimTemplates: The Naming Convention
As covered in Chapter 21, StatefulSets use volumeClaimTemplates to create per-pod PVCs. The naming convention is deterministic:
<template-name>-<statefulset-name>-<ordinal>
For a StatefulSet named db with a template named data:
data-db-0
data-db-1
data-db-2
This naming convention is not arbitrary. It is the mechanism by which Kubernetes reconnects pods to their storage after rescheduling. When db-1 is deleted and recreated (during an update, a node failure, or a manual restart), the new db-1 pod finds the PVC data-db-1 by name and reattaches to the same underlying volume. No operator intervention required.
If your StatefulSet has multiple volume templates (for example, separate volumes for data and write-ahead logs), each template produces its own set of PVCs:
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
- metadata:
name: wal
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
This creates: data-db-0, data-db-1, data-db-2, wal-db-0, wal-db-1, wal-db-2. Six PVCs for three pods, each with its own underlying PersistentVolume.
Reclaim Policies: Retain vs Delete
When a PersistentVolumeClaim is deleted, the underlying PersistentVolume’s reclaimPolicy determines what happens to the actual storage:
| Policy | What Happens | When to Use |
|---|---|---|
| Retain | PV is preserved. Data remains. PV enters Released state and must be manually reclaimed. | Production databases. Any workload where accidental data loss is unacceptable. |
| Delete | PV and underlying storage (EBS volume, GCE PD, etc.) are deleted. | Development environments. Workloads where data can be recreated. |
| Recycle | Deprecated. Was rm -rf /thevolume/* followed by making PV available again. | Never. Use Delete instead. |
The default reclaim policy for dynamically provisioned PVs is Delete in most StorageClasses. This is dangerous for production workloads. If someone accidentally deletes a PVC, the underlying data is gone.
For production, always set the reclaim policy to Retain. You can do this in the StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-retain
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain # PVs are preserved when PVCs are deleted
volumeBindingMode: WaitForFirstConsumer
parameters:
type: gp3
iops: "3000"
throughput: "125"
The consequence of Retain is that you must manually clean up PVs when you are done with them. This is a feature, not a bug. Explicit deletion of persistent data should require a human decision.
WaitForFirstConsumer: Topology Awareness
In a multi-zone cluster, where a PV is provisioned matters. An EBS volume in us-east-1a cannot be attached to a node in us-east-1b. If the volume is provisioned before the pod is scheduled, and the pod lands on a node in a different zone, the pod will be stuck in Pending forever.
WaitForFirstConsumer solves this by deferring volume provisioning until a pod actually needs it:
VOLUME BINDING MODES
─────────────────────
Immediate (default for some StorageClasses):
1. PVC created → PV provisioned in zone-a
2. Pod scheduled to zone-b → STUCK: PV is in zone-a, pod is in zone-b
WaitForFirstConsumer:
1. PVC created → PV provisioning deferred
2. Pod scheduled to zone-b → PV provisioned in zone-b (same zone as pod)
3. Pod binds to PV → SUCCESS: everything in the same zone
Always use WaitForFirstConsumer for cloud storage in multi-zone clusters. It is the only safe choice.
There is a subtle interaction with StatefulSets: once a PVC is bound to a PV in a specific zone, any future incarnation of that pod is constrained to that zone. If data-db-0 is provisioned in us-east-1a, then db-0 will always be scheduled to us-east-1a (assuming the PVC still exists). This is usually desirable for databases but means that zone failures affect specific StatefulSet members predictably.
Storage Resize
Most CSI drivers support volume expansion. The StorageClass must allow it:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-expandable
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true # Enables resize
To resize, edit the PVC’s spec.resources.requests.storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-db-0
spec:
resources:
requests:
storage: 200Gi # Was 100Gi, now requesting 200Gi
The resize process has two phases:
- Controller expansion: The CSI driver expands the underlying volume (e.g., modifies the EBS volume size). This happens automatically.
- Node expansion: The filesystem on the volume is expanded to use the new space. This happens when the pod using the volume restarts (for offline expansion) or live (for online expansion, supported by most modern CSI drivers).
Important constraints:
- Volumes can only grow, never shrink. There is no way to reduce a PVC’s size.
- EBS volumes have a cooldown period. After modifying an EBS volume, you must wait 6 hours before modifying it again.
- Some filesystems require a pod restart for the resize to take effect.
The PVC Lifecycle on Scale-Down
When you scale down a StatefulSet, the pods are deleted but the PVCs are not:
PVC LIFECYCLE ON SCALE-DOWN
─────────────────────────────
Step 1: Running at replicas=5
┌──────────────────────────────────────────────────────┐
│ Pod │ PVC │ PV │ Status │
├─────────┼───────────────┼─────────┼──────────────────┤
│ db-0 │ data-db-0 │ pv-a │ Bound │
│ db-1 │ data-db-1 │ pv-b │ Bound │
│ db-2 │ data-db-2 │ pv-c │ Bound │
│ db-3 │ data-db-3 │ pv-d │ Bound │
│ db-4 │ data-db-4 │ pv-e │ Bound │
└──────────────────────────────────────────────────────┘
Step 2: Scale to replicas=3 (kubectl scale sts db --replicas=3)
┌──────────────────────────────────────────────────────┐
│ Pod │ PVC │ PV │ Status │
├─────────┼───────────────┼─────────┼──────────────────┤
│ db-0 │ data-db-0 │ pv-a │ Bound │
│ db-1 │ data-db-1 │ pv-b │ Bound │
│ db-2 │ data-db-2 │ pv-c │ Bound │
│ --- │ data-db-3 │ pv-d │ Bound (no pod!) │
│ --- │ data-db-4 │ pv-e │ Bound (no pod!) │
└──────────────────────────────────────────────────────┘
data-db-3 and data-db-4 still exist.
You are still paying for pv-d and pv-e.
The data in pv-d and pv-e is preserved.
Step 3: Scale back to replicas=5
┌──────────────────────────────────────────────────────┐
│ Pod │ PVC │ PV │ Status │
├─────────┼───────────────┼─────────┼──────────────────┤
│ db-0 │ data-db-0 │ pv-a │ Bound │
│ db-1 │ data-db-1 │ pv-b │ Bound │
│ db-2 │ data-db-2 │ pv-c │ Bound │
│ db-3 │ data-db-3 │ pv-d │ Bound (data!) │
│ db-4 │ data-db-4 │ pv-e │ Bound (data!) │
└──────────────────────────────────────────────────────┘
db-3 and db-4 reattach to their old PVCs.
All previous data is intact.
Operational implications:
- Cost: Orphaned PVCs consume storage and incur charges. Monitor with
kubectl get pvcand cloud billing tools. - Stale data: If you scale down, modify the application, and scale back up, the reattached pods may have stale data that does not match the current application state.
- Cleanup: If you genuinely want to discard the data, you must manually delete the orphaned PVCs:
kubectl delete pvc data-db-3 data-db-4.
Backup Strategies: A Layered Approach
No single backup mechanism is sufficient for production data. Each approach has blind spots, and a robust strategy layers multiple approaches to cover each other’s weaknesses:
BACKUP STRATEGY LAYERS
────────────────────────
Layer 3: Application-Native Backup ← Highest fidelity
│ pg_basebackup + WAL archival
│ mongodump / mysqldump
│ Application-consistent snapshots
│ Understands transactions, replication state
│
Layer 2: Velero (Kubernetes-aware backup) ← Kubernetes context
│ Backs up K8s resources (StatefulSets, Services, ConfigMaps)
│ Can trigger pre/post-backup hooks (e.g., pg_start_backup)
│ Backs up PV data via snapshots or Restic/Kopia
│ Restores entire namespaces with all resources
│
Layer 1: Volume Snapshots (CSI) ← Fastest recovery
│ Point-in-time snapshot of the block device
│ Fast: typically copy-on-write, completes in seconds
│ Can clone volumes from snapshots
│ WARNING: crash-consistent, NOT application-consistent
│
▼
Storage Layer (EBS, GCE PD, Ceph, etc.)
Why Each Layer Matters
Volume snapshots are fast but dangerous in isolation. A snapshot captures the block device at a point in time, like pulling the power cord on a running database. The filesystem will be consistent (journaling handles that), but the database may have in-flight transactions that are partially written. The snapshot is crash-consistent but not application-consistent. Restoring from a snapshot alone may require crash recovery, and some data may be lost.
Velero adds Kubernetes context. It backs up not just the data but the Kubernetes resources that define how the data is used — the StatefulSet, the Service, the ConfigMaps, the Secrets. Velero can also run pre-backup hooks (like pg_start_backup or FLUSH TABLES WITH READ LOCK) that put the database into a consistent state before snapshotting.
Application-native backup is the gold standard. PostgreSQL’s continuous archival (base backup + WAL shipping) provides point-in-time recovery to any second in the past. This is the only backup method that guarantees zero data loss for committed transactions.
Volume Snapshots in Practice
# Create a VolumeSnapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: db-snapshot-20260403
spec:
volumeSnapshotClassName: csi-aws-ebs
source:
persistentVolumeClaimName: data-db-0
---
# Create a new PVC from a snapshot (cloning)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-db-0-restored
spec:
storageClassName: gp3-retain
dataSource:
name: db-snapshot-20260403
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
Volume cloning from snapshots is invaluable for creating test environments from production data. Snapshot a production PVC, create a new PVC from the snapshot, and attach it to a test StatefulSet. The clone is independent of the original — modifications to one do not affect the other.
Velero Configuration
# Install Velero with AWS provider
velero install \
--provider aws \
--bucket my-velero-bucket \
--secret-file ./credentials-velero \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--plugins velero/velero-plugin-for-aws:v1.9.0
# Schedule daily backups with 30-day retention
velero schedule create daily-db-backup \
--schedule="0 2 * * *" \
--include-namespaces database \
--ttl 720h
# Restore a namespace from backup
velero restore create --from-backup daily-db-backup-20260403020000
Velero’s pre-backup hooks let you ensure application consistency:
metadata:
annotations:
pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "psql -c \"SELECT pg_backup_start(''velero'')\""]'
pre.hook.backup.velero.io/container: postgres
post.hook.backup.velero.io/command: '["/bin/bash", "-c", "psql -c \"SELECT pg_backup_stop()\""]'
post.hook.backup.velero.io/container: postgres
The Backup Rule
Test your restores. A backup that has never been restored is a hypothesis, not a guarantee. Schedule regular restore tests to a separate namespace and verify data integrity. The time to discover that your backup process is broken is not during an incident.
Putting It All Together
A production storage configuration for a StatefulSet database combines everything in this chapter:
- StorageClass:
reclaimPolicy: Retain,volumeBindingMode: WaitForFirstConsumer,allowVolumeExpansion: true - volumeClaimTemplates: Separate templates for data and WAL if the database benefits from it
- PVC retention policy:
Retainfor bothwhenDeletedandwhenScaledDown - Backup: Application-native continuous backup (WAL archival) + Velero scheduled backups + periodic volume snapshots
- Monitoring: Alert on PVC usage approaching capacity, orphaned PVCs after scale-down, backup job failures
Storage is the foundation that stateful workloads rest on. Get it right and your databases can survive node failures, zone outages, and operational mistakes. Get it wrong and you will learn why the operations community repeats: “backups are worthless; restores are priceless.”
Common Mistakes and Misconceptions
- “All PersistentVolumes are the same.” RWO (ReadWriteOnce) can only be mounted by one node. RWX (ReadWriteMany) works across nodes but requires NFS or cloud file systems (EFS, Filestore). Choosing wrong access mode causes mount failures. Note: RWO allows multiple pods on the same node to mount the volume simultaneously. For databases that require exclusive single-pod access, use
ReadWriteOncePod(RWOP), which restricts the volume to exactly one pod. RWOP is GA since Kubernetes 1.29. - “Storage classes are just about disk type.” Storage classes also control reclaim policy (Delete vs Retain), volume binding mode (Immediate vs WaitForFirstConsumer), and provisioner. WaitForFirstConsumer is critical for zone-aware scheduling.
- “I can resize PVCs freely.” Volume expansion must be enabled on the storage class (
allowVolumeExpansion: true). Not all provisioners support it. Shrinking is never supported — plan initial sizes carefully.
Further Reading
- Kubernetes Persistent Volumes — Official PV/PVC reference
- Volume Snapshots — CSI snapshot documentation
- Velero documentation — Kubernetes backup and restore
- CSI specification — The standard that all storage drivers implement
- Kubernetes Storage Best Practices (GKE) — Cloud-specific guidance
Next: Jobs and CronJobs — Batch processing, indexed completions, and scheduling patterns.
Chapter 24: Jobs and CronJobs
Not every workload is a long-running service. Some workloads run to completion: a database migration, a batch data transformation, an ML training run, a nightly report. Deployments and StatefulSets are the wrong abstraction for these workloads because they try to keep pods running forever. Jobs and CronJobs are Kubernetes’s answer to batch and scheduled workloads. A Job creates one or more pods, runs them to completion, and then stops. A CronJob creates Jobs on a schedule. The concepts are simple, but the details — completion modes, parallelism, failure handling, and concurrency policies — matter enormously for production reliability.
Jobs: Run to Completion
A Job ensures that a specified number of pods successfully terminate. The simplest Job runs a single pod:
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration
spec:
template:
spec:
containers:
- name: migrate
image: my-app:v2.1.0
command: ["./migrate", "--target", "v2.1"]
restartPolicy: Never # Jobs require Never or OnFailure
backoffLimit: 3 # Retry up to 3 times on failure
activeDeadlineSeconds: 600 # Kill the Job if it runs longer than 10 minutes
ttlSecondsAfterFinished: 3600 # Clean up completed Job after 1 hour
Key fields:
restartPolicy: Must beNeverorOnFailure. Jobs cannot use the defaultAlwaysbecause that would restart the pod after successful completion.backoffLimit: How many times to retry before marking the Job as failed. Each retry uses exponential backoff (10s, 20s, 40s, …).activeDeadlineSeconds: A hard timeout for the entire Job. If the Job has not completed within this duration, all running pods are terminated and the Job is marked as failed.ttlSecondsAfterFinished: How long to keep the completed (or failed) Job object before garbage collection. Without this, completed Jobs accumulate forever.
Completion Modes
Jobs support two completion modes that determine how “done” is defined:
NonIndexed (Default)
The Job is complete when .spec.completions pods have succeeded. Each pod is interchangeable — they all run the same work.
spec:
completions: 5 # 5 pods must succeed
parallelism: 3 # Run up to 3 pods at a time
completionMode: NonIndexed
This creates 5 pods (3 at a time), each running the same task. If any pod fails, a replacement is created (up to backoffLimit). When 5 pods have exited with status 0, the Job is complete.
Indexed
Each pod gets a unique index (0 through completions-1) via the JOB_COMPLETION_INDEX environment variable. This enables work partitioning: each pod processes a different shard of data.
spec:
completions: 10 # 10 indexed pods (0-9)
parallelism: 5 # Run up to 5 pods at a time
completionMode: Indexed
Each pod knows its identity: pod with index 3 reads JOB_COMPLETION_INDEX=3 from its environment and processes the corresponding data partition. The Job is complete when each index (0 through 9) has one successful pod.
Indexed Jobs are the Kubernetes-native way to implement MapReduce-style parallelism. Instead of a single pod processing a 1TB dataset, ten pods each process 100GB.
Parallelism Patterns
The interaction between completions and parallelism produces different execution patterns:
JOB PARALLELISM PATTERNS
──────────────────────────
Pattern 1: Single Pod (default)
completions=1, parallelism=1
Time ──────────────────────►
┌──────────────────┐
│ Pod 0 │ ✓ Done
└──────────────────┘
Pattern 2: Fixed Completion Count
completions=5, parallelism=2
Time ──────────────────────────────────────►
┌──────────────┐
│ Pod 0 │ ✓
└──────────────┘
┌──────────────┐
│ Pod 1 │ ✓
└──────────────┘
┌──────────────┐
│ Pod 2 │ ✓
└──────────────┘
┌──────────────┐
│ Pod 3 │ ✓
└──────────────┘
┌──────────────┐
│ Pod 4 │ ✓
└──────────────┘
2 pods run at a time. 5 must succeed total.
Pattern 3: Work Queue (external coordination)
completions=unset, parallelism=5
Time ──────────────────────────────────────►
┌──────────────────────────────────────┐
│ Pod 0 (processes items from queue) │ ✓
└──────────────────────────────────────┘
┌────────────────────────────────┐
│ Pod 1 │ ✓
└────────────────────────────────┘
┌────────────────────────────────────────────┐
│ Pod 2 │ ✓
└────────────────────────────────────────────┘
┌──────────────────────────────────┐
│ Pod 3 │ ✓
└──────────────────────────────────┘
┌──────────────────────┐
│ Pod 4 │ ✓
└──────────────────────┘
All 5 pods run simultaneously.
Each pulls work from an external queue (SQS, Redis, RabbitMQ).
When a pod exits successfully, it is not restarted.
Job completes when at least one pod terminates successfully
and all other pods have also terminated.
Pattern 4: Indexed Parallel
completions=4, parallelism=4, completionMode=Indexed
Time ──────────────────────────────────────►
┌──────────────────────┐
│ Pod 0 (index=0) │ ✓ processes partition 0
└──────────────────────┘
┌────────────────────────────┐
│ Pod 1 (index=1) │ ✓ processes partition 1
└────────────────────────────┘
┌──────────────────┐
│ Pod 2 (index=2) │ ✓ processes partition 2
└──────────────────┘
┌──────────────────────────────────┐
│ Pod 3 (index=3) │ ✓ processes partition 3
└──────────────────────────────────┘
Each pod gets JOB_COMPLETION_INDEX env var.
Each processes its assigned data shard.
| Pattern | completions | parallelism | Use Case |
|---|---|---|---|
| Single pod | 1 (default) | 1 (default) | Database migration, one-off script |
| Fixed count | N | M (M <= N) | Batch processing with known work items |
| Work queue | unset | N | Queue-driven processing (SQS, RabbitMQ) |
| Indexed | N | M | Data partitioning, parallel map operations |
Failure Handling
backoffLimit
When a pod fails (exits with non-zero status or is evicted), Kubernetes retries with exponential backoff. The delay between retries starts at 10 seconds and doubles each time (10s, 20s, 40s, …), capped at 6 minutes.
spec:
backoffLimit: 6 # Allow up to 6 failures before giving up
After backoffLimit failures, the Job is marked as Failed. The default is 6.
activeDeadlineSeconds
A safety net for Jobs that might hang. If the Job has not completed after this many seconds, all pods are killed and the Job fails:
spec:
activeDeadlineSeconds: 3600 # Hard timeout: 1 hour
This is essential for production Jobs. Without it, a hung Job consumes resources indefinitely. Always set this to a value comfortably above the expected runtime.
Pod Failure Policy
Introduced in v1.26 (stable in v1.31), Pod Failure Policy gives fine-grained control over how specific failure types are handled. Instead of treating all failures the same, you can define rules:
spec:
podFailurePolicy:
rules:
- action: FailJob # Immediately fail the entire Job
onExitCodes:
containerName: migrate
operator: In
values: [42] # Exit code 42 = unrecoverable error
- action: Ignore # Do not count toward backoffLimit
onPodConditions:
- type: DisruptionTarget # Node drain, preemption, etc.
- action: Count # Default: count toward backoffLimit
onExitCodes:
containerName: migrate
operator: In
values: [1] # Exit code 1 = transient, worth retrying
This is powerful for distinguishing between transient failures (network timeout, node eviction) and permanent failures (invalid input, schema mismatch). Without Pod Failure Policy, a pod that fails due to node preemption counts toward backoffLimit the same as a pod that fails due to a bug — which is wasteful because the preempted pod should just be retried without penalty.
ttlSecondsAfterFinished
Completed Jobs (both successful and failed) remain in the cluster until garbage collected. Without ttlSecondsAfterFinished, they stay forever, cluttering kubectl get jobs output and consuming API server resources:
spec:
ttlSecondsAfterFinished: 86400 # Remove 24 hours after completion
Set this on every Job. The appropriate TTL depends on how long you need the Job (and its pod logs) for debugging.
CronJobs: Scheduled Execution
A CronJob creates Jobs on a schedule. The scheduling uses standard cron syntax:
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-backup
spec:
schedule: "0 2 * * *" # 2:00 AM every day
timeZone: "America/New_York" # Stable since v1.27
concurrencyPolicy: Forbid # Do not start a new Job if the previous is still running
startingDeadlineSeconds: 300 # If missed by more than 5 minutes, skip this run
successfulJobsHistoryLimit: 3 # Keep last 3 successful Jobs
failedJobsHistoryLimit: 5 # Keep last 5 failed Jobs
jobTemplate:
spec:
activeDeadlineSeconds: 7200 # Each Job has a 2-hour timeout
backoffLimit: 2
template:
spec:
containers:
- name: backup
image: backup-tool:v1.3
command: ["./backup.sh"]
restartPolicy: OnFailure
Cron Syntax
┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of month (1-31)
│ │ │ ┌───────────── month (1-12)
│ │ │ │ ┌───────────── day of week (0-6, Sunday=0)
│ │ │ │ │
* * * * *
Examples:
0 2 * * *— Every day at 2:00 AM*/15 * * * *— Every 15 minutes0 0 1 * *— First day of every month at midnight0 9 * * 1-5— Weekdays at 9:00 AM
timeZone
Before v1.25, CronJobs used the kube-controller-manager’s local timezone, which was usually UTC but not always. The timeZone field (stable since v1.27) lets you specify the timezone explicitly. “Every day at 2 AM” is meaningless without a timezone.
concurrencyPolicy
What happens when it is time to start a new Job but the previous one is still running?
| Policy | Behavior | When to Use |
|---|---|---|
| Allow | Start the new Job alongside the running one | Independent jobs where overlap is safe |
| Forbid | Skip the new Job if the previous is still running | Backups, database maintenance, anything that should not overlap |
| Replace | Kill the running Job and start a new one | Long-running jobs where the latest run supersedes previous runs |
Forbid is the safest default for most production CronJobs. Two concurrent backup jobs competing for the same database locks is a recipe for failures.
startingDeadlineSeconds
If the CronJob controller misses a scheduled run (for example, because the controller was down or the cluster was overloaded), startingDeadlineSeconds controls how long after the scheduled time Kubernetes will still attempt to start the Job:
spec:
startingDeadlineSeconds: 300 # If missed by more than 5 minutes, skip
Without this, Kubernetes counts missed schedules and may try to start all of them at once when the controller recovers. If more than 100 schedules were missed, the CronJob is marked as unable to be scheduled. Setting startingDeadlineSeconds provides a clean cutoff.
History Limits
spec:
successfulJobsHistoryLimit: 3 # Keep 3 successful completed Jobs
failedJobsHistoryLimit: 5 # Keep 5 failed Jobs (more for debugging)
These control how many completed Job objects are retained. Keep enough for debugging (especially for failed Jobs) but not so many that they clutter the cluster.
Real-World Use Cases
Data Pipeline Stage
apiVersion: batch/v1
kind: Job
metadata:
name: etl-daily-20260403
spec:
completions: 10
parallelism: 5
completionMode: Indexed
backoffLimit: 3
activeDeadlineSeconds: 14400 # 4 hours max
ttlSecondsAfterFinished: 86400
template:
spec:
containers:
- name: etl
image: data-pipeline:v3.0
command: ["./process_partition.sh"]
env:
- name: TOTAL_PARTITIONS
value: "10"
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
restartPolicy: Never
Each of the 10 indexed pods processes one partition of the daily data. Five run in parallel. The JOB_COMPLETION_INDEX environment variable tells each pod which partition to process.
Database Backup CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: pg-backup
spec:
schedule: "0 3 * * *"
timeZone: "UTC"
concurrencyPolicy: Forbid
startingDeadlineSeconds: 600
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 10
jobTemplate:
spec:
activeDeadlineSeconds: 3600
backoffLimit: 2
ttlSecondsAfterFinished: 604800 # Keep for 7 days
template:
spec:
containers:
- name: backup
image: postgres:16.2
command:
- /bin/bash
- -c
- |
pg_dump -h db-0.db.default.svc.cluster.local \
-U backup_user -Fc mydb | \
aws s3 cp - s3://my-backups/pg/$(date +%Y%m%d).dump
envFrom:
- secretRef:
name: pg-backup-credentials
restartPolicy: OnFailure
ML Training Job
apiVersion: batch/v1
kind: Job
metadata:
name: training-run-042
spec:
completions: 1
parallelism: 1
backoffLimit: 1 # Do not retry expensive training
activeDeadlineSeconds: 86400 # 24-hour timeout
ttlSecondsAfterFinished: 604800 # Keep for a week (to check logs)
template:
spec:
containers:
- name: train
image: ml-training:v2.1
command: ["python", "train.py", "--epochs", "100"]
resources:
requests:
cpu: "8"
memory: 32Gi
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: 32Gi
nvidia.com/gpu: "1"
volumeMounts:
- name: model-output
mountPath: /output
volumes:
- name: model-output
persistentVolumeClaim:
claimName: training-output
restartPolicy: Never
Jobs vs Other Workload Types
| Question | Answer |
|---|---|
| Should it run forever? | Use Deployment or StatefulSet |
| Should it run once and stop? | Use Job |
| Should it run on a schedule? | Use CronJob |
| Should it run on every node? | Use DaemonSet |
| Does it need stable identity? | Use StatefulSet |
| Does it need parallel indexed processing? | Use Job with completionMode: Indexed |
Common Mistakes and Misconceptions
- “CronJobs are reliable for exactly-once execution.” CronJobs can create 0 or 2+ Jobs for a single schedule point (missed schedules, clock skew). Use
concurrencyPolicy: Forbidand design jobs to be idempotent. - “Failed Jobs retry forever.” Jobs respect
backoffLimit(default 6). After that many failures, the Job is marked Failed. SetactiveDeadlineSecondsto prevent runaway jobs consuming resources. - “Jobs clean up after themselves.” Completed and Failed Jobs (and their pods) persist in the API until you or a TTL controller deletes them. Set
ttlSecondsAfterFinishedto auto-clean, or they accumulate and clutterkubectl get pods.
Further Reading
- Jobs documentation — Official Job reference
- CronJob documentation — Official CronJob reference
- Pod Failure Policy — Fine-grained failure handling
- Indexed Job for Parallel Processing — Tutorial on indexed Jobs
- Crontab Guru — Interactive cron expression editor
This concludes Part 4: Stateful Workloads. You now know how to run applications that need stable identity, persistent storage, and batch processing semantics. Part 5 turns to the question that becomes urgent once you are running real workloads: how do you secure them?
Next: RBAC from First Principles
Chapter 25: RBAC from First Principles
Kubernetes has no firewall between “can deploy an app” and “can delete the entire cluster.” Without access control, every user and every workload operates with full administrative privileges. Role-Based Access Control (RBAC) is the mechanism that prevents a junior developer’s typo from becoming a production incident and ensures that a compromised pod cannot read every secret in the cluster.
RBAC answers a single question: who can do what to which resources? Understanding it from first principles requires understanding the four objects that encode that answer, the subjects they reference, and the design patterns that make multi-tenant clusters safe.
The Authorization Model
In Chapter 3, we described the API server as the gateway that every request must pass through. The API server authenticates every request, then authorizes it — RBAC is the authorization module used by virtually every production cluster.
Every request to the Kubernetes API server carries three pieces of information relevant to authorization:
- Subject — who is making the request (user, group, or service account)
- Verb — what action is being attempted (get, list, create, update, delete, watch, patch)
- Resource — what is being acted upon (pods, services, secrets, configmaps, etc.)
RBAC evaluates these against a set of rules. If any rule grants the requested action, the request is allowed. If no rule matches, the request is denied. RBAC is additive-only — there is no way to write a “deny” rule. You grant permissions; you never revoke them. If a subject has no matching grants, the default is denial.
flowchart TD
req["kubectl get pods -n production"]
authn["API Server<br>(Authentication)"]
extract["Who: user:alice<br>What: verb:get resource:pods<br>Where: namespace:production"]
authz["API Server<br>(Authorization)"]
scan["Scan RoleBindings<br>in 'production' namespace"]
binding["RoleBinding 'dev-access'<br>subjects: group:developers<br>roleRef: role:pod-reader"]
role["Role 'pod-reader'<br>resources: pods<br>verbs: get, list, watch"]
checkUser{"alice in<br>group:developers?"}
checkVerb{"verb:get on<br>pods allowed?"}
allow["RESULT: ALLOW"]
deny["RESULT: DENY"]
req --> authn
authn --> extract
extract --> authz
authz --> scan
scan --> binding
binding --> role
role --> checkUser
checkUser -- YES --> checkVerb
checkUser -- NO --> deny
checkVerb -- YES --> allow
checkVerb -- NO --> deny
style allow fill:#d4edda,stroke:#333
style deny fill:#f8d7da,stroke:#333
The Four RBAC Objects
RBAC uses exactly four object types. Two define permissions, two bind permissions to subjects.
flowchart TD
subgraph Permissions
Role["Role<br>(namespaced)"]
CR["ClusterRole<br>(cluster-wide)"]
end
subgraph Bindings
RB["RoleBinding<br>(namespaced)"]
CRB["ClusterRoleBinding<br>(cluster-wide)"]
end
subgraph Subjects
S["User / Group /<br>ServiceAccount"]
end
RB -- "roleRef" --> Role
RB -- "roleRef" --> CR
CRB -- "roleRef" --> CR
RB -- "subjects" --> S
CRB -- "subjects" --> S
style Bindings fill:#fff0e0,stroke:#333
style Subjects fill:#e0ffe0,stroke:#333
Key insight: A RoleBinding can reference a ClusterRole. This grants the ClusterRole’s permissions only within the RoleBinding’s namespace. This is the most common pattern for multi-tenant clusters.
Role
A Role defines permissions within a single namespace. It lists which API resources can be accessed and which verbs are allowed.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: production
rules:
- apiGroups: [""] # "" = core API group
resources: ["pods"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/log"] # subresource
verbs: ["get"]
ClusterRole
A ClusterRole defines permissions cluster-wide. It can also grant access to cluster-scoped resources (nodes, namespaces, persistentvolumes) that have no namespace.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: node-reader
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["get", "list"]
RoleBinding
A RoleBinding grants permissions defined in a Role (or ClusterRole) to a set of subjects within a specific namespace.
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: dev-pod-access
namespace: production
subjects:
- kind: Group
name: developers
apiGroup: rbac.authorization.k8s.io
- kind: ServiceAccount
name: ci-deployer
namespace: ci-system
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
ClusterRoleBinding
A ClusterRoleBinding grants cluster-wide permissions. Every namespace, plus cluster-scoped resources, are accessible.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-admins
subjects:
- kind: Group
name: platform-team
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
Subjects: Who Can Be Granted Access
RBAC recognizes three kinds of subjects:
User — An external identity authenticated by the API server. Kubernetes has no User object; users are established through client certificates, bearer tokens, or an external identity provider. The username is a string extracted during authentication.
Group — A set of users. Groups are also strings extracted during authentication. The identity provider (OIDC, certificates) determines group membership. Key built-in groups: system:authenticated (all authenticated users), system:unauthenticated (anonymous requests), system:masters (unconditional full access).
ServiceAccount — A namespaced Kubernetes object representing a workload’s identity. Unlike users and groups, ServiceAccounts are managed through the API. Every pod runs as a ServiceAccount; if none is specified, it runs as the default ServiceAccount in its namespace.
Default ClusterRoles
Kubernetes ships with a set of default ClusterRoles designed for common access patterns. These are the building blocks for most RBAC configurations:
| ClusterRole | Scope | Permissions |
|---|---|---|
| cluster-admin | Cluster-wide | Everything. Full access to all resources in all namespaces. Equivalent to root. |
| admin | Namespace (via RoleBinding) | Full access within a namespace: create/update/delete Roles, RoleBindings, all workloads, secrets, configmaps. Cannot modify namespace quotas or the namespace itself. |
| edit | Namespace (via RoleBinding) | Create/update/delete workloads, services, configmaps, secrets, PVCs. Cannot manage Roles or RoleBindings. |
| view | Namespace (via RoleBinding) | Read-only access to most namespace resources. Cannot view secrets. |
The typical pattern is to bind these ClusterRoles via RoleBindings in specific namespaces, not via ClusterRoleBindings:
# Grant "edit" in the "staging" namespace to the QA team
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: qa-edit
namespace: staging
subjects:
- kind: Group
name: qa-team
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole # Reference a ClusterRole...
name: edit
apiGroup: rbac.authorization.k8s.io
# ...but the binding is namespaced, so permissions apply only in "staging"
Aggregated ClusterRoles
Aggregated ClusterRoles solve a subtle problem: when you install a CRD (Custom Resource Definition), how do the default roles (admin, edit, view) learn about the new resource types?
The answer is label-based aggregation. The default ClusterRoles have an aggregationRule that selects other ClusterRoles by label. When you create a CRD, you create small ClusterRoles with the appropriate labels, and their rules are automatically merged into the aggregated roles.
# This ClusterRole's rules get merged into "admin"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: custom-app-admin
labels:
rbac.authorization.k8s.io/aggregate-to-admin: "true"
rules:
- apiGroups: ["mycompany.io"]
resources: ["widgets"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
Any user who has admin access in a namespace now automatically gets full access to widgets in that namespace. No manual RoleBinding updates required.
ServiceAccount Tokens: The Modern Model
Kubernetes v1.24 removed the automatic creation of long-lived ServiceAccount token secrets. The modern model uses bound service account tokens with four important properties:
- Time-bound — Tokens expire (default: 1 hour; the kubelet proactively rotates the token when 80% of its lifetime has elapsed, i.e., ~48 minutes by default)
- Audience-scoped — Tokens are valid only for specific audiences (typically the API server)
- Pod-bound — Tokens are invalidated when the pod is deleted
- Auto-rotated — The kubelet refreshes tokens before expiration
This is a significant security improvement over the old model, where a leaked ServiceAccount token granted permanent access until manually revoked.
# Explicit token request for non-Kubernetes consumers
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
serviceAccountName: my-app-sa
containers:
- name: app
image: my-app:latest
volumeMounts:
- name: token
mountPath: /var/run/secrets/tokens
volumes:
- name: token
projected:
sources:
- serviceAccountToken:
path: api-token
expirationSeconds: 3600
audience: my-external-service
OIDC Integration for Human Users
Production clusters should authenticate human users via OIDC rather than client certificates, which cannot be revoked once issued. Use OpenID Connect (OIDC) to delegate authentication to an identity provider (Okta, Azure AD, Google Workspace, Dex).
The flow works as follows:
- User authenticates with the identity provider (browser-based login)
- Identity provider issues an ID token (JWT) containing username and groups
- kubectl sends the ID token with each API request
- API server validates the token signature against the OIDC provider’s public keys
- RBAC evaluates the extracted username and groups against bindings
This means group membership is managed in your identity provider, not in Kubernetes. When someone leaves the team, disabling their IdP account immediately revokes cluster access.
Multi-Tenant RBAC Design
Most production clusters serve multiple teams. The standard model is namespace-per-tenant with a three-tier access structure:
MULTI-TENANT NAMESPACE MODEL
──────────────────────────────
Cluster
├── Namespace: team-alpha-dev
│ ├── RoleBinding: alpha-devs → ClusterRole:edit
│ ├── RoleBinding: alpha-leads → ClusterRole:admin
│ ├── RoleBinding: platform-team → ClusterRole:admin
│ └── ResourceQuota + LimitRange
│
├── Namespace: team-alpha-prod
│ ├── RoleBinding: alpha-ci → ClusterRole:edit (ServiceAccount)
│ ├── RoleBinding: alpha-leads → ClusterRole:admin
│ ├── RoleBinding: platform-team → ClusterRole:admin
│ └── ResourceQuota + LimitRange
│
├── Namespace: team-beta-dev
│ ├── RoleBinding: beta-devs → ClusterRole:edit
│ ├── RoleBinding: beta-leads → ClusterRole:admin
│ ├── RoleBinding: platform-team → ClusterRole:admin
│ └── ResourceQuota + LimitRange
│
└── Namespace: kube-system (platform only)
└── ClusterRoleBinding: platform-team → cluster-admin
THREE-TIER MODEL
─────────────────
Tier 1: Platform Team → cluster-admin (ClusterRoleBinding)
Tier 2: Team Leads → admin per namespace (RoleBinding)
Tier 3: Developers → edit per namespace (RoleBinding)
Design principles:
- Bind to Groups, not Users. When Alice joins team-alpha, add her to the
alpha-devsgroup in your identity provider. No Kubernetes RBAC changes needed. - Use ClusterRoles with namespaced RoleBindings. Define permissions once, apply per-namespace.
- Every namespace gets ResourceQuota and LimitRange. RBAC controls what you can do; quotas control how much.
- CI/CD uses dedicated ServiceAccounts with edit permissions scoped to specific namespaces. Never share ServiceAccounts across pipelines.
Least Privilege Checklist
- Every workload has its own ServiceAccount
automountServiceAccountToken: falseunless the workload needs API access- No ClusterRoleBindings to cluster-admin except for the platform team
- No wildcard verbs or resources in custom Roles
- Groups (not individual users) in all RoleBindings
- OIDC for human users, bound tokens for workloads
- Regular audits of who can access secrets and create pods (pod creation implies secret access)
Common Mistakes and Misconceptions
- “cluster-admin for everyone is fine in dev.” Bad habits in dev carry to production. Practice least-privilege from the start. Create namespace-scoped roles that match what each team actually needs.
- “RBAC denies by default, so I’m secure.” RBAC only controls API access. It doesn’t prevent a compromised pod from attacking the network, reading the filesystem, or accessing cloud metadata. RBAC is one layer of defense, not the whole strategy.
- “I can see who has access by reading RoleBindings.” Aggregated ClusterRoles, group memberships, and impersonation make the effective permission set non-obvious. Use
kubectl auth can-i --list --as=userto audit actual permissions.
Further Reading
- RBAC documentation — Official reference
- Using RBAC Authorization — API details and examples
- Authenticating with OIDC — OIDC configuration
- Bound Service Account Tokens — Token projection and rotation
- RBAC.dev — Interactive RBAC lookup and visualization tool
Next: Network Policies — Controlling pod-to-pod traffic with ingress and egress rules.
Chapter 26: Network Policies
By default, every pod can reach every other pod — no firewalls, no segmentation. A compromised pod can reach databases, scan the cluster network, and exfiltrate data.
The Fundamental Model
Kubernetes Network Policies operate on three principles:
-
Non-isolated by default. A pod with no Network Policy selecting it accepts all inbound and all outbound traffic. Network Policies are opt-in.
-
Additive allow-only. There are no “deny” rules. Policies can only allow traffic. If you create a policy that selects a pod, that pod becomes isolated for the direction(s) specified (ingress, egress, or both). Once isolated, only traffic explicitly allowed by a policy is permitted.
-
Both sides must allow. For traffic to flow from pod A to pod B, the egress policy on pod A must allow traffic to B, AND the ingress policy on pod B must allow traffic from A. If either side denies (by isolation without a matching allow), the traffic is dropped.
NETWORK POLICY TRAFFIC FLOW
─────────────────────────────
Pod A (team-alpha) Pod B (team-beta)
┌───────────────────┐ ┌───────────────────┐
│ │ │ │
│ Egress Policy: │ │ Ingress Policy: │
│ "allow to │──────────▶│ "allow from │
│ team-beta pods" │ Traffic │ team-alpha pods" │
│ │ flows │ │
└───────────────────┘ only if └───────────────────┘
BOTH
allow
If Pod A has no egress policy → Pod A is non-isolated
for egress → all egress allowed (A's side: OK)
If Pod B has no ingress policy → Pod B is non-isolated
for ingress → all ingress allowed (B's side: OK)
If Pod A has egress policy that does NOT list Pod B
→ traffic BLOCKED at A's side
The Essential Policy Templates
Default Deny All Ingress
The most important policy in any cluster. Apply this to every namespace and then add specific allow rules.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {} # Empty selector = all pods in namespace
policyTypes:
- Ingress # Isolate for ingress; no ingress rules = deny all
# No ingress rules → all inbound traffic denied
Default Deny All Egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
# No egress rules → all outbound traffic denied
Warning: Denying all egress breaks DNS resolution. Pods will not be able to resolve service names. You almost always need to pair this with a DNS allow rule (see below).
Default Deny Both Directions
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Allow DNS Egress (Critical)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
Namespace Isolation
Allow traffic only from pods within the same namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-same-namespace
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- podSelector: {} # All pods in THIS namespace
Specific Pod-to-Pod Communication
Allow only the frontend to reach the backend, and only the backend to reach the database:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: database-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: backend
ports:
- protocol: TCP
port: 5432
Egress to External IPs
Allow pods to reach a specific external service (e.g., a third-party API):
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-external-api
namespace: production
spec:
podSelector:
matchLabels:
app: payment-service
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 203.0.113.0/24 # External API range
ports:
- protocol: TCP
port: 443
The AND vs OR Selector Trap
This is the single most common source of Network Policy bugs. The behavior changes depending on whether selectors appear in the same from/to item or in separate list items.
THE SELECTOR LOGIC TRAP
─────────────────────────
COMBINED (AND logic) --- both conditions must match:
ingress:
- from:
- namespaceSelector: ┐
matchLabels: │ AND
env: production │
podSelector: │ Both must be true:
matchLabels: │ namespace=production
app: frontend ┘ AND app=frontend
SEPARATE (OR logic) --- either condition can match:
ingress:
- from:
- namespaceSelector: ← OR: any pod in namespace
matchLabels: with env=production
env: production
- podSelector: ← OR: any pod in SAME namespace
matchLabels: with app=frontend
app: frontend
THE DIFFERENCE:
Combined: Only frontend pods in production namespaces
Separate: ALL pods in production namespaces
OR frontend pods in the CURRENT namespace
The difference is a single - character (a new list item). Combined selectors are intersections (AND). Separate selectors are unions (OR). Getting this wrong can open your namespace to traffic from every pod in a production-labeled namespace.
CNI Support: The Enforcement Gap
Network Policies are a Kubernetes API object. Any cluster accepts them. But enforcing them requires a CNI plugin that implements the NetworkPolicy specification. If your CNI does not support Network Policies, policies exist in etcd but are silently ignored — no warning, no effect on traffic.
| CNI Plugin | Network Policy Support | Notes |
|---|---|---|
| Calico | Full | The most widely deployed policy-capable CNI. Supports both Kubernetes NetworkPolicy and its own more expressive CRDs (GlobalNetworkPolicy, deny rules, application-layer policies). |
| Cilium | Full + extended | eBPF-based. Supports Kubernetes NetworkPolicy plus CiliumNetworkPolicy with L7 (HTTP, gRPC, Kafka) filtering, DNS-aware policies, and identity-based enforcement. |
| Weave Net | Full | Supports standard NetworkPolicy. Less common in new deployments. |
| Antrea | Full | VMware-backed, built on Open vSwitch. Good support for NetworkPolicy and its own Antrea-native policies. |
| Flannel | None | Flannel provides connectivity only. If you apply a NetworkPolicy on a Flannel cluster, it is silently ignored. This is the most common enforcement gap in production. |
| kubenet | None | Basic CNI for simple clusters. No policy support. |
How to verify enforcement: Deploy two pods. Apply a deny-all ingress policy to the target pod’s namespace. Attempt to connect from the source pod. If the connection succeeds, your CNI is not enforcing policies.
# Quick verification test
kubectl run source --image=busybox --rm -it --restart=Never -- \
wget -qO- --timeout=3 http://target-pod-ip:8080
# If this succeeds after a deny-all policy, your CNI does not enforce policies
A Complete Namespace Policy Set
A production namespace typically needs a layered set of policies. Here is a complete example for a three-tier application:
POLICY LAYERING FOR A NAMESPACE
─────────────────────────────────
production namespace
┌──────────────────────────────────────────────────┐
│ │
│ Policy 1: default-deny-all (ingress + egress) │
│ Policy 2: allow-dns (egress to kube-dns) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ frontend │───▶│ backend │───▶│ database │ │
│ │ │ │ │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ▲ │ │ │
│ │ ▼ │ │
│ Policy 3: Policy 5: Policy 7: │
│ allow ingress allow egress deny all │
│ from ingress to database egress (no │
│ controller on 5432 external) │
│ │
│ Policy 4: Policy 6: │
│ allow egress allow ingress │
│ to backend from backend │
│ on 8080 on 5432 │
│ │
└──────────────────────────────────────────────────┘
External traffic → ingress controller → frontend → backend → database
Every other path is blocked.
Debugging Network Policies
When traffic is unexpectedly blocked:
- Check that policies exist:
kubectl get networkpolicy -n <namespace> - Verify CNI enforcement: Test with a known-blocked connection
- Inspect the policy:
kubectl describe networkpolicy <name> -n <namespace> - Check labels: Policies select pods by label. A missing or misspelled label means the policy does not apply to the pod you think it does.
kubectl get pods --show-labels -n <namespace> - Check DNS: If pods can connect by IP but not by name, the egress DNS rule is missing or incorrect
- Remember the AND/OR trap: Review your
from/toselectors for unintended union logic
Limitations of Kubernetes Network Policies
The standard NetworkPolicy API has real limitations:
- No deny rules. You cannot explicitly block a specific source. You can only fail to allow it.
- No logging. There is no built-in way to log dropped packets.
- No cluster-wide policies. Each NetworkPolicy is namespaced. There is no way to apply a policy across all namespaces without creating it in each one.
- No L7 filtering. Standard policies operate at L3/L4 (IP and port). They cannot distinguish between
GET /api/publicandDELETE /api/admin.
For these capabilities, use your CNI’s extended policy CRDs. Calico’s GlobalNetworkPolicy and Cilium’s CiliumNetworkPolicy both address these gaps.
Common Mistakes and Misconceptions
- “Pods are isolated by default.” The opposite: all pods can reach all other pods by default. You must explicitly create NetworkPolicies to restrict traffic. No policy = fully open.
- “A NetworkPolicy on ingress also blocks egress.” Ingress and egress are independent. A policy selecting only ingress rules does not restrict outbound traffic. You need separate egress rules.
- “My CNI supports NetworkPolicy.” See the CNI support table above. Always verify enforcement with a test connection.
- “NetworkPolicies work across namespaces automatically.” You must use
namespaceSelectorto allow cross-namespace traffic. A policy only applies to pods in its own namespace.
Further Reading
- Network Policies documentation — Official reference
- Network Policy recipes — Practical examples
- Calico Network Policy — Extended policy features
- Cilium Network Policy — L7-aware policies
- Network Policy Editor — Visual editor for building and understanding NetworkPolicies
Next: Supply Chain Security — Image signing, admission policies, SBOMs, and the SLSA framework.
Chapter 27: Supply Chain Security
A container image passes through source code, build systems, registries, and your cluster — each step is an opportunity for compromise. Supply chain security verifies that nothing was tampered with along the way.
This is not a theoretical concern. The SolarWinds attack (2020) injected malicious code into a build pipeline. The Codecov breach (2021) modified a bash uploader to exfiltrate credentials. The xz utils backdoor (2024) hid a sophisticated compromise in a compression library used by SSH. Kubernetes clusters are particularly exposed because they pull images from external registries on every deployment, and a single compromised base image can propagate to hundreds of workloads.
The Problem in Layers
THE SOFTWARE SUPPLY CHAIN
──────────────────────────
Source Code ──▶ Build System ──▶ Registry ──▶ Cluster
│ │ │ │
▼ ▼ ▼ ▼
Was the code Was the build Was the Is the image
reviewed? tampered with? image allowed to
Who authored Was the build modified run? Was it
this commit? reproducible? in transit signed? Is it
or at rest? from a trusted
registry?
ATTACK SURFACE AT EACH STAGE:
┌─────────┐ ┌─────────┐ ┌──────────┐ ┌─────────────┐
│ Typo- │ │ Build │ │ Registry │ │ Deployment │
│ squatted│ │ system │ │ compro- │ │ of unsigned │
│ deps │ │ compro- │ │ mised │ │ or outdated │
│ │ │ mised │ │ │ │ images │
└─────────┘ └─────────┘ └──────────┘ └─────────────┘
Image Signing with Sigstore/Cosign
Sigstore is the dominant open-source project for signing and verifying container images. Its key innovation is keyless signing — you do not need to manage long-lived signing keys. Instead, you prove your identity through an existing OIDC provider (GitHub Actions, Google, Microsoft), and Sigstore issues a short-lived certificate tied to that identity.
The Keyless Signing Flow
SIGSTORE KEYLESS SIGNING PIPELINE
───────────────────────────────────
Developer / CI Pipeline
│
│ 1. Request identity token (OIDC)
▼
┌────────────┐
│ OIDC │ GitHub Actions, Google, etc.
│ Provider │ Issues JWT with identity claims
└─────┬──────┘
│ 2. Present OIDC token
▼
┌────────────┐
│ Fulcio │ Sigstore's certificate authority
│ (CA) │ Verifies OIDC token
│ │ Issues short-lived X.509 cert
│ │ (valid ~20 minutes)
└─────┬──────┘
│ 3. Ephemeral certificate + private key
▼
┌────────────┐
│ Cosign │ Signs the image digest using
│ (client) │ the ephemeral private key
│ │ Pushes signature to registry
└─────┬──────┘
│ 4. Record signing event
▼
┌────────────┐
│ Rekor │ Sigstore's transparency log
│ (log) │ Immutable, append-only record
│ │ Proves signing happened at
│ │ a specific time with a
│ │ specific identity
└────────────┘
VERIFICATION:
cosign verify checks:
✓ Signature matches image digest
✓ Certificate was issued by Fulcio
✓ Certificate identity matches expected signer
✓ Signing event exists in Rekor transparency log
Cosign in Practice
# Sign an image (keyless, in CI)
cosign sign ghcr.io/myorg/myapp@sha256:abc123...
# Verify an image
cosign verify \
--certificate-identity=https://github.com/myorg/myapp/.github/workflows/build.yml@refs/heads/main \
--certificate-oidc-issuer=https://token.actions.githubusercontent.com \
ghcr.io/myorg/myapp@sha256:abc123...
# Sign with a key pair (traditional, for air-gapped environments)
cosign generate-key-pair
cosign sign --key cosign.key ghcr.io/myorg/myapp@sha256:abc123...
cosign verify --key cosign.pub ghcr.io/myorg/myapp@sha256:abc123...
Notation / Notary v2
Notation (the CNCF’s Notary v2 project) takes a traditional PKI approach. You manage your own signing keys and certificates, sign images using the notation CLI, and store signatures as OCI artifacts alongside the image in the registry.
Notation is the right choice when your organization already has a PKI infrastructure, when you need to comply with regulations that require specific key management practices, or when you operate in air-gapped environments where Sigstore’s online services (Fulcio, Rekor) are not reachable.
| Feature | Cosign (Sigstore) | Notation (Notary v2) |
|---|---|---|
| Key management | Keyless (OIDC) or key-pair | Key-pair with PKI |
| Certificate authority | Fulcio (public) | Your own CA |
| Transparency log | Rekor (public) | None (optional) |
| Air-gapped support | Requires key-pair mode | Native |
| Ecosystem adoption | Wider (GitHub, GCP, AWS) | Growing (Azure ACR native) |
| Signature storage | OCI registry | OCI registry |
Admission Control: Enforcing Policy at Deploy Time
Signing images is useless unless you verify signatures before deployment. This is the job of admission controllers — webhook-based components that intercept API requests and enforce policies before objects are created.
OPA Gatekeeper vs Kyverno
| Feature | OPA Gatekeeper | Kyverno |
|---|---|---|
| Policy language | Rego (purpose-built, steep learning curve) | YAML (Kubernetes-native, familiar) |
| Mutation | Supported (via assign/modify) | Native (mutate rules in policy) |
| Generation | Not supported | Native (generate resources from policy) |
| Image verification | Via external data or custom Rego | Built-in verifyImages rule |
| Validation | Core strength | Core strength |
| Audit mode | Built-in (audit violations without blocking) | Built-in (audit/enforce modes) |
| Learning curve | High (Rego is a new language) | Low (YAML-native) |
| Community | Mature, CNCF Graduated | Fast-growing, CNCF Incubating |
| Policy library | Gatekeeper Library | Kyverno Policies |
For image verification specifically, Kyverno has a significant advantage: signature verification is a first-class feature, not something you bolt on with Rego functions.
# Kyverno policy: require Cosign signature from trusted identity
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-signed-images
spec:
validationFailureAction: Enforce
rules:
- name: verify-signature
match:
any:
- resources:
kinds:
- Pod
verifyImages:
- imageReferences:
- "ghcr.io/myorg/*"
attestors:
- entries:
- keyless:
issuer: "https://token.actions.githubusercontent.com"
subject: "https://github.com/myorg/*"
rekor:
url: "https://rekor.sigstore.dev"
SBOM: Software Bill of Materials
An SBOM is a machine-readable inventory of every component in a container image — every package and dependency. It answers the question: “When the next Log4Shell happens, are we affected?”
Generation tools:
- Trivy — Generates SBOMs as part of its scanning workflow. Supports SPDX and CycloneDX formats. Can scan container images, filesystems, and Git repositories.
- Syft — Anchore’s dedicated SBOM generator. Deeper catalog of package types. Outputs SPDX, CycloneDX, and its own JSON format.
Formats:
- SPDX — Linux Foundation standard. Widely adopted for compliance. Verbose.
- CycloneDX — OWASP standard. More focused on security use cases. Lighter.
# Generate SBOM with Trivy
trivy image --format cyclonedx --output sbom.json ghcr.io/myorg/myapp:latest
# Generate SBOM with Syft
syft ghcr.io/myorg/myapp:latest -o spdx-json > sbom.spdx.json
# Attach SBOM to image with Cosign
cosign attach sbom --sbom sbom.json ghcr.io/myorg/myapp@sha256:abc123...
Image Scanning
Scanning should happen in CI, in the registry, and at admission time (via Kyverno/Gatekeeper).
The SLSA Framework
SLSA (Supply-chain Levels for Software Artifacts, pronounced “salsa”) is a framework from Google that defines increasingly rigorous levels of supply chain integrity.
| Level | Name | Requirements |
|---|---|---|
| 0 | No guarantees | No SLSA compliance |
| 1 | Provenance exists | Build process generates provenance metadata documenting how the artifact was built |
| 2 | Hosted build | Build runs on a hosted service (not a developer laptop). Provenance is signed. |
| 3 | Hardened builds | Build service is hardened against tampering. Provenance is non-forgeable. Build is isolated. Source is version-controlled. |
GitHub Actions with reusable workflows can achieve SLSA Level 3 using the slsa-framework/slsa-github-generator action, which produces signed provenance attestations.
Restricting Image Registries
A fundamental control: only allow images from registries you trust.
# Kyverno: restrict to approved registries
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-registries
spec:
validationFailureAction: Enforce
rules:
- name: allowed-registries
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Images must come from approved registries."
pattern:
spec:
containers:
- image: "ghcr.io/myorg/* | registry.internal.company.com/*"
initContainers:
- image: "ghcr.io/myorg/* | registry.internal.company.com/*"
Putting It Together: The Secure Pipeline
END-TO-END SUPPLY CHAIN SECURITY
──────────────────────────────────
┌─────────────┐
│ Source Code │ Signed commits, code review,
│ │ dependency scanning (Dependabot)
└──────┬──────┘
│
▼
┌─────────────┐
│ CI Build │ SLSA Level 2+: hosted, signed provenance
│ (GitHub │ Trivy scan: fail on CRITICAL
│ Actions) │ SBOM generation (Syft/Trivy)
└──────┬──────┘
│
▼
┌─────────────┐
│ Sign & │ Cosign keyless sign
│ Attest │ Attach SBOM attestation
│ │ Record in Rekor transparency log
└──────┬──────┘
│
▼
┌─────────────┐
│ Registry │ Continuous scanning
│ (GHCR/ECR) │ Image retention policy
└──────┬──────┘
│
▼
┌─────────────┐
│ Admission │ Kyverno/Gatekeeper:
│ Control │ ✓ Signature verified
│ │ ✓ Registry allowed
│ │ ✓ No critical CVEs
│ │ ✓ SBOM attached
└──────┬──────┘
│
▼
┌─────────────┐
│ Runtime │ Pod Security Standards
│ Cluster │ Network Policies
│ │ Runtime monitoring (Falco)
└─────────────┘
Common Mistakes and Misconceptions
- “I scan images once and they’re secure.” New CVEs are discovered daily. Images that were clean yesterday may have critical vulnerabilities today. Continuous scanning in the registry (not just at build time) is essential.
- “Using official base images means no vulnerabilities.” Even official images contain OS packages with CVEs. Use distroless or scratch-based images to minimize attack surface. Regularly rebuild images to pick up base image patches.
- “Image signing is enough.” Signing proves provenance but not safety. A signed image can still contain vulnerabilities. Signing + scanning + admission policy (Kyverno/Gatekeeper) together form the chain.
Further Reading
- Sigstore documentation — Cosign, Fulcio, Rekor
- Kyverno image verification — Policy examples
- SLSA framework — Levels and requirements
- Trivy documentation — Scanning and SBOM generation
- Notation documentation — Notary v2
Next: Secrets Management — Encryption at rest, KMS integration, and external secrets operators.
Chapter 28: Secrets Management
Kubernetes Secrets are base64-encoded. This is not encryption. Base64 is a reversible encoding — echo "cGFzc3dvcmQxMjM=" | base64 -d produces password123 instantly. Every tutorial mentions this, yet production clusters routinely store database passwords, API keys, and TLS certificates in Secrets with no additional protection. The data sits in etcd in plaintext (or rather, in trivially decodable base64), readable by anyone with access to the etcd data directory or sufficient RBAC permissions.
This chapter covers the full spectrum of secrets protection: encrypting data at rest in etcd, integrating with external key management systems, and using external secrets operators that keep sensitive data out of Kubernetes entirely.
The Default: No Encryption
When you create a Secret, Kubernetes stores it in etcd. By default, the identity provider is used, which means the data is stored as-is (base64-encoded, not encrypted). Anyone with read access to etcd — a backup, a compromised node, a misconfigured endpoint — can read every secret in the cluster.
flowchart LR
kubectl["<b>kubectl</b><br>create secret generic db-creds<br>--from-literal=password=hunter2"]
api["<b>API Server</b>"]
etcd["<b>etcd</b><br>/registry/secrets/default/db-creds<br><br>data:<br> password: aHVudGVyMg==<br><br>base64 'hunter2' — NOT encrypted"]
kubectl --> api --> etcd
Encryption at Rest
Kubernetes supports encrypting Secret data before it reaches etcd. You configure this through an EncryptionConfiguration file referenced by the API server’s --encryption-provider-config flag.
Encryption Providers
| Provider | Algorithm | Key Management | Use Case |
|---|---|---|---|
| identity | None (plaintext) | N/A | Default. Insecure. |
| aescbc | AES-256-CBC | Static key in config file | Simple encryption. Key is on disk alongside the API server. |
| aesgcm | AES-256-GCM | Static key in config file | Authenticated encryption (integrity + confidentiality). Uses random 96-bit nonces (collision risk negligible). Key rotation still recommended. |
| secretbox | XSalsa20-Poly1305 | Static key in config file | Modern authenticated encryption. Preferred over aescbc/aesgcm for static key scenarios. |
| kms v2 | Envelope encryption | External KMS (AWS KMS, GCP KMS, Azure Key Vault, HashiCorp Vault) | Production-grade. Keys never leave the KMS. |
Basic EncryptionConfiguration
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- secretbox: # Primary: encrypt with secretbox
keys:
- name: key1
secret: <base64-encoded-32-byte-key>
- identity: {} # Fallback: read unencrypted data
The provider order matters. The first provider is used for writing (encrypting new secrets). All listed providers are tried for reading (so you can decrypt data written by a previous provider during key rotation). The identity provider at the end ensures that secrets written before encryption was enabled can still be read.
KMS v2 Envelope Encryption
Static keys stored in configuration files have an obvious weakness: the key is on the same machine as the encrypted data. If someone compromises the API server node, they have both the ciphertext and the key. KMS v2 solves this with envelope encryption.
flowchart TD
A["<b>1. API Server</b><br>Generates random DEK<br>(plaintext, cached in memory)"]
B["<b>2. External KMS</b><br>AWS KMS / GCP Cloud KMS /<br>Vault / Azure Key Vault"]
C["<b>3. API Server</b><br>Encrypts secret data with plaintext DEK"]
D["<b>4. etcd</b><br>Stores encrypted DEK + encrypted data"]
A -- "Send plaintext DEK<br>for encryption" --> B
B -- "Return encrypted DEK<br>(wrapped with KEK;<br>KEK never leaves KMS)" --> C
C -- "Store enc(DEK) + enc(data)" --> D
Key insight: The KEK never leaves the KMS. Even if etcd is fully compromised, the attacker has encrypted data and an encrypted DEK but no way to decrypt either without KMS access. The plaintext DEK is cached in API Server memory and never written to disk.
KMS v2 Configuration
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- kms:
apiVersion: v2
name: aws-kms-provider
endpoint: unix:///var/run/kmsplugin/socket.sock
timeout: 3s
- identity: {}
The KMS plugin runs as a separate process (typically a DaemonSet or static pod on control plane nodes) that translates between the Kubernetes KMS gRPC protocol and your cloud provider’s KMS API.
Key Rotation
For static key providers, rotation requires three steps:
- Add the new key as the first entry in the
keyslist (so new writes use it) - Restart the API server to pick up the configuration change
- Re-encrypt all existing secrets:
kubectl get secrets --all-namespaces -o json | kubectl replace -f - - Remove the old key from the configuration
For KMS v2, rotation happens in the KMS itself. When you rotate the KEK in AWS KMS or GCP Cloud KMS, new DEKs are wrapped with the new KEK. Existing secrets are re-encrypted on next write or via the re-encryption command above.
External Secrets Solutions
Encrypting at rest protects data in etcd, but the secrets still exist as Kubernetes Secret objects — visible to anyone with RBAC read access, exposed in pod environment variables, logged by admission webhooks. External secrets solutions keep the canonical secret in an external system and sync or inject it into pods.
Sealed Secrets
What it is: A controller that encrypts secrets with a public key so they can be safely stored in Git. Only the controller running in the cluster has the private key to decrypt them.
How it works: You use kubeseal to encrypt a Secret into a SealedSecret custom resource. The SealedSecret can be committed to Git. The controller decrypts it and creates the corresponding Secret in the cluster.
# Encrypt a secret for Git storage
kubectl create secret generic db-creds \
--from-literal=password=hunter2 --dry-run=client -o yaml \
| kubeseal --controller-namespace kube-system \
--controller-name sealed-secrets -o yaml > sealed-db-creds.yaml
Trade-offs: Simple to deploy, works with GitOps, no external dependencies beyond the controller. But the decrypted Secret still exists in etcd as a standard Kubernetes Secret. Sealed Secrets protect the Git side, not the runtime side.
External Secrets Operator (ESO)
What it is: A controller that syncs secrets from external providers (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, HashiCorp Vault, 1Password, and many more) into Kubernetes Secrets.
flowchart TD
aws["<b>AWS Secrets Manager</b><br>prod/db-pass"]
gcp["<b>GCP Secret Manager</b><br>api-key"]
store["<b>SecretStore / ClusterSecretStore</b><br>(auth config per provider)"]
eso["<b>External Secrets Operator</b><br>Reads ExternalSecret CRs<br>Fetches values from providers<br>Creates/updates K8s Secrets<br>Syncs on interval (e.g. every 1h)"]
secret["<b>Kubernetes Secret</b><br>(auto-created, kept in sync)"]
pods["<b>Pods</b><br>Mounted as files or env vars"]
aws --> store
gcp --> store
store --> eso
eso --> secret
secret --> pods
# SecretStore: how to authenticate to the provider
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets
namespace: production
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: eso-sa # Uses IRSA for authentication
---
# ExternalSecret: what to sync
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets
target:
name: db-credentials # Name of the K8s Secret to create
data:
- secretKey: password
remoteRef:
key: prod/database/password
HashiCorp Vault
Vault provides dynamic secret generation (short-lived database credentials created on demand), PKI certificate issuance, transit encryption (encrypt data without exposing keys), and detailed audit logging.
Vault integrates with Kubernetes in three ways:
Agent Sidecar Injector — A mutating webhook injects a Vault Agent sidecar into pods. The agent authenticates to Vault using the pod’s ServiceAccount, retrieves secrets, and writes them to a shared volume. The application reads secrets from files.
CSI Provider — The Vault CSI provider mounts secrets as a CSI volume. Simpler than the sidecar approach but with fewer features (no dynamic renewal).
Vault Secrets Operator (VSO) — The newest approach. A Kubernetes operator that syncs Vault secrets into Kubernetes Secret objects, similar to ESO but Vault-specific and with native Vault features like dynamic secrets and lease renewal.
Comparison
See Appendix C: Decision Trees for a secret management decision flowchart.
| Feature | Sealed Secrets | ESO | Vault |
|---|---|---|---|
| Complexity | Low | Medium | High |
| External dependency | None (controller only) | Cloud provider secrets service | Vault cluster |
| Git-safe secrets | Yes (primary purpose) | No (syncs from cloud) | No |
| Dynamic secrets | No | No | Yes (database creds, PKI certs) |
| Multi-cloud | N/A | Yes (many providers) | Yes (one Vault, many consumers) |
| Audit logging | No | Provider-dependent | Yes (detailed) |
| Cost | Free | Free + cloud secrets service pricing | Free (OSS) or paid (Enterprise) + operational cost |
| Best for | Small teams, GitOps | Cloud-native, multi-provider | Enterprise, strict compliance, dynamic secrets |
Best Practices
Mount secrets as files, not environment variables. Environment variables are exposed in /proc/<pid>/environ, appear in crash dumps, and are inherited by child processes. File-mounted secrets can have restrictive file permissions and are not leaked through process inspection.
# Preferred: mount as file
containers:
- name: app
volumeMounts:
- name: db-creds
mountPath: /etc/secrets
readOnly: true
volumes:
- name: db-creds
secret:
secretName: db-credentials
defaultMode: 0400 # Read-only by owner
Use short-lived credentials. A database password that never expires is a permanently valid attack vector. Vault’s dynamic secrets generate credentials with a TTL (e.g., 1 hour). When the lease expires, Vault revokes the credentials. ESO’s refresh interval keeps synced secrets current.
Scope RBAC for secrets. Not every developer needs kubectl get secrets. Restrict Secret read access to the specific ServiceAccounts and namespaces that need it. Remember that pod creation implies secret access (anyone who can create a pod can mount any secret in the namespace).
Audit secret access. Enable Kubernetes audit logging for Secret read operations. In Vault, audit logging is built in and records every secret access with the requesting identity.
Rotate regularly. Automate key rotation for encryption at rest. Automate credential rotation for application secrets. Test that rotation does not cause downtime.
Never log secrets. Ensure admission webhooks, logging sidecars, and debug tools do not capture secret values. Mask sensitive fields in application logs.
Common Mistakes and Misconceptions
- “Kubernetes Secrets are encrypted.” By default, Secrets are stored as base64 in etcd — which is encoding, not encryption. You must enable encryption at rest (
EncryptionConfiguration) or use an external KMS provider. - “Sealed Secrets or External Secrets solve everything.” These tools solve the GitOps problem (how to store secrets in Git). They don’t solve rotation, access auditing, or least-privilege access. Use them with a proper vault backend.
Further Reading
- Encrypting Secret Data at Rest — Official guide
- KMS v2 documentation — KMS plugin setup
- External Secrets Operator — Multi-provider secrets sync
- Sealed Secrets — Git-safe encrypted secrets
- Vault Kubernetes integration — Agent, CSI, VSO
Next: Pod Security Standards — Privileged, Baseline, and Restricted profiles with Pod Security Admission.
Chapter 29: Pod Security Standards
A container is a process running on a Linux host. By default, Kubernetes places remarkably few restrictions on what that process can do. A pod can run as root, mount the host filesystem, share the host network namespace, escalate privileges, and disable security profiles. Each of these capabilities is a legitimate attack surface. Pod Security Standards define three profiles — Privileged, Baseline, and Restricted — that progressively lock down what pods are allowed to do. Pod Security Admission (PSA) enforces these profiles at the namespace level, providing a built-in mechanism to prevent dangerous pod configurations from ever reaching the cluster.
This chapter covers the standards themselves, the admission controller that enforces them, and the migration path from the now-removed PodSecurityPolicy (PSP) to the current model.
Why Pod-Level Security Matters
Consider what an attacker gains from a compromised container with no security restrictions:
- Privileged mode: Full access to host devices, effectively root on the node
- Host PID namespace: See and signal every process on the node
- Host network namespace: Bind to any port on the node, sniff network traffic
- hostPath volumes: Read and write any file on the node filesystem
- Root user: Write to container filesystem, install tools, exploit kernel vulnerabilities
- Privilege escalation: Gain capabilities beyond the container’s initial set
- No seccomp profile: Access the full set of ~300+ Linux syscalls, including dangerous ones like
ptrace,mount, andreboot
Without pod security controls, every container is one exploit away from full node compromise. The standards exist to define a sensible default: what should a “normal” pod look like?
The Three Profiles
Controls Matrix
| Control | Privileged | Baseline | Restricted |
|---|---|---|---|
| Privileged containers | Allowed | Forbidden | Forbidden |
| Host namespaces (hostPID, hostIPC, hostNetwork) | Allowed | Forbidden | Forbidden |
| Host ports | Allowed | Limited (known ranges) | Limited (known ranges) |
| HostPath volumes | Allowed | Forbidden | Forbidden |
| Privileged escalation (allowPrivilegeEscalation) | Allowed | Allowed | Forbidden (must be false) |
| Running as root (runAsNonRoot) | Allowed | Allowed | Forbidden (must be true) |
| Root user (runAsUser: 0) | Allowed | Allowed | Forbidden |
| Capabilities | All | Cannot add capabilities beyond the default Docker set (AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FKILL, FSETID, KILL, MKNOD, NET_BIND_SERVICE, SETFCAP, SETGID, SETPCAP, SETUID, SYS_CHROOT) | Drop ALL, add only: NET_BIND_SERVICE |
| Seccomp profile | Any or none | Any or none | Must set RuntimeDefault or Localhost |
| Volume types | All | All except hostPath | Restricted set: configMap, downwardAPI, emptyDir, persistentVolumeClaim, projected, secret |
| Sysctls | All | Safe set only | Safe set only |
| AppArmor | Any or none | Any or none | Must not opt out of default profile |
| SELinux | Any | Cannot set MustRunAs type to escalating types | Cannot set MustRunAs type to escalating types |
| /proc mount type | Any | Default only | Default only |
| Seccomp (ephemeral containers) | Any | Any | Must set RuntimeDefault or Localhost |
Profile Descriptions
Privileged — No restrictions. Used for system-level workloads that genuinely need full host access: CNI plugins, storage drivers, logging agents that read /var/log, monitoring agents that access /proc and /sys. This profile should apply only to system namespaces (kube-system) and never to application workloads.
Baseline — Prevents known privilege escalation paths. Blocks privileged containers, host namespaces, and hostPath volumes. Allows running as root and does not require seccomp profiles. This is the minimum viable security policy for application workloads. Most applications work under Baseline without modification.
Restricted — Enforces current security best practices. Requires non-root execution, drops all capabilities, mandates seccomp profiles, and limits volume types. Many applications need modification to work under Restricted (switching from root to a non-root user, updating file permissions in the container image). This is the target state for all application workloads.
Pod Security Admission (PSA)
Pod Security Admission is the built-in admission controller (enabled by default since Kubernetes 1.25) that enforces Pod Security Standards. It operates at the namespace level via labels.
flowchart TD
kubectl["<b>kubectl apply -f deployment.yaml</b>"]
api["<b>API Server</b>"]
psa["<b>Pod Security Admission Controller</b><br>Namespace: production"]
lookup["Look up namespace labels:<br>enforce: baseline<br>warn: restricted<br>audit: restricted"]
enforce{"<b>ENFORCE (baseline)</b><br>hostNetwork? hostPath?<br>privileged?"}
reject["REJECT (403)"]
warn{"<b>WARN (restricted)</b><br>runs as root?<br>no seccomp profile?"}
audit{"<b>AUDIT (restricted)</b><br>same checks as warn"}
allow["ALLOW<br>(pod admitted)"]
kubectl --> api --> psa --> lookup --> enforce
enforce -- "violation" --> reject
enforce -- "pass" --> warn
warn -- "violation" --> allow
warn -. "warnings shown to user" .-> allow
allow --> audit
audit -. "violations logged for review" .-> allow
Namespace Labels
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
# Enforce: reject pods that violate the profile
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: v1.30
# Warn: allow but show warnings for violations
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: v1.30
# Audit: allow but log violations
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: v1.30
The three modes serve different purposes:
- enforce — Hard block. The pod is rejected with a 403 error. Use for the profile you are confident about.
- warn — Soft signal. The pod is admitted, but the user sees a warning in their kubectl output. Use for the profile you are migrating toward.
- audit — Silent logging. The pod is admitted, and the violation is recorded in the audit log. Use for monitoring before tightening.
The recommended pattern is to enforce the current standard and warn/audit at the next level up. This gives teams visibility into what would break if you tightened the policy.
Version Pinning
The *-version labels pin the profile to a specific Kubernetes version’s definition. This prevents surprise breakage when you upgrade the cluster: a new Kubernetes version might add new checks to the Restricted profile, and pinning ensures the old definition is used until you explicitly update.
# Pin to v1.30 definitions regardless of cluster version
pod-security.kubernetes.io/enforce-version: v1.30
# Use "latest" to always get the current version's definitions
pod-security.kubernetes.io/enforce-version: latest
Migration from PodSecurityPolicy
PodSecurityPolicy (PSP) was removed in Kubernetes 1.25. If your cluster still relies on PSP, the migration to PSA follows a deliberate progression:
| Step | Action | Namespace Label |
|---|---|---|
| 1. AUDIT | Add audit labels to all namespaces. Review audit logs for violations. No impact on running workloads. | audit: restricted |
| 2. WARN | Add warn labels. Developers see warnings when deploying non-compliant pods. Still no enforcement. | warn: restricted |
| 3. FIX | Update workloads to comply: - runAsNonRoot: true- seccompProfile: RuntimeDefault- Drop all capabilities - Switch to non-root base images | (no label change) |
| 4. ENFORCE | Add enforce labels. Non-compliant pods are rejected. Remove PSP resources. | enforce: baselinewarn: restricted |
| 5. TIGHTEN | Move enforcement from baseline to restricted as workloads are updated. | enforce: restricted |
Common Migration Fixes
Running as non-root:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefault
Choosing a non-root base image:
FROM node:20-slim
# Create non-root user
RUN groupadd -r app && useradd -r -g app -d /app app
WORKDIR /app
COPY --chown=app:app . .
USER app
Namespace Exemptions
Some namespaces legitimately need Privileged access. The PSA admission controller supports exemptions configured at the API server level:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
configuration:
apiVersion: pod-security.admission.config.k8s.io/v1
kind: PodSecurityConfiguration
defaults:
enforce: baseline
enforce-version: latest
warn: restricted
warn-version: latest
audit: restricted
audit-version: latest
exemptions:
usernames: []
runtimeClasses: []
namespaces:
- kube-system # System components need privileges
- monitoring # Node exporters need host access
- storage-system # CSI drivers need host access
This configuration sets cluster-wide defaults (enforce baseline, warn restricted) and exempts specific namespaces. Exempt namespaces are not subject to PSA checks at all, so apply RBAC and other controls carefully.
When to Supplement with Kyverno or Gatekeeper
Pod Security Standards cover the most critical pod-level controls, but they are intentionally limited in scope. They do not address:
- Image registry restrictions (only allow images from approved registries)
- Required labels or annotations (every pod must have
teamandcost-centerlabels) - Resource limits (every container must have CPU and memory limits)
- Specific capability requirements (allow NET_RAW for ping utilities)
- Per-workload exceptions (allow hostNetwork for a specific DaemonSet but not others)
- Custom validation (container images must use digest references, not tags)
For these use cases, supplement PSA with Kyverno or OPA Gatekeeper. The recommended pattern:
- PSA handles the broad security baseline (enforce at the namespace level, zero configuration per workload)
- Kyverno/Gatekeeper handles fine-grained policies (per-resource exceptions, organizational standards, image policies)
# Kyverno: require resource limits on all containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
spec:
validationFailureAction: Enforce
rules:
- name: check-limits
match:
any:
- resources:
kinds:
- Pod
validate:
message: "All containers must have CPU and memory limits."
pattern:
spec:
containers:
- resources:
limits:
cpu: "?*"
memory: "?*"
A Practical Security Posture
RECOMMENDED PSA CONFIGURATION
───────────────────────────────
Namespace Type Enforce Warn Audit
────────────── ─────── ──── ─────
kube-system privileged --- ---
monitoring privileged --- ---
storage-system privileged --- ---
application-dev baseline restricted restricted
application-staging restricted restricted restricted
application-production restricted restricted restricted
Start with baseline enforcement for dev namespaces.
Move to restricted as workloads are updated.
Production should enforce restricted from the start
for new applications.
Common Mistakes and Misconceptions
- “Running as root in a container is the same as root on the host.” Without proper configuration, it can be. Container root can escape to host root via privileged containers, host mounts, or kernel exploits. Always set
runAsNonRoot: trueand drop capabilities. - “Pod Security Standards are optional.” Without enforcement (via PSS labels or admission controllers), any user who can create pods can create privileged pods. Default to
restrictedand grant exceptions explicitly. - “My application needs
privileged: true.” Very few applications genuinely need host-level access. Most cases can be solved with specific Linux capabilities (NET_BIND_SERVICE, SYS_PTRACE) instead of full privilege.
Further Reading
- Pod Security Standards — Profile definitions
- Pod Security Admission — Enforcement mechanism
- Migrate from PodSecurityPolicy — Migration guide
- Kyverno policies — Policy library for supplemental controls
Part 6 shifts from securing workloads to scaling them.
Next: Horizontal Pod Autoscaler
Chapter 30: Horizontal Pod Autoscaler
A deployment with a fixed replica count is a bet that traffic will stay constant. Traffic never stays constant. If you guess too low, pods become overloaded and latency spikes. If you guess too high, you pay for idle compute around the clock. The Horizontal Pod Autoscaler (HPA) replaces this guessing game with a feedback loop: measure demand, compute the right number of replicas, and adjust — continuously.
Understanding HPA from first principles requires understanding the algorithm it uses, the metrics it consumes, how to extend it beyond built-in metrics, and the tuning knobs that prevent it from behaving erratically.
The Scaling Algorithm
The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). Each iteration executes a single formula:
desiredReplicas = ceil( currentReplicas * ( currentMetricValue / desiredMetricValue ) )
This is a proportional controller. If you have 4 replicas running at 80% CPU and your target is 50% CPU, the math is:
desiredReplicas = ceil( 4 * (80 / 50) ) = ceil( 6.4 ) = 7
The HPA will scale from 4 to 7 replicas. When those 7 replicas bring average CPU down to 45%, the formula produces:
desiredReplicas = ceil( 7 * (45 / 50) ) = ceil( 6.3 ) = 7
No change. The system has stabilized.
The 10% Tolerance Band
To prevent constant oscillation around the target, the HPA applies a tolerance of 0.1 (10%). If the ratio currentMetric / desiredMetric falls within [0.9, 1.1], the HPA takes no action. This dead zone prevents the controller from chasing noise.
sequenceDiagram
participant M as Metrics API
participant H as HPA Controller
participant D as Deployment
participant P as Pods
loop Every 15 seconds
H->>M: Fetch current metrics
M-->>H: CPU 80%, target 60%
Note right of H: ratio = 80/60 = 1.33<br>Outside tolerance (0.9–1.1)<br>desiredReplicas = ceil(current * 1.33)<br>Clamp to [min, max]
H->>D: Patch .spec.replicas
D->>P: Create new pods (or terminate)
P-->>M: Report metrics via cAdvisor
end
Default Metrics: CPU and Memory
The simplest HPA targets CPU utilization:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-frontend
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-frontend
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Critical prerequisite: CPU utilization is computed as a percentage of the pod’s resource request. If your pods do not have resources.requests.cpu set, the HPA cannot compute utilization and will refuse to scale. This is the single most common HPA misconfiguration.
You can target memory the same way, but memory-based scaling is tricky. Many applications (JVM, Python, Go with large heaps) allocate memory and never release it. Scaling up works, but scaling down may never trigger because memory consumption does not drop when load drops.
Custom Metrics via Prometheus Adapter
Built-in CPU and memory metrics are crude. Most services should scale on business-relevant metrics: requests per second, queue depth, p99 latency. The custom metrics API (custom.metrics.k8s.io) provides the abstraction; Prometheus Adapter is the most common implementation that bridges Prometheus metrics into this API.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
The Prometheus Adapter configuration maps PromQL queries to Kubernetes metric names. When the HPA asks “what is the current value of http_requests_per_second for deployment api-server?”, the adapter executes the corresponding PromQL query and returns the result.
KEDA: Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) does not replace HPA — it extends it. KEDA solves two problems that HPA cannot:
-
Zero-to-one scaling. HPA’s
minReplicasmust be at least 1. KEDA can scale a deployment to zero and activate it when an event arrives. -
Diverse event sources. KEDA ships with 60+ scalers: Kafka consumer lag, AWS SQS queue depth, Azure Service Bus, Redis streams, Cron schedules, PostgreSQL query results, and more. Adding a new metric source requires no adapter installation — just a
ScaledObjectmanifest.
KEDA Architecture
KEDA installs two components:
-
Operator (keda-operator): Watches
ScaledObjectandScaledJobCRDs. When scaling from 0 to 1, KEDA directly modifies the deployment’s replica count. For scaling from 1 to N, KEDA creates and manages an HPA resource, feeding it metrics through the second component. -
Metrics Adapter (keda-operator-metrics-apiserver): Implements the external metrics API (
external.metrics.k8s.io). The HPA that KEDA creates targets metrics served by this adapter.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 100
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: orders
topic: incoming-orders
lagThreshold: "50"
When the Kafka consumer lag for the orders group exceeds 50, KEDA activates the deployment (0 to 1), then the HPA scales from 1 to N based on how far the lag exceeds the threshold.
HPAv2 Behavior Tuning
The autoscaling/v2 API introduced the behavior field, which provides fine-grained control over how fast the HPA scales up and down. Without tuning, the HPA can oscillate: a traffic spike causes rapid scale-up, load drops as new pods absorb traffic, the HPA immediately scales down, load spikes again.
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 10
periodSeconds: 60 # Remove at most 10% of pods per minute
- type: Pods
value: 2
periodSeconds: 60 # Or at most 2 pods per minute
selectPolicy: Min # Use whichever policy removes FEWER pods
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15 # Double pod count every 15 seconds
- type: Pods
value: 4
periodSeconds: 15 # Or add 4 pods every 15 seconds
selectPolicy: Max # Use whichever policy adds MORE pods
Key Concepts
-
stabilizationWindowSeconds: The HPA looks at all recommended replica counts over this time window and picks the highest value for both scale-up and scale-down. This conservative behavior prevents premature scale-down (by remembering recent high recommendations) and ensures adequate scale-up (by not ignoring recent spikes). A 300-second stabilization window for scale-down means the HPA will not reduce replicas until no recommendation within the past 5 minutes called for a higher count. This prevents premature scale-down after a traffic burst.
-
Policies (Percent vs Pods): Each policy defines a maximum change rate.
Percent: 10means remove at most 10% of current replicas.Pods: 2means remove at most 2 pods. You can combine multiple policies. -
selectPolicy: When multiple policies exist,
Minpicks the one that changes the least (conservative),Maxpicks the one that changes the most (aggressive), andDisabledprevents scaling in that direction entirely.
General wisdom: Scale up aggressively (fast selectPolicy: Max), scale down conservatively (slow selectPolicy: Min with a stabilization window). It is always cheaper to run a few extra pods for a few minutes than to drop requests during a scale-up delay.
Common Pitfalls
Metrics lag. The metrics pipeline introduces latency. cAdvisor scrapes every 10–15 seconds. Metrics Server aggregates. The HPA polls every 15 seconds. End-to-end, there can be 30–60 seconds between a load spike and the HPA deciding to scale. For latency-sensitive services, consider scaling on leading indicators (queue depth, connection count) rather than lagging indicators (CPU).
Thrashing. Without behavior tuning, the HPA can oscillate between two replica counts every loop iteration. The stabilization window and policy limits exist to prevent this. If you see ScalingActive events alternating between scale-up and scale-down, increase the stabilization window.
Cold start. New pods take time to start (image pull, init containers, JVM warmup, cache loading). The HPA sees new pods as “not yet reporting metrics” and may scale up further before the first wave is ready. Use readiness probes with appropriate initial delays and consider scaleUp.stabilizationWindowSeconds to give new pods time to absorb load.
Missing resource requests. If pods lack resources.requests.cpu, the HPA cannot compute utilization percentages and will emit FailedGetResourceMetric events. Always set resource requests on pods that will be autoscaled.
Scaling both on CPU and a custom metric. When multiple metrics are specified, the HPA computes the desired replica count for each and takes the maximum. This is usually correct (scale up if either metric is hot), but can lead to over-provisioning if metrics are poorly correlated.
Putting It Together
A production-ready HPA configuration typically combines:
- A primary business metric (requests per second, queue depth)
- A safety-net CPU metric (catches runaway computation)
- Conservative scale-down behavior (stabilization window of 5–10 minutes)
- Aggressive scale-up behavior (double capacity every 15–30 seconds)
- Reasonable min/max bounds (min = 2 for HA, max = cost limit)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-service
minReplicas: 3
maxReplicas: 40
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: Min
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 5
periodSeconds: 15
selectPolicy: Max
Common Mistakes and Misconceptions
- “HPA reacts instantly to traffic spikes.” End-to-end reaction time is 1–2 minutes due to metrics lag and stabilization windows (see above).
- “I can use HPA and VPA together on CPU.” HPA and VPA both try to act on CPU metrics, creating a conflict. Use HPA for horizontal scaling on CPU/memory and VPA only for non-HPA-targeted resources, or use the VPA recommendation-only mode alongside HPA.
- “Setting target CPU utilization to 50% wastes resources.” 50% target means HPA scales up when average utilization exceeds 50%. This headroom absorbs traffic spikes during the scaling delay. Setting it to 90% means pods are overloaded before new ones arrive.
- “HPA works without resource requests.” Utilization is computed as a percentage of requests; without them, CPU/memory-based HPA cannot function (see above).
Further Reading
- HPA Algorithm Details — Official algorithm documentation
- KEDA Documentation — Event-driven autoscaling
- Prometheus Adapter — Custom metrics bridge
- HPAv2 Behavior — Scaling policies reference
Next: Vertical Pod Autoscaler — Right-sizing pod resource requests with VPA, in-place resize, and Goldilocks.
Chapter 31: Vertical Pod Autoscaler and Right-Sizing
The Vertical Pod Autoscaler (VPA) adjusts pod resource requests and limits rather than replica count — a harder problem because changing resources historically required restarting the pod. This constraint shaped VPA’s design from the beginning, and in-place pod resize — alpha since Kubernetes 1.27 and beta since 1.33 — is expected to reach general availability around Kubernetes 1.35, after more than six years of development.
Note: In-place pod resize is a rapidly evolving feature. GA timing and API details may change; check KEP-1287 for current status.
Understanding VPA requires understanding why right-sizing matters, how VPA’s three modes work, the new in-place resize mechanism, the critical interaction between VPA and HPA, and the practical workflow for using VPA in production.
Why Right-Sizing Matters
Most teams set resource requests once during initial deployment and never revisit them. Studies of large Kubernetes clusters consistently show that only 10–15% of requested CPU is actually consumed. The remaining 85–90% is reserved but idle — the scheduler cannot assign it to other workloads because it is “spoken for.”
This waste compounds:
- Overprovisioned pods reserve resources they never use. The scheduler treats requests as firm commitments, so idle reservations block other pods from being scheduled.
- Underprovisioned pods hit CPU throttling and memory OOM kills. Teams respond by doubling requests, creating more waste.
- Node scaling follows requests, not usage. The Cluster Autoscaler adds nodes when pods cannot be scheduled, which depends on requested resources. Bloated requests cause premature node scaling.
VPA closes this loop by observing actual usage over time and recommending (or applying) appropriate resource requests.
VPA Architecture
VPA consists of three components:
flowchart TD
subgraph VPA["VPA Components"]
rec["<b>Recommender</b><br>Watches pod metrics over time<br>Builds usage histogram<br>Emits target, lowerBound,<br>upperBound"]
upd["<b>Updater</b><br>Evicts pods outside<br>recommended range<br>(Auto mode only)"]
adm["<b>Admission Webhook</b><br>Mutates pod spec at<br>creation time<br>(applies recs to new pods)"]
end
metrics["<b>Metrics Server / Prometheus</b>"]
pods["<b>Running Pods</b><br>(evict + recreate)"]
api["<b>API Server</b><br>(pod create admission)"]
rec --> metrics
upd --> pods
adm --> api
-
Recommender: Continuously observes pod resource usage (via Metrics API or Prometheus) and computes recommendations. It maintains a decaying histogram of usage patterns and outputs four values per container:
lowerBound,target,uncappedTarget, andupperBound. -
Updater: In Auto mode, compares running pods’ resource requests against the recommended range. If a pod’s requests fall outside the
[lowerBound, upperBound]range, the Updater evicts it so it can be recreated with updated requests. -
Admission Webhook: Intercepts pod creation requests and mutates the resource requests to match the VPA’s current recommendation. This is how the updated values actually get applied — the Updater evicts the old pod, the Deployment creates a replacement, and the Admission Webhook sets the recommended requests on the new pod.
The Three Modes (Plus One)
VPA operates in one of four modes, set via updatePolicy.updateMode:
Off (Recommendation Only)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off"
The Recommender computes and stores recommendations, but neither the Updater nor the Admission Webhook applies them. You read recommendations from the VPA status and decide manually whether to act. This is the safest starting point.
kubectl get vpa api-server-vpa -o jsonpath='{.status.recommendation}' | jq .
Initial
The Admission Webhook applies recommendations to new pods at creation time, but the Updater does not evict running pods. Existing pods keep their current requests until they are restarted for other reasons (deployment rollout, node drain, crash). This is useful for gradual rollouts — new pods get right-sized, old pods are unaffected.
Auto (Recreate)
Both the Updater and Admission Webhook are active. The Updater will evict pods whose requests are outside the recommended range, causing them to be recreated with new requests. This provides fully automated right-sizing but causes pod restarts, which can be disruptive for stateful workloads or services with long startup times.
InPlaceOrRecreate (New)
With in-place pod resize approaching GA, VPA is gaining a fourth mode (the name InPlaceOrRecreate is used here but may differ in the final implementation — check the VPA documentation for your version). In this mode, VPA first attempts to resize the pod in place — updating its resource requests without restarting it. If in-place resize is not possible (for example, the new requests exceed node capacity), VPA falls back to the Recreate behavior and evicts the pod.
This is the mode most teams should target once their clusters support in-place resize at GA.
In-Place Pod Resize
In-place pod resize was proposed in KEP-1287 and has been in development for over six years (alpha in 1.27, beta in 1.33). The core challenge was that Kubernetes originally treated a pod’s resource requests as immutable — changing them required deleting and recreating the pod.
With in-place resize, you can patch a running pod’s spec.containers[*].resources.requests and spec.containers[*].resources.limits, and the kubelet will apply the change to the running container’s cgroup without restarting it. The pod’s status.resize field reports whether the resize was accepted (InProgress, Proposed, Deferred, Infeasible).
For CPU, this is straightforward — the kubelet adjusts the CFS quota. For memory, it is more complex. Increasing memory limits is safe (just raise the cgroup limit). Decreasing memory limits can only succeed if the container’s current resident memory is below the new limit.
VPA and HPA Interaction
Never use VPA and HPA on the same metric for the same workload. This is the most critical rule. If both VPA and HPA target CPU:
- Load increases. CPU utilization rises.
- HPA wants to add more pods.
- VPA wants to increase per-pod CPU requests.
- VPA increases requests. Utilization (relative to new, higher request) drops.
- HPA sees lower utilization and scales down.
- Fewer pods mean higher per-pod load. Cycle repeats.
The result is oscillation and instability.
Safe combinations:
- HPA scales on a custom metric (requests per second, queue depth). VPA manages CPU and memory requests. They operate on orthogonal signals.
- HPA scales on CPU. VPA is in
Offmode (recommendation only), and a human periodically adjusts requests based on VPA’s suggestions. - Use Multidimensional Pod Autoscaler (MPA) from Google, which coordinates horizontal and vertical scaling decisions in a single controller.
The Right-Sizing Workflow
For production workloads, the recommended approach is deliberate and manual:
Step 1: Deploy VPA in Off mode. Attach a VPA with updateMode: "Off" to your deployment. Let it observe for at least 7 days to capture weekly traffic patterns.
Step 2: Collect recommendations. Read the VPA status. The target field is what VPA would set. The lowerBound and upperBound define the acceptable range.
Step 3: Analyze. Compare VPA’s target against current requests. If the target is significantly lower, your pods are overprovisioned. If higher, they are underprovisioned. Cross-reference with actual OOM kills and CPU throttling events.
Step 4: Set manual requests. Update your deployment manifests with the recommended values. Use the target as the request and upperBound as the limit (or no limit for CPU — see Chapter 33). Deploy via your normal rollout process.
Step 5: Repeat. Traffic patterns change. Revisit VPA recommendations quarterly.
Goldilocks: Automated Recommendations at Scale
Running kubectl get vpa across hundreds of deployments is tedious. Goldilocks (by Fairwinds) automates VPA recommendation collection and presents it as a dashboard.
Goldilocks creates a VPA in Off mode for every deployment in labeled namespaces, then serves a web UI showing current requests versus VPA recommendations for every container. It provides both “guaranteed” (VPA upper bound) and “burstable” (VPA target) suggestions.
# Label namespaces for Goldilocks
kubectl label namespace production goldilocks.fairwinds.com/enabled=true
# Goldilocks creates VPAs automatically and serves a dashboard
This is the fastest path to answering “how much are we wasting across the entire cluster?” without changing any workload behavior.
Resource Policy: Constraining VPA
You can constrain VPA’s recommendations with a resource policy to prevent it from setting values too low (risking OOM kills) or too high (wasting resources):
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
- containerName: sidecar
mode: "Off" # Don't touch the sidecar
The mode: "Off" per container is particularly useful for sidecars (Istio proxies, log collectors) that should retain their manually tuned requests.
Common Mistakes and Misconceptions
- “VPA automatically right-sizes my pods.” In
updateMode: Auto, VPA evicts and recreates pods with new resources, causing restarts. UseOffmode to get recommendations without disruption, then apply them during planned maintenance. - “VPA recommendations are immediately correct.” VPA needs days to weeks of historical data to produce good recommendations. Initial recommendations based on hours of data are often wrong. Let it observe through at least one full traffic cycle.
- “I should apply VPA to every workload.” VPA’s pod-restart behavior makes it unsuitable for workloads that can’t tolerate restarts (single-replica databases, leader-election services). Use it for stateless services with multiple replicas.
Further Reading
- VPA Documentation — Official VPA repository
- KEP-1287: In-Place Pod Resize — The six-year journey to in-place resize
- Goldilocks — VPA recommendation dashboard
- Multidimensional Pod Autoscaler — Coordinated horizontal + vertical scaling
Next: Node Scaling — Cluster Autoscaler, Karpenter, and the architecture of node-level scaling.
Chapter 32: Node Scaling: Cluster Autoscaler and Karpenter
When pods cannot be scheduled due to insufficient cluster capacity, the system must provision new nodes. When nodes sit idle, it must remove them.
Two tools dominate this space: the Cluster Autoscaler, which has been the standard since 2016, and Karpenter, which rethinks node provisioning from first principles. Understanding both requires understanding why one emerged to replace the other and the architectural difference that makes Karpenter faster, cheaper, and simpler.
Cluster Autoscaler
The Cluster Autoscaler (CA) is a Kubernetes controller that watches for pods stuck in the Pending state due to insufficient resources. When it finds them, it asks the cloud provider to add nodes. When nodes are underutilized, it drains and removes them.
How It Works
flowchart LR
pending["Pending Pods"] --> CA["Cluster<br>Autoscaler"]
CA -->|"which group<br>can fit?"| A["Node Group A<br>m5.large"]
CA -->|"which group<br>can fit?"| B["Node Group B<br>m5.4xlarge"]
CA -->|"which group<br>can fit?"| C["Node Group C<br>p3.2xlarge (GPU)"]
A --> cloud["Cloud API<br>+1 node"]
B --> cloud
C --> cloud
Key constraint: Each node group is a fixed pool of identical instances. CA picks a group and increments its count — it cannot mix instance types or optimize across groups.
The critical abstraction is the node group (called Auto Scaling Group on AWS, Managed Instance Group on GCP, VM Scale Set on Azure). Each node group is a pool of identically configured nodes: same instance type, same labels, same taints. The Cluster Autoscaler does not provision individual machines — it increments or decrements a node group’s desired count.
The Latency Problem
Cluster Autoscaler’s end-to-end scaling latency typically runs 3–4 minutes:
- Detection (0–30s): CA polls for unschedulable pods every 10 seconds.
- Decision (10–30s): CA simulates scheduling against each node group template.
- Cloud API (30–60s): The cloud provider acknowledges the scale-up request.
- Instance launch (60–120s): The VM boots, pulls the OS image, starts kubelet.
- Node ready (10–30s): kubelet registers with the API server, node passes health checks.
For workloads that can tolerate minutes of latency, this is acceptable. For latency-sensitive services, it is not.
Multi-Cloud Support
CA’s strength is breadth. It supports AWS, GCP, Azure, OpenStack, vSphere, and more. For teams running Kubernetes on-premise or on non-AWS clouds, CA is often the only option.
Karpenter
Karpenter takes a fundamentally different approach. Instead of managing node groups, it provisions individual nodes with the exact size and configuration needed for the pending pods. There are no pre-defined node groups. Karpenter looks at what pods need and picks the optimal instance type, availability zone, and purchase option (on-demand vs spot) in a single step.
Architecture
flowchart LR
pending2["Pending Pods<br>(batched 10s)"] --> K["Karpenter<br>bin-pack + select"]
K -->|"best fit from<br>full catalog"| cloud2["Cloud API<br>launch exact instance"]
cloud2 --> ex1["m5.2xlarge<br>spot, us-east-1b"]
cloud2 --> ex2["c5.xlarge<br>on-demand, us-east-1a"]
Key difference: No node groups. Karpenter evaluates the full instance type catalog, bin-packs pending pods, and launches exactly the right instance — type, size, AZ, and purchase option — in a single API call.
Why Karpenter Is Architecturally Superior
The node group abstraction that Cluster Autoscaler depends on is the root cause of most of its limitations:
Instance type rigidity. A node group has a single instance type (or a mixed-instance policy with limitations). If your workload needs 7.5 GB of memory, and your node group uses m5.large (8 GB), you waste very little. But if it needs 9 GB, you must either use a different node group with m5.xlarge (16 GB) — wasting 7 GB — or create a new node group for every size bracket. In practice, teams maintain 3–10 node groups, each an imperfect approximation.
Karpenter eliminates this entirely. It evaluates the full instance type catalog and picks the cheapest instance that fits the pending pods after bin-packing. If three pending pods need 2 CPU + 4 GB each, Karpenter might choose a single m5.xlarge (4 CPU, 16 GB) rather than three separate nodes.
Scaling speed. Karpenter’s end-to-end latency is approximately 60–90 seconds — roughly 2–3x faster than Cluster Autoscaler. It skips the node group indirection and calls the cloud API directly. It also batches pending pods for 10 seconds before making a decision, which produces better bin-packing.
The following sequence diagram shows the timing of each step in Karpenter’s scaling cascade — notice the 10-second batching window that enables better bin-packing:
sequenceDiagram
participant PP as Pending Pod
participant K as Karpenter
participant EC2 as EC2 Fleet API
participant VM as New EC2
participant KL as kubelet
participant S as Scheduler
participant P as Pod
Note over PP,K: 0-10s
PP->>K: detected (unschedulable)
Note over K: batch pending pods (10s wait)
Note over K: select optimal instance type (bin-pack)
K->>EC2: CreateFleet (spot/OD, AZ, instance type)
EC2-->>K: fleet accepted (~5s)
EC2->>VM: launch VM
Note over VM,KL: ~20-30s
VM->>KL: boot OS, start kubelet
Note over KL: ~5s
KL->>KL: register node with API server
KL->>S: node Ready
S->>P: bind pod to node
Note over PP,P: ~60-90s total end-to-end
Note over P: Running
Consolidation. Karpenter continuously evaluates whether existing nodes can be consolidated. If node A is 30% utilized and node B is 25% utilized, Karpenter can cordon both, move their pods to a single smaller node, and terminate the originals. Cluster Autoscaler can only scale down nodes that are underutilized — it cannot replace a node with a smaller one.
Disruption budgets. Karpenter respects NodePool disruption budgets that control how many nodes can be disrupted simultaneously during consolidation, drift remediation, or node expiry. This prevents consolidation from causing service disruptions.
Karpenter Configuration
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["m5", "m6i", "m6a", "c5", "c6i"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "1000"
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
budgets:
- nodes: "10%"
Comparison
| Aspect | Cluster Autoscaler | Karpenter |
|---|---|---|
| Abstraction | Node groups (ASG/MIG) | Direct instance provisioning |
| Instance selection | Fixed per node group | Dynamic per scheduling batch |
| Scale-up latency | 3–4 minutes | ~60–90 seconds |
| Scale-down | Remove underutilized nodes | Consolidation (replace + remove) |
| Bin-packing | Limited (one group at a time) | Cross-instance-type optimization |
| Spot handling | Mixed instance policies | First-class, per-node decisions |
| Cloud support | AWS, GCP, Azure, others | AWS (GA), Azure (GA via AKS Node Autoprovision since late 2024) |
| Configuration | Node groups + CA flags | NodePool CRDs |
| Maturity | 8+ years, battle-tested | Younger, rapidly maturing |
When to Use Each
Use Cluster Autoscaler when:
- You run on GCP, OpenStack, vSphere, or bare metal
- Your organization requires the stability of a long-established project
- You have existing node group infrastructure and limited appetite for migration
Use Karpenter when:
- You run on AWS or Azure (AKS Node Autoprovision, GA since late 2024)
- You want faster scaling, better bin-packing, and automated cost optimization
- You are starting a new cluster or willing to migrate from node groups
- You run diverse workloads that benefit from flexible instance type selection
For most AWS-based clusters starting today, Karpenter is the better default. Its consolidation alone typically reduces node costs by 20–35% compared to Cluster Autoscaler with static node groups.
Common Mistakes and Misconceptions
- “Cluster Autoscaler scales down immediately.” CA waits 10 minutes (default
scale-down-delay-after-add) before considering a node for removal, then checks if pods can be moved safely. Scale-down is intentionally conservative. - “Spot/preemptible instances are unreliable for anything.” With proper pod disruption budgets, multiple instance types, and spread across availability zones, spot instances work well for stateless services. Karpenter handles spot interruptions by proactively replacing nodes.
Further Reading
- Cluster Autoscaler FAQ — Detailed behavior documentation
- Karpenter Documentation — Official Karpenter docs
- Karpenter Best Practices — AWS EKS guide
- Karpenter Migration Guide — Migrating from Cluster Autoscaler
Next: Resource Tuning Deep Dive — CFS quotas, CPU throttling, QoS classes, and why removing CPU limits sometimes improves performance.
Chapter 33: Resource Tuning Deep Dive
CPU requests and limits translate directly into Linux cgroup parameters. Getting them wrong causes throttling on idle nodes, random OOM kills, and wasted capacity at scale. Understanding resource tuning from first principles requires understanding the kernel mechanisms themselves.
CFS Quota Mechanics
When you set a CPU limit on a container, Kubernetes translates it into two cgroup v2 parameters (or cgroup v1 equivalents):
cpu.cfs_period_us: The length of the scheduling period, always 100,000 microseconds (100ms).cpu.cfs_quota_us: The total CPU time the container may consume within each period.
The formula is:
cpu.cfs_quota_us = cpu_limit * cpu.cfs_period_us
For a container with a CPU limit of 500m (half a core):
cpu.cfs_quota_us = 0.5 * 100,000 = 50,000 us
This means the container can use at most 50ms of CPU time in every 100ms period. If it uses its 50ms in the first 30ms of the period, the kernel throttles it — the container gets zero CPU for the remaining 70ms, even if the node’s other cores are completely idle.
CFS PERIOD AND QUOTA
─────────────────────
cpu.cfs_period_us = 100,000 (100ms)
cpu.cfs_quota_us = 50,000 (50ms) ← limit: 500m
Period 1 Period 2
├──────────────────────────┤──────────────────────────┤
│████████████░░░░░░░░░░░░░░│████████████░░░░░░░░░░░░░░│
│← 50ms used →│← throttled │← 50ms used →│← throttled │
│ │ 50ms →│ │ 50ms →│
└──────────────────────────┘──────────────────────────┘
Container bursts to full speed, exhausts quota in 50ms,
then sits idle for 50ms. Latency spikes every 100ms.
Multi-threaded container with limit: 1000m (1 core)
and 4 threads running simultaneously:
├──────────────────────────┤
│ Thread 1: ██████ (25ms) │
│ Thread 2: ██████ (25ms) │ Total: 100ms of CPU time
│ Thread 3: ██████ (25ms) │ consumed in first 25ms
│ Thread 4: ██████ (25ms) │ of wall-clock time
│ │
│ ALL THREADS THROTTLED │ Quota exhausted.
│ for remaining 75ms │ 75ms of wall-clock
│░░░░░░░░░░░░░░░░░░░░░░░░░ │ latency added.
└──────────────────────────┘
The Throttling Paradox
This is the most counterintuitive aspect of CPU limits: a container can be heavily throttled even when the node has plenty of idle CPU. CFS quotas are enforced per-container, regardless of overall node utilization. The kernel does not say “the node is 30% utilized, let this container use more.” It says “this container has used its quota for this period, stop.”
You can observe throttling via:
# cgroup v2
cat /sys/fs/cgroup/<pod-cgroup>/cpu.stat
# Look for:
# nr_throttled ← number of times throttled
# throttled_usec ← total time spent throttled (microseconds)
Or via Prometheus:
rate(container_cpu_cfs_throttled_periods_total[5m])
/ rate(container_cpu_cfs_periods_total[5m])
A throttle ratio above 10–20% indicates the limit is actively harming performance.
Why NOT Setting CPU Limits Is Sometimes Better
For bursty workloads — web servers, API gateways, batch processors — CPU usage is spiky. A request handler might be idle for 95ms, then need 40ms of CPU to process a request. With a 500m limit, the container has 50ms of quota per period, which is enough for the burst. But if two requests arrive in the same period, the container needs 80ms and gets throttled for 20ms.
Removing the CPU limit entirely allows the container to burst to whatever the node can provide. The container still has a CPU request, which guarantees it a minimum share of CPU via CFS weight (the cpu.weight cgroup parameter). Requests affect scheduling and provide a proportional minimum, but without a limit, there is no hard ceiling.
resources:
requests:
cpu: 500m # Guaranteed minimum share
memory: 256Mi
limits:
# cpu: omitted # No hard ceiling --- container can burst
memory: 512Mi # Memory limits should ALWAYS be set
When to remove CPU limits:
- Web servers, API handlers, and other latency-sensitive, bursty workloads
- When throttling metrics show significant throttle ratios
- When the cluster has spare CPU capacity (requests < node allocatable)
When to keep CPU limits:
- Multi-tenant clusters where one workload could starve others
- Batch jobs that would happily consume every available core
- Environments that require Guaranteed QoS class (limits must equal requests)
Always keep memory limits. Unlike CPU (which throttles), exceeding a memory limit causes the OOM killer to terminate the container. Memory is an incompressible resource — the kernel cannot “slow down” memory usage the way it can pause CPU access.
QoS Classes
Kubernetes assigns every pod a Quality of Service class based on its resource configuration. QoS determines eviction priority when a node runs out of resources.
QoS CLASSES AND EVICTION ORDER
────────────────────────────────
EVICTED FIRST EVICTED LAST
◄──────────────────────────────────────────────────────►
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ BestEffort │ │ Burstable │ │ Guaranteed │
│ │ │ │ │ │
│ No requests │ │ Requests │ │ requests == │
│ No limits │ │ set, but │ │ limits for │
│ │ │ limits != │ │ every container │
│ First to │ │ requests │ │ in every pod │
│ die under │ │ (or limits │ │ │
│ pressure │ │ missing) │ │ Last to die │
└──────────────┘ └──────────────┘ └──────────────────┘
BestEffort: No resource requests or limits on any container. These pods are scheduled wherever there is room and are the first evicted. Appropriate only for truly disposable workloads (background log cleanup, test pods).
Burstable: At least one container has a request or limit, but they are not equal. This is the most common class. Eviction order within Burstable is based on how far the pod exceeds its requests.
Guaranteed: Every container in the pod has requests equal to limits for both CPU and memory. These pods get the highest scheduling priority and are evicted last. The trade-off is that Guaranteed pods cannot burst above their limits, which means you must size them for peak usage or accept throttling.
# Guaranteed QoS
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "2" # Must equal request
memory: 4Gi # Must equal request
Topology Manager and NUMA-Aware Scheduling
On multi-socket servers, memory access times vary depending on which CPU socket is accessing which memory bank. This is Non-Uniform Memory Access (NUMA). A process running on socket 0 accessing memory on socket 1 pays a latency penalty of 50–100 nanoseconds per access — irrelevant for most workloads, but significant for high-performance computing, machine learning inference, and network-intensive pods using SR-IOV.
The Topology Manager is a kubelet component that coordinates resource allocation across CPU Manager, Memory Manager, and Device Manager to ensure aligned NUMA placement. It operates in four policies:
| Policy | Behavior |
|---|---|
| none | No topology alignment (default). |
| best-effort | Prefer aligned allocation but allow misalignment. |
| restricted | Require aligned allocation; reject pods that cannot be aligned. |
| single-numa-node | All resources must come from a single NUMA node. |
Topology Manager only affects Guaranteed QoS pods. Burstable and BestEffort pods always get the default behavior.
Node Allocatable vs Capacity
A node’s total resources (capacity) are not entirely available for pods. The kubelet reserves resources for itself, the OS, and eviction thresholds:
block
columns 2
block
columns 1
blockArrowId1<["<b>capacity</b>:<br>16 CPU, 64 Gi memory"]>(right)
blockArrowId2<["<b>kube-reserved</b><br>kubelet, container runtime<br>cpu: 200m, memory: 1Gi"]>(right)
blockArrowId3<["<b>system-reserved</b><br>OS daemons, sshd, journald<br>cpu: 100m, memory: 500Mi"]>(right)
blockArrowId4<["<b>eviction-threshold</b><br>memory.available < 100Mi"]>(right)
end
block
columns 1
AL["= capacity<br>− kube-reserved<br>− system-reserved<br>− eviction-threshold<br><br>= 15.7 CPU, 62.4 Gi<br><br><b>This is what the scheduler<br>uses to place pods.</b>"]
end
The scheduler uses allocatable, not capacity, when deciding whether a pod fits on a node. If you do not set kube-reserved and system-reserved, the node can become unstable under load as the kubelet and OS compete with pods for resources.
The Overcommitment Reality
In practice, most clusters are dramatically overcommitted on CPU and undercommitted on memory:
- Developers set CPU requests based on peak usage to avoid throttling.
- Actual average utilization is 10–15% of requested CPU across large clusters.
- Memory is harder to reclaim, so teams set memory requests closer to actual usage.
This means a cluster with 100 CPUs of total requests might only be using 13 CPUs at any given time. The remaining 87 CPUs are reserved but idle.
Strategies for handling overcommitment:
- VPA in Off mode to identify overprovisioned workloads (see Chapter 31).
- Remove CPU limits for bursty workloads so they can use idle CPU.
- Pod Priority and Preemption to ensure critical workloads can evict less important ones.
- Cluster-level overcommit policies (request-to-limit ratios in LimitRanges) to systematically set requests lower than limits.
- Right-size nodes. A few large nodes waste less to fragmentation than many small nodes.
Practical Guidelines
| Resource | Request | Limit | Rationale |
|---|---|---|---|
| CPU (latency-sensitive) | Set to P50 usage | Omit | Burst without throttling |
| CPU (batch/background) | Set to average | Set to max | Prevent neighbor starvation |
| Memory (all workloads) | Set to P95 usage | Set to P99 or max | Always limit memory |
Start with VPA recommendations in Off mode, remove CPU limits for web workloads, always set memory limits, and monitor container_cpu_cfs_throttled_periods_total as a key performance indicator.
Common Mistakes and Misconceptions
- “CPU limits prevent my app from using idle CPU.” CPU limits enforce CFS quotas regardless of available CPU. A pod hitting its CPU limit is throttled even if the node has idle cores. This is why some teams remove CPU limits entirely.
- “Setting requests equal to limits (Guaranteed QoS) is always best.” Guaranteed QoS means your pod is never throttled or OOM-killed for using burst capacity, but it also means you pay for peak capacity at all times. Burstable QoS is more cost-effective for most workloads.
- “Memory limits protect my application.” Memory limits protect the node by OOM-killing your container when it exceeds the limit. This is protection for neighbors, not for you. Your app crashes. Set limits above your expected peak, and profile memory usage to right-size.
- “1 CPU means one full core.” 1 CPU = 1000 millicores of CFS bandwidth, enforced in 100ms periods (100ms of CPU time per 100ms wall clock, per thread — a multi-threaded process shares this budget across all threads, as described above). On a throttled container, “1 CPU” may deliver far less throughput than expected due to CFS burst behavior.
Further Reading
- Managing Resources for Containers — Official resource management docs
- CPU CFS Bandwidth Control — Linux kernel CFS documentation
- Topology Manager — NUMA-aware resource allocation
- Node Allocatable — Reserving compute resources
This concludes Part 6: Scaling and Performance. You now understand how to scale pods horizontally and vertically, scale nodes underneath them, and tune resource allocation down to the kernel level. Part 7 zooms out from a single cluster to the organizational challenge: running multiple clusters, building internal developer platforms, and managing multi-tenancy.
Next: Multi-Cluster Strategies
Chapter 34: Multi-Cluster Strategies
A single Kubernetes cluster is a single failure domain. One misconfigured admission webhook can block all deployments. One etcd corruption event can lose all state. One cloud region outage can take everything offline. As organizations move from “we run some things on Kubernetes” to “Kubernetes is our platform,” the question shifts from “how do we run a cluster?” to “how do we run many clusters, and how do they relate to each other?”
Multi-cluster is not about redundancy alone. Teams adopt multiple clusters for blast radius reduction, regulatory compliance, geographic latency, team isolation, and environment separation. The challenge is not running multiple clusters — it is managing them as a coherent system without reintroducing the operational complexity Kubernetes was supposed to eliminate. For a visual overview of Part 7’s platform engineering concepts, see Appendix B: Mental Models.
Why Multi-Cluster
- Blast radius — Multiple clusters contain failures so a bad upgrade in staging does not take down production.
- Compliance and data sovereignty — Regulations like GDPR and HIPAA may require per-region clusters to keep data local.
- Latency — Geographic distribution puts compute close to users.
- Team isolation — Separate clusters provide hard isolation (API servers, RBAC, upgrade schedules) beyond what namespaces offer.
- Upgrade cadence — Running version N in production and N+1 in staging lets teams validate upgrades before rollout.
Approach 1: Independent Clusters
The simplest multi-cluster strategy is no strategy at all. Each cluster is independently provisioned, independently configured, and independently managed. Teams own their clusters end-to-end.
This works for small organizations with 2–3 clusters and dedicated platform teams per cluster. It fails at scale because every cluster drifts: different versions, different policies, different monitoring configurations, different security postures.
Approach 2: GitOps-Driven Multi-Cluster
The most widely adopted approach uses a GitOps tool to manage multiple clusters from a single source of truth. ArgoCD ApplicationSets are purpose-built for this.
flowchart TB
subgraph git["Git Repository"]
base["/base/<br>deployment.yaml<br>networkpolicy.yaml<br>monitoring.yaml"]
usVals["/clusters/us-east/<br>values.yaml"]
euVals["/clusters/eu-west/<br>values.yaml"]
apVals["/clusters/ap-south/<br>values.yaml"]
end
subgraph hub["ArgoCD Hub Cluster"]
appset["ApplicationSet generator<br>For each cluster:<br>- Create Application<br>- Inject cluster-specific values<br>- Sync state to match Git"]
end
subgraph regional["Regional Clusters"]
usEast["us-east cluster<br>base + region overrides"]
euWest["eu-west cluster<br>base + region overrides"]
apSouth["ap-south cluster<br>base + region overrides"]
end
git -- "ArgoCD watches repo" --> hub
appset --> usEast
appset --> euWest
appset --> apSouth
style git fill:#f0f0ff,stroke:#333
style hub fill:#fff0e0,stroke:#333
style regional fill:#e0ffe0,stroke:#333
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: platform-services
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
env: production
template:
metadata:
name: "platform-{{name}}"
spec:
project: default
source:
repoURL: https://github.com/org/platform-config
targetRevision: main
path: "clusters/{{metadata.labels.region}}"
destination:
server: "{{server}}"
namespace: platform
syncPolicy:
automated:
prune: true
selfHeal: true
The ApplicationSet generator iterates over all clusters registered in ArgoCD that match the label selector, creates one Application per cluster, and injects cluster-specific values. A single Git commit can roll out a change to every production cluster worldwide.
This approach does not, however, provide cross-cluster service discovery or traffic management.
Approach 3: Federation
Federation projects attempt to provide a single API that spans multiple clusters. You submit a workload to the federation control plane, and it distributes replicas across member clusters.
KubeFed (Kubernetes Federation v2) was the original approach but is no longer actively developed. Karmada is the current leading project in this space. Karmada provides:
- A dedicated API server that accepts standard Kubernetes resources
- PropagationPolicy resources that define which clusters receive which workloads
- OverridePolicy resources for per-cluster customization
- Replica scheduling across clusters (weighted, by resource availability, or by policy)
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: api-server-spread
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: api-server
placement:
clusterAffinity:
clusterNames:
- us-east
- eu-west
- ap-south
replicaScheduling:
replicaDivisionPreference: Weighted
replicaSchedulingType: Divided
weightPreference:
staticWeightList:
- targetCluster:
clusterNames: [us-east]
weight: 2
- targetCluster:
clusterNames: [eu-west]
weight: 1
- targetCluster:
clusterNames: [ap-south]
weight: 1
Open Cluster Management (OCM), a CNCF sandbox project backed by Red Hat, takes a different approach to federation. Rather than a centralized control plane that pushes workloads to clusters, OCM uses a hub-and-spoke model where managed clusters pull their desired state from the hub via agents. This pull-based model can be easier to operate in environments with strict network policies or firewalls between clusters.
Federation is powerful but complex. It introduces a new control plane that must itself be highly available, and debugging failures requires understanding the federation layer, the per-cluster state, and the reconciliation between them.
Approach 4: Service Mesh Multi-Cluster
Service meshes solve the cross-cluster networking problem: how do services in cluster A discover and call services in cluster B?
Istio multi-cluster supports multiple topologies: shared control plane, replicated control planes, and multi-primary. In the multi-primary model, each cluster runs its own Istio control plane, and they exchange service endpoint information so that a service in cluster A can route traffic to pods in cluster B as if they were local.
Cilium ClusterMesh provides a similar capability at the CNI level. Cilium agents across clusters connect via a shared etcd (or KVStoreMesh proxy) and exchange pod identity and endpoint information. Services can be declared as “global,” making them accessible from any cluster in the mesh.
# Cilium global service annotation
apiVersion: v1
kind: Service
metadata:
name: api-server
annotations:
service.cilium.io/global: "true"
service.cilium.io/shared: "true"
spec:
ports:
- port: 80
With this annotation, any pod in any cluster in the ClusterMesh can resolve api-server and reach backends in the originating cluster. Cilium handles endpoint synchronization, identity-aware routing, and even affinity (prefer local cluster backends).
Approach 5: Cluster API for Lifecycle Management
All the above approaches assume clusters already exist. Cluster API (CAPI) addresses the lifecycle problem: how do you create, upgrade, and delete clusters declaratively?
Cluster API treats clusters as Kubernetes resources. You define a Cluster, MachineDeployment, and infrastructure-specific resources (AWS, Azure, GCP, vSphere), and Cluster API controllers reconcile them into running clusters. Upgrading a cluster’s Kubernetes version is a spec change; Cluster API handles the rolling update of control plane and worker nodes.
Combining Cluster API with GitOps gives you a fully declarative multi-cluster lifecycle: Git commits create clusters, ArgoCD ApplicationSets configure them, and Cluster API manages their infrastructure.
Choosing an Approach
| Requirement | Recommended Approach |
|---|---|
| Consistent configuration across clusters | GitOps (ArgoCD ApplicationSets) |
| Cross-cluster service discovery | Service mesh (Istio, Cilium ClusterMesh) |
| Workload distribution across clusters | Federation (Karmada) |
| Declarative cluster lifecycle | Cluster API |
| Simple, low-overhead | Independent clusters + GitOps |
Most organizations start with GitOps-driven multi-cluster and add service mesh or federation only when they have a concrete cross-cluster routing or scheduling requirement. Cluster API is orthogonal — it manages infrastructure regardless of the workload management strategy.
Common Mistakes and Misconceptions
- “One big cluster is always better than multiple small ones.” Large clusters have larger blast radius, harder upgrades, and more complex RBAC. Many organizations use multiple clusters for environment isolation, team autonomy, and regional locality.
- “Service mesh is required for cross-cluster communication.” DNS-based service discovery, cloud load balancers, or simple ingress routing can connect services across clusters. A mesh adds mTLS and observability but isn’t always necessary.
Further Reading
- ArgoCD ApplicationSets — Multi-cluster GitOps
- Karmada Documentation — Multi-cluster federation
- Open Cluster Management — CNCF sandbox hub-and-spoke multi-cluster management
- Cilium ClusterMesh — Cross-cluster networking
- Cluster API — Declarative cluster lifecycle management
Next: Building Internal Developer Platforms — Backstage, golden paths, and the platform engineering stack.
Chapter 35: Building Internal Developer Platforms
Kubernetes gives you the building blocks of a platform. It does not give you a platform. A raw Kubernetes cluster presents developers with 60+ resource types, YAML manifests that regularly exceed 200 lines, and a debugging experience that requires understanding networking, storage, scheduling, and Linux internals. Platform engineering is the discipline of assembling these building blocks into something a product developer can use without a week of onboarding.
This is not an abstraction for its own sake. It is a response to a measurable problem: developer cognitive load. When deploying a service requires editing Kubernetes manifests, Terraform modules, CI pipelines, monitoring dashboards, and alerting rules across multiple repositories, developers spend more time on infrastructure plumbing than on the product they are building. Platform engineering inverts this by providing opinionated, pre-built paths that handle the infrastructure automatically.
The Platform Layers
An internal developer platform is a stack of tools, each handling a layer of the infrastructure problem. The typical production stack looks like this:
| Layer | Purpose | Typical Tools |
|---|---|---|
| Developer Interface | Service catalog, scaffolding, docs, API registry, golden paths | Backstage |
| Delivery & Deployment | GitOps continuous delivery, CI pipelines | ArgoCD / Flux, Tekton / GitHub Actions |
| Infrastructure Provisioning | Cloud resources as code | Crossplane (CRDs), Terraform (HCL) |
| Container Platform | Scheduling, networking, service discovery, autoscaling | Kubernetes |
| Observability | Metrics, logs, traces, alerting | Prometheus + Grafana, Loki, Tempo, PagerDuty |
Each layer serves a distinct purpose, and the platform team’s job is to integrate them so that developers interact primarily with the top layer.
Backstage: The Developer Portal
Backstage, originally built at Spotify and now a CNCF incubating project, is the most widely adopted developer portal. It provides:
Service catalog. Every service, library, website, and infrastructure component registered in a single searchable catalog. Each entry tracks ownership, dependencies, documentation links, API definitions, CI/CD status, and deployment targets.
Software templates. Scaffolding that creates a new service with all the boilerplate pre-configured: repository, CI pipeline, Kubernetes manifests, monitoring dashboards, and Backstage catalog entry. A developer clicks “Create New Service,” fills in a form, and gets a production-ready repository in minutes.
TechDocs. Documentation generated from Markdown files in the service’s repository and rendered in Backstage. This solves the “where do I find docs?” problem by making documentation discoverable alongside the service catalog.
Plugin ecosystem. Backstage is extensible via plugins. The Kubernetes plugin shows pod status, deployment history, and logs. The ArgoCD plugin shows sync status. The PagerDuty plugin shows on-call schedules and incidents. This consolidation means developers check one portal instead of switching between five tools.
# backstage catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: checkout-service
description: Handles order checkout and payment processing
annotations:
backstage.io/techdocs-ref: dir:.
argocd/app-name: checkout-service
pagerduty.com/service-id: P123ABC
tags:
- python
- grpc
spec:
type: service
lifecycle: production
owner: team-payments
system: commerce-platform
dependsOn:
- component:payment-gateway
- resource:orders-database
providesApis:
- checkout-api
Golden Paths
A golden path is a pre-built, opinionated, end-to-end workflow for a common task. It is not a mandate — developers can deviate — but it is the supported, documented, well-tested way to do something.
Examples of golden paths:
-
Deploy a new microservice: Use the Backstage template. It creates a repo with Dockerfile, Helm chart, ArgoCD Application, Prometheus ServiceMonitor, and Grafana dashboard. Merge to main triggers CI, which builds the image and updates the Helm values. ArgoCD syncs to the cluster.
-
Add a PostgreSQL database: File a Crossplane Claim (see Chapter 36). The platform provisions an RDS instance, creates a Kubernetes Secret with credentials, and injects the connection string into the service via environment variables.
-
Scale for a traffic event: Set the HPA target metric and max replicas in the Helm values file. The platform handles the rest — HPA, node autoscaling, and monitoring adjustments are pre-configured.
The key property of a golden path is that it requires zero Kubernetes knowledge from the developer. They fill in business-level inputs (service name, language, database size) and the platform handles the infrastructure mapping.
The Platform Team
Platform engineering is a product discipline, not an infrastructure discipline. The platform team builds a product whose users are developers. This means:
Measure adoption, not features. A platform with 50 features that nobody uses is worse than one with 5 features that everyone uses. Track what percentage of services use the golden paths, how long it takes to go from “new service idea” to “running in production,” and how many support tickets the platform team receives.
Treat the platform as an internal product. Have a roadmap, gather user feedback, prioritize ruthlessly. The most successful platform teams run internal betas, have documentation budgets, and deprecate features deliberately.
Provide escape hatches. Golden paths should be the default, not a prison. When a team needs something non-standard (a GPU workload, a non-HTTP service, a custom CRD), the platform should not block them. The platform reduces friction for the 90% case; the 10% case gets manual support.
Anti-Patterns
Leaky abstractions. If the platform hides Kubernetes but developers still need to debug Kubernetes when things go wrong, the abstraction has not reduced cognitive load — it has added a layer. Good platforms either make the underlying system invisible (developers never need to know it is Kubernetes) or transparent (developers can drill down when they choose to).
Ignoring the developer experience. A platform that requires developers to learn a new DSL, install three CLI tools, and read 40 pages of documentation has failed. The best platforms feel like they were designed by someone who has deployed a service in anger.
No migration path. Organizations that build v1 of the platform without a plan for migrating existing services end up running two platforms indefinitely. Design for migration from the start.
A Minimal Starting Stack
For teams beginning their platform engineering journey, the minimal viable stack is:
- Kubernetes (managed: EKS, GKE, or AKS)
- ArgoCD for GitOps deployment
- Helm for templating with sensible defaults
- Prometheus + Grafana for monitoring (or a managed equivalent)
- A software template (even a shell script that generates a repo from a template)
Add Backstage when you have 10+ services and the catalog becomes valuable. Add Crossplane when you need self-service cloud resources. Add Tekton or a CI system when GitHub Actions is insufficient.
The goal is to make the most common developer workflows — deploy, observe, debug, rollback — take less than 5 minutes and require no Kubernetes-specific knowledge.
Common Mistakes and Misconceptions
- “A platform team should build everything from scratch.” The best platforms compose existing tools (ArgoCD, Crossplane, Backstage) with thin glue layers. Building custom versions of solved problems wastes years and creates maintenance burdens.
- “If we build it, developers will use it.” Platforms succeed when they’re easier than the alternative. If your platform is harder than
kubectl apply, developers will bypass it. Invest in developer experience and documentation. - “Platform engineering is just DevOps renamed.” DevOps is a culture of shared responsibility. Platform engineering builds self-service products (internal developer platforms) that embed operational best practices. The platform is the product; developers are the customers.
Further Reading
- Backstage Documentation — Official Backstage guides
- CNCF Platforms White Paper — Principles of cloud-native platforms
- Team Topologies — Organizational patterns for platform teams
- Platform Engineering on Kubernetes — Comprehensive book on the topic
Next: Crossplane — Managing cloud infrastructure as Kubernetes CRDs with the universal control plane.
Chapter 36: Crossplane: Infrastructure as CRDs
Crossplane extends Kubernetes’ reconciliation engine to any cloud resource — databases, storage buckets, DNS records, IAM roles — by representing each as a Kubernetes Custom Resource.
The Architecture
Crossplane installs as a set of controllers in your Kubernetes cluster. It extends the API server with CRDs that represent cloud resources, then reconciles those CRDs against the actual cloud state via provider plugins.
flowchart TD
Claim["<b>Claim (XC)</b><br>I need a PostgreSQL DB,<br>medium size"]
XR["<b>Composite Resource (XR)</b><br>cluster-scoped, created by<br>Crossplane from Claim"]
Comp["Composition (template)<br>maps XR to managed resources"]
Claim -->|Developer writes| XR
XR --> Comp
Comp --> MR1["<b>Managed Resource</b><br>RDS Instance<br>(provider-aws)"]
Comp --> MR2["<b>Managed Resource</b><br>Subnet Group<br>(provider-aws)"]
Comp --> MR3["<b>Managed Resource</b><br>Security Group<br>(provider-aws)"]
MR1 --> AWS["<b>AWS API</b><br>Actual RDS instance, subnet group,<br>security group created and<br>continuously reconciled"]
MR2 --> AWS
MR3 --> AWS
Core Concepts
Providers
A Provider is a Crossplane package that installs CRDs and controllers for a specific cloud platform or service. provider-aws adds CRDs for RDS, S3, IAM, VPC, and hundreds of other AWS resources. provider-gcp, provider-azure, provider-helm, and provider-kubernetes do the same for their respective domains.
Providers authenticate to the cloud API using credentials stored in Kubernetes Secrets or via IRSA/Workload Identity.
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
name: provider-aws
spec:
package: xpkg.upbound.io/upbound/provider-family-aws:v1.17.0
Managed Resources
A Managed Resource is a 1:1 representation of a cloud resource. One Managed Resource maps to exactly one external resource. The Crossplane controller for that resource type continuously reconciles: if the resource does not exist, create it. If it exists but has drifted from the spec, update it. If the Managed Resource is deleted, delete the cloud resource.
apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
metadata:
name: my-database
spec:
forProvider:
region: us-east-1
instanceClass: db.t3.medium
engine: postgres
engineVersion: "15"
allocatedStorage: 20
masterUsername: admin
masterPasswordSecretRef:
name: db-password
namespace: crossplane-system
key: password
providerConfigRef:
name: aws-provider
This is the lowest-level Crossplane abstraction. Platform teams rarely expose Managed Resources directly to developers — they are too detailed and cloud-specific.
Composite Resource Definitions (XRDs)
An XRD defines a new custom API — a higher-level abstraction that hides cloud-specific details. Think of it as defining a new Kubernetes resource type. The XRD specifies the schema (what fields developers can set) and optionally offers a namespaced Claim variant.
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: xpostgresqls.database.example.org
spec:
group: database.example.org
names:
kind: XPostgreSQL
plural: xpostgresqls
claimNames:
kind: PostgreSQL
plural: postgresqls
versions:
- name: v1alpha1
served: true
revalidation: Strict
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
size:
type: string
enum: ["small", "medium", "large"]
version:
type: string
default: "15"
required:
- size
This XRD creates two new resource types: XPostgreSQL (cluster-scoped composite resource) and PostgreSQL (namespaced claim). Developers only interact with the claim.
Compositions
A Composition is the template that maps a Composite Resource to one or more Managed Resources. It is where the platform team encodes organizational opinions: which instance types correspond to “small,” “medium,” and “large,” what security groups to attach, what backup policies to apply.
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: postgresql-aws
spec:
compositeTypeRef:
apiVersion: database.example.org/v1alpha1
kind: XPostgreSQL
resources:
- name: rds-instance
base:
apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
spec:
forProvider:
region: us-east-1
engine: postgres
publiclyAccessible: false
storageEncrypted: true
backupRetentionPeriod: 7
patches:
- type: FromCompositeFieldPath
fromFieldPath: spec.version
toFieldPath: spec.forProvider.engineVersion
- type: FromCompositeFieldPath
fromFieldPath: spec.size
toFieldPath: spec.forProvider.instanceClass
transforms:
- type: map
map:
small: db.t3.small
medium: db.t3.medium
large: db.r6g.xlarge
Claims
A Claim is the developer-facing interface. It is namespaced (unlike the Composite Resource), so it integrates naturally with team namespaces and RBAC. When a developer creates a Claim, Crossplane creates the corresponding Composite Resource, which the Composition expands into Managed Resources.
apiVersion: database.example.org/v1alpha1
kind: PostgreSQL
metadata:
name: orders-db
namespace: checkout-team
spec:
size: medium
version: "15"
Three lines of meaningful configuration. The developer does not need to know about RDS instance classes, security groups, subnet groups, or parameter groups. The platform team has encoded all of those decisions in the Composition.
Crossplane vs Terraform
Both Crossplane and Terraform manage cloud infrastructure declaratively. The differences are architectural:
| Aspect | Crossplane | Terraform |
|---|---|---|
| Execution model | Continuous reconciliation | On-demand apply |
| State storage | Kubernetes etcd (CRDs) | State files (S3, local, etc.) |
| Drift detection | Automatic, continuous | Manual (terraform plan) |
| Drift correction | Automatic | Manual (terraform apply) |
| Developer interface | kubectl, Kubernetes RBAC | CLI, separate auth |
| Composition | XRDs + Compositions (CRDs) | Modules (HCL) |
| Ecosystem | Growing, CRD-based providers | Massive, mature provider ecosystem |
| Secret handling | Kubernetes Secrets, native | State file (encrypted via backend configuration; e.g., S3 SSE, Terraform Cloud encryption at rest) |
Crossplane’s advantage: Continuous reconciliation means drift is detected and corrected automatically. If someone manually changes an RDS instance’s configuration via the AWS console, Crossplane will notice and revert it on the next reconciliation cycle (typically 1–10 minutes). Terraform only detects drift when someone runs terraform plan.
Terraform’s advantage: Maturity, ecosystem breadth, and the terraform plan workflow that lets teams review changes before applying them. Crossplane’s reconciliation model means changes to a Composition apply immediately to all resources that use it — there is no “plan” step.
In practice, many organizations use both: Terraform for foundational infrastructure (VPCs, IAM, Kubernetes clusters) managed by a platform team with manual review, and Crossplane for application-level resources (databases, caches, queues) managed self-service by development teams.
The Universal Control Plane Vision
Crossplane’s long-term vision is the “universal control plane” — a single Kubernetes API server that manages everything: containers, cloud resources, SaaS services, and internal tooling. Instead of developers learning kubectl for Kubernetes, the AWS console for cloud resources, and a CI tool’s web interface for pipelines, they interact with a single API that accepts declarative manifests for all of it.
Provider coverage is broad but not total. Complex multi-resource dependencies (create VPC, then subnet, then security group, then RDS instance) require careful ordering in Compositions. Error messages from failed cloud API calls can be opaque. But the trajectory is clear: the Kubernetes resource model is becoming the universal interface for infrastructure management, and Crossplane is the primary vehicle for that expansion.
Common Mistakes and Misconceptions
- “Crossplane replaces Terraform.” See the comparison table above. Many organizations use both: Terraform for foundational infrastructure, Crossplane for application-level self-service.
- “Compositions apply changes immediately with no review.” This is actually true and often a surprise. Unlike Terraform’s plan/apply workflow, changing a Composition affects all resources using it immediately. Use Composition revisions and staged rollouts.
- “Crossplane providers cover every cloud resource.” Coverage is broad but not complete. Check the provider’s CRD list before committing to Crossplane for a specific resource. Some niche services may need Terraform or direct API calls.
Further Reading
- Crossplane Documentation — Official guides and reference
- Upbound Marketplace — Provider and configuration packages
- Crossplane Getting Started — Official introduction and tutorials
- Crossplane Concepts — Compositions, XRDs, Claims, and Providers
Next: Multi-Tenancy — Namespace isolation, hierarchical namespaces, vCluster, and when soft boundaries are not enough.
Chapter 37: Multi-Tenancy
A Kubernetes cluster is expensive. Running one cluster per team, per environment, or per application multiplies that cost — not just in compute, but in operational overhead: patching, monitoring, upgrading, and securing each cluster independently. Multi-tenancy is the practice of sharing a single cluster among multiple tenants (teams, applications, customers) while maintaining isolation between them.
The fundamental tension in multi-tenancy is between sharing (for efficiency) and isolation (for safety). Too much sharing and one tenant’s misconfiguration affects others. Too much isolation and you lose the efficiency gains that motivated sharing in the first place. Kubernetes provides multiple isolation mechanisms at different strengths, and choosing the right combination depends on your trust model: are tenants friendly teams within the same organization, or are they untrusted customers running arbitrary code?
Namespace-Level Isolation
The namespace is Kubernetes’s primary unit of multi-tenancy. A namespace provides a scope for names and a target for access control, network policies, and resource quotas. For trusted, internal tenants, namespace isolation is often sufficient.
The Four Pillars
Effective namespace isolation requires four mechanisms working together:
1. RBAC (who can do what). Each tenant gets a Role and RoleBinding scoped to their namespace. Tenants can create Deployments, Services, and ConfigMaps in their namespace but cannot access other namespaces or cluster-scoped resources.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: tenant-developer
namespace: team-alpha
rules:
- apiGroups: ["", "apps", "batch"]
resources: ["pods", "deployments", "services", "configmaps", "jobs"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list"] # Read but not create --- secrets managed by platform
2. NetworkPolicies (who can talk to whom). Default-deny ingress and egress policies per namespace, with explicit allow rules for legitimate cross-namespace traffic. Without NetworkPolicies, pods in team-alpha can freely reach pods in team-beta.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: team-alpha
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
egress:
- to: [] # DNS only (add DNS allow separately)
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
3. ResourceQuotas (how much can be consumed). Without quotas, one tenant can consume all cluster resources, starving others. ResourceQuotas set hard limits per namespace.
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
services: "10"
persistentvolumeclaims: "20"
4. LimitRanges (sane defaults). LimitRanges set default requests and limits for containers that do not specify them, and enforce minimum/maximum bounds. This prevents a developer from deploying a pod with requests.memory: 1Ti.
apiVersion: v1
kind: LimitRange
metadata:
name: tenant-limits
namespace: team-alpha
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 256Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "4"
memory: 8Gi
min:
cpu: 50m
memory: 64Mi
flowchart TD
subgraph Cluster["Shared Cluster"]
subgraph Alpha["team-alpha"]
AlphaConfig["RBAC: own role<br>NetworkPolicy: deny by default<br>ResourceQuota: 10 CPU, 20Gi<br>LimitRange: defaults set"]
PodA["Pod A"]
PodB["Pod B"]
end
subgraph Beta["team-beta"]
BetaConfig["RBAC: own role<br>NetworkPolicy: deny by default<br>ResourceQuota: 8 CPU, 16Gi<br>LimitRange: defaults set"]
PodC["Pod C"]
PodD["Pod D"]
end
Shared["<b>Shared:</b> API server, scheduler, kubelet,<br>etcd, container runtime, nodes"]
end
Limitations of Namespace Isolation
- CRDs are cluster-scoped. One tenant’s CRD installation affects all tenants. A buggy CRD controller can crash the API server for everyone.
- Cluster-scoped resources cannot be isolated. ClusterRoles, PriorityClasses, IngressClasses, and StorageClasses are visible to all tenants.
- Node-level resources are shared. Tenants share the Linux kernel, container runtime, and host filesystem. A container escape vulnerability gives access to all pods on the node.
- API server rate limits affect everyone. One tenant’s controller making excessive API calls degrades performance for all tenants.
- No per-tenant admission control. Admission webhooks are cluster-scoped. You cannot run different admission policies per namespace without complex webhook routing.
For internal teams with moderate trust, these limitations are acceptable. For untrusted tenants or strict compliance requirements, they are not.
Hierarchical Namespaces (HNC)
In flat namespace models, creating a new team or sub-team requires manual namespace provisioning with duplicated RBAC, NetworkPolicies, and ResourceQuotas. Hierarchical Namespace Controller (HNC) adds parent-child relationships between namespaces. A child namespace automatically inherits Roles, RoleBindings, NetworkPolicies, and ResourceQuotas from its parent.
apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
name: team-alpha-staging
namespace: team-alpha
This creates team-alpha-staging as a child of team-alpha. RBAC bindings from team-alpha propagate automatically. When the parent’s policies change, children update. This is particularly useful for organizations with hierarchical team structures (org > division > team > project).
HNC does not add stronger isolation — it makes namespace management more scalable. The isolation boundary is still the namespace with its four pillars.
vCluster: Virtual Clusters
When namespace isolation is insufficient, the next step is vCluster — a project that creates virtual Kubernetes clusters inside a host cluster. Each vCluster runs its own API server, controller manager, and (optionally) its own etcd, all as pods within a namespace of the host cluster. Tenants interact with their vCluster as if it were a standalone cluster.
flowchart TD
subgraph Host["Host Cluster"]
subgraph VCAlpha["Namespace: vc-alpha"]
subgraph Alpha["vCluster alpha"]
AlphaCP["Own API server<br>Own controller-mgr<br>Own scheduler<br>Own etcd (or SQLite)"]
AlphaTenant["<b>Tenant sees:</b><br>Own namespaces<br>Own CRDs<br>Own RBAC<br>Own secrets"]
end
SyncerA["Syncer<br>Syncs pods to host<br>cluster for actual scheduling"]
Alpha --> SyncerA
end
subgraph VCBeta["Namespace: vc-beta"]
subgraph Beta["vCluster beta"]
BetaCP["Own API server<br>Own ctrl-mgr<br>Own scheduler<br>Own etcd"]
BetaTenant["<b>Tenant sees:</b><br>Own namespaces<br>Own CRDs<br>Own RBAC<br>Own secrets"]
end
SyncerB["Syncer<br>Syncs pods to host"]
Beta --> SyncerB
end
HostProvides["Host cluster provides: nodes, networking, storage"]
end
How vCluster Works
- The vCluster control plane (API server, controller manager, optional etcd) runs as pods in a host namespace.
- Tenants connect to the vCluster’s API server via a kubeconfig. They see a normal Kubernetes cluster with its own namespaces, CRDs, and RBAC.
- When a tenant creates a pod in the vCluster, the syncer translates it into a pod in the host namespace with a mangled name. The host cluster’s scheduler places it on a real node.
- The tenant’s pod runs on host infrastructure but appears in the vCluster’s API server with the tenant’s labels, annotations, and namespace.
What vCluster Provides
- CRD isolation. Each vCluster can install its own CRDs without affecting other tenants.
- Cluster-admin per tenant. Tenants can have
cluster-admininside their vCluster without affecting the host. - Independent upgrades. Each vCluster can run a different Kubernetes version.
- Full RBAC isolation. ClusterRoles and ClusterRoleBindings are scoped to the vCluster.
- Admission webhook isolation. Tenants can install their own admission webhooks.
What vCluster Does NOT Provide
- Node-level isolation. Pods from different vClusters share the same nodes and Linux kernel. Container escape is still a risk.
- Network isolation by default. You still need NetworkPolicies on the host cluster to isolate traffic between vClusters.
- Zero overhead. Each vCluster’s control plane consumes resources (typically 0.5–1 CPU and 512Mi–1Gi memory for the API server and syncer).
Comparison
| Capability | Namespaces | Namespaces + HNC | vCluster |
|---|---|---|---|
| RBAC isolation | Namespace-scoped | Inherited + namespace | Full cluster-admin |
| CRD isolation | None | None | Full |
| Network isolation | Via NetworkPolicy | Via NetworkPolicy | Via NetworkPolicy + own Services |
| Resource quotas | Per namespace | Inherited | Per vCluster namespace on host |
| Independent K8s version | No | No | Yes |
| Own admission webhooks | No | No | Yes |
| Overhead per tenant | ~0 | ~0 | 0.5–1 CPU, 512Mi–1Gi |
| Node-level isolation | No | No | No (use Kata/gVisor) |
| Suitable for | Internal teams | Hierarchical orgs | SaaS, untrusted tenants |
When Namespaces Are Not Enough
Use separate physical clusters when:
- Tenants are mutually untrusted and require node-level isolation
- Compliance mandates physical separation (some PCI-DSS interpretations)
- Failure domains must be completely independent
The Isolation Spectrum
Multi-tenancy is not binary. It is a spectrum from shared namespaces to dedicated clusters, and you can mix strategies:
- Production workloads for different teams: namespaces with strict RBAC, NetworkPolicies, and quotas
- Development and CI environments: vClusters (disposable, fast to create, cheap)
- Customer-facing SaaS tenants: vClusters with NetworkPolicies and optional gVisor for runtime isolation
- Regulated workloads: dedicated clusters with Cluster API lifecycle management
The right answer depends on your threat model, compliance requirements, and operational capacity. Start with namespaces, add vCluster when you hit a namespace limitation, and reach for dedicated clusters only when virtual isolation is insufficient.
Common Mistakes and Misconceptions
- “Namespaces provide security isolation.” Namespaces are a grouping mechanism, not a security boundary. Without NetworkPolicies, RBAC, ResourceQuotas, and Pod Security Standards, pods in different namespaces can freely communicate and compete for resources.
- “vCluster is overkill for multi-tenancy.” For strong isolation (e.g., different customers, untrusted workloads), namespace-level controls are often insufficient. vCluster provides full API isolation with lower overhead than separate physical clusters.
Further Reading
- Multi-tenancy Guide — Official Kubernetes multi-tenancy documentation
- Hierarchical Namespaces — HNC project
- vCluster Documentation — Virtual cluster project
- Kata Containers — VM-level pod isolation for node-level multi-tenancy
This concludes Part 7: Multi-Cluster and Platform Engineering. You now know how to operate Kubernetes at organizational scale — managing multiple clusters, building internal platforms, extending the API with Crossplane, and isolating tenants. Part 8 goes deeper into the machinery itself: writing your own controllers, understanding API internals, operating etcd, and running GPU and ML workloads.
Next: Writing Controllers and Operators
Chapter 38: Writing Controllers and Operators
Kubernetes ships with roughly thirty built-in controllers — the Deployment controller, the ReplicaSet controller, the Job controller, and so on. Each one watches a particular resource type, compares the desired state in the spec with the actual state in the cluster, and takes action to close the gap. This reconciliation pattern is the engine that makes Kubernetes declarative.
An operator is simply a custom controller that encodes domain-specific operational knowledge for a particular application. The Deployment controller knows how to roll out generic pods; a PostgreSQL operator knows how to initialize replicas, manage failover, and orchestrate backups. The extension mechanism is the same — only the knowledge embedded in the reconciliation logic differs.
This chapter covers how to build operators using the standard Go toolchain: the controller-runtime library and its scaffolding tool, Kubebuilder.
The Reconcile Loop
Every controller follows the same fundamental pattern. The control plane delivers a reconcile request — essentially a namespace/name pair — and the controller’s job is to make reality match the desired state for that object. The loop looks like this:
flowchart TD
WQ["Work Queue<br>ns/name, ns/name, ..."] --> Fetch
Fetch["1. FETCH<br>Get resource by ns/name"]
Fetch -->|"found"| List
Fetch -->|"not found<br>(deleted)"| Cleanup["Cleanup owned resources"] --> Done
List["2. LIST<br>List owned child resources<br>(Deployments, Services, etc.)"]
List --> Compare
Compare["3. COMPARE<br>Diff desired state (spec)<br>vs actual state (children)"]
Compare --> Act
Act{"4. ACT"}
Act -->|"missing"| Create["Create resources"]
Act -->|"drifted"| Update["Update resources"]
Act -->|"obsolete"| Delete["Delete resources"]
Create --> Status
Update --> Status
Delete --> Status
Status["5. STATUS<br>Update status subresource<br>(conditions, counts)"]
Status --> Return
Return{"6. RETURN"}
Return -->|"error"| ErrRequeue["Requeue with<br>exponential backoff"] --> WQ
Return -->|"RequeueAfter"| TimedRequeue["Requeue after<br>duration"] --> WQ
Return -->|"nil, nil"| Done2["Done<br>(wait for next event)"]
Kubebuilder Scaffolding
Kubebuilder generates the boilerplate so you can focus on the reconciliation logic. A typical workflow:
# Initialize a new project
kubebuilder init --domain example.com --repo github.com/example/myoperator
# Create an API (CRD + controller)
kubebuilder create api --group apps --version v1alpha1 --kind MyApp
# Create a webhook (optional)
kubebuilder create webhook --group apps --version v1alpha1 --kind MyApp \
--defaulting --programmatic-validation
This generates a directory structure with api/v1alpha1/myapp_types.go (your CRD schema), internal/controller/myapp_controller.go (your Reconcile function), and the wiring to register everything with the manager.
The Reconcile Function Skeleton
Here is the canonical structure in Go using controller-runtime:
package controller
import (
"context"
"fmt"
"time"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
myappv1 "github.com/example/myoperator/api/v1alpha1"
)
type MyAppReconciler struct {
client.Client
Scheme *runtime.Scheme
}
func (r *MyAppReconciler) Reconcile(ctx context.Context,
req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx)
// ── Step 1: Fetch the primary resource ──────────────────
var app myappv1.MyApp
if err := r.Get(ctx, req.NamespacedName, &app); err != nil {
if errors.IsNotFound(err) {
logger.Info("MyApp deleted, nothing to do")
return ctrl.Result{}, nil
}
return ctrl.Result{}, err // requeue with backoff
}
// ── Step 2: List owned child resources ──────────────────
var childDeploys appsv1.DeploymentList
if err := r.List(ctx, &childDeploys,
client.InNamespace(req.Namespace),
// NOTE: This field selector requires a custom index. You must register it
// in SetupWithManager using mgr.GetFieldIndexer().IndexField() — it does
// not work out of the box. See the controller-runtime documentation for
// how to set up custom field indexes.
client.MatchingFields{"metadata.ownerReferences.uid": string(app.UID)},
); err != nil {
return ctrl.Result{}, err
}
// ── Step 3: Compare desired vs actual ───────────────────
desiredReplicas := app.Spec.Replicas
if len(childDeploys.Items) == 0 {
// ── Step 4a: Create ─────────────────────────────────
deploy := r.buildDeployment(&app)
if err := ctrl.SetControllerReference(&app, deploy, r.Scheme); err != nil {
return ctrl.Result{}, err
}
if err := r.Create(ctx, deploy); err != nil {
return ctrl.Result{}, err
}
logger.Info("Created Deployment", "replicas", desiredReplicas)
} else {
// ── Step 4b: Update if drifted ──────────────────────
existing := &childDeploys.Items[0]
if *existing.Spec.Replicas != desiredReplicas {
existing.Spec.Replicas = &desiredReplicas
if err := r.Update(ctx, existing); err != nil {
return ctrl.Result{}, err
}
}
}
// ── Step 5: Update status ───────────────────────────────
if len(childDeploys.Items) > 0 {
app.Status.ReadyReplicas = childDeploys.Items[0].Status.ReadyReplicas
}
app.Status.Phase = "Running"
if err := r.Status().Update(ctx, &app); err != nil {
return ctrl.Result{}, err
}
// ── Step 6: Return result ───────────────────────────────
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
Notice that the status update uses r.Status().Update() — this hits the /status subresource, which has a separate authorization check and does not modify the spec. This separation is deliberate: it prevents a controller that only needs to report status from accidentally mutating the desired state.
Watches and Predicates
A controller must tell the manager which objects to watch. The SetupWithManager method configures this:
func (r *MyAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&myappv1.MyApp{}). // primary resource
Owns(&appsv1.Deployment{}). // child resource
Owns(&corev1.Service{}). // another child
WithEventFilter(predicate.GenerationChangedPredicate{}).
Complete(r)
}
.For() registers a watch on the primary resource. When a MyApp object is created, updated, or deleted, a reconcile request is enqueued.
.Owns() registers a watch on child resources and automatically maps events back to the owning parent. If someone manually edits a Deployment owned by your MyApp, the controller will reconcile the parent MyApp — and correct the drift.
Predicates filter which events actually trigger reconciliation. GenerationChangedPredicate skips status-only updates (since .metadata.generation only increments on spec changes). You can write custom predicates for arbitrary filtering:
withAnnotation := predicate.NewPredicateFuncs(func(obj client.Object) bool {
return obj.GetAnnotations()["myapp.example.com/managed"] == "true"
})
Requeue Logic
The return value of Reconcile controls what happens next:
| Return Value | Behavior |
|---|---|
ctrl.Result{}, nil | Terminal. No requeue. The controller is done until the next watch event. |
ctrl.Result{}, err | Immediate requeue with exponential backoff (default ~1s → 16min). |
ctrl.Result{Requeue: true}, nil | Immediate requeue (no backoff). Use sparingly. |
ctrl.Result{RequeueAfter: 30s}, nil | Scheduled requeue. Useful for polling external systems. |
The exponential backoff on error is critical. Without it, a controller that encounters a persistent error (like a missing dependency) would hammer the API server in a tight loop. The backoff gives transient errors time to resolve and limits the blast radius of permanent failures.
Concurrency and Idempotency
By default, a controller processes one reconcile request at a time. You can increase parallelism:
ctrl.NewControllerManagedBy(mgr).
WithOptions(controller.Options{MaxConcurrentReconciles: 5}).
Complete(r)
But this means two reconcile calls for different objects may run simultaneously. Your Reconcile function must be idempotent — calling it twice with the same input must produce the same result. It must also be safe for concurrent execution across different keys. Never rely on in-memory state between reconciliations; always read from the API server.
Leader Election
In production, operators typically run with two or more replicas for availability. Only one replica should be actively reconciling at any time. controller-runtime supports leader election out of the box:
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
LeaderElection: true,
LeaderElectionID: "myapp-operator-lock",
})
Leader election uses a Lease object in the cluster. The active leader renews the lease periodically. If it fails to renew (crash, network partition), another replica acquires the lease and begins reconciling. The transition typically takes 15–30 seconds depending on configuration.
Webhook Development
Kubebuilder scaffolds two types of admission webhooks:
Mutating (Defaulting) webhooks modify incoming objects before they are persisted. Use these to inject default values, add labels, or set fields the user omitted:
func (r *MyApp) Default() {
if r.Spec.Replicas == 0 {
r.Spec.Replicas = 3
}
if r.Spec.Image == "" {
r.Spec.Image = "myapp:latest"
}
}
Validating webhooks reject invalid objects. They run after mutating webhooks and return an error if the object violates business rules:
func (r *MyApp) ValidateCreate() (admission.Warnings, error) {
if r.Spec.Replicas > 100 {
return nil, fmt.Errorf("replicas cannot exceed 100")
}
return nil, nil
}
Webhooks require TLS certificates. In production, use cert-manager to automate certificate provisioning and rotation.
The Operator Maturity Model
The Operator Framework defines five maturity levels. Most operators in the wild sit at Level 1 or 2. Reaching Level 5 is rare and typically reserved for complex stateful systems.
OPERATOR MATURITY MODEL
────────────────────────
Level 5 │ AUTO PILOT
│ Automatic scaling, tuning, anomaly detection.
│ Horizontal/vertical scaling based on load.
│ Self-healing beyond simple restart.
│
Level 4 │ DEEP INSIGHTS
│ Expose metrics, alerts, log processing.
│ Grafana dashboards, SLO tracking.
│ Workload-specific telemetry.
│
Level 3 │ FULL LIFECYCLE
│ Automated backup/restore.
│ Version upgrades with data migration.
│ Configuration tuning.
│
Level 2 │ SEAMLESS UPGRADES
│ Patch and minor version upgrades.
│ Operand configuration changes.
│ No downtime during upgrades.
│
Level 1 │ BASIC INSTALL
│ Automated deployment and configuration.
│ Operator manages basic provisioning.
│
└──────────────────────────────────────────────
Increasing automation and operational knowledge
Start at Level 1. Level 3 is the inflection point where automated backup and upgrades pay off. Level 5 (auto-pilot) is rare; CockroachDB and ECK are examples.
Putting It All Together
-
Start with the API. Design your CRD spec and status carefully. They are a contract with your users. Changing them later requires conversion webhooks and migration paths.
-
Keep Reconcile idempotent. If you create a resource, check whether it already exists first. If you update, compare before patching. Never assume the world has not changed between your List and your Create.
-
Use owner references. They give you garbage collection for free and enable the
.Owns()watch pattern. When the parent is deleted, all owned children are cleaned up automatically. -
Separate spec from status. Always use the status subresource. Never let the controller modify the spec.
-
Test with envtest. controller-runtime includes an integration test harness that spins up a real API server and etcd without needing a full cluster. Use it.
-
Think about failure modes. What happens when the API server is unreachable? When a child resource is stuck terminating? When two operators fight over the same resource? The answers should be in your code, not in a runbook.
Common Mistakes and Misconceptions
- “Every application needs an Operator.” Operators are for stateful, complex applications that need operational automation (databases, message queues). A stateless web service managed by a Deployment does not need an Operator.
- “Writing an Operator is straightforward.” Operators encode operational expertise in code. The happy path is simple, but handling every failure mode (partial updates, resource conflicts, cascading failures) correctly takes significant engineering effort.
- “Operators are always better than Helm charts.” Helm charts are simpler: apply once, done. Use Operators when you need active reconciliation; use Helm when install-time configuration is sufficient.
- “All Operators on OperatorHub are production-quality.” OperatorHub lists community and vendor operators with varying maturity levels. Check the capability level (basic install through full lifecycle) and community adoption before deploying to production.
Further Reading
- Operator pattern — the official Kubernetes documentation explaining the operator concept, when to use one, and how operators extend the API.
- Operator SDK documentation — the full guide for building operators with the Operator SDK, covering Go, Ansible, and Helm-based operators.
- The Kubebuilder Book — a comprehensive tutorial that walks through building a controller from scratch using kubebuilder, including CRD design, webhook configuration, and testing.
- OperatorHub.io — a catalog of community and vendor operators you can install in your cluster, useful for understanding what problems operators solve in practice.
- Introducing Operators — the original CoreOS blog post by Brandon Philips that coined the term “operator” and explained the motivation behind encoding operations knowledge in software.
- controller-runtime documentation — API reference for the Go library that underpins both kubebuilder and Operator SDK, covering the Manager, Controller, Reconciler, and client interfaces.
- Programming Kubernetes (O’Reilly) — a book by Michael Hausenblas and Stefan Schimanski that covers the Kubernetes API machinery, custom resources, and operator development in depth.
Next: The Kubernetes API Internals — how requests flow through admission, what aggregated API servers are, and how API priority and fairness protects the control plane.
Chapter 39: The Kubernetes API Internals
Every interaction with a Kubernetes cluster — every kubectl apply, every controller reconciliation, every kubelet heartbeat — is an HTTP request to the API server.
This chapter covers the internal request lifecycle, API versioning mechanics, aggregated API servers, admission webhooks, conversion webhooks, and the priority and fairness system that prevents any single tenant from overwhelming the control plane.
API Groups and Versioning
Kubernetes organizes its API into groups. The core group (empty string, paths under /api/v1) contains the original resources: Pods, Services, ConfigMaps, Secrets. Everything added later lives in named groups under /apis/ — apps/v1 for Deployments, batch/v1 for Jobs, networking.k8s.io/v1 for Ingress.
Each group can serve multiple versions simultaneously, but only one is the storage version — the format actually written to etcd. When you create a Deployment via apps/v1, the API server converts it to the storage version before writing. When you read via a different version, it converts from storage on the way out.
Version Progression
Versions follow a strict graduation path:
| Stage | Convention | Meaning |
|---|---|---|
| Alpha | v1alpha1 | Disabled by default. May change or be removed without notice. Never use in production. |
| Beta | v1beta1 | Enabled by default (since 1.24, beta APIs require explicit opt-in for new APIs). Schema is mostly stable but may change. Migration path will be provided. |
| Stable | v1 | GA. The API is committed. Breaking changes require a new group or version. |
The version in the group name (v1, v2) is the API version, not the software version. autoscaling/v2 replaced autoscaling/v2beta2 when HPA’s extended metrics support graduated.
The API Request Lifecycle
Every request to the API server passes through a series of stages.
sequenceDiagram
participant C as Client
participant AuthN as Authentication
participant AuthZ as Authorization
participant MW as Mutating Webhooks
participant SV as Schema Validation
participant VW as Validating Webhooks
participant E as etcd
C->>AuthN: request (kubectl, controller, kubelet)
Note right of AuthN: Who are you?<br>x509 certs, bearer tokens,<br>OIDC, webhook
AuthN->>AuthZ: user: alice, groups: [devs]
Note right of AuthZ: Are you allowed?<br>RBAC, Webhook, Node
AuthZ->>MW: allowed
Note right of MW: Modify object (serial)<br>Add defaults, inject sidecars,<br>set labels
MW->>SV: mutated object
Note right of SV: OpenAPI schema check<br>+ CEL validation rules
SV->>VW: valid object
Note right of VW: Policy checks (parallel)<br>Cannot modify, only reject
VW->>E: approved
Note right of E: Convert to storage version<br>Write to etcd
E-->>C: response
The ordering of mutating before validating is deliberate. Mutating webhooks may add fields that validating webhooks then check. If validation ran first, it would reject objects that mutating webhooks would have fixed.
Aggregated API Servers
Not every API endpoint is served by the core API server. The aggregation layer allows you to register custom API servers that handle specific API groups. The core API server proxies requests to these backends transparently.
This is how the metrics API (metrics.k8s.io) works. The metrics-server registers an APIService object:
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1beta1.metrics.k8s.io
spec:
service:
name: metrics-server
namespace: kube-system
group: metrics.k8s.io
version: v1beta1
groupPriorityMinimum: 100
versionPriority: 100
When a client requests kubectl top pods, the API server sees that metrics.k8s.io is handled by the metrics-server Service and proxies the request there. Authentication and authorization still happen at the front door — the aggregated server receives the request with identity headers already set.
They require their own storage, their own availability guarantees, and careful certificate management. For most use cases, CRDs are the simpler extension mechanism. Use aggregated APIs when you need custom storage backends, custom admission logic baked into the server, or sub-resource behaviors that CRDs cannot express.
Admission Webhooks
Admission webhooks are the primary extension point for policy enforcement and object mutation. They intercept requests after authentication and authorization but before storage.
Mutating Admission Webhooks
Mutating webhooks are called in serial, but the invocation order is determined by the API server (alphabetically by webhook name) and should not be relied upon. Each webhook can modify the object, and subsequent webhooks see the modifications made by previous ones. Furthermore, the API server may re-invoke mutating webhooks if a later webhook modifies the object, to give earlier webhooks a chance to react. Design mutating webhooks to be idempotent and order-independent. Common uses:
- Injecting sidecar containers (Istio, Linkerd)
- Adding default labels and annotations
- Setting resource requests/limits from policy
- Rewriting image references to use a private registry mirror
Validating Admission Webhooks
Validating webhooks are called in parallel after all mutating webhooks have run. They cannot modify the object — they can only accept or reject it. If any validating webhook rejects, the request is denied. Common uses:
- Enforcing naming conventions
- Requiring specific labels (owner, cost-center)
- Blocking privileged containers
- Preventing deployments to protected namespaces
The Admission Webhook Pipeline
The following diagram traces a Pod creation through the full admission webhook pipeline, showing how mutating webhooks run serially (each receiving the output of the previous one) while validating webhooks run in parallel (all must allow for the request to succeed).
sequenceDiagram
participant Client as Client (kubectl)
participant API as API Server
participant MW1 as Mutating Webhook 1<br>(e.g. Istio sidecar inject)
participant MW2 as Mutating Webhook 2<br>(e.g. Vault secret inject)
participant SV as Schema Validator
participant VW1 as Validating Webhook 1<br>(e.g. OPA Gatekeeper)
participant VW2 as Validating Webhook 2<br>(e.g. Kyverno)
participant etcd as etcd
Client->>API: CREATE Pod
Note over API: authn/authz (internal)
rect rgba(50, 108, 229, 0.1)
Note over API,MW2: Mutating webhooks run SERIALLY
API->>MW1: POST /mutate
MW1-->>API: mutated obj
API->>MW2: POST /mutate (with mutations from WH 1)
MW2-->>API: mutated obj
end
API->>SV: validate schema
SV-->>API: OK
rect rgba(90, 142, 240, 0.1)
Note over API,VW2: Validating webhooks run in PARALLEL
par
API->>VW1: POST /validate
VW1-->>API: allowed: true
and
API->>VW2: POST /validate
VW2-->>API: allowed: true
end
end
API->>etcd: ALL passed -> store
etcd-->>API: stored
API-->>Client: 201 Created
Configuration Details
A webhook configuration includes several critical fields:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: require-labels
webhooks:
- name: require-labels.example.com
rules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
clientConfig:
service:
name: label-enforcer
namespace: policy-system
caBundle: <base64-encoded-CA>
failurePolicy: Fail # or Ignore
sideEffects: None # None, NoneOnDryRun, or Unknown
timeoutSeconds: 5 # default 10, max 30
matchConditions: # CEL-based filtering (beta 1.28, GA 1.30)
- name: exclude-system
expression: "!object.metadata.namespace.startsWith('kube-')"
failurePolicy controls what happens when the webhook itself is unreachable or returns an error. Fail means the API request is rejected — safe but can block the entire cluster if the webhook goes down. Ignore means the request proceeds without the webhook check — available but potentially unsafe. In production, Fail is the correct default for security-critical webhooks, but you must ensure the webhook is highly available.
sideEffects declares whether the webhook has side effects beyond modifying the admission response. Webhooks that write to external systems should declare this honestly; it affects dry-run behavior.
matchConditions (beta in Kubernetes 1.28, GA in 1.30) use CEL expressions to filter which objects actually get sent to the webhook. This is far more efficient than filtering inside the webhook itself, because non-matching objects never leave the API server.
timeoutSeconds sets how long the API server waits for a webhook response. Keep this low (3–5 seconds). A slow webhook adds latency to every matching API request. A webhook that consistently times out under failurePolicy: Fail will make the cluster unusable.
Conversion Webhooks
When a CRD serves multiple versions, the API server needs a way to convert between them. For trivial changes, Kubernetes can handle this automatically. For structural changes, you deploy a conversion webhook.
The model is hub and spoke: you designate one version as the “hub” (typically the storage version), and the webhook converts between the hub and every other version. This avoids the combinatorial explosion of converting between every pair of versions.
flowchart LR
Hub["v1<br>(Hub / storage version)"]
A["v1alpha1"]
B["v1beta1"]
Hub -- "webhook converts<br>v1 → v1alpha1" --> A
A -- "webhook converts<br>v1alpha1 → v1" --> Hub
Hub -- "webhook converts<br>v1 → v1beta1" --> B
B -- "webhook converts<br>v1beta1 → v1" --> Hub
Conversion webhooks must be lossless — converting v1beta1 → v1 → v1beta1 must produce the original object. If a new version adds fields that older versions lack, use annotations to preserve the data during round-trips. This is subtle and error-prone; test conversion extensively.
API Priority and Fairness
A single misbehaving controller can issue thousands of LIST requests and overwhelm the API server, starving kubelet heartbeats and other critical traffic. API Priority and Fairness (APF) prevents this by categorizing requests into priority levels and applying fair queuing within each level.
How It Works
-
FlowSchemas classify incoming requests. Each FlowSchema matches requests by user, group, namespace, verb, or resource and assigns them to a priority level.
-
Priority levels define how much of the API server’s capacity is allocated to that class of traffic. Higher-priority levels get more capacity and can borrow from lower levels.
-
Within a priority level, fair queuing ensures that no single flow (e.g., requests from one service account) monopolizes the allocation.
The system ships with several built-in FlowSchemas:
| FlowSchema | Priority Level | Purpose |
|---|---|---|
exempt | exempt | Health checks, system:masters. No queuing. |
system-leader-election | leader-election | Controller manager, scheduler leader election. |
system-nodes | system | Kubelet requests. Must not be starved. |
kube-controller-manager | workload-high | Built-in controllers. |
service-accounts | workload-low | Default for service account traffic. |
global-default | global-default | Catch-all for unmatched requests. |
The exempt level is special — requests skip all queuing and rate limiting. This ensures that the API server can always respond to its own health checks and that break-glass admin access is never throttled.
Diagnosing APF Issues
When requests are being throttled, the API server returns 429 Too Many Requests with a Retry-After header. The apiserver_flowcontrol_dispatched_requests_total and apiserver_flowcontrol_rejected_requests_total metrics reveal which priority levels are saturated.
If your operator is being throttled, the fix is usually one of:
- Reduce the request rate (use watches instead of polling, use caches, reduce list scope with label selectors).
- Create a dedicated FlowSchema that assigns your operator to a higher priority level.
- Increase the concurrency shares for the relevant priority level.
Option 1 is almost always the right answer. Options 2 and 3 just shift the problem to other tenants.
Design Implications
Understanding the API internals changes how you build on Kubernetes:
Webhook placement matters. A mutating webhook that injects sidecars adds latency to every pod creation. Measure it. Keep webhook logic fast and simple. Avoid calling external services from inside a webhook.
Conversion webhooks are migration infrastructure. Plan for them from the start if your CRD is likely to evolve. Design your v1 storage version with enough flexibility that you do not need structural changes with every release.
APF protects the control plane from you. If your operator lists all pods in a 10,000-pod cluster every 30 seconds, APF will eventually throttle it. Use informer caches, label selectors, and field selectors to minimize API server load.
Authentication is pluggable. The API server does not care how you prove your identity — it supports client certificates, OIDC tokens, webhook-based token review, and service account tokens.
Common Mistakes and Misconceptions
- “Admission webhooks are fire-and-forget.” A failing webhook can block all resource creation in your cluster. Always configure
failurePolicy: Ignorefor non-critical webhooks and ensure webhook services have high availability. - “Mutating and validating webhooks run in any order.” Mutating webhooks run first (and can run multiple rounds), then validating webhooks run. A validating webhook sees the final mutated object, not the original user submission.
- “CRDs are free to create.” Each CRD adds load to the API server: storage in etcd, watch channels, discovery endpoints. Hundreds of CRDs (common with Crossplane providers) measurably increase API server memory and CPU usage.
Further Reading
- Dynamic Admission Control — the official Kubernetes documentation on mutating and validating admission webhooks, including configuration, failure policies, and reinvocation.
- Webhook Configuration Reference — details on configuring webhook authentication and authorization backends, including token review and subject access review webhooks.
- API Aggregation Layer — how to extend the Kubernetes API with your own API server registered via APIService objects, including when to use aggregation versus CRDs.
- Versions in CustomResourceDefinitions — how CRD versioning works, including storage versions, conversion webhooks, and strategies for evolving your API without breaking clients.
- Extending the Kubernetes API — an overview comparing CRDs and aggregated API servers, covering the tradeoffs and use cases for each extension mechanism.
- The Life of a Kubernetes API Request — a KubeCon talk that traces a request through authentication, authorization, admission, validation, and storage, visualizing the full request pipeline.
- API Priority and Fairness — the official documentation on APF, covering FlowSchemas, PriorityLevelConfigurations, and how the API server prevents any single client from starving others.
Next: etcd Operations — the database that stores everything, and how to keep it healthy.
Chapter 40: etcd Operations
Every object in a Kubernetes cluster — every Pod, every Secret, every ConfigMap, every CRD instance — exists as a key-value pair in etcd. There is no secondary database, no cache that can reconstruct state. If etcd loses data, the cluster loses its memory. If etcd becomes slow, every API server call becomes slow. If etcd goes down, the cluster is effectively frozen: controllers cannot reconcile, the scheduler cannot place pods, and kubectl returns errors.
This chapter covers backup, restore, maintenance, monitoring, and the disaster recovery procedures you hope to never need.
What etcd Stores
etcd is a distributed key-value store that uses the Raft consensus protocol for replication. In a Kubernetes cluster, the API server is the only client — all reads and writes go through it. etcd stores:
- All resource objects (Pods, Services, Deployments, Secrets, etc.)
- Cluster configuration (RBAC rules, admission configurations)
- Lease objects (node heartbeats, leader election)
- Custom resources (anything registered via CRDs)
- Events (though these are often short-lived)
The data is stored under a key hierarchy rooted at /registry/. A Pod named nginx in namespace default lives at /registry/pods/default/nginx.
etcd Cluster Architecture
flowchart TD
subgraph Member1["etcd Member 1 (LEADER)"]
WAL1["WAL<br>(write-ahead log)"]
DB1["DB<br>(boltdb / bbolt)"]
end
subgraph Member2["etcd Member 2 (FOLLOWER)"]
WAL2["WAL<br>(write-ahead log)"]
DB2["DB<br>(boltdb / bbolt)"]
end
subgraph Member3["etcd Member 3 (FOLLOWER)"]
WAL3["WAL<br>(write-ahead log)"]
DB3["DB<br>(boltdb / bbolt)"]
end
Member1 -- "Raft: replicate<br>log entries" --> Member2
Member1 -- "Raft: replicate<br>log entries" --> Member3
Writes["Write requires agreement<br>from majority (2 of 3)"]
Reads["Read can be served by<br>any member (with<br>consistency options)"]
Member1 --- Writes
Member2 --- Reads
API["API Server connects via gRPC over TLS.<br>Only the API server talks to etcd directly.<br>All other components go through the API server."]
Writes --- API
Reads --- API
Raft requires a quorum — a majority of members — to commit writes. With 3 members, you can lose 1. With 5, you can lose 2. Always run an odd number of members. An even number provides no additional fault tolerance (4 members still tolerates only 1 failure, same as 3) while increasing the coordination overhead.
Backup
A backup you have never tested is wishful thinking.
Taking a Snapshot
# Using etcdctl (the network-aware CLI)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot
etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table
The snapshot captures the entire database at a point in time. Schedule snapshots at least every hour for production clusters. Store them off-cluster — in object storage (S3, GCS) with versioning enabled.
What Snapshots Do Not Capture
Snapshots capture etcd data only. They do not capture:
- Container images (stored in registries)
- Persistent volume data (stored on disks/NAS/cloud volumes)
- External secrets (in Vault, AWS Secrets Manager, etc.)
- Certificates (unless stored as Kubernetes Secrets)
A complete disaster recovery plan must address all of these.
Restore
Restoring from a snapshot is a destructive operation. It creates a new etcd data directory with a new cluster ID. All existing members must be stopped, and the new data directory must be distributed to all of them.
The Restore Command
Since etcd 3.5.x, the etcdctl snapshot restore command is deprecated. Use etcdutl instead:
# etcdutl operates on local files --- no network connection needed
etcdutl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restored \
--name=etcd-member-1 \
--initial-cluster="etcd-member-1=https://10.0.1.10:2380,\
etcd-member-2=https://10.0.1.11:2380,\
etcd-member-3=https://10.0.1.12:2380" \
--initial-cluster-token=etcd-cluster-restored \
--initial-advertise-peer-urls=https://10.0.1.10:2380
etcdctl vs etcdutl
This distinction confuses many operators:
| Tool | Scope | Example Operations |
|---|---|---|
etcdctl | Network operations. Talks to a running etcd server over gRPC. | snapshot save, get, put, member list, endpoint health |
etcdutl | File operations. Works on local data files without a running server. | snapshot restore, snapshot status, defrag (offline) |
The rule of thumb: if the cluster is running and you are interacting with it, use etcdctl. If the cluster is down and you are operating on files, use etcdutl.
Compaction
etcd is a versioned key-value store. Every write creates a new revision. By default, etcd keeps all historical revisions, which means the database grows indefinitely. Compaction removes revisions older than a specified point.
# Get the current revision
rev=$(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')
# Compact everything older than current revision minus 10000
etcdctl compact $((rev - 10000))
Kubernetes API server handles compaction automatically via the --etcd-compaction-interval flag (default: 5 minutes). The API server compacts revisions older than the specified interval. You rarely need to run compaction manually unless the automatic process has fallen behind.
Defragmentation
Compaction marks old revisions as deleted but does not reclaim disk space. The database file retains its size (or grows) because bbolt uses a free-list internally. Defragmentation rewrites the database to reclaim this space.
# Online defragmentation (one member at a time)
etcdctl defrag --endpoints=https://10.0.1.10:2379
# Offline defragmentation (member must be stopped)
etcdutl defrag --data-dir=/var/lib/etcd
Defragmentation is an expensive operation that briefly blocks reads and writes on the affected member. In a multi-member cluster, defragment one member at a time, waiting for it to catch up with the leader before moving to the next. Never defragment the leader first — defragment followers, then transfer leadership, then defragment the old leader.
Performance Tuning
etcd is exquisitely sensitive to disk I/O latency. The single most impactful tuning decision is giving etcd dedicated, fast storage.
Hardware Recommendations
- Disk: NVMe SSD or high-IOPS cloud volumes (gp3 with provisioned IOPS on AWS, pd-ssd on GCP). etcd’s WAL fsync is on the critical path for every write. Spinning disks are unacceptable. Network-attached storage with unpredictable latency is dangerous.
- CPU: 2–4 dedicated cores. etcd is not CPU-intensive but is sensitive to scheduling delays.
- Memory: 8 GB is sufficient for most clusters. etcd memory-maps its database, so larger databases need proportionally more RAM.
- Network: Low-latency links between members. Raft consensus requires leader-to-follower round trips for every write. Cross-region etcd clusters are a recipe for latency problems.
Dedicated Machines
For production clusters, run etcd on dedicated nodes — not co-located with the API server or other control plane components. A CPU-hungry admission webhook or a memory leak in the scheduler should not be able to starve etcd of resources.
Tuning Parameters
# Increase heartbeat interval for high-latency networks (default 100ms)
--heartbeat-interval=250
# Increase election timeout proportionally (default 1000ms)
--election-timeout=2500
# Set snapshot count (how many transactions between snapshots)
--snapshot-count=10000
# Set quota backend bytes (database size limit, default 2GB, max 8GB)
--quota-backend-bytes=8589934592
The --quota-backend-bytes is a safety valve. When the database exceeds this limit, etcd switches to read-only mode to prevent unbounded growth. If this happens, you must compact and defragment to get below the limit before etcd will accept writes again.
Key Monitoring Metrics
Monitor these metrics to catch problems before they become outages:
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
etcd_mvcc_db_total_size_in_bytes | Database size. Indicates growth trends. | > 6 GB (approaching 8 GB quota) |
etcd_disk_wal_fsync_duration_seconds | WAL write latency. The canary for disk problems. | p99 > 10ms |
etcd_disk_backend_commit_duration_seconds | Backend commit latency. | p99 > 25ms |
etcd_network_peer_round_trip_time_seconds | Peer-to-peer latency. | p99 > 50ms |
etcd_server_proposals_failed_total | Failed Raft proposals. Indicates leader instability. | Any increase |
etcd_server_leader_changes_seen_total | Leader elections. Frequent changes signal network or disk issues. | > 3 per hour |
etcd_server_has_leader | Whether this member sees a leader. | 0 for > 30s |
The WAL fsync duration is the single most important metric. When disk latency increases, writes slow down, Raft heartbeats are delayed, followers fall behind, and the leader may trigger unnecessary elections. Everything cascades from slow disks.
Scaling the Cluster
Adding Members
# Add a new member (run from existing cluster)
etcdctl member add etcd-member-4 \
--peer-urls=https://10.0.1.13:2380
# Start the new member with --initial-cluster-state=existing
New members join as learners (non-voting) until they catch up with the leader’s log, then promote to full voting members. Always add one member at a time and wait for it to become healthy before adding the next.
Removing Members
# Get member ID
etcdctl member list
# Remove the member
etcdctl member remove <member-id>
When scaling from 3 to 5 members, you gain tolerance for 2 failures instead of 1, but you increase the write latency (the leader must wait for acknowledgment from 3 members instead of 2). Most clusters should stay at 3 or 5 members; 7+ members hurt write latency without meaningful availability gain.
Disaster Recovery Procedure
When etcd is down and you have a snapshot, follow this sequence.
The following sequence diagram shows the exact ordering of a disaster recovery restore. The order is critical — starting API servers before etcd is stopped, or restoring only some members, causes data loss or split-brain:
sequenceDiagram
participant Op as SRE / Operator
participant API1 as API Server (node 1)
participant API23 as API Server (node 2+3)
participant E1 as etcd member-1
participant E2 as etcd member-2
participant E3 as etcd member-3
participant Tool as etcdutl
Note over Op,Tool: 1. Stop ALL API servers (prevent writes during restore)
Op->>API1: stop
Op->>API23: stop
API1-->>Op: stopped
API23-->>Op: stopped
Note over Op,Tool: 2. Stop ALL etcd members
Op->>E1: stop
Op->>E2: stop
Op->>E3: stop
E1-->>Op: stopped
E2-->>Op: stopped
E3-->>Op: stopped
Note over Op,Tool: 3. Restore snapshot on EACH member
Op->>Tool: etcdutl snapshot restore
Tool->>E1: restore to new data-dir
Tool->>E2: restore to new data-dir
Tool->>E3: restore to new data-dir
Note over Op,Tool: 4. Replace data directories (mv new-data -> /var/lib/etcd)
Op->>E1: replace data-dir
Op->>E2: replace data-dir
Op->>E3: replace data-dir
Note over Op,Tool: 5. Start etcd members (one at a time)
Op->>E1: start
E1-->>Op: running
Op->>E2: start
E2-->>Op: running
Op->>E3: start
E3-->>Op: running
Note over Op,Tool: 6. Verify: etcdctl endpoint health + member list
Op->>E1: health check
E1-->>Op: all healthy
Note over Op,Tool: 7. Start API servers
Op->>API1: start
Op->>API23: start
API1-->>Op: running
API23-->>Op: running
Note over Op,Tool: CRITICAL: If you start API servers<br>before stopping etcd, new writes will be<br>lost when the snapshot overwrites the data directory.
Follow these steps:
- Stop all API servers. They will reconnect when etcd is back.
- Stop all etcd members.
- Restore the snapshot on each member using
etcdutl snapshot restorewith the correct--name,--initial-cluster, and--initial-advertise-peer-urlsfor each member. - Replace the data directory on each member with the restored data.
- Start all etcd members simultaneously (or within a few seconds of each other).
- Verify cluster health:
etcdctl endpoint health - Start the API servers.
- Verify cluster state:
kubectl get nodes,kubectl get pods --all-namespaces
After a restore, the cluster will be in the state captured by the snapshot. Any objects created between the snapshot and the failure are lost. This is why frequent snapshots and short RPO targets matter.
If only one member has failed (and quorum is maintained), do not restore from a snapshot. Instead, remove the failed member, provision a new one, and add it to the cluster. The Raft protocol will replicate the current state to the new member automatically.
Common Mistakes and Misconceptions
- “etcd backs up automatically in managed Kubernetes.” True for the control plane etcd in EKS/GKE/AKS. But if you run self-managed clusters or use etcd for other purposes, backups are your responsibility. Test restores regularly.
- “etcd can store large values.” etcd has a default per-value limit of 1.5 MB. Storing large ConfigMaps, Secrets, or CRDs that approach this limit degrades performance. Keep resources small.
- “Adding more etcd nodes improves write performance.” More nodes means more Raft acknowledgments per write — 7+ members hurt, not help, write performance.
Further Reading
- etcd Documentation — the official etcd docs covering installation, configuration, clustering, authentication, and the client API.
- etcd Performance — benchmarking methodology and tuning guidance for etcd, including disk I/O recommendations, network latency requirements, and how to interpret benchmark results.
- etcd Disaster Recovery — step-by-step procedures for recovering an etcd cluster from snapshot backups, including single-member and multi-member restore workflows.
- etcd FAQ — answers to common operational questions about etcd, including cluster sizing, data size limits, request size limits, and performance expectations.
- Operating etcd Clusters for Kubernetes — the Kubernetes-specific guide for setting up, backing up, and upgrading etcd, including TLS configuration and snapshot best practices.
- etcd-operator (archived) — the original CoreOS operator for managing etcd clusters on Kubernetes, now archived but valuable as a reference for understanding automated etcd lifecycle management.
- Auger — a tool for directly decoding and inspecting Kubernetes objects stored in etcd, useful for debugging and understanding how the API server serializes resources.
Next: GPU Workloads and AI/ML on Kubernetes — how GPUs are exposed to the scheduler, shared between workloads, and orchestrated for distributed training.
Chapter 41: GPU Workloads and AI/ML on Kubernetes
Kubernetes was built to orchestrate stateless web services. GPUs were built to render triangles and multiply matrices. Bringing these two worlds together required years of extension work — device plugins, operator stacks, specialized schedulers, and high-speed networking — because none of the original Kubernetes abstractions anticipated hardware accelerators.
The Device Plugin Framework
Kubernetes has no native understanding of GPUs. It knows about CPU (millicores), memory (bytes), ephemeral storage, and hugepages. Everything else enters through the device plugin framework, a gRPC-based extension point introduced in Kubernetes 1.8. In Chapter 3, we described the kubelet as a single-responsibility agent that converts API state into running containers. The device plugin framework extends the kubelet’s vocabulary beyond CPU and memory, letting it manage hardware it was never designed to know about.
How It Works
A device plugin is a process (usually running as a DaemonSet on every GPU node) that implements three gRPC services:
-
Registration: The plugin connects to the kubelet’s Registration service at
/var/lib/kubelet/device-plugins/kubelet.sockand announces a resource name (e.g.,nvidia.com/gpu). -
ListAndWatch: The kubelet calls
ListAndWatchon the plugin. The plugin returns a stream of device IDs — one per physical GPU (or virtual slice). If a GPU fails or is removed, the plugin sends an updated list. The kubelet forwards this inventory to the API server, which stores it in the Node’s.status.capacityand.status.allocatablefields. -
Allocate: When the scheduler places a pod requesting
nvidia.com/gpu: 1on this node, the kubelet callsAllocatewith the chosen device ID. The plugin returns the environment variables, device mounts, and annotations needed to make the GPU visible inside the container (e.g.,/dev/nvidia0, the NVIDIA device files, andNVIDIA_VISIBLE_DEVICES).
flowchart TD
subgraph Node["GPU Node"]
Plugin["NVIDIA Device Plugin (Pod)<br>Reports: GPU-0, GPU-1, GPU-2, GPU-3"]
Kubelet["kubelet"]
Plugin -- "1. Register('nvidia.com/gpu')" --> Kubelet
Kubelet -- "2. ListAndWatch()<br>Plugin streams device IDs:<br>{GPU-0, GPU-1, GPU-2, GPU-3}" --> Plugin
Kubelet -- "4. Allocate(GPU-2)" --> Plugin
Plugin -- "Returns:<br>/dev/nvidia2<br>NVIDIA_VISIBLE_DEVICES=2<br>volume mounts" --> Kubelet
end
Kubelet -- "3. Updates Node status:<br>capacity: nvidia.com/gpu: 4" --> API["API Server<br>Node object .status:<br>allocatable: nvidia.com/gpu: 4"]
The device plugin protocol has two phases — registration (once at startup) and per-pod allocation. The following sequence diagram shows both:
sequenceDiagram
participant Plugin as NVIDIA Device Plugin
participant Kubelet as kubelet<br>(device manager)
participant API as API Server
participant Sched as Scheduler
participant User as User (kubectl)
rect rgba(50, 108, 229, 0.1)
Note over Plugin,User: PHASE 1: REGISTRATION (startup)
Plugin->>Kubelet: Register() via Unix socket
Kubelet-->>Plugin: accepted
Plugin->>Kubelet: ListAndWatch() stream:<br>[gpu-0, gpu-1, gpu-2, gpu-3]
Kubelet->>API: update Node .status.capacity<br>nvidia.com/gpu: 4
end
rect rgba(90, 142, 240, 0.1)
Note over Plugin,User: PHASE 2: PER-POD ALLOCATION
User->>API: create Pod<br>nvidia.com/gpu: 1
API->>Sched: schedule: node has available GPU
Sched-->>API: bind Pod to node
API-->>Kubelet: Pod assigned to node
Kubelet->>Plugin: Allocate() request: gpu-0
Plugin-->>Kubelet: response:<br>/dev/nvidia0<br>CUDA_VISIBLE_DEVICES=0<br>/usr/lib/nvidia
Note over Kubelet: start container with<br>GPU access (via CRI)
Kubelet->>API: update Pod status: Running
end
Critical Constraints
The device plugin model has several hard limitations that shape everything downstream:
- Integer-only quantities. You request
nvidia.com/gpu: 1ornvidia.com/gpu: 2. There is nonvidia.com/gpu: 0.5. Fractional GPUs do not exist in this model. - Non-sharable. A GPU allocated to one pod is exclusively allocated. Two pods cannot share the same device ID through the standard device plugin.
- Not overcommittable. Unlike CPU, which can be overcommitted (requests < limits), GPU counts are absolute. If a node has 4 GPUs and 4 are allocated, a fifth pod cannot be scheduled there.
- No memory management. Kubernetes has no visibility into GPU memory. There is no equivalent of
resources.limits.memoryfor GPU VRAM. A pod requestingnvidia.com/gpu: 1gets the full physical GPU.
These constraints are why MIG, MPS, time-slicing, and ultimately DRA were created.
The NVIDIA GPU Operator
Installing GPU drivers on bare metal is annoying. Installing GPU drivers on every node in a Kubernetes cluster, keeping them in sync with the CUDA toolkit version, ensuring the container runtime is configured correctly, and monitoring GPU health across hundreds of nodes — that is an operational nightmare. The NVIDIA GPU Operator solves this by packaging the entire GPU software stack as Kubernetes-native operators and containerized components.
The Eight Components
| Component | Function |
|---|---|
| Node Feature Discovery (NFD) | Labels nodes with hardware capabilities (PCI vendor IDs, CPU features). The GPU stack depends on NFD labels to identify GPU nodes. |
| GPU Driver Container | Runs the NVIDIA kernel driver inside a container, compiled for the host’s kernel version. No host-level driver installation needed. |
| NVIDIA Container Toolkit | Configures the container runtime (containerd/CRI-O) to expose GPUs to containers. Installs the nvidia-container-runtime hook. |
| Device Plugin | The gRPC device plugin described above. Reports GPUs to the kubelet. |
| GPU Feature Discovery (GFD) | Labels nodes with GPU-specific metadata: model (nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB), driver version, CUDA version, MIG capabilities. |
| DCGM Exporter | Exposes GPU metrics (utilization, temperature, memory usage, ECC errors, power draw) as Prometheus metrics. |
| MIG Manager | Configures Multi-Instance GPU partitioning on supported hardware (A100, H100, H200). Applies MIG profiles via node labels. |
| Operator Validator | Runs post-installation validation to confirm the entire stack is functional. Reports status as conditions on the ClusterPolicy CRD. |
The Modern Stack (2025-2026)
NVIDIA announced the evolution of the GPU management stack at KubeCon 2026:
GPU Operator –> DRA Driver –> KAI Scheduler
The GPU Operator now ships a DRA driver (replacing the legacy device plugin path) that exposes GPUs through the Dynamic Resource Allocation API. The KAI Scheduler is a topology-aware GPU scheduler that understands NVLink domains, MIG slices, and multi-node placement. This trio is the direction all production GPU infrastructure is heading.
GPU Sharing and Multi-Tenancy
A single H100 has 80 GB of HBM3 memory and massive compute throughput. Running a small inference model that uses 2 GB of VRAM on a dedicated H100 wastes 97.5% of the memory. GPU sharing exists to solve this economics problem.
Three Approaches
GPU SHARING MODELS
──────────────────
MULTI-INSTANCE GPU (MIG) MULTI-PROCESS SERVICE (MPS)
Hardware Partitioning Software Space Partitioning
┌──────────────────────┐ ┌───────────────────────┐
│ Physical GPU │ │ Physical GPU │
│ ┌─────┬─────┬─────┐ │ │ │
│ │ MIG │ MIG │ MIG │ │ │ ┌───┐ ┌───┐ ┌───┐ │
│ │Inst │Inst │Inst │ │ │ │P1 │ │P2 │ │P3 │ │
│ │ 0 │ 1 │ 2 │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │30%│ │50%│ │20%│ │
│ │1g. │1g. │1g. │ │ │ │ │ │ │ │ │ │
│ │10gb │10gb │10gb │ │ │ └───┘ └───┘ └───┘ │
│ ├─────┼─────┼─────┤ │ │ Shared CUDA Context │
│ │Own │Own │Own │ │ │ Explicit memory │
│ │SM │SM │SM │ │ │ and compute limits │
│ │+Mem │+Mem │+Mem │ │ │ per process │
│ └─────┴─────┴─────┘ │ └───────────────────────┘
└──────────────────────┘
TIME-SLICING
Isolated compute engines, CUDA Context Switching
memory controllers, and ┌───────────────────────┐
cache partitions. │ Physical GPU │
Fault isolation: yes. │ │
Memory isolation: yes. │ ┌──────────────────┐ │
│ │ Time T1: Pod A │ │
│ ├──────────────────┤ │
│ │ Time T2: Pod B │ │
│ ├──────────────────┤ │
│ │ Time T3: Pod C │ │
│ └──────────────────┘ │
│ Round-robin context │
│ switching. All pods │
│ see full GPU memory. │
│ No memory isolation. │
└───────────────────────┘
Multi-Instance GPU (MIG) partitions a physical GPU at the hardware level. On an A100-80GB, you can create up to 7 instances, each with dedicated streaming multiprocessors, memory controllers, and L2 cache. Profiles include 1g.5gb (1 compute slice, 5 GB), 2g.10gb, 3g.20gb, 4g.40gb, and 7g.80gb. MIG provides true fault and memory isolation. One instance cannot see another’s memory, and a CUDA crash in one instance does not affect others.
Multi-Process Service (MPS) is a software-level sharing mechanism. An MPS server sits between CUDA clients and the GPU, multiplexing access. You can set explicit per-client limits: CUDA_MPS_PINNED_DEVICE_MEM_LIMIT=0=4096M caps a process to 4 GB. MPS allows concurrent kernel execution (true parallelism on the SM level) but lacks the hard isolation of MIG.
Time-Slicing is the simplest approach. The NVIDIA device plugin is configured to advertise more “GPUs” than physically exist (e.g., 4 physical GPUs advertised as 16 time-sliced replicas). CUDA contexts are switched in round-robin fashion. There is no memory isolation — all pods see the full VRAM and can OOM-kill each other. Context switching adds latency overhead.
When to Use Each
| Scenario | Recommended Approach | Rationale |
|---|---|---|
| Production inference with SLAs | MIG | Hard isolation, predictable performance |
| Development and experimentation | Time-slicing | Simple setup, maximum flexibility |
| Batch inference pipelines | MPS | Concurrent execution, configurable limits |
| Multi-tenant cluster, untrusted workloads | MIG | Fault isolation between tenants |
| Cost optimization, trusted workloads | Time-slicing or MPS | Maximize utilization |
Dynamic Resource Allocation (DRA)
You cannot express “give me a MIG slice with 20 GB of memory on a GPU that has NVLink connectivity to another GPU already allocated to this pod” with nvidia.com/gpu: 1.
Why Device Plugins Were Insufficient
- Count-based only. No way to parameterize requests (memory size, compute capability, MIG profile).
- No sharing semantics. Two pods cannot request access to the same physical device.
- No topology awareness. No way to express “these two GPUs must be on the same NVLink domain.”
- No scheduling integration. Device allocation happens at the kubelet level, after scheduling. The scheduler has no visibility into device topology.
- Vendor-locked plugin logic. All allocation intelligence is inside the vendor’s plugin binary.
The DRA Model
DRA, graduating to GA in Kubernetes 1.34-1.35, introduces a structured, parameterized model for hardware allocation.
DEVICE PLUGIN MODEL vs DRA MODEL
─────────────────────────────────
DEVICE PLUGIN (Legacy) DRA (Modern)
────────────────────── ────────────
Pod spec: Pod spec:
resources: resourceClaims:
limits: - name: gpu
nvidia.com/gpu: 1 resourceClaimTemplateName: gpu-claim
That's it. Count only. ResourceClaimTemplate:
No parameters. spec:
No sharing. devices:
No topology. requests:
- name: gpu
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: >
device.attributes["gpu.nvidia.com"]
.productName == "H100" &&
device.attributes["gpu.nvidia.com"]
.memory.isGreaterThan(
quantity("40Gi"))
┌─────────┐ count=1 ┌─────────┐ ┌─────────┐ claim ┌────────────┐
│ Pod │──────────►│ kubelet │ │ Pod │──────►│ Scheduler │
│ │ │ picks │ │ │ │ evaluates │
│ │ │ any │ │ │ │ CEL exprs, │
│ │ │ GPU │ │ │ │ topology, │
└─────────┘ └─────────┘ └─────────┘ │ sharing │
└────────────┘
│
┌─────▼──────┐
│ DRA Driver │
│ prepares │
│ device │
└────────────┘
The Four API Objects
| Object | Purpose |
|---|---|
| ResourceSlice | Published by the DRA driver. Describes available devices on a node: attributes, capacity, topology. The scheduler reads these to make placement decisions. |
| DeviceClass | Cluster-scoped. Defines a class of devices with admin-set constraints and configuration. Example: gpu.nvidia.com class might set a default MIG profile. |
| ResourceClaim | Namespace-scoped. A pod’s request for a device, with CEL-based selectors. Allocated by the scheduler, bound to specific devices. |
| ResourceClaimTemplate | Creates ResourceClaims per pod, like PVCs from PVC templates in StatefulSets. |
CEL selector expressions can match on any device attribute: product name, memory size, MIG capability, driver version, NUMA node, NVLink group. You can express prioritized alternatives (“prefer H100, accept A100”) and device sharing (“this claim can share a device with that claim”).
NVIDIA donated their DRA driver to the CNCF at KubeCon 2026, making it a vendor-neutral component of the ecosystem.
ML Training on Kubernetes
Training a large model is a distributed systems problem. A single GPU can handle fine-tuning a 7B model. Training a 70B model from scratch requires hundreds of GPUs coordinated across dozens of nodes, all processing data in lockstep. Kubernetes needs specialized operators and schedulers to manage these workloads.
Training Operators
Kubeflow Training Operator provides CRDs for distributed training frameworks:
PyTorchJob: Launches distributed PyTorch withtorchrun. ConfiguresMASTER_ADDR,MASTER_PORT,WORLD_SIZE, andRANKautomatically.TFJob: TensorFlow distributed training with PS/Worker topology.MPIJob: MPI-based training (Horovod). Launches an MPI ring with SSH between pods.TrainJob(v2): The unified API that abstracts framework details behind a single CRD. Specify a model, dataset, and training runtime; the operator generates the correct distributed topology.
KubeRay is the Kubernetes operator for Ray, the distributed compute framework used by OpenAI for ChatGPT training infrastructure. It provides:
RayCluster: A persistent Ray cluster with head and worker nodes.RayJob: Submits a job to a RayCluster (or creates an ephemeral one).RayService: Serves Ray Serve deployments with rolling upgrades.
Ray’s advantage is its unified API for training, tuning, and serving. A single Ray program can orchestrate data preprocessing, distributed training with PyTorch, hyperparameter tuning, and model serving.
Gang Scheduling
Standard Kubernetes scheduling is pod-by-pod. For a distributed training job requiring 64 GPUs across 8 nodes, the default scheduler might place 7 pods and then get stuck waiting for the 8th. Those 7 pods sit idle, burning GPU-hours, waiting for a resource that may not free up for hours.
Gang scheduling (all-or-nothing scheduling) ensures that either all pods in a job are scheduled simultaneously, or none are. Volcano is the primary gang scheduler for Kubernetes. It introduces:
JobCRD withminAvailable(minimum pods required to start).- Queue-based scheduling with fair-sharing across teams.
- Preemption policies for priority-based scheduling.
Job Queuing with Kueue
Kueue is the Kubernetes-native job queuing system. While Volcano is a full scheduler replacement, Kueue works with the default scheduler, adding queuing and quota semantics on top.
Core concepts:
- ClusterQueue: Defines a pool of resources (e.g., 100 GPUs, 200 CPUs) with borrowing limits.
- LocalQueue: Namespace-scoped queue that points to a ClusterQueue. Users submit jobs here.
- ResourceFlavor: Describes a class of nodes (e.g.,
a100-spot,h100-ondemand). Maps to node labels. - Cohort borrowing: ClusterQueues in the same cohort can borrow unused resources from each other. Team A’s unused GPU quota flows to Team B automatically.
Kueue vs Volcano: Use Kueue when you need multi-tenant quota management and work with the default scheduler. Use Volcano when you need a full scheduler replacement with gang scheduling, preemption, and topology-aware placement. Many production clusters use both: Kueue for queuing and quota, Volcano for gang scheduling.
Networking for Distributed Training
Distributed training spends a significant fraction of total time on communication. After each forward/backward pass, gradients must be synchronized across all workers (AllReduce).
Why Standard TCP Is Insufficient
Standard TCP networking (Pod-to-Pod via CNI) adds multiple copies and context switches per message:
- GPU memory –> CPU memory (PCIe DMA)
- CPU memory –> kernel socket buffer
- Kernel –> NIC (TCP/IP stack processing, segmentation)
- Network transit
- NIC –> kernel socket buffer –> CPU memory –> GPU memory (reverse path)
For a 70B parameter model with fp16 gradients, each AllReduce exchanges ~140 GB of data. Over standard 25 Gbps Ethernet with TCP, this gradient sync alone takes minutes. Real-world benchmarks show: standard TCP can make a training run take 5 hours that completes in 1h40m with GPUDirect RDMA.
GPU-TO-GPU COMMUNICATION PATHS
───────────────────────────────
STANDARD TCP (SLOW)
┌──────┐ PCIe ┌──────┐ TCP/IP ┌──────┐ PCIe ┌──────┐
│ GPU │───────►│ CPU │─────────►│ CPU │───────►│ GPU │
│ Node │ copy │ RAM │ stack │ RAM │ copy │ Node │
│ A │◄───────│ │◄─────────│ │◄───────│ B │
└──────┘ └──────┘ NIC └──────┘ └──────┘
Copies: 4 (GPU→CPU, CPU→NIC, NIC→CPU, CPU→GPU)
Latency: ~100μs+ Bandwidth: limited by TCP stack
RDMA / RoCE (FAST)
┌──────┐ PCIe ┌──────┐ RDMA ┌──────┐ PCIe ┌──────┐
│ GPU │───────►│ CPU │ bypass │ CPU │──────►│ GPU │
│ Node │ copy │ RAM │────────►│ RAM │ copy │ Node │
│ A │ └──────┘ no TCP └──────┘ │ B │
└──────┘ NIC does NIC does └──────┘
direct direct
memory memory
access access
Copies: 2 (GPU→CPU, CPU→GPU)
Latency: ~2μs Kernel bypass, zero-copy NIC
GPUDirect RDMA (FASTEST)
┌──────┐ RDMA ┌──────┐
│ GPU │──────────────────────►│ GPU │
│ Node │ NIC reads directly │ Node │
│ A │ from GPU memory │ B │
│ │◄──────────────────────│ │
└──────┘ No CPU involved └──────┘
Copies: 0 (GPU memory → NIC → network → NIC → GPU memory)
Latency: ~1μs Maximum bandwidth, zero CPU overhead
The Communication Stack
NCCL (NVIDIA Collective Communications Library) is the standard for multi-GPU collective operations (AllReduce, AllGather, Broadcast). NCCL automatically selects the fastest available transport: NVLink for intra-node, InfiniBand or RoCE for inter-node, falling back to TCP if nothing better exists.
InfiniBand provides the highest bandwidth (400 Gbps NDR) with sub-microsecond latency and native RDMA. Most large GPU clusters (DGX SuperPOD, etc.) use InfiniBand fabrics.
RoCE (RDMA over Converged Ethernet) provides RDMA semantics over standard Ethernet. Lower cost than InfiniBand, but requires lossless Ethernet configuration (PFC, ECN).
NVIDIA Network Operator
The NVIDIA Network Operator brings RDMA networking to Kubernetes:
- Multus CNI: Attaches multiple network interfaces to pods (one for standard traffic, one for RDMA).
- SR-IOV Device Plugin: Exposes SR-IOV Virtual Functions as schedulable resources (
nvidia.com/rdma_shared_device_a). - RDMA Shared Device Plugin: Enables RDMA device sharing across containers.
- Host Device Network: Passes InfiniBand/RoCE interfaces directly into pods.
A distributed training pod spec requests both GPUs and RDMA devices:
resources:
limits:
nvidia.com/gpu: 8
nvidia.com/rdma_shared_device_a: 1
Cost Optimization for GPU Workloads
H100 instances cost $25-35/hour on-demand in major clouds. A 64-GPU training cluster burns $50,000-$70,000 per day. Cost optimization is not a nice-to-have; it is an engineering requirement.
Spot/Preemptible GPU Instances
Cloud providers offer GPU instances at 60-80% discounts through spot/preemptible pricing. The tradeoff: instances can be reclaimed with 30-120 seconds notice. For training workloads with checkpointing (save state every N steps), this is viable. For inference with graceful draining, it works with proper pod disruption budgets.
Karpenter with GPU Node Pools
Karpenter provisions right-sized nodes on demand. For GPU workloads, configure separate NodePools:
- GPU NodePool: Instance types restricted to GPU families (p5, p4d, g5). Spot pricing enabled.
SpotToSpotConsolidationmoves workloads between spot pools to maintain availability. - CPU NodePool: Standard instances for non-GPU workloads. Prevents GPU nodes from being used for CPU-only pods.
Scheduling: Bin-Packing
The default Kubernetes scheduler spreads pods across nodes. For GPU workloads, bin-packing is critical: fill GPU nodes completely before allocating new ones. A half-utilized 8-GPU node is a node you are paying full price for. Use NodeResourcesFit with MostAllocated scoring strategy, or Karpenter’s consolidation to continuously pack workloads onto fewer nodes.
The Full Cost Stack
- Spot instances for fault-tolerant training (60-80% savings).
- GPU sharing (MIG/MPS/time-slicing) for inference and dev workloads (2-7x utilization improvement).
- Bin-packing scheduling to minimize partially-used nodes.
- Kueue quotas to prevent teams from hoarding GPUs.
- Scale-to-zero for inference endpoints with no traffic (via KServe or KEDA).
- Preemption policies to let high-priority training preempt low-priority batch jobs.
Common Mistakes and Misconceptions
- GPUs are allocated exclusively as whole units by default. No fractional requests, no sharing between pods without MIG, MPS, or DRA.
- “Any Kubernetes node can schedule GPU workloads.” Nodes need the NVIDIA device plugin (or GPU Operator) installed, proper drivers, and the nvidia container runtime configured. Without this stack, K8s doesn’t know GPUs exist.
- “Training and inference need the same infrastructure.” Training needs high-bandwidth interconnects (NVLink, InfiniBand), gang scheduling, and checkpointing. Inference needs low latency, autoscaling, and model serving frameworks. Different workloads, different architectures.
Further Reading
- NVIDIA GPU Operator Documentation — the complete guide to deploying and managing the GPU Operator, which automates driver installation, container runtime configuration, device plugin deployment, and GPU monitoring.
- Device Plugins — the official Kubernetes documentation on the device plugin framework, explaining how hardware vendors expose accelerators, FPGAs, and other devices to the kubelet.
- Dynamic Resource Allocation KEP — the Kubernetes Enhancement Proposal for DRA with structured parameters, replacing the opaque device plugin model with a richer, scheduler-integrated resource claim system.
- NVIDIA Multi-Instance GPU User Guide — how to partition A100 and H100 GPUs into isolated MIG instances with dedicated compute, memory, and cache, including supported profiles and configuration procedures.
- Kubeflow Documentation — the full guide to the Kubeflow ML platform, covering pipelines, training operators (TFJob, PyTorchJob, MPIJob), model serving with KServe, and experiment tracking.
- KubeRay Documentation — deploying and managing Ray clusters on Kubernetes for distributed training, hyperparameter tuning, and Ray Serve inference workloads.
- Volcano Scheduler — documentation for the batch scheduling system designed for high-performance and ML workloads, supporting gang scheduling, fair-share queuing, and resource reservation.
- NVIDIA Container Toolkit — the low-level runtime that makes GPUs accessible inside containers, including installation, configuration, and CDI (Container Device Interface) support.
- NVIDIA GPU Operator Quickstart — Hands-on guide to setting up GPU scheduling on Kubernetes
Next: Chapter 42: Running LLMs on Kubernetes
Chapter 42: Running LLMs on Kubernetes
Serving a large language model is not the same problem as serving a web application. A web app handles requests independently in milliseconds with megabytes of memory. An LLM loads 50-400 GB of weights into GPU memory, processes requests through billions of sequential matrix multiplications, generates tokens one at a time, and must manage a KV cache that grows with every token. The infrastructure required — specialized inference servers, GPU-aware autoscaling, multi-node parallelism, model caching, and intelligent routing — demands a purpose-built stack.
ML Inference on Kubernetes
The inference server sits between Kubernetes and the GPU. It loads model weights, manages batching, handles tokenization, and exposes an API. Choosing the right one determines your throughput, latency, and cost.
KServe
KServe is the Kubernetes-native model serving framework. It provides a standard InferenceService CRD that abstracts away the inference runtime:
- Autoscaling with Knative (including scale-to-zero, so idle models release GPU nodes entirely).
- Canary rollouts: Route 10% of traffic to a new model version, monitor metrics, promote or roll back.
- Multi-framework support: TensorFlow, PyTorch, ONNX, XGBoost, Triton, vLLM, and custom containers.
- Transformer/Predictor/Explainer pipeline: Pre-process, predict, and post-process in a single InferenceService.
KServe v0.16 introduced the LLMInferenceService CRD, purpose-built for large language models:
- OpenAI-compatible API endpoints out of the box (
/v1/chat/completions,/v1/completions). - Integration with Gateway API for traffic management.
- Distributed parallelism: define tensor parallelism and pipeline parallelism directly in the CRD spec.
- Backend support for vLLM, TGI, and SGLang.
apiVersion: serving.kserve.io/v1beta1
kind: LLMInferenceService
metadata:
name: llama-3-70b
spec:
modelId: meta-llama/Llama-3-70B-Instruct
workerSpec:
tensorParallelSize: 4
resources:
limits:
nvidia.com/gpu: 4
NVIDIA Triton (Dynamo Triton)
Triton Inference Server (now part of the NVIDIA Dynamo framework) is the most feature-rich inference server:
- Multi-framework: Load TensorRT, ONNX, PyTorch, TensorFlow, and Python models simultaneously.
- Dynamic batching: Accumulates requests for a configurable window (e.g., 50ms) and batches them into a single GPU kernel launch. Transforms 100 serial requests into 1 batched operation.
- Model ensembles: Chain multiple models (tokenizer –> encoder –> decoder –> post-processor) in a DAG with zero-copy tensor passing between stages.
- Model repository: Hot-load and unload models from S3/GCS/local storage without restarting.
vLLM
vLLM changed LLM inference economics. Its two core innovations:
PagedAttention: Traditional inference servers pre-allocate a contiguous block of GPU memory for each request’s KV cache, sized for the maximum sequence length. Most of this memory is wasted (a 2048-token allocation for a 200-token response wastes 90%). PagedAttention borrows the concept of virtual memory paging from operating systems: the KV cache is stored in non-contiguous physical blocks, mapped through a block table. Memory is allocated on demand as tokens are generated.
Continuous batching: Traditional batching waits for all requests in a batch to complete before accepting new ones. If one request generates 500 tokens and another generates 10, the short request’s GPU cycles are wasted while waiting. Continuous batching (also called iteration-level scheduling) adds and removes requests from the batch at every decode step. The GPU is never idle.
Together, these deliver up to 24x throughput improvement over naive single-request serving (2–4x over production servers with static batching). Organizations adopting vLLM have reported 50–75% cost reductions compared to traditional serving stacks, thanks to the combination of PagedAttention’s memory efficiency and continuous batching’s GPU utilization gains.
Inference Server Comparison
| Feature | KServe + vLLM | Triton (Dynamo) | vLLM standalone |
|---|---|---|---|
| Autoscaling (incl. scale-to-zero) | Yes (Knative/KPA) | Manual / custom | No (needs wrapper) |
| OpenAI-compatible API | Yes (v0.16+) | Via ensemble config | Yes, native |
| Dynamic batching | Continuous (vLLM) | Configurable window | Continuous |
| Multi-model serving | Via multiple InferenceServices | Single server, multiple models | One model per process |
| PagedAttention | Yes | Via vLLM backend | Yes |
| Canary / traffic splitting | Native | External (Istio/Gateway) | External |
| Model ensemble / chaining | Via Transformer pipeline | Native DAG | No |
| Production maturity | High (CNCF project) | High (NVIDIA supported) | High (growing fast) |
| Best for | Production serving with MLOps | Multi-framework, complex pipelines | Maximum single-model throughput |
The GPU Autoscaling Problem
Autoscaling GPU inference is fundamentally different from autoscaling web services.
GPU utilization is a misleading metric. A GPU running vLLM at 95% utilization might be handling 10 requests/sec or 200 requests/sec — utilization stays pinned high as long as any work is being done. GPU utilization stays pinned high as long as any work is in-flight — it does not indicate whether users are getting good latency.
The right metrics to scale on:
- Queue depth: Number of requests waiting to be processed. If the queue is growing, you need more replicas.
- Time to First Token (TTFT): Latency from request receipt to first generated token. This is what users perceive as “responsiveness.”
- Inter-Token Latency (ITL): Time between consecutive tokens. Affects streaming experience.
- Request throughput: Requests completed per second vs requests arriving per second.
KEDA Configuration for GPU Workloads
KEDA scales based on external metrics. For LLM inference:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-scaler
spec:
scaleTargetRef:
name: llm-deployment
pollingInterval: 10 # Check every 10s (not 30s default)
cooldownPeriod: 300 # 5 min cooldown (GPU nodes are expensive to churn)
minReplicaCount: 1 # Keep 1 warm pod always
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
query: |
sum(vllm:num_requests_waiting{model="llama-3-70b"})
threshold: "10" # Scale up when >10 requests queued
Scaling Latency Benchmarks
Every second of scaling latency is a second of degraded user experience:
| Scenario | Time |
|---|---|
| Warm node, model already loaded | ~45 seconds (pod scheduling + container start) |
| Cold node, Karpenter provisioning | ~6.5 minutes (instance launch + GPU driver init + model load) |
| Model load from NVMe local storage | ~18 seconds (for 70B fp16 model) |
| Model load from SATA/network PVC | ~74 seconds (same model) |
| Model load from S3/GCS | ~90-180 seconds (varies by region and model size) |
The implication: keep warm pods. The cost of one idle GPU pod ($25-35/hour) is almost always less than the cost of 6+ minutes of failed requests during cold scale-up.
Cost Circuit Breakers
KEDA supports maxReplicaCount, but that is a blunt instrument. For cost control, implement circuit breakers:
- Set
maxReplicaCountto cap worst-case spend. - Use KEDA’s
fallbackconfiguration to define behavior when the metrics source is unreachable. - Monitor scaling events with alerts: “LLM deployment scaled to max replicas” should trigger investigation.
Multi-Node Inference
A single 70B parameter model in fp16 requires ~140 GB of GPU memory. An H100 has 80 GB. The model does not fit on one GPU. You have two ways to split it across multiple GPUs, and they solve different bottlenecks.
Tensor Parallelism (TP)
Tensor parallelism splits individual matrix multiplications across GPUs. For a weight matrix W of shape [4096, 4096], TP=4 gives each GPU a [4096, 1024] slice. Each GPU computes its portion of the output, then an AllReduce synchronizes the results.
Requirement: TP demands extremely high inter-GPU bandwidth because synchronization happens within every layer (multiple times per token). NVLink (900 GB/s on H100) is required. TP across network-connected GPUs is impractical.
Pipeline Parallelism (PP)
Pipeline parallelism splits the model by layers. If a model has 80 layers and PP=2, GPU group A handles layers 0-39 and GPU group B handles layers 40-79. A request’s activations flow from A to B after layer 39. The communication is sequential and relatively infrequent (once per micro-batch per pipeline stage), so network bandwidth requirements are modest.
Advantage: PP works across nodes connected by standard (even Ethernet) networking.
TENSOR PARALLELISM vs PIPELINE PARALLELISM
───────────────────────────────────────────
TENSOR PARALLELISM (TP=4) PIPELINE PARALLELISM (PP=2)
Split WITHIN each layer Split BY layers
Layer N: Node A (Layers 0-39):
┌─────────────────────────┐ ┌───────────────────────┐
│ Weight Matrix [4096²] │ │ Layer 0 │
│ │ │ Layer 1 │
│ ┌────┬────┬────┬────┐ │ │ ... │
│ │GPU │GPU │GPU │GPU │ │ │ Layer 39 │
│ │ 0 │ 1 │ 2 │ 3 │ │ │ │
│ │1024│1024│1024│1024│ │ │ Activations ─────────┼──►
│ │cols│cols│cols│cols│ │ └───────────────────────┘
│ └──┬─┴──┬─┴──┬─┴──┬─┘ │ Network
│ │ │ │ │ │ (modest BW)
│ └────┴──┬─┴────┘ │
│ AllReduce │ Node B (Layers 40-79):
│ (NVLink, │ ┌───────────────────────┐
│ 900 GB/s) │ ──► │ Layer 40 │
└─────────────────────────┘ │ Layer 41 │
│ ... │
Communication: per-layer, │ Layer 79 │
extremely frequent. │ │
Requires NVLink. │ Output ──────────────┼──►
└───────────────────────┘
COMBINED: 2 nodes x 8 H100s = TP=8 (within node) + PP=2 (across nodes)
This is how Llama 3.1 405B runs: ~810 GB in fp16, split across 16 GPUs.
LeaderWorkerSet (LWS)
LeaderWorkerSet is a Kubernetes-native API for multi-node GPU workloads. It creates a group of pods where one is designated the leader and the rest are workers. The leader’s hostname and IP are injected into all workers, solving the distributed coordination problem:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: llama-405b
spec:
replicas: 2 # 2 model replicas
leaderWorkerTemplate:
size: 2 # 2 nodes per replica (PP=2)
leaderTemplate:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: 8 # TP=8 within each node
workerTemplate:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: 8
llm-d (CNCF Sandbox, March 2026)
Traditional LLM serving treats prefill and decode as a single operation on the same GPU. This is wasteful: prefill (processing the input prompt) is compute-bound and bursty, while decode (generating output tokens one at a time) is memory-bandwidth-bound and latency-sensitive. A GPU optimized for one is suboptimal for the other.
llm-d (accepted into CNCF Sandbox in March 2026) disaggregates these phases:
Prefill/Decode Disaggregation
- Prefill nodes: Handle prompt processing. Can use batch-optimized configurations, larger batch sizes, and are less sensitive to per-request latency.
- Decode nodes: Handle token generation. Optimized for low latency, smaller batches, dedicated KV cache memory.
Requests flow: client –> prefill node (processes prompt, generates KV cache) –> KV cache transfer –> decode node (generates tokens, streams back to client).
The following sequence diagram shows how llm-d splits a single inference request across two specialized node pools — the prefill node processes the full prompt, then hands off the KV cache to a decode node for token generation:
sequenceDiagram
participant Client as Client (API request)
participant GW as Gateway / EPP Router
participant Prefill as Prefill Node<br>(vLLM worker)
participant KV as KV Cache Transfer
participant Decode as Decode Node<br>(vLLM worker)
Client->>GW: POST /chat/completions<br>{prompt: 4096 tokens}
GW->>Prefill: route to prefill pool (prompt-heavy)
Note over Prefill: Process full prompt in one<br>forward pass (compute-bound,<br>high GPU utilization)
Prefill->>KV: transfer KV cache state
Note over Prefill: Prefill done,<br>GPU freed for next prompt
KV->>Decode: deliver KV cache to decode worker
Note over Decode: Auto-regressive token generation<br>(memory-bound, sequential)
Decode->>GW: stream tokens
loop Token streaming
GW->>Client: token
end
GW->>Client: [DONE]
Note over Client,Decode: Key insight: Prefill is compute-bound (one large forward pass).<br>Decode is memory-bound (sequential token generation).<br>Splitting them lets each pool use optimally-sized GPUs and scale independently.
KV Cache Management
llm-d implements hierarchical KV cache offloading:
- GPU HBM: Fastest, most expensive. Active decode requests.
- CPU DRAM: 10-50x cheaper per GB. Recently completed requests that may be reused (prefix caching).
- Local NVMe/distributed storage: Persistent cache for common prefixes (system prompts, few-shot examples).
When a new request arrives with a prefix matching a cached KV cache, the decode node skips recomputation entirely.
Performance
Benchmarks from the llm-d team show:
- ~57x faster P90 Time to First Token compared to round-robin load balancing in prefix-cache-heavy workloads with high prompt reuse (because cache-aware routing eliminates redundant prefill).
- ~2x throughput improvement versus round-robin distribution.
The Production Stack
The emerging production architecture is: KServe (model lifecycle, autoscaling, API) + llm-d (intelligent routing, disaggregated serving, KV cache management). KServe handles the Kubernetes-native concerns; llm-d handles the LLM-specific optimization.
Gateway API Inference Extension
As LLM endpoints proliferate in a cluster, standard load balancing (round-robin, least-connections) leaves performance on the table. A request whose prefix matches a warm KV cache on GPU-3 should be routed to GPU-3, not to GPU-7 which would recompute the cache from scratch. Round-robin ignores the most important variable: which GPU already has relevant computation cached in memory.
The Gateway API Inference Extension adds model-aware routing to Kubernetes. It extends the standard Gateway API (the successor to Ingress) with inference-specific semantics.
CRDs
- InferencePool: Defines a pool of pods serving inference (analogous to a Service, but model-aware). Each pool has an Endpoint Selection Extension (ESE) sidecar that makes routing decisions based on real-time pod state.
- InferenceModel: Maps a model name to an InferencePool with criticality levels and traffic policies. Multiple InferenceModel resources can point to the same pool, enabling multi-model routing through a single gateway.
Endpoint Selection Extension (ESE)
The ESE sidecar receives routing requests from the gateway and selects the optimal backend pod based on:
- KV cache affinity: Route to the pod most likely to have the request’s prefix cached. This is the single biggest optimization — prefix cache hits eliminate redundant prefill computation, reducing TTFT from seconds to milliseconds for repeated system prompts.
- Queue depth: Avoid overloaded pods. The ESE tracks per-pod pending request counts in real time.
- Model version: Route to pods serving the requested model version during canary deployments.
- LoRA adapter affinity: When serving multiple LoRA fine-tuned variants from a single base model, route to the pod that already has the requested adapter loaded in memory.
Request Criticality
InferenceModel supports criticality levels (Critical, Standard, Sheddable). During overload, the gateway sheds Sheddable requests first, protecting Critical traffic. This maps naturally to production use cases: customer-facing chat is Critical, background summarization is Sheddable, internal testing is BestEffort.
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: llama-critical
spec:
modelName: meta-llama/Llama-3-70B-Instruct
criticality: Critical
poolRef:
name: llm-pool
targetModels:
- name: meta-llama/Llama-3-70B-Instruct
weight: 100
Model Caching and Storage
The single biggest contributor to LLM cold-start latency is model loading. Llama 3.1 405B in fp16 is ~810 GB. Downloading this from object storage to GPU memory takes minutes. Every strategy in this section exists to minimize or eliminate that wait.
MODEL LOADING STRATEGIES
────────────────────────
STRATEGY 1: Object Storage (Slow, Simple)
┌──────┐ download ┌───────────┐ load ┌──────┐
│ S3 │──────────────►│ Pod │──────────►│ GPU │
│ GCS │ 90-180s │ (ephemeral│ 10-30s │ VRAM │
│ Hub │ (network) │ storage) │ │ │
└──────┘ └───────────┘ └──────┘
Total: 100-210s. Every scale-up pays this cost.
STRATEGY 2: Shared PVC (NFS / ReadWriteMany)
┌──────────────────────────────────────────────────────┐
│ NFS PVC (ReadWriteMany) │
│ /models/llama-3-70b/ (pre-populated) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Pod A │ │ Pod B │ │ Pod C │ │
│ │ mounts │ │ mounts │ │ mounts │ │
│ │ /models │ │ /models │ │ /models │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────────┘
Total: 30-74s (NFS read → GPU). No download step.
STRATEGY 3: KServe LocalModel + Local NVMe
┌───────────┐ pre-cached ┌────────────┐ load ┌──────┐
│ LocalModel│─────────────►│ Node NVMe │────────►│ GPU │
│ controller│ (background) │ /mnt/models│ ~18s │ VRAM │
└───────────┘ └────────────┘ └──────┘
Total: ~18s. Model pre-staged on node before pod starts.
STRATEGY 4: GKE Hyperdisk ML
┌──────────┐ block device ┌──────────┐ load ┌──────┐
│ Hyperdisk│────────────────►│ Pod │────────►│ GPU │
│ ML volume│ 1.2 TB/s read │ │ ~20min │ VRAM │
│ (GKE) │ throughput │ │ │ │
└──────────┘ └──────────┘ └──────┘
Total: ~20 min for 405B. Was 90 min from GCS.
The Concurrent Download Corruption Problem
When multiple pods share an NFS PVC and a new model version is deployed, naive init containers in each pod will download the model simultaneously. This creates a race condition: Pod A writes half the file, Pod B overwrites it, both end up with corrupt weights.
The solution: Use a central Kubernetes Job that downloads the model once to the shared PVC. Pods wait (via an init container that checks for a sentinel file) until the Job completes. This pattern is simple but eliminates an entire class of data corruption bugs.
apiVersion: batch/v1
kind: Job
metadata:
name: download-llama-70b
spec:
template:
spec:
containers:
- name: downloader
image: python:3.11
command: ["huggingface-cli", "download",
"meta-llama/Llama-3-70B-Instruct",
"--local-dir", "/models/llama-3-70b"]
volumeMounts:
- name: model-store
mountPath: /models
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
volumes:
- name: model-store
persistentVolumeClaim:
claimName: shared-model-store
restartPolicy: OnFailure
GKE Hyperdisk ML
Google’s Hyperdisk ML volumes provide up to 1.2 TB/s read throughput from a block storage volume. For Llama 3.1 405B loading, GKE benchmarks show reduction from 90 minutes (GCS download) to approximately 20 minutes with Hyperdisk ML, with further improvement possible through multi-volume striping.
The Hugging Face Ecosystem on Kubernetes
Text Generation Inference (TGI)
TGI was the first production-grade open-source LLM inference server. It pioneered several techniques now considered industry standard: continuous batching, flash attention integration, tensor parallelism, quantization support (GPTQ, AWQ, EETQ), and speculative decoding. TGI continues to be actively developed by Hugging Face. TGI 3.x added structured generation, speculative decoding, and multi-LoRA support, keeping it competitive with other inference servers. For new deployments, evaluate TGI alongside vLLM (for throughput) and SGLang (for structured generation and agent workloads) based on your specific requirements.
Text Embeddings Inference (TEI)
TEI is purpose-built for embedding and reranking models. Key characteristics:
- Small footprint: Embedding models (e.g.,
BAAI/bge-large-en-v1.5at 1.3 GB) fit on a single GPU or even CPU. - Fast boot: Sub-second cold starts for small models.
- Dynamic batching: Automatically batches concurrent requests.
- Token-based API:
POST /embedwith OpenAI-compatible response format.
TEI is the right choice for embedding pipelines in RAG architectures. Run it on a small GPU (T4, L4) or CPU nodes to keep costs minimal.
Hub Integration on Kubernetes
Most Hugging Face models are served from the Hugging Face Hub. On Kubernetes, the integration pattern is:
- Authentication: Store your token in a Kubernetes Secret and mount as
HUGGING_FACE_HUB_TOKEN(orHF_TOKEN) environment variable. - Caching: The Hub client caches downloads in
~/.cache/huggingface/hub. Mount a PVC at this path to persist downloads across pod restarts. - Multi-replica caching: For multiple replicas sharing a model, use a ReadWriteMany NFS PVC with a pre-population Job (as described above). This ensures one download, many readers.
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
- name: HF_HOME
value: /models/cache
volumeMounts:
- name: model-cache
mountPath: /models/cache
NVIDIA NIM
NVIDIA NIM (NVIDIA Inference Microservices) provides pre-optimized inference containers. Rather than configuring TensorRT profiles, quantization settings, and parallelism parameters yourself, NIM containers ship with models already optimized for specific GPU configurations.
Why NIM Matters
Raw vLLM or Triton deployments require significant tuning: choosing the right quantization (GPTQ, AWQ, fp8), compiling TensorRT-LLM engines for your GPU architecture, setting optimal batch sizes and cache configurations. NIM pre-solves this optimization problem. NVIDIA benchmarks show 2.6x throughput improvement over off-the-shelf vLLM deployment for the same model on the same hardware.
NIM Operator 3.0.0
The NIM Operator manages NIM containers on Kubernetes:
- Multi-LLM: Deploy and manage multiple models from a single operator.
- Multi-node: Automatic configuration of tensor and pipeline parallelism across nodes.
- DRA support: Integrates with Dynamic Resource Allocation for fine-grained GPU management.
- NIMCache CRD: Pre-downloads and caches model engines on nodes, solving the cold-start problem.
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: llama-3-70b-nim
spec:
image: nvcr.io/nim/meta/llama-3-70b-instruct:latest
replicas: 2
resources:
gpus: 4 # TP=4 automatically configured
storage:
nimCache: llama-cache # Pre-populated NIMCache
When to Use NIM vs vLLM
Use NIM when you need maximum performance with minimal tuning effort and are running NVIDIA-supported models on NVIDIA GPUs. The pre-optimization is the differentiator: NIM containers include TensorRT-LLM engines compiled for specific GPU architectures, with quantization, batching, and cache settings already tuned. You trade flexibility for performance.
Use vLLM directly when you need full control over serving configuration, run non-NVIDIA hardware (AMD ROCm, Intel Gaudi), serve models not in the NIM catalog, or need to customize the serving logic (custom sampling, constrained decoding, speculative decoding with draft models). vLLM’s open-source community moves fast — new model architectures are typically supported within days of release.
Common Mistakes and Misconceptions
- “Serving an LLM is just deploying a container.” Large models need tensor parallelism across multiple GPUs, KV cache management, continuous batching, and careful memory planning.
- “Bigger instances are always better for LLM serving.” Cost-per-token often favors multiple smaller GPU instances over fewer large ones, depending on model size and batching strategy. Profile your specific model to find the cost-optimal configuration.
- “Auto-scaling LLM inference works like web services.” LLM pods take minutes to load models into GPU memory. Scale-from-zero is extremely slow. Maintain warm replicas and scale on custom metrics (queue depth, KV cache utilization) rather than CPU.
- “All LLM serving frameworks are interchangeable.” vLLM excels at throughput with PagedAttention, TGI integrates tightly with Hugging Face models, Triton supports multi-model serving. Choose based on your specific model and serving requirements.
Further Reading
- vLLM Documentation and GitHub — the open-source inference engine covering PagedAttention, continuous batching, tensor parallelism, and supported model architectures.
- KServe Documentation — the Kubernetes-native model inference platform, including its InferenceService CRD, model mesh, and autoscaling configuration.
- llm-d GitHub Repository — the Kubernetes-native LLM serving stack with disaggregated prefill/decode, KV-cache-aware routing, and LoRA adapter management.
- LeaderWorkerSet Documentation — the Kubernetes SIG-Apps project for deploying multi-node inference workloads where one leader coordinates multiple workers for tensor and pipeline parallelism.
- NVIDIA Triton Inference Server Documentation — NVIDIA’s production inference server covering model ensembles, dynamic batching, and multi-framework support.
- Text Generation Inference (TGI) by Hugging Face — Hugging Face’s optimized inference server with flash attention, quantization, watermarking, and grammar-constrained generation.
- Efficient Memory Management for Large Language Model Serving with PagedAttention (paper) — the foundational paper on PagedAttention that enables vLLM’s near-optimal KV cache memory management.
- Anyscale: How Continuous Batching Enables 23x Throughput — a practical explanation of why continuous (iteration-level) batching dramatically outperforms static batching for LLM serving.
Next: Disaster Recovery — cluster backup, etcd snapshots, multi-region strategies, and the procedures you test before you need them.
Chapter 43: Disaster Recovery
A Kubernetes cluster is not a single thing that fails in a single way. The control plane can fail while workloads keep running. A namespace can be accidentally deleted while the rest of the cluster is fine. An entire region can go dark. Disaster recovery for Kubernetes requires thinking in layers: the cluster state layer and the workload layer, each with its own backup strategy, its own restore procedure, and its own failure modes.
Two-Layer Backup Strategy
Kubernetes disaster recovery operates on two distinct layers, and you need both.
flowchart TD
subgraph Layer1["LAYER 1: CLUSTER STATE (etcd)"]
Etcd["etcd snapshots capture ALL cluster state:<br>- Every resource object (Pods, Deployments, Services)<br>- RBAC rules, NetworkPolicies, CRDs<br>- Secrets, ConfigMaps<br>- Custom resources (operators, databases, etc.)"]
NotCaptured["Does NOT capture:<br>- Persistent Volume data<br>- Container images<br>- External state (DNS, load balancers, IAM)"]
end
subgraph Layer2["LAYER 2: WORKLOAD BACKUP (Velero)"]
Velero["Velero backs up K8s resources AND persistent volumes:<br>- Namespace-scoped resource manifests<br>- PersistentVolume snapshots (via CSI or cloud APIs)<br>- Label/annotation-based selection<br>- Scheduled backups on a cron cadence"]
Storage["Stored externally in object storage<br>(S3, GCS, MinIO)"]
end
Layer1 --- Why{{"WHY BOTH?"}}
Layer2 --- Why
Why --> EtcdUse["etcd snapshots: Full cluster restore<br>after total loss. All or nothing."]
Why --> VeleroUse["Velero backups: Surgical restore of<br>specific namespaces or workloads.<br>Includes PV data. Cross-cluster migration."]
etcd snapshots are your insurance against total cluster loss. They capture the complete cluster state at a point in time. But they are all-or-nothing — you cannot restore a single namespace from an etcd snapshot without restoring everything. They also do not include persistent volume data.
Velero (formerly Heptio Ark) fills the gaps. It backs up Kubernetes resource manifests and can snapshot persistent volumes via CSI snapshot support or cloud provider APIs. It supports selective backup by namespace, label, or resource type. And it can restore into a different cluster, which makes it invaluable for migration.
Velero in Practice
Backup Configuration
# Install Velero with AWS plugin
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket my-cluster-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1
# Create a scheduled backup
velero schedule create daily-production \
--schedule="0 2 * * *" \
--include-namespaces=production,staging \
--ttl=720h \
--snapshot-volumes=true
Selective Backup and Restore
Velero’s label selectors allow targeted backups:
# Back up only resources with a specific label
velero backup create critical-apps \
--selector app.kubernetes.io/tier=critical
# Back up everything except ephemeral namespaces
velero backup create full-backup \
--exclude-namespaces=kube-system,monitoring,temp-*
Restore with Dependency Ordering
A common failure mode during restore is attempting to create resources before their dependencies exist — a Deployment that references a ConfigMap that has not been restored yet. Velero handles this through a priority-based restore order:
flowchart TD
S1["1. Cluster-scoped resources<br>(Namespaces, ClusterRoles, CRDs, StorageClasses)"]
S2["2. Namespace-scoped foundation<br>(ServiceAccounts, ConfigMaps, Secrets, PVCs)"]
S3["3. Workload resources<br>(Deployments, StatefulSets, DaemonSets, Services)"]
S4["4. Dependent resources<br>(Ingress, NetworkPolicies, HPA, PodDisruptionBudgets)"]
S5["5. Custom Resources<br>(CRD instances -- restored after CRDs exist)"]
S6["6. Volume data<br>(PV snapshots restored and bound to new PVCs)"]
S1 --> S2 --> S3 --> S4 --> S5 --> S6
You can customize this order via restore hooks and init containers to wait for dependencies.
RPO and RTO
Two metrics define your disaster recovery targets:
Recovery Point Objective (RPO): How much data can you afford to lose? If your etcd snapshots run hourly and Velero backups run daily, your RPO is the time since the last relevant backup. An RPO of 1 hour means you accept losing up to 1 hour of changes.
Recovery Time Objective (RTO): How quickly must you be back in service? This includes time to detect the failure, execute the recovery procedure, verify the cluster is healthy, and confirm application availability.
| Scenario | Typical RPO | Typical RTO |
|---|---|---|
| Single namespace deletion | Minutes (Velero) | 15–30 minutes |
| Control plane failure (etcd intact) | 0 (no data loss) | 5–15 minutes |
| Total cluster loss (etcd gone) | Last etcd snapshot interval | 1–4 hours |
| Full region failure | Last cross-region replication | 15 min – 4 hours |
The gap between your target RPO/RTO and your actual tested RPO/RTO is your risk. Measure both.
Multi-Region Strategies
For organizations that cannot tolerate the RTO of rebuilding from backup, multi-region architecture provides resilience at the infrastructure level.
Active-Active
Two or more clusters in different regions serve traffic simultaneously. A global load balancer distributes requests. Stateful workloads either use a multi-region database (CockroachDB, Spanner) or accept eventual consistency.
Pros: Near-zero RTO for region failure. No cold-start latency. Cons: Operationally complex. Data consistency is hard. Cost doubles (or more).
Active-Passive
A primary cluster serves all traffic. A standby cluster in another region has the same applications deployed but receives no traffic. On failure, DNS or the load balancer shifts traffic to the standby.
Pros: Simpler than active-active. Lower cost (standby can be smaller). Cons: RTO is limited by DNS propagation and application warm-up. Standby cluster may have stale data.
Partitioned (Regional Affinity)
Each region operates independently, serving only users or workloads in that region. There is no failover between regions — each is self-contained.
Pros: Simplest multi-region model. Data sovereignty compliance. Cons: No cross-region resilience. If a region goes down, its users are affected.
MULTI-REGION STRATEGY COMPARISON
──────────────────────────────────
ACTIVE-ACTIVE ACTIVE-PASSIVE
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Region A │ │ Region B │ │ Region A │ │ Region B │
│ ████████ │ │ ████████ │ │ ████████ │ │ (standby)│
│ traffic │ │ traffic │ │ traffic │ │ │
└─────┬────┘ └─────┬────┘ └─────┬────┘ └─────┬────┘
│ │ │ │
└──────┬─────┘ │ (failover)
│ │ │
Global LB Primary ──────▶ Promote
on failure
PARTITIONED
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Region A │ │ Region B │ │ Region C │
│ Users: A │ │ Users: B │ │ Users: C │
│ Data: A │ │ Data: B │ │ Data: C │
└──────────┘ └──────────┘ └──────────┘
(independent, no cross-region failover)
Testing Recovery
Testing restores is not optional — an untested backup procedure is an untested promise. Teams routinely discover during an actual incident that:
- The backup credentials have rotated and the backup job has been silently failing for weeks.
- The etcd snapshot is from the wrong cluster (staging, not production).
- The Velero restore fails because the target cluster has a different Kubernetes version and the CRD schemas are incompatible.
- The persistent volume snapshots are in a different region from the recovery cluster.
- The restore completes but the application does not start because it depends on an external service that was not part of the backup.
Testing Practices
-
Schedule monthly restore drills. Restore to a separate cluster and verify application health. Automate as much as possible.
-
Test at every layer. Restore a single namespace from Velero. Restore an entire cluster from an etcd snapshot. Fail over to a standby region.
-
Measure actual RTO. Start a timer when the drill begins. Stop when the application is serving traffic. Compare against your target. If you miss the target, the plan needs work.
-
Break things intentionally. Delete a namespace. Corrupt an etcd member. Simulate a region failure by blocking network traffic. Chaos engineering is the only honest test of resilience.
-
Verify data integrity. After restore, do not just check that pods are running. Verify that the application data is consistent and correct. A running pod with a corrupted database is not a successful recovery.
Documented Runbooks
Disaster recovery procedures must be written down, version-controlled, and accessible during an outage. A runbook stored in the cluster that just failed is useless.
A good runbook includes:
- Prerequisites: What tools, credentials, and access are needed?
- Decision tree: Which procedure applies to which failure scenario?
- Step-by-step commands: Copy-pasteable, with placeholders clearly marked.
- Verification steps: How to confirm each step succeeded before proceeding.
- Rollback: What to do if the recovery makes things worse.
- Communication plan: Who to notify, what channels to use, what to tell customers.
Store runbooks in a location that survives the failure of the thing they describe. A Git repository in a different cloud account. A wiki on a different provider. Printed copies in a binder (yes, really, for the truly catastrophic scenarios).
Putting It Together
A complete disaster recovery strategy for Kubernetes looks like this:
- etcd snapshots every hour, uploaded to cross-region object storage with versioning and lifecycle rules.
- Velero scheduled backups daily, with volume snapshots, stored in a separate object storage bucket.
- Multi-region standby cluster for production workloads that cannot tolerate multi-hour RTO.
- Monthly restore drills that exercise both etcd restore and Velero restore paths.
- Runbooks that have been used successfully in a drill within the last quarter.
- Monitoring and alerting on backup job success/failure, backup age, and storage health.
Common Mistakes and Misconceptions
- “Backing up etcd is enough for DR.” etcd contains cluster state, but not PersistentVolume data, external DNS records, cloud load balancers, or IAM configurations. A complete DR plan includes application data, infrastructure-as-code, and secrets.
- “Velero backs up everything.” Velero backs up Kubernetes resources and can snapshot cloud volumes, but it doesn’t back up external databases, object storage contents, or resources managed outside K8s. Know what’s covered and what isn’t.
- “I’ll figure out DR when I need it.” By definition, you need DR during an emergency when you have the least capacity for planning. Test restores quarterly. An untested backup is not a backup.
Further Reading
- Velero Documentation — the open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes, including scheduled backups, storage provider plugins, and restore workflows.
- Kubernetes Documentation: Operating etcd Clusters — the official guide covering etcd backup and restore procedures, snapshot management, and cluster upgrade strategies.
- Kubernetes Documentation: Backing Up an etcd Cluster — step-by-step instructions for taking etcd snapshots with etcdctl and restoring from them.
- Velero: Disaster Recovery — Velero’s official disaster recovery guide covering scheduled backups, storage location management, and step-by-step restore procedures.
- AWS EKS Best Practices: Disaster Recovery — AWS-specific patterns for multi-region EKS deployments, cross-region replication, and failover automation.
- Google Cloud: Disaster Recovery Planning Guide — Google’s framework for DR planning including cold, warm, and hot standby patterns applicable to GKE and hybrid deployments.
- Kubernetes SIG Cluster Lifecycle — the upstream SIG responsible for cluster provisioning, upgrades, and lifecycle tooling that underpins recovery automation.
Next: Cost Optimization — making sure all this infrastructure is not more expensive than it needs to be.
Chapter 44: Cost Optimization
Kubernetes makes it easy to deploy applications and hard to understand what they cost. A developer requests 2 CPU cores and 4 GB of memory for a service that uses 0.3 cores and 800 MB at peak. Multiply that by hundreds of services across dozens of namespaces, and you arrive at the industry average: only 13% of requested CPU is actually used. The rest is reserved but idle, burning money on cloud provider invoices.
The Cost Problem
The disconnect between requested and used resources exists because of a rational incentive: nobody wants their service to be OOM-killed or CPU-throttled, so everyone over-provisions.
THE RESOURCE EFFICIENCY GAP
─────────────────────────────
Requested CPU Actual CPU Used
┌──────────────────────────┐ ┌──────────────────────────┐
│██████████████████████████│ │███░░░░░░░░░░░░░░░░░░░░░░░│
│██████████████████████████│ │███░░░░░░░░░░░░░░░░░░░░░░░│
│██████████████████████████│ │███░░░░░░░░░░░░░░░░░░░░░░░│
│ 100 cores │ │ 13 cores 87 wasted │
└──────────────────────────┘ └──────────────────────────┘
██ = Allocated/Used ░░ = Allocated but Idle
Industry average: 13% CPU utilization of requested resources
Typical savings from right-sizing: 30-50%
This is not a Kubernetes problem per se — the same over-provisioning existed in the VM world. But Kubernetes makes it both more visible (you can measure it) and more actionable (you can change it without reprovisioning hardware).
Right-Sizing with VPA and Goldilocks
The Vertical Pod Autoscaler (VPA) observes actual resource usage over time and recommends (or automatically sets) CPU and memory requests. In recommendation mode, it does not change anything — it just tells you what the values should be.
Goldilocks (from Fairwinds) wraps VPA in a dashboard that shows recommendations for every deployment in a namespace. It creates a VPA object in recommendation mode for each deployment and surfaces the results in a web UI.
# Install Goldilocks
helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks
# Enable for a namespace
kubectl label namespace production goldilocks.fairwinds.com/enabled=true
After a few days of observation, Goldilocks will show you something like:
| Deployment | Current Request | Recommended | Monthly Savings |
|---|---|---|---|
| api-server | 2 CPU / 4 Gi | 500m CPU / 1 Gi | $340 |
| worker | 4 CPU / 8 Gi | 1.5 CPU / 3 Gi | $520 |
| frontend | 1 CPU / 2 Gi | 200m CPU / 512 Mi | $180 |
| cache | 2 CPU / 16 Gi | 500m CPU / 12 Gi | $85 |
Typical savings from right-sizing are 30–50% of compute cost. This is the lowest-effort, highest-impact optimization available.
Caution: Do not blindly apply VPA recommendations. Review them in the context of peak load, seasonal patterns, and latency requirements. A recommendation based on two weeks of low traffic will not survive Black Friday.
Spot and Preemptible Instances
Cloud providers sell unused capacity at steep discounts — 60–90% off on-demand pricing. The trade-off is that the instances can be reclaimed with as little as 30 seconds notice (AWS Spot) or 2 minutes (GCP Preemptible/Spot).
Kubernetes makes spot instances practical because it was designed for failure. Pods are ephemeral. Deployments replace terminated pods automatically. The key is ensuring your workloads can tolerate interruption.
Karpenter and Spot
Karpenter excels at spot instance management. It can:
- Diversify across many instance types to reduce interruption probability
- Automatically replace interrupted nodes
- Mix spot and on-demand in a single NodePool via
capacity-typeweights - Consolidate workloads onto fewer nodes as demand decreases
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-workers
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.xlarge", "m5a.xlarge", "m6i.xlarge",
"m6a.xlarge", "c5.xlarge", "c6i.xlarge"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "200"
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 60s
Best practice: Run control plane workloads (monitoring, CI, databases) on on-demand instances. Run stateless application workloads (web servers, API handlers, batch jobs) on spot. The cost savings typically range from 60–90% for the spot-eligible portion of your fleet.
Cost Attribution with Kubecost and OpenCost
You cannot optimize what you cannot measure. Kubecost and OpenCost provide cost attribution — breaking down cluster costs by namespace, deployment, label, or any other dimension.
OpenCost is the open-source standard for Kubernetes cost monitoring, donated to the CNCF by Kubecost. It calculates costs by:
- Querying cloud provider pricing APIs for node costs
- Allocating node costs to pods based on resource requests (and optionally usage)
- Adding persistent volume and network costs
- Aggregating by any Kubernetes metadata (namespace, label, annotation)
Chargeback via Labels
The foundation of cost attribution is consistent labeling. Every workload should carry labels that identify its owner and purpose:
metadata:
labels:
app.kubernetes.io/name: checkout-service
app.kubernetes.io/part-of: ecommerce
cost-center: "CC-4521"
team: payments
environment: production
With these labels, you can generate reports like:
| Team | Namespace | Monthly Cost | CPU Efficiency | Memory Efficiency |
|---|---|---|---|---|
| Payments | payments-prod | $4,200 | 22% | 45% |
| Search | search-prod | $8,100 | 31% | 52% |
| ML | ml-training | $12,500 | 78% | 65% |
| Platform | monitoring | $2,300 | 15% | 40% |
The ML team has high efficiency because GPU workloads tend to saturate resources. The platform team has low efficiency because monitoring tools are sized for peak incident load. Context matters — not every namespace should target the same efficiency percentage.
Cluster Consolidation
Karpenter Consolidation
Karpenter’s consolidation feature continuously evaluates whether workloads can be packed onto fewer or cheaper nodes:
- WhenEmpty: Remove nodes that have no non-daemonset pods.
- WhenEmptyOrUnderutilized: Also replace nodes when their workloads could fit on other existing nodes or on a single cheaper node.
This is particularly powerful in clusters with variable load. During off-peak hours, Karpenter consolidates workloads onto fewer nodes and terminates the empties. During peak, it scales back out.
kube-green for Off-Hours
Many development and staging environments are used only during business hours. kube-green scales workloads to zero during off-hours:
apiVersion: kube-green.com/v1alpha1
kind: SleepInfo
metadata:
name: working-hours
namespace: development
spec:
weekdays: "1-5"
sleepAt: "20:00"
wakeUpAt: "08:00"
timeZone: "America/New_York"
suspendDeployments: true
suspendStatefulSets: true
suspendCronJobs: true
If your development cluster costs $10,000/month and is used 10 hours a day, 5 days a week, kube-green can reduce that to roughly $3,000/month — a 70% savings with zero impact on developer productivity.
Unused Resource Detection
Waste hides in plain sight. Common sources of orphaned cost:
- Unattached PersistentVolumes: PVCs deleted but PVs retained due to
Retainreclaim policy. Cloud disks still billing. - Idle load balancers: Services of type LoadBalancer that no longer receive traffic.
- Orphaned node groups: Managed node groups or ASGs with minimum size > 0 but no workloads scheduled.
- Oversized namespaces: Test namespaces that were never cleaned up.
- Unused ConfigMaps and Secrets: Resources referenced by nothing.
Tools like kubectl-cost (from Kubecost), pluto (for deprecated APIs), and custom scripts that compare resource references against actual usage can surface these.
Optimization Strategy Comparison
| Strategy | Effort | Typical Savings | Risk |
|---|---|---|---|
| Right-sizing (VPA/Goldilocks) | Low | 30–50% | Under-provisioning causes latency/OOM |
| Spot/Preemptible instances | Medium | 60–90% of eligible workloads | Interruption, requires fault tolerance |
| Off-hours scaling (kube-green) | Low | 50–70% for non-prod | Forgot to wake up before a demo |
| Cluster consolidation (Karpenter) | Medium | 20–40% | Consolidation churn, scheduling delays |
| Unused resource cleanup | Low | 5–15% | Accidentally deleting needed resources |
| Reserved instances / savings plans | Low | 30–40% vs on-demand | Lock-in, less flexibility |
| Namespace resource quotas | Low | Preventive (caps waste) | Blocks legitimate scaling |
The highest-ROI strategy for most organizations is to start with right-sizing (immediate, low-risk, high-impact) and then layer on spot instances for eligible workloads. Together, these two strategies alone typically reduce compute costs by 50–70%.
Building a Cost-Aware Culture
Tools and automation are necessary but not sufficient. Cost optimization sticks only when teams have visibility and accountability:
-
Dashboard visibility. Put cost dashboards where developers already look — Grafana, Backstage, Slack summaries. If people have to seek out cost data, they will not.
-
Cost in the deploy pipeline. Show the cost impact of resource request changes in pull request comments. “This change increases monthly cost for checkout-service by $120.”
-
Team-level budgets. Allocate cloud budgets to teams, not just to the organization. When a team sees that their namespace costs $8,000/month, they start asking whether that staging environment with 16 replicas is really necessary.
-
Regular review cadence. Monthly cost reviews at the team level, quarterly at the organization level. Celebrate wins (a team that cut costs 40% through right-sizing) and investigate anomalies (a namespace that doubled in cost with no traffic increase).
The goal is not to minimize cost — it is to maximize the value per dollar.
Common Mistakes and Misconceptions
- “Kubernetes saves money.” Kubernetes adds overhead: control plane costs, monitoring, engineer expertise, and operational complexity. It saves money at scale through bin-packing and automation, but small deployments often cost more than VMs.
- “Spot instances are always 60-90% cheaper.” Spot pricing is dynamic. Popular instance types in busy regions may offer small discounts. Diversify across instance families and AZs. Karpenter handles this automatically.
- “Right-sizing is a one-time task.” Application resource needs change with code changes, traffic patterns, and data growth. Continuous monitoring with VPA recommendations or tools like Kubecost is necessary to prevent drift.
Further Reading
- OpenCost Documentation — the CNCF open-source standard for real-time Kubernetes cost monitoring with allocation by namespace, label, and deployment.
- OpenCost Project — the CNCF sandbox project for Kubernetes cost monitoring, providing a vendor-neutral open-source specification and implementation for cost allocation.
- FinOps Foundation — the industry body defining FinOps practices, frameworks, and maturity models for managing cloud costs across engineering and finance teams.
- AWS: Best Practices for EC2 Spot Instances — AWS guidance on diversifying instance types, handling interruptions, and using Spot with EKS node groups and Karpenter.
- GKE Cost Optimization Guide — Google’s recommendations for GKE right-sizing, cluster autoscaling, committed use discounts, and Spot VMs.
- Kubernetes Documentation: Resource Management for Pods and Containers — the official reference for requests, limits, QoS classes, and LimitRanges that form the foundation of cost control.
- Goldilocks by Fairwinds — an open-source tool that runs VPA in recommendation mode and presents a dashboard of right-sizing suggestions per workload.
Next: Observability with OpenTelemetry — making sure you can see what is happening inside all these workloads.
Chapter 45: Observability with OpenTelemetry
Observability is the ability to understand the internal state of a system by examining its external outputs. In a Kubernetes environment, those outputs are metrics (numerical measurements over time), logs (discrete events with context), and traces (the path of a request through multiple services). These are the three pillars, and OpenTelemetry is the open standard that unifies how they are collected, processed, and exported.
The Three Pillars
| Signal | Answers | Strength | Limitation |
|---|---|---|---|
| Metrics | What is happening right now and how does it compare to the past? (CPU utilization, request latency percentiles, error rates, queue depths) | Cheap to store, fast to query, excellent for dashboards and alerting | Terrible for debugging specific requests |
| Logs | What happened in this specific component at this specific time? (stack traces, failed SQL queries, loaded configuration values) | Rich in context, excellent for debugging | Expensive to store and slow to search at scale; terrible for detecting trends |
| Traces | What was the path of this specific request through the system? (timing and outcome of each hop across services) | Essential for debugging latency in distributed systems | Nearly useless for trend detection or component-level debugging |
No single pillar is sufficient. Effective observability requires all three, correlated so you can move from a metric anomaly to the relevant traces to the specific log lines that explain the root cause.
OpenTelemetry Architecture
OpenTelemetry (OTel) provides a vendor-neutral framework for instrumentation, collection, and export of telemetry data. The key components are:
- SDKs: Language-specific libraries that instrument applications (auto-instrumentation or manual)
- Collector: A standalone binary that receives, processes, and exports telemetry data
- Protocol (OTLP): The wire format for transmitting telemetry between components
Collector Deployment Patterns
The Collector is the workhorse of the OTel pipeline. How you deploy it determines the reliability, scalability, and cost of your observability stack.
flowchart TD
subgraph P1["PATTERN 1: DAEMONSET / AGENT (most widely adopted)"]
subgraph N1["Node 1"]
A1["App A"] --> C1["OTel Collector"]
A2["App B"] --> C1
end
subgraph N2["Node 2"]
A3["App C"] --> C2["OTel Collector"]
A4["App D"] --> C2
end
C1 --> B1["Backends"]
C2 --> B1
end
P1 --> P2
subgraph P2["PATTERN 2: SIDECAR (per-pod collector, high isolation)"]
subgraph Pod["Pod"]
App["App"] --> OTel["OTel Collector"]
end
OTel --> B2["Backend"]
end
P2 --> P3
subgraph P3["PATTERN 3: GATEWAY (centralized, scaled Deployment)"]
GA["App A"] --> GW["OTel Collector Gateway"]
GB["App B"] --> GW
GC["App C"] --> GW
GW --> B3["Backend"]
end
style P1 stroke:#326CE5
style P2 stroke:#326CE5
style P3 stroke:#326CE5
DaemonSet (Agent) is the recommended pattern for most clusters. Each node runs a collector pod that receives telemetry from all application pods on that node via localhost. This minimizes network hops, provides natural load distribution, and fails gracefully (a collector crash affects only one node).
Sidecar provides the strongest isolation — each application pod has its own collector. Use this when different applications require different collection configurations or when you need strict resource accounting per application. The cost is significant: every pod runs an additional container.
Gateway centralizes collection into a single deployment. Use this as a second tier behind agents (agent → gateway → backend) for cross-cutting processing like tail sampling, enrichment, or routing to multiple backends. Do not use a gateway as the sole collector tier — it creates a single point of failure.
The production pattern is Agent + Gateway: node-level agents forward to a gateway for sampling and export.
The LGTM Stack
The most widely adopted open-source backend stack for Kubernetes observability is LGTM:
| Component | Signal | Role |
|---|---|---|
| Loki | Logs | Log aggregation. Indexes labels, not content. Cheap at scale. |
| Grafana | All | Visualization and dashboarding. Unified query interface. |
| Tempo | Traces | Distributed tracing backend. Object-storage-based. |
| Mimir | Metrics | Long-term metrics storage. Horizontally scalable Prometheus. |
This stack is entirely open source (all Grafana Labs projects under AGPLv3) and can be self-hosted or consumed as Grafana Cloud. The key advantage over alternatives is the tight integration — Grafana can correlate a metric spike to traces to logs without leaving the UI.
flowchart TD
Apps["Applications"]
Collector["OTel Collector Agent<br>(DaemonSet)"]
Mimir["Mimir<br>(metrics)"]
Tempo["Tempo<br>(traces)"]
Loki["Loki<br>(logs)"]
Grafana["Grafana<br>(query, visualize, alert)"]
Apps --> Collector
Collector -- "metrics" --> Mimir
Collector -- "traces" --> Tempo
Collector -- "logs" --> Loki
Mimir --> Grafana
Tempo --> Grafana
Loki --> Grafana
The OpenTelemetry Operator
The OTel Operator is a Kubernetes operator that manages OTel Collectors and provides auto-instrumentation for application pods.
Auto-Instrumentation
Instead of modifying application code to import OTel SDKs, you annotate pods and the operator injects the instrumentation automatically:
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
annotations:
instrumentation.opentelemetry.io/inject-java: "true"
spec:
template:
spec:
containers:
- name: checkout
image: myapp/checkout:v1.2.3
The operator supports auto-instrumentation for:
- Java — via a Java agent injected as an init container
- Python — via the
opentelemetry-instrumentwrapper - .NET — via the .NET startup hook
- Node.js — via the
@opentelemetry/auto-instrumentations-nodepackage - Go — via eBPF-based instrumentation (more limited than other languages)
Auto-instrumentation captures HTTP requests, database queries, gRPC calls, and messaging operations without any code changes. It is the fastest path to distributed tracing in an existing application.
Instrumentation Resource
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: default-instrumentation
namespace: production
spec:
exporter:
endpoint: http://otel-collector.observability:4317
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.1" # sample 10% of traces
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
Signal Correlation
The power of observability comes from connecting the three signal types. When a metric alert fires for high latency on the checkout service, you want to click through to the traces that show which downstream call is slow, and then to the log lines from that specific call.
This requires consistent identifiers across signals:
- Trace context propagation: Every HTTP or gRPC call propagates
traceparentheaders (W3C Trace Context standard). The OTel SDKs handle this automatically. - Trace ID in logs: Configure your logging library to include the trace ID and span ID in every log line. This allows Grafana to jump from a trace to the exact log lines produced during that span.
- Exemplars in metrics: Prometheus exemplars attach a trace ID to a specific metric observation, so you can click from a latency histogram bucket to a representative trace.
SIGNAL CORRELATION FLOW
─────────────────────────
Grafana Dashboard
┌───────────────────────────────────────────────────┐
│ Checkout Latency p99 = 1.2s [▲ spike at 14:23] │
│ │ │
│ Click exemplar ──────────────────▶│ │
│ ▼ │
│ Trace: abc123 │
│ ├── checkout-svc 200ms │
│ ├── inventory-svc 150ms │
│ └── payment-svc 850ms ◄── slow! │
│ │ │
│ Click span ───────────▶ │
│ ▼ │
│ Logs for payment-svc, traceID=abc123: │
│ 14:23:01 WARN Connection pool exhausted │
│ 14:23:01 ERROR Timeout waiting for DB connection │
└───────────────────────────────────────────────────┘
Production Lessons
Teams that have deployed OTel in production at scale converge on a common set of lessons:
Version-Lock Everything
The OTel ecosystem moves fast. The Operator, Collector, and auto-instrumentation images must be compatible. Pin all three to tested versions and upgrade them together:
# Do not use "latest" in production
operator: ghcr.io/open-telemetry/opentelemetry-operator:v0.96.0
collector: otel/opentelemetry-collector-contrib:0.96.0
java-agent: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.1.0
Memory Requirements
OTel Collectors buffer data in memory. Under load, a DaemonSet collector can easily consume 1–2 GB of memory. Gateway collectors handling high-cardinality traces may need 4 GB or more. Size your collector pods with appropriate requests and limits, and set memory_limiter processor as the first processor in your pipeline:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1500
spike_limit_mib: 500
Start with Traces, Not Metrics
If you already have Prometheus for metrics, adding OTel for traces provides the most incremental value. Auto-instrumentation gives you distributed tracing with zero code changes. Migrating metrics to OTel can come later (and for many teams, Prometheus remains the better choice for metrics).
Sampling is Essential
Collecting 100% of traces is prohibitively expensive at scale. Use tail sampling at the gateway tier to keep:
- All error traces
- All slow traces (above a latency threshold)
- A random sample of normal traces (1–10%)
This captures the traces you actually need for debugging while keeping storage costs manageable.
Target Allocator for Prometheus Scraping
If you use the OTel Collector to scrape Prometheus endpoints (replacing Prometheus itself), the Target Allocator distributes scrape targets across collector replicas. Without it, every collector scrapes every target, duplicating data. The Target Allocator requires careful resource provisioning — plan for 4 GB+ nodes in the allocator pool.
What to Monitor About Your Monitoring
Observability infrastructure is itself a system that can fail. Monitor:
- Collector memory and CPU usage (alert before OOM)
- Export failures (collector cannot reach backend)
- Queue depth (data backing up faster than it can be exported)
- Span drop rate (how much data is being discarded)
- Backend ingestion rate and storage growth
An observability system that silently drops data during the incident you need to debug is worse than no observability at all, because it gives you confidence that is not warranted.
Common Mistakes and Misconceptions
- “Prometheus can store data forever.” Prometheus is designed for real-time monitoring with limited retention (default 15 days). For long-term storage, use Thanos, Cortex, or Grafana Mimir as a remote write backend.
- “More metrics are always better.” High-cardinality metrics (per-user, per-request-id labels) can overwhelm Prometheus and explode storage costs. Be intentional about labels. Cardinality is the primary cost driver in metrics systems.
- “Logging everything to stdout is sufficient.” Unstructured logs are hard to query. Use structured logging (JSON) with consistent fields (request_id, user_id, trace_id). This makes log aggregation systems (Loki, Elasticsearch) actually useful.
Further Reading
- Prometheus Documentation — the CNCF graduated project for metrics collection and alerting, covering PromQL, service discovery, recording rules, and alerting configuration.
- OpenTelemetry Documentation — the CNCF observability framework unifying traces, metrics, and logs with auto-instrumentation, SDKs, and the Collector pipeline.
- OpenTelemetry Collector Documentation — detailed reference for configuring receivers, processors, exporters, and connectors in the OTel Collector, including the Target Allocator for Prometheus scraping.
- Grafana Documentation — the visualization platform for building dashboards across Prometheus, Loki, Tempo, and other data sources.
- Grafana Loki Documentation — the log aggregation system designed for cost-effective storage with label-based indexing rather than full-text indexing.
- Jaeger Documentation — the CNCF graduated distributed tracing platform for monitoring and troubleshooting microservice architectures.
- kube-prometheus-stack (GitHub) — the Helm chart bundling Prometheus Operator, Grafana, Alertmanager, node-exporter, and pre-built Kubernetes dashboards and recording rules.
- Kubernetes SIG Instrumentation — the upstream SIG responsible for Kubernetes metrics, structured logging, tracing standards, and the metrics stability framework.
Back to: Table of Contents (00-README.md)
Appendix A: Glossary
This appendix provides a quick-reference glossary for terms used throughout the book. Entries are organized alphabetically with cross-references to the chapter where each concept is covered in depth.
Admission Controller — A plugin that intercepts requests to the Kubernetes API server after authentication and authorization but before the object is persisted, used to validate or mutate resources. (see Chapter 39)
Affinity — A set of rules that constrain which nodes a Pod can be scheduled on, based on labels on nodes or other Pods. (see Chapter 33)
API Group — A logical grouping of related Kubernetes API resources (e.g., apps, batch, networking.k8s.io), enabling independent versioning and extension. (see Chapter 4)
API Server (kube-apiserver) — The central management component of the Kubernetes control plane that exposes the Kubernetes API, validates requests, and persists state to etcd. (see Chapter 3)
ArgoCD — A declarative, GitOps-based continuous delivery tool for Kubernetes that synchronizes cluster state with Git repositories. (see Chapter 6)
Backstage — An open-source developer portal framework, originally from Spotify, used to build internal developer platforms with service catalogs and templates. (see Chapter 35)
Cloud Controller Manager — A control plane component that embeds cloud-specific control logic, allowing Kubernetes to interact with the underlying cloud provider’s APIs for nodes, routes, and load balancers. (see Chapter 17)
Cluster Autoscaler — A component that automatically adjusts the number of nodes in a cluster based on pending Pod resource requests and node utilization. (see Chapter 32)
ClusterIP — The default Service type that exposes a Service on a cluster-internal virtual IP, reachable only from within the cluster. (see Chapter 5)
ClusterRole — An RBAC resource that defines a set of permissions across all namespaces or for cluster-scoped resources. (see Chapter 25)
ClusterRoleBinding — An RBAC resource that grants the permissions defined in a ClusterRole to a user, group, or ServiceAccount cluster-wide. (see Chapter 25)
CNI (Container Network Interface) — A specification and set of plugins for configuring networking in Linux containers, used by Kubernetes to set up Pod networking. (see Chapter 13)
ConfigMap — A Kubernetes object used to store non-confidential configuration data as key-value pairs, which can be consumed by Pods as environment variables or mounted files. (see Chapter 18)
containerd — An industry-standard container runtime that manages the complete container lifecycle on a host, commonly used as the runtime in Kubernetes nodes. (see Chapter 10)
Container Runtime — The software responsible for running containers on a node, such as containerd or CRI-O. (see Chapter 10)
Controller — A control loop that watches the state of the cluster through the API server and makes changes to move the current state toward the desired state. (see Chapter 38)
Controller Manager (kube-controller-manager) — A control plane component that runs the core set of built-in controllers (ReplicaSet, Deployment, etc.) as a single process. (see Chapter 3)
CoreDNS — The default cluster DNS server in Kubernetes, providing service discovery via DNS for Services and Pods. (see Chapter 5)
CRD (Custom Resource Definition) — An extension mechanism that allows users to define their own resource types in the Kubernetes API without modifying the API server. (see Chapter 4)
CRI (Container Runtime Interface) — A plugin interface that enables the kubelet to use different container runtimes without needing to recompile. (see Chapter 10)
CRI-O — A lightweight container runtime purpose-built for Kubernetes, implementing the CRI specification. (see Chapter 10)
CronJob — A Kubernetes resource that creates Jobs on a recurring schedule defined using cron syntax. (see Chapter 24)
Crossplane — An open-source framework that extends Kubernetes to provision and manage cloud infrastructure and services using CRDs and controllers. (see Chapter 36)
CSI (Container Storage Interface) — A standard interface for exposing block and file storage systems to container orchestrators like Kubernetes. (see Chapter 23)
DaemonSet — A resource that ensures a copy of a Pod runs on every (or a selected subset of) node in the cluster, commonly used for logging agents and monitoring. (see Chapter 18)
Deployment — A resource that provides declarative updates for Pods and ReplicaSets, supporting rolling updates and rollbacks. (see Chapter 18)
Device Plugin — A kubelet framework that allows hardware vendors to advertise specialized resources (GPUs, FPGAs, etc.) to the Kubernetes scheduler without modifying core code. (see Chapter 41)
Digest — A content-addressable identifier (usually a SHA-256 hash) that uniquely identifies a specific container image, providing an immutable reference. (see Chapter 10)
DRA (Dynamic Resource Allocation) — A Kubernetes framework for requesting and sharing specialized hardware resources (GPUs, accelerators) with fine-grained allocation semantics beyond the device plugin model. (see Chapter 41)
Edge-triggered — A reconciliation approach where the controller reacts only when a change event occurs, as opposed to level-triggered reconciliation. (see Chapter 38)
Endpoint — A network address (IP and port) that represents a single backend for a Service, historically tracked via Endpoints objects. (see Chapter 5)
EndpointSlice — A scalable replacement for the Endpoints resource that splits endpoint information across multiple objects to reduce API server and etcd load. (see Chapter 13)
etcd — A consistent, distributed key-value store used as the primary datastore for all Kubernetes cluster state and configuration. (see Chapter 3)
ExternalName — A Service type that maps a Service to an external DNS name, acting as a CNAME alias without proxying. (see Chapter 5)
Finalizer — A metadata key on a Kubernetes object that prevents deletion until a controller has performed its cleanup logic and removed the finalizer. (see Chapter 39)
Flux — A GitOps toolkit for Kubernetes that keeps clusters in sync with configuration stored in Git repositories. (see Chapter 6)
Gateway API — A next-generation Kubernetes API for modeling service networking, designed to be expressive, extensible, and role-oriented as a successor to Ingress. (see Chapter 13)
Grafana — An open-source observability platform for visualizing metrics, logs, and traces, commonly used alongside Prometheus in Kubernetes monitoring stacks. (see Chapter 45)
GVR (Group/Version/Resource) — The three-part coordinate system (API group, version, resource name) used to uniquely identify any resource type in the Kubernetes API. (see Chapter 4)
Helm — A package manager for Kubernetes that uses templated charts to define, install, and upgrade applications. (see Chapter 12)
HPA (Horizontal Pod Autoscaler) — A controller that automatically scales the number of Pod replicas based on observed CPU, memory, or custom metrics. (see Chapter 30)
Image — A lightweight, standalone, executable package that includes everything needed to run a piece of software: code, runtime, libraries, and settings. (see Chapter 10)
Informer — A client-side caching mechanism in client-go that watches API server resources and maintains a local cache to reduce API server load. (see Chapter 39)
Ingress — A Kubernetes resource that manages external HTTP/HTTPS access to Services, providing load balancing, TLS termination, and name-based virtual hosting. (see Chapter 5)
Ingress Controller — A controller that fulfills Ingress resources by configuring a load balancer or reverse proxy (e.g., NGINX, Envoy, Traefik). (see Chapter 13)
Init Container — A specialized container that runs to completion before any app containers start in a Pod, used for setup tasks like waiting for dependencies or populating shared volumes. (see Chapter 18)
Job — A Kubernetes resource that creates one or more Pods and ensures a specified number of them successfully terminate, used for batch and one-off tasks. (see Chapter 24)
Karpenter — A node provisioning tool that automatically launches right-sized compute nodes in response to unschedulable Pods, offering faster and more flexible scaling than Cluster Autoscaler. (see Chapter 32)
KServe — A Kubernetes-native platform for serving machine learning models with support for autoscaling, canary rollouts, and multi-framework inference. (see Chapter 42)
kube-proxy — A network component running on each node that maintains network rules for Service traffic forwarding using iptables, IPVS, or eBPF. (see Chapter 3)
Kubeflow — An open-source machine learning platform for Kubernetes that provides tools for ML pipelines, training, tuning, and serving. (see Chapter 41)
kubelet — The primary node agent that runs on every node, responsible for ensuring that containers described in PodSpecs are running and healthy. (see Chapter 3)
KubeRay — A Kubernetes operator for deploying and managing Ray clusters, commonly used for distributed ML training and inference workloads. (see Chapter 42)
Kustomize — A template-free configuration management tool built into kubectl that uses overlays to customize Kubernetes manifests for different environments. (see Chapter 12)
Kyverno — A Kubernetes-native policy engine that validates, mutates, and generates configurations using policies defined as Kubernetes resources. (see Chapter 29)
Label — A key-value pair attached to Kubernetes objects used for organizing and selecting subsets of resources. (see Chapter 4)
LeaderWorkerSet — A Kubernetes API for deploying multi-node distributed workloads with leader-worker topology, commonly used for distributed ML training. (see Chapter 42)
Level-triggered — A reconciliation approach where the controller continuously compares desired state to actual state and acts on the difference, regardless of what events occurred. (see Chapter 38)
Liveness Probe — A periodic check that determines whether a container is still running; if it fails, the kubelet restarts the container. (see Chapter 20)
LoadBalancer — A Service type that exposes the Service externally using a cloud provider’s load balancer, automatically provisioning an external IP. (see Chapter 5)
MIG (Multi-Instance GPU) — An NVIDIA technology that partitions a single GPU into multiple isolated instances, each with dedicated compute, memory, and bandwidth. (see Chapter 41)
Namespace — A virtual partition within a Kubernetes cluster that provides scope for resource names and a mechanism for applying policies and resource quotas. (see Chapter 37)
NetworkPolicy — A resource that specifies how groups of Pods are allowed to communicate with each other and with external endpoints, acting as a firewall for Pod traffic. (see Chapter 26)
Node — A worker machine (virtual or physical) in Kubernetes that runs Pods, managed by the control plane. (see Chapter 3)
NodePool — A Karpenter resource that defines a set of constraints and instance types for provisioning nodes, replacing the older Provisioner resource. (see Chapter 32)
NodePort — A Service type that exposes a Service on a static port on every node’s IP, making it accessible from outside the cluster. (see Chapter 5)
OCI (Open Container Initiative) — A set of industry standards for container image formats and runtimes, ensuring interoperability across container tools. (see Chapter 10)
OPA/Gatekeeper — Open Policy Agent integrated with Kubernetes via the Gatekeeper project, providing policy enforcement through admission control using the Rego policy language. (see Chapter 29)
OpenTelemetry — A vendor-neutral observability framework for generating, collecting, and exporting telemetry data (traces, metrics, logs) from applications. (see Chapter 45)
Operator — A pattern that combines a CRD with a custom controller to encode operational knowledge for managing complex applications on Kubernetes. (see Chapter 38)
Owner Reference — A metadata field on a Kubernetes object that identifies its parent object, enabling garbage collection when the parent is deleted. (see Chapter 39)
PersistentVolume (PV) — A cluster-level storage resource provisioned by an administrator or dynamically via a StorageClass, representing a piece of networked storage. (see Chapter 23)
PersistentVolumeClaim (PVC) — A user’s request for storage that binds to an available PersistentVolume, abstracting the underlying storage implementation. (see Chapter 23)
Pod — The smallest deployable unit in Kubernetes, consisting of one or more containers that share networking and storage and are co-scheduled on the same node. (see Chapter 3)
Pod Security Standards — A set of three built-in security profiles (Privileged, Baseline, Restricted) enforced at the namespace level to control Pod security contexts. (see Chapter 29)
PodDisruptionBudget (PDB) — A resource that limits the number of Pods of a replicated application that can be voluntarily disrupted at the same time, ensuring availability during maintenance. (see Chapter 20)
Priority Class — A resource that defines a priority value for Pods, influencing scheduling order and preemption decisions when cluster resources are scarce. (see Chapter 33)
Prometheus — An open-source monitoring and alerting toolkit that collects metrics via a pull model and stores them in a time-series database, widely used in Kubernetes environments. (see Chapter 45)
RBAC (Role-Based Access Control) — The Kubernetes authorization mechanism that regulates access to resources based on the roles assigned to users or service accounts. (see Chapter 25)
Readiness Probe — A periodic check that determines whether a container is ready to accept traffic; failing containers are removed from Service endpoints. (see Chapter 20)
Reconciliation Loop — The core control pattern in Kubernetes where a controller continuously observes the current state, compares it with the desired state, and takes action to converge them. (see Chapter 38)
Registry — A service that stores and distributes container images, such as Docker Hub, GitHub Container Registry, or a private registry. (see Chapter 10)
ReplicaSet — A resource that ensures a specified number of identical Pod replicas are running at any given time, typically managed by a Deployment. (see Chapter 18)
Resource Quota — A constraint that limits the aggregate resource consumption (CPU, memory, object count) within a Namespace. (see Chapter 37)
Role — An RBAC resource that defines a set of permissions within a specific Namespace. (see Chapter 25)
RoleBinding — An RBAC resource that grants the permissions defined in a Role to a user, group, or ServiceAccount within a specific Namespace. (see Chapter 25)
runc — The reference implementation of the OCI runtime specification, a low-level container runtime that spawns and runs containers. (see Chapter 10)
SBOM (Software Bill of Materials) — A formal inventory of all components, libraries, and dependencies in a software artifact, used for supply chain security and vulnerability tracking. (see Chapter 27)
Scheduler (kube-scheduler) — A control plane component that assigns newly created Pods to nodes based on resource requirements, constraints, affinity rules, and scheduling policies. (see Chapter 3)
Secret — A Kubernetes object for storing sensitive data (passwords, tokens, TLS certificates). Values are base64-encoded in YAML, but base64 is encoding, not encryption — configure encryption at rest for real protection. (see Chapter 28)
Selector — A query expression that uses labels to filter and identify a set of Kubernetes objects. (see Chapter 4)
Service — An abstraction that defines a stable network endpoint (virtual IP and DNS name) for accessing a set of Pods selected by labels. (see Chapter 5)
Service Mesh — An infrastructure layer that manages service-to-service communication with features like mutual TLS, traffic management, and observability (e.g., Istio, Linkerd). (see Chapter 13)
ServiceAccount — A Kubernetes identity assigned to Pods that enables them to authenticate with the API server and other services. (see Chapter 25)
Sidecar — A secondary container that runs alongside the main application container within a Pod, providing supporting functionality like logging, proxying, or configuration. (see Chapter 18)
Sigstore — An open-source project providing tools (Cosign, Fulcio, Rekor) for signing, verifying, and protecting the software supply chain for container images. (see Chapter 27)
StatefulSet — A resource for managing stateful applications that require stable network identities, persistent storage, and ordered deployment and scaling. (see Chapter 21)
StorageClass — A resource that defines a class of storage with a provisioner and parameters, enabling dynamic provisioning of PersistentVolumes. (see Chapter 23)
Taint — A property applied to a node that repels Pods unless those Pods have a matching Toleration, used to reserve nodes for specific workloads. (see Chapter 33)
Tag — A human-readable label (e.g., v1.2.3, latest) applied to a container image in a registry, which can be overwritten and is therefore mutable. (see Chapter 10)
Toleration — A Pod-level property that allows the Pod to be scheduled onto a node with a matching Taint. (see Chapter 33)
Topology Spread Constraints — Rules that control how Pods are distributed across failure domains (zones, nodes, etc.) to improve availability and resource utilization. (see Chapter 33)
Velero — An open-source tool for backing up, restoring, and migrating Kubernetes cluster resources and persistent volumes. (see Chapter 43)
vLLM — A high-throughput, memory-efficient inference engine for large language models that uses PagedAttention for optimized GPU memory management. (see Chapter 42)
VPA (Vertical Pod Autoscaler) — A component that automatically adjusts the CPU and memory resource requests of Pods based on historical usage patterns. (see Chapter 31)
Watch — An API mechanism that allows clients to receive streaming notifications of changes to Kubernetes resources, enabling reactive controllers. (see Chapter 39)
Webhook — An HTTP callback used in Kubernetes for admission control (validating or mutating webhooks) and for extending API server behavior. (see Chapter 39)
Back to Table of Contents
Appendix B: Mental Models
Each part of this book introduces a cluster of related concepts. These diagrams show how they connect — use them as maps when navigating the chapters.
Part 1: First Principles (Chapters 1-9)
The Reconciliation Loop — the heart of Kubernetes.
flowchart TD
A["User writes YAML"] --> B["kubectl"]
B --> C["API Server"]
C --> D["etcd<br>(desired state stored here)"]
C --> E["Controller Manager<br>(reconcile)"]
C --> F["Scheduler<br>(assign to node)"]
E --> G["kubelet (on node)"]
F --> G
G --> H["Container Runtime"]
H --> I["Container"]
subgraph loop ["The Watch / Reconciliation Loop"]
direction LR
W1["Controller watches"] --> W2["Detects drift"]
W2 --> W3["Compares desired<br>vs. actual state"]
W3 --> W4["Takes action<br>to converge"]
W4 --> W1
end
Part 2: Tooling Evolution (Chapters 10-14)
The Stack — what runs on what.
flowchart TD
S1["Application"]
S2["Helm / Kustomize (packaging)"]
S3["kubeadm / k3s (bootstrap)"]
S4["Kubernetes API"]
S5["Container Runtime<br>containerd / CRI-O"]
S6["CNI Plugin<br>Cilium, Calico, Flannel ..."]
S7["OCI Runtime (runc)"]
S1 --> S2 --> S3 --> S4
S4 --> S5
S4 --> S6
S5 --> S7
S6 --> S7
subgraph kernel ["Linux Kernel"]
K1["cgroups<br>(resource limits)"]
K2["namespaces<br>(isolation)"]
end
S7 --> kernel
subgraph cni ["CNI Virtual Network"]
direction LR
PA["Pod A"] <--> PB["Pod B"]
end
S6 --> cni
Part 3: Practical Setup (Chapters 15-19)
Your First Cluster — who talks to whom.
flowchart TD
Cloud["Cloud Provider<br>(AWS / GCP / AZ)"]
Cloud --> VPC
kubectl["kubectl"] --> API
CICD["CI/CD Pipeline"] --> VPC
subgraph VPC
subgraph CP ["Control Plane (managed)"]
API["API Server"]
end
subgraph N1 ["Worker Node 1"]
subgraph Pod1 ["Pod"]
App["app"]
Sidecar["sidecar"]
end
end
subgraph N2 ["Worker Node 2"]
Pod2a["Pod"]
Pod2b["Pod"]
end
end
API --> N1
API --> N2
subgraph debug ["Debugging Tools"]
direction LR
KC["kubectl"] --> Logs["logs"]
KC --> Exec["exec"]
KC --> Describe["describe (events)"]
end
Part 4: Stateful Workloads (Chapters 20-24)
State — the hard problem.
flowchart TD
Deploy["Deployment<br>(stateless)<br>Pods are fungible,<br>interchangeable"]
SS["StatefulSet<br>(ordered, stable ID)<br>pod-0, pod-1, pod-2<br>each has stable name"]
SS --> PVC["PVC<br>(claim storage)"]
PVC --> PV["PV<br>(actual volume)"]
PV --> SC["StorageClass<br>(provisioner)"]
SC --> Disk["Cloud Disk<br>(EBS / PD / AzD)"]
subgraph operators ["Operators manage databases on K8s"]
direction LR
Op["Operator"] -->|watches| CRD["CRD<br>(e.g. PostgresCluster)"]
CRD -->|manages| Res["StatefulSet +<br>PVCs + Secrets"]
end
subgraph jobs ["Jobs and CronJobs"]
direction LR
Job["Job<br>(run once)"]
CronJob["CronJob<br>(scheduled)"] -->|creates Job<br>on schedule| Job
end
Part 5: Security (Chapters 25-29)
Defense in Depth.
┌────────────────────────────────────────────────────────┐
│ Supply Chain (outermost ring) │
│ Sigstore, SBOM, image scanning │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Cluster │ │
│ │ RBAC, Admission Control (OPA/Kyverno) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │ Namespace │ │ │
│ │ │ NetworkPolicy, ResourceQuota │ │ │
│ │ │ │ │ │
│ │ │ ┌─────────────────────────────────┐ │ │ │
│ │ │ │ Pod │ │ │ │
│ │ │ │ SecurityContext, Seccomp, │ │ │ │
│ │ │ │ AppArmor │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ ┌─────────────────────────┐ │ │ │ │
│ │ │ │ │ Container (innermost) │ │ │ │ │
│ │ │ │ │ read-only rootfs │ │ │ │ │
│ │ │ │ │ non-root user │ │ │ │ │
│ │ │ │ │ dropped capabilities │ │ │ │ │
│ │ │ │ └─────────────────────────┘ │ │ │ │
│ │ │ └─────────────────────────────────┘ │ │ │
│ │ └─────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
Secrets Management (cross-cutting concern):
┌──────────────────────────────────────────────┐
│ │
│ External Secrets ──▶ K8s Secret ──▶ Pod │
│ │ │
│ Vault / AWS SM / GCP SM │
│ (source of truth) │
│ │
│ Cuts across ALL rings above │
└──────────────────────────────────────────────┘
Part 6: Scaling (Chapters 30-33)
The Scaling Cascade — metrics to machines.
flowchart TD
M["Metrics<br>(CPU, memory, custom)"]
M --> HPA["HPA"]
HPA -->|"scale pods<br>horizontally"| Pods["More Pods"]
HPA -->|"pods go Pending<br>(no capacity)"| KCA["Karpenter /<br>Cluster Autoscaler"]
KCA -->|"scale nodes"| Cloud["Cloud API<br>(provision new VMs)"]
subgraph vpa ["VPA (Vertical Pod Autoscaler)"]
direction LR
VM["Metrics"] --> VPA2["VPA"] --> Resize["Resize pods vertically<br>(adjust requests/limits)"]
end
subgraph scheduling ["Resource Tuning feeds Scheduling"]
direction LR
RL["requests and limits<br>(CPU, memory)"] --> Sched["Scheduler decisions"]
RL --> Effects["Affects bin-packing,<br>QoS class, eviction<br>priority, HPA thresholds"]
end
Part 7: Platform Engineering (Chapters 34-39)
The Platform — abstraction over infrastructure.
flowchart TD
Dev["Developer"] -->|writes Claim| PlatAPI["Platform API<br>(Crossplane XRD / CRD)"]
PlatAPI -->|provisions| CloudRes["Cloud Resources<br>(RDS, S3, etc.)"]
Git["Git Repo<br>(source of truth)"] -->|GitOps loop| Argo["ArgoCD / Flux"]
Argo -->|sync| Clusters["Cluster(s)"]
subgraph ext ["Extension Mechanism"]
direction LR
CRD["CRD<br>(defines new API)"] --> Operator["Operator<br>(watches & reconciles)"] --> Resources["Manages resources"]
end
subgraph horiz ["Horizontal Concerns"]
MC["Multi-Cluster<br>(fleet mgmt, federation)"]
MT["Multi-Tenancy<br>(namespaces, vClusters,<br>resource quotas)"]
end
Part 8: Advanced Topics (Chapters 40-45)
Running it for Real.
Operational Concerns:
┌──────────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌────────────────┐ ┌─────────────┐ │
│ │ etcd ops │ │ Disaster │ │ Cost │ │
│ │ (backup, │ │ Recovery │ │ Optimization│ │
│ │ defrag, │ │ (Velero) │ │ (right-size,│ │
│ │ health) │ │ │ │ spot, idle)│ │
│ └──────────┘ │ backup ──▶ │ └─────────────┘ │
│ │ restore ──▶ │ │
│ │ migrate │ │
│ └────────────────┘ │
└──────────────────────────────────────────────────────┘
Observability (the three pillars):
┌───────────┐
│ Metrics │
│(Prometheus│
│ / Mimir) │
└─────┬─────┘
│
┌─────────┼─────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌────────┐
│ Logs │ │Traces │ │Alerts │
│(Loki) │ │(Tempo)│ │(Grafana│
└───────┘ └───────┘ │ / PD) │
└────────┘
GPU Scheduling:
┌──────────────────┐ ┌──────────────────┐ ┌────────────┐
│ Pod with │────▶│ Device Plugin / │────▶│ NVIDIA GPU │
│ gpu request │ │ DRA │ │ (on node) │
│ (limits: │ │ (allocates GPU) │ │ │
│ nvidia.com/gpu)│ └──────────────────┘ └────────────┘
└──────────────────┘
LLM Serving:
┌─────────┐ ┌──────────────┐ ┌──────────┐ ┌────────────┐
│ Model │───▶│ vLLM / TGI │───▶│ KServe │───▶│ Inference │
│(weights)│ │ (serving │ │ (routing,│ │ endpoint │
│ │ │ engine) │ │ scaling)│ │ (/predict) │
└─────────┘ └──────────────┘ └──────────┘ └────────────┘
Back to Table of Contents
Appendix C: Decision Trees
Kubernetes offers many options for the same problem. These decision trees encode the trade-offs discussed throughout the book into quick-reference flowcharts.
1. Which Workload Controller?
Kubernetes provides several controllers for running workloads, each designed for a different scheduling pattern. Chapter 18 covers Deployments and Services, Chapter 21 covers StatefulSets, Chapter 24 covers Jobs and CronJobs, and Chapter 42 covers LeaderWorkerSet for ML gang scheduling. Start by asking whether your workload is stateless.
flowchart TD
A[New Workload] --> B{Stateless?}
B -->|Yes| C([Deployment])
B -->|No| D{Need stable identity<br>or ordering?}
D -->|Yes| E([StatefulSet])
D -->|No| F{Run on every node?}
F -->|Yes| G([DaemonSet])
F -->|No| H{Run to completion?}
H -->|Yes| I([Job])
H -->|No| J{Run on schedule?}
J -->|Yes| K([CronJob])
J -->|No| L(["LeaderWorkerSet + Volcano<br>(ML gang scheduling)"])
2. Which Service Type?
Every application that receives traffic needs a Service, but Kubernetes offers five types with very different behaviors. Chapter 18 introduces ClusterIP and NodePort, Chapter 17 covers LoadBalancer integration with cloud providers, and Chapter 13 discusses Ingress and the newer Gateway API.
flowchart TD
A[Expose a Service] --> B{Internal only?}
B -->|Yes| C([ClusterIP])
B -->|No| D{External DNS name<br>only, no proxy?}
D -->|Yes| E([ExternalName])
D -->|No| F{Need L7 routing<br>by host or path?}
F -->|Yes| G(["Ingress / Gateway API"])
F -->|No| H{Dev/test only?}
H -->|Yes| I([NodePort])
H -->|No| J(["LoadBalancer<br>(L4 TCP/UDP)"])
3. Which Storage?
Storage decisions depend on durability, access patterns, and whether multiple pods need simultaneous access. Chapter 23 covers PersistentVolumes, StorageClasses, and CSI drivers in depth. Chapter 17 explains how cloud providers implement storage backends.
flowchart TD
A[Need Storage] --> B{Ephemeral — survives<br>container restart?}
B -->|Yes| C([emptyDir])
B -->|No| D{Shared across pods<br>ReadWriteMany?}
D -->|Yes| E(["NFS / EFS<br>(RWX PVC)"])
D -->|No| F{High IOPS database?}
F -->|Yes| G(["Local SSD / io2<br>(PVC + StorageClass)"])
F -->|No| H{Object storage?}
H -->|Yes| I(["S3 / GCS<br>(use SDK, not a PV)"])
H -->|No| J(["PVC + StorageClass<br>(general purpose)"])
4. Which Autoscaler?
Kubernetes scaling operates at two levels: pod-level (adding replicas or resizing resource requests) and node-level (adding machines when pods can’t be scheduled). Chapter 30 covers HPA, Chapter 31 covers VPA, Chapter 32 covers Karpenter and Cluster Autoscaler, and Chapter 33 explains how resource requests feed into scheduling.
flowchart TD
A[Need Autoscaling] --> B{Scale pods or nodes?}
B -->|Pods| C{Horizontal —<br>more replicas?}
B -->|Nodes| D{Running on AWS or Azure?}
C -->|Yes| E([HPA])
C -->|No| F{Right-size resources?}
F -->|Yes| G([VPA])
F -->|No| H(["KEDA<br>(event-driven, queues, etc.)"])
D -->|Yes| I(["Karpenter<br>(AWS native, Azure via AKS NAP)"])
D -->|No| J(["Cluster Autoscaler<br>(GCP / on-prem)"])
5. Which Managed Kubernetes?
The choice between managed and self-managed Kubernetes depends on your infrastructure constraints and operational maturity. Chapter 16 compares EKS, GKE, and AKS in detail. Chapter 15 covers kubeadm for self-managed clusters, and Chapter 11 covers k3s and other lightweight distributions.
flowchart TD
A[Choose Managed K8s] --> B{On-premises?}
B -->|Yes| C(["kubeadm / k3s / Rancher"])
B -->|No| D{Which cloud?}
D -->|AWS| E(["EKS<br>(see Karpenter for node scaling)"])
D -->|GCP| F[GKE] --> G{Zero node management?}
G -->|Yes| H([GKE Autopilot])
G -->|No| I([GKE Standard])
D -->|Azure| J[AKS] --> K(["AKS Free tier<br>(free control plane, dev)"])
6. Which CNI?
The Container Network Interface plugin determines how pods get IP addresses and how network traffic flows between nodes. Most managed clusters default to the cloud provider’s native CNI, but self-managed clusters require an explicit choice. Chapter 13 traces the evolution from Flannel through Calico to Cilium and explains the eBPF performance advantage.
flowchart TD
A[Choose a CNI] --> B{Managed cloud cluster?}
B -->|Yes| C(["Use provider default<br>(VPC CNI / Azure CNI / GKE native)"])
B -->|No| D{Need eBPF, no<br>iptables overhead?}
D -->|Yes| E([Cilium])
D -->|No| F{Need NetworkPolicy?}
F -->|Yes| G([Calico])
F -->|No| H(["Flannel<br>(simple overlay)"])
7. Which Package Manager?
Managing Kubernetes YAML at scale requires tooling — the question is which kind. Helm uses Go templates for parameterization and dominates third-party chart distribution. Kustomize uses overlay-based patching without any template language. Many teams combine both. Chapter 12 covers all three approaches and explains when to use each.
flowchart TD
A["Package / Template<br>K8s Manifests"] --> B{Need type-safe<br>code generation?}
B -->|Yes| C([cdk8s])
B -->|No| D{Need Go-template style<br>parameterization?}
D -->|Yes| E(["Helm<br>(most popular for 3rd-party<br>chart distribution)"])
D -->|No| F{Want patch-based overlays<br>without templates?}
F -->|Yes| G([Kustomize])
F -->|Both| H(["helm template |<br>kustomize build<br>(common hybrid)"])
8. Which Secret Management?
Secrets require special handling: they must not appear in plain text in Git, they may need to rotate automatically, and they often originate from an external vault or cloud provider. Chapter 28 covers Kubernetes Secrets and encryption at rest, Sealed Secrets for GitOps, and integration with HashiCorp Vault and cloud secret managers via the External Secrets Operator.
flowchart TD
A[Manage Secrets] --> B{Need external secret store<br>— Vault, AWS SM?}
B -->|Yes| C{Need auto-rotation?}
B -->|No| D{Storing in Git<br>for GitOps?}
C -->|Yes| E(["Vault + sidecar injector<br>(dynamic secrets)"])
C -->|No| F(["External Secrets Operator<br>+ Vault / AWS Secrets Manager"])
D -->|Yes| G(["Sealed Secrets<br>(encrypt before committing)"])
D -->|No| H(["K8s Secrets +<br>encryption at rest<br>(simple, low security)"])
9. Which GitOps Tool?
GitOps applies the Kubernetes reconciliation pattern to deployment itself: a controller watches a Git repository and ensures the cluster matches. The two major tools are ArgoCD and Flux, which differ primarily in UI richness and multi-cluster management. Chapter 12 covers both, and Chapter 34 discusses multi-cluster GitOps patterns.
flowchart TD
A[Adopt GitOps] --> B{Need rich UI, multi-cluster,<br>app-of-apps pattern?}
B -->|Yes| C([ArgoCD])
B -->|No| D{Want lightweight, Git-native,<br>Helm/Kustomize controller?}
D -->|Yes| E([Flux])
D -->|Both| F(["They can coexist:<br>Flux for infra clusters,<br>Argo for app clusters"])
10. StatefulSet vs Operator for Databases?
Running databases on Kubernetes is possible but requires careful consideration. A managed cloud database (RDS, Cloud SQL) avoids the operational burden entirely. If you must run on K8s, operators like CloudNativePG and Percona handle failover, backups, and scaling automatically. A raw StatefulSet works for dev/staging but lacks production automation. Chapter 22 covers this decision in depth, and Chapter 38 explains the operator pattern.
flowchart TD
A[Run a Database on K8s] --> B{Managed DB available<br>— RDS, Cloud SQL, etc.?}
B -->|Yes| C(["Use managed DB<br>(provision via Crossplane<br>or Terraform)"])
B -->|No| D{Production with failover,<br>backup, scaling?}
D -->|Yes| E(["Use an Operator<br>(CloudNativePG, Percona,<br>Strimzi, etc.)"])
D -->|No| F(["StatefulSet<br>(simple, single instance,<br>dev/staging)"])
Back to Table of Contents
Appendix D: Troubleshooting Quick Reference
This appendix maps the error messages and symptoms you will encounter in practice to their most common root causes. Organized by where you see the error.
General Debugging Flowchart
flowchart LR
Start["Pod not working?"]
GetPods["kubectl get pods -n namespace"]
Status{"What status do you see?"}
Start --> GetPods --> Status
Status --> Pending
Status --> Crash["CrashLoopBackOff"]
Status --> Image["ImagePullBackOff"]
Status --> Running["Running but not working"]
Status --> Evicted
Status --> Unknown["Unknown / NodeLost"]
Pending --> PDescribe["kubectl describe pod"]
PDescribe --> P1{"Insufficient<br>cpu/memory?"}
PDescribe --> P2{"No nodes match<br>selectors?"}
PDescribe --> P3{"PVC not bound?"}
P1 --> P1Fix["Scale up or adjust requests"]
P2 --> P2Fix["Fix nodeSelector/affinity"]
P3 --> P3Fix["Check PVC"]
Crash --> CLogs["kubectl logs --previous"]
CLogs --> C1{"OOMKilled?"}
CLogs --> C2{"App error?"}
CLogs --> C3{"Missing config?"}
C1 --> C1Fix["Increase memory limits"]
C2 --> C2Fix["Fix application startup"]
C3 --> C3Fix["Check ConfigMaps/Secrets"]
Image --> IDescribe["kubectl describe pod"]
IDescribe --> I1{"repo does not exist?"}
IDescribe --> I2{"unauthorized?"}
IDescribe --> I3{"tag not found?"}
I1 --> I1Fix["Fix image name"]
I2 --> I2Fix["Fix imagePullSecrets"]
I3 --> I3Fix["Fix image tag"]
Running --> RLogs["kubectl logs -f"]
RLogs --> R1["Check readiness probe:<br>kubectl describe pod"]
RLogs --> R2["Check service endpoints:<br>kubectl get endpoints"]
RLogs --> R3["Test from inside:<br>kubectl exec -it -- sh"]
Evicted --> EDescribe["kubectl describe node"]
EDescribe --> E1["Check for DiskPressure /<br>MemoryPressure"]
Unknown --> UNodes["kubectl get nodes"]
UNodes --> U1{"Node NotReady?"}
U1 --> U1Fix["SSH to node, check kubelet"]
Pod Status Errors
Pending
What it means: The scheduler cannot find a node to place the pod on, or a prerequisite resource is not ready.
Common causes:
- No node has enough CPU or memory to satisfy the pod’s resource requests.
nodeSelector,nodeAffinity, ortolerationsdo not match any available node.- A referenced PersistentVolumeClaim is not bound.
- Resource quotas in the namespace are exhausted.
- The cluster has no nodes at all (scaling from zero).
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Look at the Events section
kubectl get nodes -o wide # Check node status and capacity
kubectl describe node <node-name> # Check Allocatable vs Allocated
kubectl get pvc -n <namespace> # Check PVC status
kubectl get resourcequota -n <namespace> # Check quota usage
Fix:
- If resource-constrained: lower the pod’s resource requests, add nodes, or remove idle workloads.
- If selector mismatch: correct
nodeSelector/nodeAffinitylabels or add labels to nodes. - If PVC not bound: ensure a matching PV exists or the StorageClass can dynamically provision one.
- If quota exceeded: request a quota increase or free capacity in the namespace.
CrashLoopBackOff
What it means: The container starts, exits with an error, and Kubernetes keeps restarting it with exponential backoff.
Common causes:
- Application crashes on startup (uncaught exception, missing dependency).
- Required ConfigMap or Secret is mounted but contains wrong data (wrong key, wrong format).
OOMKilled– the container exceeds its memory limit on startup.- Liveness probe is too aggressive and kills the container before it finishes starting.
- Entrypoint or command is misconfigured.
How to diagnose:
kubectl logs <pod-name> -n <namespace> --previous # Logs from the last crashed container
kubectl describe pod <pod-name> -n <namespace> # Check Last State, Exit Code, Reason
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState}'
Fix:
- Read the logs from
--previousto find the actual error. - If
OOMKilled(exit code 137): increaseresources.limits.memory. - If liveness probe is killing the pod: increase
initialDelaySecondsandfailureThreshold. - If config is missing: verify ConfigMap/Secret exists and has the expected keys.
ImagePullBackOff / ErrImagePull
What it means: The kubelet cannot pull the container image from the registry.
Common causes:
- Image name or tag is misspelled.
- The image tag does not exist (e.g.,
latestwas overwritten or a SHA was pruned). - The registry is private and
imagePullSecretsare missing or contain invalid credentials. - The node cannot reach the registry (network/firewall issue).
- Docker Hub rate limits are hit on unauthenticated pulls.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Read the pull error message
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].image}'
# Verify the image exists:
docker pull <image> # Or: crane manifest <image>
# Check imagePullSecrets:
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
Fix:
- Correct the image name and tag.
- Create or fix
imagePullSecrets:kubectl create secret docker-registry regcred \ --docker-server=<registry> \ --docker-username=<user> \ --docker-password=<pass> \ -n <namespace> - If rate-limited: configure a pull-through cache or authenticate to Docker Hub.
OOMKilled
What it means: The Linux kernel’s OOM killer terminated the container because it tried to use more memory than its cgroup limit allows.
Common causes:
- Memory limit is set too low for the workload.
- Java application is not configured for container-aware memory (
-XX:MaxRAMPercentagenot set, or old JVM ignoring cgroup limits). - Memory leak in the application.
- Large file processing or caching loading entire datasets into memory.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Look for "OOMKilled" in Last State
kubectl top pod <pod-name> -n <namespace> # Current memory usage
kubectl logs <pod-name> -n <namespace> --previous # App logs before kill
# On the node:
dmesg | grep -i "oom\|killed" # Kernel OOM killer logs
Fix:
- Increase
resources.limits.memoryto match the actual needs of the workload. - For Java: set
-XX:MaxRAMPercentage=75.0instead of a fixed-Xmx, and ensure JVM version 10+. - For memory leaks: profile the application, fix the leak, then right-size the limit.
- Set
resources.requests.memoryclose tolimits.memoryto avoid scheduling on nodes that cannot support the workload.
CreateContainerConfigError
What it means: The kubelet cannot create the container because a referenced ConfigMap or Secret does not exist.
Common causes:
- The ConfigMap or Secret was not created before the pod.
- Typo in the ConfigMap/Secret name in the pod spec.
- The ConfigMap/Secret is in a different namespace (they are namespace-scoped).
- It was accidentally deleted.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Events will name the missing resource
kubectl get configmap -n <namespace>
kubectl get secret -n <namespace>
Fix:
- Create the missing ConfigMap or Secret.
- Correct any name typos in the pod spec.
- If it should be optional, set
optional: trueon theconfigMapRef/secretRef.
Init:CrashLoopBackOff
What it means: An init container is repeatedly crashing, preventing the main containers from starting.
Common causes:
- The init container is waiting for a service that is not yet available (e.g., database migration init container cannot connect to the DB).
- Script error in the init container command.
- Wrong image or command for the init container.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Identify which init container is failing
kubectl logs <pod-name> -n <namespace> -c <init-container-name> --previous
Fix:
- Check the init container logs for the specific error.
- Verify the service it depends on is running and reachable.
- Fix the init container command, image, or configuration.
Evicted
What it means: The kubelet evicted the pod because the node was under resource pressure (disk, memory, or PID).
Common causes:
- Node is under
DiskPressure(ephemeral storage or container logs filled the disk). - Node is under
MemoryPressure(too many pods withBestEffortQoS). - PID exhaustion on the node.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Shows eviction reason
kubectl describe node <node-name> # Check Conditions for pressure
kubectl get pods -n <namespace> --field-selector=status.phase=Failed | grep Evicted
Fix:
- Clean up disk usage on the node (prune unused images, clear old logs).
- Set proper
resources.requestsso BestEffort pods are evicted first. - Configure
ephemeral-storagerequests and limits. - Set up log rotation and image garbage collection on nodes.
Node Issues
NotReady
What it means: The kubelet on the node is not communicating with the API server, so the control plane marks it NotReady.
Common causes:
- Kubelet service is not running or has crashed.
- CNI plugin is not installed or is misconfigured (the node cannot report Ready without a working CNI).
- Node is under
DiskPressureorMemoryPressure. - Network partition between the node and the control plane.
- Expired kubelet client certificate.
How to diagnose:
kubectl describe node <node-name> # Check Conditions and Events
kubectl get node <node-name> -o yaml # Look at .status.conditions
# SSH to the node:
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago"
crictl ps # Check container runtime
ls /etc/cni/net.d/ # Check CNI configuration
Fix:
- Restart kubelet:
systemctl restart kubelet. - If CNI is missing: install or reinstall the CNI plugin (Calico, Cilium, Flannel, etc.).
- If certificates expired: rotate certificates with
kubeadm certs renew. - If disk pressure: free disk space on the node.
SchedulingDisabled
What it means: The node has been cordoned – new pods will not be scheduled onto it.
Common causes:
- An administrator ran
kubectl cordon <node>. - A node drain is in progress (
kubectl drain). - A cluster autoscaler is decommissioning the node.
How to diagnose:
kubectl get nodes # Look for SchedulingDisabled
kubectl describe node <node-name> # Check Taints for NoSchedule
Fix:
- If the maintenance is complete:
kubectl uncordon <node-name>. - If autoscaler-managed: the node will be removed; no action needed.
DiskPressure / MemoryPressure
What it means: The node’s available disk or memory has dropped below the kubelet’s eviction threshold.
Common causes:
- Container images consuming too much disk.
- Application logs not rotated, filling up the filesystem.
- Too many pods on the node relative to available memory.
- Large emptyDir volumes.
How to diagnose:
kubectl describe node <node-name> # Check Conditions section
# SSH to the node:
df -h # Disk usage
free -m # Memory usage
crictl images | wc -l # Number of cached images
du -sh /var/log/pods/* # Pod log sizes
Fix:
- Disk: prune unused images (
crictl rmi --prune), enable log rotation, clean/var/log. - Memory: evict low-priority pods, increase node size, or add more nodes.
- Configure kubelet garbage collection thresholds in the KubeletConfiguration.
Networking
Connection refused on Service
What it means: A TCP connection to the Service IP and port is actively refused, meaning nothing is listening.
Common causes:
- No ready endpoints behind the Service (pods are not running or not passing readiness probes).
- The Service
targetPortdoes not match the port the application is actually listening on. - The pod is running but the application inside has not started listening yet.
How to diagnose:
kubectl get endpoints <service-name> -n <namespace> # Are there any endpoints?
kubectl get pods -n <namespace> -l <selector> # Are pods running and Ready?
kubectl describe svc <service-name> -n <namespace> # Check selector and ports
kubectl exec -it <pod> -n <namespace> -- ss -tlnp # What is the pod listening on?
Fix:
- If no endpoints: ensure the Service selector matches the pod labels exactly.
- If targetPort is wrong: update the Service to match the container’s listening port.
- If pods are not Ready: fix the readiness probe or the underlying application issue.
DNS Resolution Failures
What it means: Pods cannot resolve Kubernetes service names or external hostnames.
Common causes:
- CoreDNS pods are not running or are crashing.
- The
ndotssetting (default: 5) causes excessive search domain lookups, leading to timeouts. - Pod’s
dnsPolicyis set toDefault(uses node DNS) instead ofClusterFirst. - Network policy blocking DNS traffic (UDP/TCP port 53 to
kube-system).
How to diagnose:
kubectl get pods -n kube-system -l k8s-app=kube-dns # Is CoreDNS running?
kubectl logs -n kube-system -l k8s-app=kube-dns # CoreDNS errors
# Test from inside a pod:
kubectl exec -it <pod> -n <namespace> -- nslookup kubernetes.default
kubectl exec -it <pod> -n <namespace> -- cat /etc/resolv.conf
Fix:
- If CoreDNS is down: check its deployment, resource limits, and node resources.
- For
ndotsissues: adddnsConfigwithndots: 2to the pod spec, or use FQDNs (trailing dot). - If NetworkPolicy is blocking: allow egress to
kube-systemon port 53. - If dnsPolicy is wrong: set
dnsPolicy: ClusterFirst.
Service has no endpoints
What it means: The Service exists but has no backing pods, so all traffic to it fails.
Common causes:
- The Service’s label selector does not match any pod labels (typo or mismatch).
- All matching pods are failing their readiness probes.
- The pods are in a different namespace than expected (selectors are namespace-scoped).
How to diagnose:
kubectl describe svc <service-name> -n <namespace> # Check Selector
kubectl get endpoints <service-name> -n <namespace> # Should list pod IPs
kubectl get pods -n <namespace> --show-labels # Compare labels to selector
kubectl get pods -n <namespace> -l <key>=<value> # Test the selector directly
Fix:
- Align the Service selector with the pod’s labels.
- Fix readiness probes so pods become Ready.
- Ensure pods are deployed in the correct namespace.
Timeout Connecting Between Pods
What it means: TCP connections between pods hang and eventually time out, rather than being refused.
Common causes:
- A NetworkPolicy is blocking the traffic.
- The CNI plugin is misconfigured or its pods are crashing.
- IPtables/eBPF rules are stale after a CNI upgrade or node reboot.
- Nodes are in different subnets and inter-node routing is broken.
How to diagnose:
kubectl get networkpolicy -n <namespace> # Are there policies restricting traffic?
kubectl describe networkpolicy <name> -n <namespace>
kubectl get pods -n kube-system -l k8s-app=calico-node # Or your CNI's pods
# Test connectivity from inside a pod:
kubectl exec -it <pod> -n <namespace> -- curl -v --connect-timeout 5 <target-svc>:<port>
# On the node:
iptables-save | grep <service-cluster-ip> # Check kube-proxy rules
Fix:
- If NetworkPolicy is blocking: update the policy to allow the required ingress/egress.
- If CNI is broken: restart CNI pods, or reinstall the CNI plugin.
- If iptables are stale: restart kube-proxy (
kubectl rollout restart ds/kube-proxy -n kube-system). - Check cloud provider security groups and route tables for inter-node communication.
Storage
PVC Stuck in Pending
What it means: The PersistentVolumeClaim cannot be bound to a PersistentVolume.
Common causes:
- No PV matches the PVC’s
storageClassName,accessModes, orcapacity. - The StorageClass does not exist or the provisioner is not installed.
- In multi-zone clusters: the PV is in a different zone than the node running the pod.
WaitForFirstConsumerbinding mode means the PVC will not bind until a pod using it is scheduled.
How to diagnose:
kubectl describe pvc <pvc-name> -n <namespace> # Events explain why it is pending
kubectl get storageclass # Does the StorageClass exist?
kubectl get pv # Are there available PVs?
kubectl get events -n <namespace> --field-selector reason=ProvisioningFailed
Fix:
- If no StorageClass: create one or set a default StorageClass.
- If provisioner is missing: install the CSI driver (e.g.,
ebs-csi-driver,csi-driver-nfs). - If zone mismatch: use
volumeBindingMode: WaitForFirstConsumerto bind in the correct zone. - If capacity mismatch: create a PV with the required size, or adjust the PVC request.
FailedMount / FailedAttachVolume
What it means: The kubelet cannot mount or attach the volume to the node.
Common causes:
- The volume is still attached to another node (common when a pod is rescheduled – the old node has not detached yet).
- The CSI driver is not installed or its pods are not running.
- The volume does not exist (deleted out of band).
- Filesystem corruption requiring manual
fsck. - Exceeded the maximum number of volumes per node (e.g., AWS limit of EBS volumes per instance type).
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Look at Events for mount errors
kubectl get volumeattachments # Check if volume is attached elsewhere
kubectl get pods -n kube-system -l app=ebs-csi-node # Check CSI driver pods
kubectl get pv <pv-name> -o yaml # Check volume status
Fix:
- If stuck attachment: wait for the
VolumeAttachmentto be cleaned up (up to 6 minutes), or manually delete theVolumeAttachmentobject. - If CSI driver is missing: install it.
- If volume limit reached: use a larger instance type or distribute pods across more nodes.
- If volume was deleted: recreate it and restore from backup.
Control Plane
API Server connection refused
What it means: Clients cannot reach the Kubernetes API server.
Common causes:
- The
kube-apiserverprocess is not running. - TLS certificates have expired.
- Firewall or security group is blocking port 6443.
- Load balancer in front of API server is misconfigured or unhealthy.
How to diagnose:
# On a control plane node:
crictl ps | grep kube-apiserver # Is the container running?
crictl logs <apiserver-container-id> | tail -50 # API server logs
openssl s_client -connect <api-server>:6443 # Test TLS handshake
curl -k https://<api-server>:6443/healthz # Health endpoint
journalctl -u kubelet | grep apiserver # Kubelet managing static pod?
Fix:
- If not running: check static pod manifest at
/etc/kubernetes/manifests/kube-apiserver.yaml. - If certificates expired:
kubeadm certs renew all && systemctl restart kubelet. - If firewall blocking: open port 6443 to the required source IPs.
- If load balancer: check health check configuration and backend targets.
etcd Errors
What it means: The etcd cluster backing the API server is unhealthy.
Common causes:
- Disk latency is too high (etcd requires low-latency storage, ideally SSD).
- Quorum lost (majority of etcd members are down).
- Database size has exceeded the space quota (default 2 GB).
- Clock skew between etcd members.
How to diagnose:
# If etcd is accessible:
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table
etcdctl alarm list
# Check disk latency:
etcdctl check perf
# From API server logs:
crictl logs <apiserver-container-id> | grep etcd
Fix:
- If disk latency: move etcd to SSD-backed storage, or use dedicated etcd nodes.
- If quorum lost: restore from snapshot (
etcdctl snapshot restore). - If space quota exceeded: compact and defragment:
etcdctl compactthenetcdctl defrag. - If alarms triggered:
etcdctl alarm disarmafter resolving the root cause.
Forbidden (RBAC)
What it means: The authenticated identity does not have permission to perform the requested action.
Common causes:
- Missing Role/ClusterRole or RoleBinding/ClusterRoleBinding.
- The binding references the wrong ServiceAccount, user, or group.
- Namespace mismatch: a Role only grants permissions in its own namespace.
- The ServiceAccount token is from a different namespace.
How to diagnose:
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<ns>:<sa>
kubectl auth can-i <verb> <resource> --as=<user> -n <namespace>
kubectl get rolebinding,clusterrolebinding -A | grep <service-account-name>
kubectl describe clusterrole <role-name> # What permissions does it grant?
Fix:
- Create the missing Role and RoleBinding (or ClusterRole/ClusterRoleBinding for cluster-wide access).
- Verify the
subjectsin the binding match the actual identity making the request. - Use
kubectl auth can-i --list --as=<identity>to see all permissions for debugging.
Webhook Errors (failed calling webhook)
What it means: An admission webhook (validating or mutating) is failing, blocking resource creation or updates.
Common causes:
- The webhook’s backing Service or pod is down.
- The webhook’s TLS certificate has expired.
- The webhook was installed with
failurePolicy: Failand the service is unreachable. - The webhook is rejecting the request due to policy (this is intentional, not an error in the webhook itself).
How to diagnose:
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
kubectl describe validatingwebhookconfiguration <name> # Check service and failurePolicy
kubectl get pods -n <webhook-namespace> # Is the webhook pod running?
kubectl logs -n <webhook-namespace> <webhook-pod> # Webhook logs
Fix:
- If the webhook service is down: restart it or fix its deployment.
- If certificates expired: renew them (often managed by cert-manager).
- Emergency bypass: temporarily set
failurePolicy: Ignoreor delete the webhook configuration. - To exclude a namespace: add the appropriate
namespaceSelectorto the webhook configuration.
Deployment Issues
Rollout Stuck
What it means: A Deployment rollout is not progressing – new pods are not becoming Ready or old pods are not being terminated.
Common causes:
- New pods are failing (CrashLoopBackOff, ImagePullBackOff, Pending).
- A PodDisruptionBudget is preventing old pods from being evicted.
- Resource quota in the namespace is exhausted (cannot create new ReplicaSet pods).
- The
progressDeadlineSecondshas not yet been reached (default 600s).
How to diagnose:
kubectl rollout status deployment/<name> -n <namespace>
kubectl describe deployment <name> -n <namespace> # Check Conditions and Events
kubectl get rs -n <namespace> # Compare old vs new ReplicaSet
kubectl get pods -n <namespace> -l <selector> # What state are the new pods in?
kubectl get pdb -n <namespace> # Check PodDisruptionBudgets
kubectl get resourcequota -n <namespace>
Fix:
- Fix the underlying pod issue (image, config, resources) then let the rollout continue.
- If PDB is blocking: temporarily relax the PDB or scale up first.
- If stuck and unrecoverable:
kubectl rollout undo deployment/<name> -n <namespace>. - If quota exceeded: increase the quota or delete unused resources.
FailedCreate on ReplicaSet
What it means: The ReplicaSet controller cannot create new pods.
Common causes:
- Resource quota in the namespace is fully consumed.
- An admission webhook is rejecting pod creation.
- LimitRange in the namespace is setting constraints the pod spec violates.
- The ServiceAccount referenced by the pod does not exist.
How to diagnose:
kubectl describe rs <replicaset-name> -n <namespace> # Events will show the error
kubectl get resourcequota -n <namespace> -o yaml # Compare used vs hard
kubectl get limitrange -n <namespace> -o yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
Fix:
- If quota exhausted: increase the quota or reduce resource requests on the pods.
- If webhook rejecting: check webhook logs to understand the rejection reason.
- If LimitRange violation: adjust pod resource requests/limits to comply.
- If ServiceAccount missing: create it or correct the reference.
Useful Commands Cheat Sheet
# --- Inspecting Resources ---
kubectl get pods -n <ns> -o wide # Pod status with node and IP
kubectl describe pod <pod> -n <ns> # Full pod details and events
kubectl get events -n <ns> --sort-by='.lastTimestamp' # Recent events in namespace
kubectl get events -A --field-selector type=Warning # All warnings cluster-wide
# --- Logs ---
kubectl logs <pod> -n <ns> # Current container logs
kubectl logs <pod> -n <ns> --previous # Logs from last crashed container
kubectl logs <pod> -n <ns> -c <container> # Specific container in multi-container pod
kubectl logs -l app=<label> -n <ns> --tail=100 # Logs by label selector
# --- Interactive Debugging ---
kubectl exec -it <pod> -n <ns> -- /bin/sh # Shell into a running container
kubectl debug node/<node> -it --image=busybox # Debug node-level issues
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash # Ephemeral network debug pod
# --- Networking ---
kubectl get endpoints <svc> -n <ns> # Service endpoints
kubectl port-forward svc/<svc> 8080:80 -n <ns> # Forward service port to localhost
kubectl exec <pod> -n <ns> -- nslookup <svc> # Test DNS resolution from pod
# --- Resource Usage ---
kubectl top nodes # Node CPU and memory usage
kubectl top pods -n <ns> --sort-by=memory # Pod resource consumption
kubectl top pods -n <ns> --containers # Per-container resource usage
# --- Cluster State ---
kubectl get componentstatuses # Control plane health (deprecated but useful)
kubectl cluster-info dump | grep -i error # Dump cluster state and search for errors
kubectl api-resources # All available API resources
# --- Rollouts ---
kubectl rollout status deployment/<name> -n <ns> # Watch rollout progress
kubectl rollout history deployment/<name> -n <ns> # Rollout revision history
kubectl rollout undo deployment/<name> -n <ns> # Roll back to previous revision
Back to Table of Contents
Appendix E: Architecture Evolution Timeline
Kubernetes and its ecosystem have evolved rapidly since 2014. This timeline shows the major architectural shifts — each one driven by real problems with the previous approach. Understanding this evolution explains why the current ecosystem looks the way it does.
Visual Timeline (2013-2026)
Container Runtimes, Orchestration, Networking, and Package Management
flowchart TD
subgraph y2013 ["2013"]
docker13["Docker released<br>(monolithic daemon)"]
end
subgraph y2015 ["2015"]
oci["OCI founded<br>runc extracted"]
k8s10["Kubernetes 1.0<br>CNCF founded"]
flannel["Flannel (overlay)<br>kube-proxy + iptables"]
yaml15["Raw YAML<br>kubectl apply"]
end
subgraph y2016 ["2016"]
containerd16["containerd extracted<br>from Docker"]
calico["Calico (BGP)<br>Canal"]
helm2["Helm v2<br>(with Tiller)"]
end
subgraph y2017 ["2017"]
cri17["CRI interface defined"]
swarm17["Docker Swarm<br>embedded in Docker"]
cni17["CNI spec matures"]
end
subgraph y2018 ["2018-2019"]
kust["Kustomize<br>(patch-based)"]
cilium["Cilium (eBPF-based)"]
helm3["Helm v3 (no Tiller!)"]
mesos["Docker Enterprise<br>sold to Mirantis"]
end
subgraph y2020 ["2020-2022"]
deprec["K8s 1.20: dockershim<br>DEPRECATED"]
removed["K8s 1.24: dockershim<br>REMOVED"]
mesos21["Apache Mesos<br>RETIRED"]
helmkust["Helm + Kustomize<br>combined pattern"]
end
subgraph y2023 ["2023-2024"]
std24["containerd + CRI-O<br>are the standards"]
k8sstd["Kubernetes is<br>THE standard"]
gw["Gateway API GA"]
ciliumdef["Cilium = default CNI<br>for many platforms"]
cdk["cdk8s, Timoni<br>(CUE-based)"]
end
docker13 --> oci --> containerd16 --> cri17 --> deprec --> removed --> std24
docker13 --> k8s10 --> swarm17 --> mesos --> mesos21 --> k8sstd
flannel --> calico --> cni17 --> cilium --> gw --> ciliumdef
yaml15 --> helm2 --> kust --> helm3 --> helmkust --> cdk
Four parallel evolutions that shaped the infrastructure layer: Docker’s monolith was decomposed into containerd and CRI-O. The orchestration wars ended with Kubernetes as the universal standard. Networking shifted from overlays and iptables to eBPF-native with Cilium. And YAML management evolved from raw manifests through Helm’s Tiller era to today’s Helm v3 + Kustomize hybrid.
Security, GitOps, Scaling, and GPU/ML
flowchart TD
subgraph y2017b ["2016-2017"]
rbac["RBAC GA (K8s 1.8)"]
ca["Cluster Autoscaler"]
hpa["HPA v2"]
end
subgraph y2018b ["2018"]
psp["PodSecurityPolicy"]
argo["ArgoCD, Flux v1<br>GitOps begins"]
devplugin["Device plugins<br>for GPUs"]
end
subgraph y2020b ["2020-2021"]
opa["OPA / Gatekeeper"]
sig["Sigstore, Cosign<br>Kyverno matures"]
flux2["Flux v2 rewrite"]
crossplane["Crossplane<br>Backstage joins CNCF"]
karpenter["Karpenter (AWS)"]
kubeflow["Kubeflow, KubeRay"]
end
subgraph y2022b ["2022-2023"]
pspdep["PSP DEPRECATED"]
pss["Pod Security Standards<br>replace PSP"]
plateng["Platform Engineering<br>as a discipline"]
gpuop["NVIDIA GPU<br>Operator mature"]
dra["DRA alpha<br>(Dynamic Resource Allocation)"]
end
subgraph y2024b ["2024-2025"]
supply["Supply chain security:<br>SBOM, SLSA standard"]
idp["Internal Developer<br>Platforms go mainstream"]
karpga["Karpenter GA<br>+ Azure support"]
llm["LLM serving explosion:<br>vLLM, TGI, KServe"]
llmd["llm-d, LeaderWorkerSet<br>multi-node inference"]
end
rbac --> psp --> opa --> pspdep --> pss --> supply
argo --> flux2 --> crossplane --> plateng --> idp
ca --> hpa --> karpenter --> karpga
sig ~~~ pss
devplugin --> kubeflow --> gpuop --> dra --> llm --> llmd
Security moved from the flawed PodSecurityPolicy to the simpler Pod Security Standards, while policy engines like OPA and Kyverno filled the gap. GitOps went from manual kubectl to ArgoCD/Flux, then broadened into full Internal Developer Platforms. Scaling evolved from the slow, group-based Cluster Autoscaler to Karpenter’s per-pod provisioning. And GPU/ML infrastructure exploded from basic device plugins to DRA, vLLM, and disaggregated serving with llm-d.
Observability
timeline
title Observability Evolution
2016 : Prometheus joins CNCF
2018 : Prometheus graduates CNCF
2019 : OpenTelemetry formed
: (OpenTracing + OpenCensus merger)
2021 : Grafana Loki, Tempo mature
2023 : OpenTelemetry GA
: (traces, metrics)
2024 : OpenTelemetry logging matures
Observability converged from three fragmented signals — Prometheus for metrics, various tools for logs, and Jaeger/Zipkin for traces — into a unified standard with OpenTelemetry. The Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) emerged as the dominant open-source backend.
Node Autoscaling: The CA-to-Karpenter Transition
| Cluster Autoscaler (2016) | Karpenter (2021+) | ||
|---|---|---|---|
| Abstraction | Node-group based | → | Groupless provisioning |
| Scaling unit | Scale by group min/max | → | Per-pod scheduling |
| Speed | Slow (minutes) | → | Fast (seconds) |
| Bin-packing | No | → | Cross-instance-type optimization |
| Consolidation | Reactive only | → | Active consolidation |
| Instance types | Fixed per group | → | Works across all types |
Why it changed: Cluster Autoscaler couldn’t keep up with diverse GPU/ML workloads that needed fast, flexible provisioning across many instance types. Karpenter eliminated the node group abstraction entirely.
Summary: Architectural Shifts by Domain
| Domain | Old Way | New Way | Why It Changed |
|---|---|---|---|
| Container Runtime | Docker (monolithic daemon) | containerd / CRI-O via CRI | Docker included too much (build, swarm, CLI). K8s only needs a runtime. CRI allows pluggable runtimes. |
| Orchestration | Docker Swarm, Mesos, multiple options | Kubernetes (universal standard) | K8s won on extensibility (CRDs, operators) and ecosystem. Swarm was too simple, Mesos too complex. |
| Networking | Flannel overlay + iptables kube-proxy | Cilium (eBPF) + Gateway API | iptables doesn’t scale. Overlay adds latency. eBPF gives kernel-level networking without kube-proxy. |
| Package Management | Raw YAML / Helm v2 with Tiller | Helm v3 + Kustomize (or combined) | Tiller was a security risk (cluster-admin in-cluster). Raw YAML doesn’t compose. Kustomize avoids templating. |
| Security | PodSecurityPolicy (PSP) | Pod Security Standards (PSS) + Kyverno/OPA | PSP was confusing, hard to audit, and couldn’t be extended. PSS is simpler; policy engines are more flexible. |
| GitOps & Platform | Manual kubectl apply / CI pipelines | ArgoCD/Flux + Internal Developer Platforms | Imperative deploys are fragile and unauditable. GitOps makes the desired state declarative and versioned. |
| Scaling | Cluster Autoscaler (node-group based) | Karpenter (groupless, per-pod) | CA was slow and inflexible with diverse workloads. Karpenter provisions exactly what’s needed, fast. |
| GPU/ML | Basic device plugins | GPU Operator + DRA + specialized serving (vLLM, llm-d) | LLM explosion demands multi-node GPU scheduling, fractional GPUs, and inference-optimized runtimes. |
| Observability | Prometheus + ad-hoc logging/tracing | OpenTelemetry (unified) + Grafana stack | Three separate telemetry signals (metrics, logs, traces) needed a unified collection and correlation standard. |
Back to Table of Contents
Colophon
How This Book Was Made
Kubernetes from First Principles: Why It Works the Way It Does was written over a weekend (April 3-4, 2026) through a collaboration between Rajat Arya and Claude Code (Anthropic’s Claude Opus 4.6 with 1M context). The entire process — from “help me set up a Kubernetes cluster on EC2” to a published 98,000-word, 45-chapter textbook with 5 appendices — happened in one continuous conversation.
The Process
The book emerged organically from a hands-on Kubernetes learning session. Rajat was setting up a 3-node Kubernetes cluster on AWS EC2 from scratch using kubeadm. As we worked through real problems (containerd CRI disabled, SystemdCgroup mismatch, etcd crash loops, missing CNI plugins, API server CrashLoopBackOff), we documented what we learned. That documentation grew into Part 1 (First Principles), then expanded into a full curriculum.
Generation Method
-
Research phase: For each part, specialized research agents were dispatched to search the web (via Actionbook browser automation), read official Kubernetes documentation, blog posts, cloud provider docs, CNCF project pages, and academic papers. Research agents ran in parallel — up to 4 simultaneously — to gather material for different topics.
-
Writing phase: Writing agents received the research findings along with detailed chapter outlines specifying what to cover, what tone to use, and what diagrams to include. Writers also ran in parallel — up to 5 simultaneously — each producing 4-8 chapters.
-
Coherence pass: A review agent read every chapter, verified all “Next:” links, added cross-references between chapters, wrote part transition paragraphs, and checked tone consistency.
-
Link verification: All external URLs were tested for accessibility.
The Prompts
The book was generated through a series of natural-language prompts. Here are the key ones that shaped each part:
Part 1 (First Principles, chapters 1-9):
“I need to understand how Kubernetes and its ecosystem fit into the modern deployment landscape — but from first principles. I see an infinite number of resources online describing how to use k8s, but I don’t see any real information on where it comes from, why it was architected this way, and what problems it seeks to solve.”
Parts 2-3 (Tooling Evolution + Practice, chapters 10-20):
“Make a part 2 that includes tool ecosystem history and evolution. Has there always been kubeadm, kubelet, etc? And then part 3 can cover getting started with modern Kubernetes, including setting up a cluster from scratch, using public cloud kubernetes offerings, understanding how Kubernetes networking and storage map to public cloud VM offerings, and provide a more practical hands-on way to connect the theory to practice.”
Parts 4-8 (Stateful, Security, Scaling, Platform, Advanced, chapters 21-45):
“I want all of these. I also want the collection of individual topics. It is especially important that I understand the GPU workloads and AI/ML on Kubernetes. Go in as much depth as possible. I need all of these topics to understand the infrastructure at my work.”
Guiding Constraints
These instructions were consistent across all parts:
- “Focus on WHY decisions were made, not HOW to use the tools.” — This shaped the entire tone. Every chapter explains the reasoning behind design decisions rather than just listing commands.
- “I know Linux, I know the computer pretty well, and I know networking pretty well.” — This set the audience level. The book doesn’t explain what a process is or how TCP works, but it does explain why Kubernetes chose a flat networking model over Docker’s port-mapping.
- “Liberally draw diagrams” — Every chapter includes ASCII diagrams illustrating architecture, data flow, or concept relationships.
- “Same tone as part 1” — The first-principles, textbook-quality tone was established in Part 1 and maintained throughout by referencing the existing chapters as style guides.
Tools Used
- Claude Code (Anthropic Claude Opus 4.6, 1M context) — conversation orchestration, research coordination, writing, and editing
- Actionbook Browser — web research automation for gathering source material
- GitHub CLI (
gh) — repository creation and publishing
The Companion Cluster
The install.sh script in this repository is a real, working bootstrap script that was iteratively debugged during the conversation. It went through several revisions:
- v1: Based on an outdated reference script, used deprecated
apt-key, installed full Docker engine, had wrong CNI plugin version - v2: Fixed containerd CRI config, added SystemdCgroup, switched to containerd-only (no Docker engine), updated to modern keyring approach
- v3: Fixed CNI plugin version (v1.6.1 didn’t exist, updated to v1.9.1), added
conntrackdependency
Every error documented in the troubleshooting sections of the README and Chapter 15 was a real error encountered during the session.
Accuracy and Limitations
- Research was conducted on April 3-4, 2026. Version numbers, feature statuses, and ecosystem information reflect this date. Kubernetes evolves rapidly — verify versions before following any specific instructions.
- The AI-generated content was guided by web research from official documentation, CNCF project pages, and reputable technical blogs. However, AI can hallucinate details. When in doubt, consult the official Kubernetes documentation at https://kubernetes.io/docs/.
- External links were verified at publication time but may break as pages move or are removed.
- The book reflects one learning path. There are many valid ways to learn Kubernetes. This path emphasizes first-principles understanding over hands-on tutorials, which suits some learners better than others.
License
The book is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). Any additional code and supporting materials (e.g., install.sh) are licensed under the MIT License. See LICENSE for details.
Kubernetes is a registered trademark of The Linux Foundation.
Contributing
Found an error? A broken link? A concept that could be explained better? Contributions are welcome via pull requests.
