![]() |
Kubernetes from First PrinciplesWhy It Works the Way It Does |
Most Kubernetes resources teach you how to write YAML. This book teaches you why the YAML looks the way it does. 45 chapters — each traced from the original design problem through the ecosystem’s evolution to today’s best practice.
About
This is an eight-part book that takes you from “why does Kubernetes exist?” to “I’m running GPU-accelerated ML workloads in production across multiple clusters.” It is written for engineers who understand Linux, networking, and how systems work — and want to understand Kubernetes deeply, not just follow tutorials.
Part 1: First Principles
Why Kubernetes was designed the way it was.
- The Road to Kubernetes — From bare metal to Borg to Kubernetes
- The Problems Kubernetes Solves — Bin packing, service discovery, self-healing, and the desired state model
- Architecture from First Principles — etcd, API server, controllers, scheduler, kubelet, kube-proxy
- The API Model — Resources, specs, status, reconciliation loops, labels, and CRDs
- The Networking Model — Flat networking, CNI, Services, Ingress, and Network Policies
- The Ecosystem — Operators, Helm, service meshes, and Kubernetes as a platform for platforms
- Key Design Principles — Declarative over imperative, control loops, level-triggered vs edge-triggered
- Why Kubernetes Won — The competitive landscape and the deeper architectural lesson
- References and Further Reading — Foundational papers, design documents, talks, and books
Part 2: The Tooling Ecosystem — History and Evolution
How the tools around Kubernetes evolved, and why they look the way they do today.
- The Container Runtime Wars — Docker to containerd to CRI-O: why Docker was deprecated
- Bootstrapping a Cluster — From kube-up.sh to kubeadm: how cluster setup evolved
- Package Management and GitOps — Helm v2/v3, Kustomize, ArgoCD, Flux
- The Networking Stack Evolution — Flannel to Calico to Cilium: how eBPF changed everything
- Kubernetes Version History — A guided tour of key releases and what they introduced
Part 3: From Theory to Practice
Connecting the principles from Part 1 to real-world usage.
- Setting Up a Cluster from Scratch — What kubeadm actually does: TLS bootstrapping, static pods
- Managed Kubernetes: EKS, GKE, and AKS — Cloud provider comparison and how to choose
- Cloud Networking and Storage — VPC CNI, CSI drivers, and how K8s maps to cloud infrastructure
- Your First Workloads — Hands-on: Deployments, Services, ConfigMaps, rolling updates
- Debugging Kubernetes — The kubectl toolkit and diagnosing common failures
- Production Readiness — Monitoring, logging, security basics, and backup
Part 4: Stateful Workloads
Running real applications with persistent state.
- StatefulSets Deep Dive — Stable identities, ordered operations, and headless Services
- Databases on Kubernetes — When to run databases on K8s, operators, and the trade-offs
- Persistent Storage Patterns — volumeClaimTemplates, reclaim policies, backup, and resize
- Jobs and CronJobs — Batch processing, indexed completions, and scheduling patterns
Part 5: Security Deep Dive
Understanding and implementing Kubernetes security from the ground up.
- RBAC from First Principles — Roles, bindings, ServiceAccounts, and multi-tenant design
- Network Policies — Default deny, namespace isolation, and egress control
- Supply Chain Security — Image signing, admission policies, scanning, and SLSA
- Secrets Management — Encryption at rest, Vault, External Secrets Operator, and best practices
- Pod Security Standards — Privileged, Baseline, Restricted profiles and enforcement
Part 6: Scaling and Performance
Making Kubernetes handle real-world load.
- Horizontal Pod Autoscaler — The scaling algorithm, custom metrics, KEDA, and tuning
- Vertical Pod Autoscaler and Right-Sizing — Recommendation mode, in-place resize, and resource tuning
- Node Scaling: Cluster Autoscaler and Karpenter — How nodes scale, Karpenter’s architecture, and consolidation
- Resource Tuning Deep Dive — CPU throttling, memory cgroups, NUMA, and overcommitment
Part 7: Multi-Cluster and Platform Engineering
Operating Kubernetes at organizational scale.
- Multi-Cluster Strategies — Federation, GitOps-driven, service mesh, and Cluster API
- Building Internal Developer Platforms — Backstage, the platform stack, and reducing cognitive load
- Crossplane: Infrastructure as CRDs — Managing cloud resources through Kubernetes
- Multi-Tenancy — Namespace isolation, virtual clusters, and tenant boundaries
Part 8: Advanced Topics
Deep dives for infrastructure engineers.
- Writing Controllers and Operators — controller-runtime, Kubebuilder, and the Reconcile pattern
- The Kubernetes API Internals — Aggregation, admission webhooks, API priority and fairness
- etcd Operations — Backup, restore, compaction, monitoring, and disaster recovery
- GPU Workloads and AI/ML on Kubernetes — Device plugins, DRA, GPU sharing, distributed training
- Running LLMs on Kubernetes — vLLM, TGI, KServe, multi-node inference, and model serving
- Disaster Recovery — Cluster backup, etcd snapshots, multi-region strategies
- Cost Optimization — Right-sizing, spot instances, Kubecost, and chargeback
- Observability with OpenTelemetry — Metrics, logs, traces, and the OTel Collector
How to Read This
Part 1 is the intellectual foundation. Read it first.
Part 2 fills in the historical context of the tooling. Read it after Part 1.
Part 3 is hands-on. Reference it as you work through your own cluster.
Parts 4-5 cover stateful workloads and security — essential for running real production systems.
Part 6 covers scaling — read it when your workloads need to handle real load.
Part 7 is for when you’re operating multiple clusters or building a platform team.
Part 8 is deep reference material. Read chapters as needed. The GPU/ML chapters (41-42) are especially relevant for AI infrastructure teams.
If you only have time for one chapter from each part:
- Part 1: Architecture from First Principles
- Part 2: The Container Runtime Wars
- Part 3: Debugging Kubernetes
- Part 4: StatefulSets Deep Dive
- Part 5: RBAC from First Principles
- Part 6: Node Scaling: Cluster Autoscaler and Karpenter
- Part 7: Building Internal Developer Platforms
- Part 8: GPU Workloads and AI/ML on Kubernetes
Appendices
- Appendix A: Glossary — Quick-reference definitions for 100+ Kubernetes terms
- Appendix B: Mental Models — Visual diagrams showing how concepts in each part connect
- Appendix C: Decision Trees — Flowcharts for choosing workload types, storage, networking, and tools
- Appendix D: Troubleshooting Quick Reference — Error messages mapped to root causes and fixes
- Appendix E: Architecture Evolution Timeline — How the Kubernetes ecosystem evolved from 2013 to today
Companion Material
- install.sh — The bootstrap script we built to provision Kubernetes nodes on EC2
- Colophon — How this book was made, the prompts used, and accuracy notes
