Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Kubernetes

Kubernetes from First Principles

Why It Works the Way It Does


Most Kubernetes resources teach you how to write YAML. This book teaches you why the YAML looks the way it does. 45 chapters — each traced from the original design problem through the ecosystem’s evolution to today’s best practice.

About

This is an eight-part book that takes you from “why does Kubernetes exist?” to “I’m running GPU-accelerated ML workloads in production across multiple clusters.” It is written for engineers who understand Linux, networking, and how systems work — and want to understand Kubernetes deeply, not just follow tutorials.


Part 1: First Principles

Why Kubernetes was designed the way it was.

  1. The Road to Kubernetes — From bare metal to Borg to Kubernetes
  2. The Problems Kubernetes Solves — Bin packing, service discovery, self-healing, and the desired state model
  3. Architecture from First Principles — etcd, API server, controllers, scheduler, kubelet, kube-proxy
  4. The API Model — Resources, specs, status, reconciliation loops, labels, and CRDs
  5. The Networking Model — Flat networking, CNI, Services, Ingress, and Network Policies
  6. The Ecosystem — Operators, Helm, service meshes, and Kubernetes as a platform for platforms
  7. Key Design Principles — Declarative over imperative, control loops, level-triggered vs edge-triggered
  8. Why Kubernetes Won — The competitive landscape and the deeper architectural lesson
  9. References and Further Reading — Foundational papers, design documents, talks, and books

Part 2: The Tooling Ecosystem — History and Evolution

How the tools around Kubernetes evolved, and why they look the way they do today.

  1. The Container Runtime Wars — Docker to containerd to CRI-O: why Docker was deprecated
  2. Bootstrapping a Cluster — From kube-up.sh to kubeadm: how cluster setup evolved
  3. Package Management and GitOps — Helm v2/v3, Kustomize, ArgoCD, Flux
  4. The Networking Stack Evolution — Flannel to Calico to Cilium: how eBPF changed everything
  5. Kubernetes Version History — A guided tour of key releases and what they introduced

Part 3: From Theory to Practice

Connecting the principles from Part 1 to real-world usage.

  1. Setting Up a Cluster from Scratch — What kubeadm actually does: TLS bootstrapping, static pods
  2. Managed Kubernetes: EKS, GKE, and AKS — Cloud provider comparison and how to choose
  3. Cloud Networking and Storage — VPC CNI, CSI drivers, and how K8s maps to cloud infrastructure
  4. Your First Workloads — Hands-on: Deployments, Services, ConfigMaps, rolling updates
  5. Debugging Kubernetes — The kubectl toolkit and diagnosing common failures
  6. Production Readiness — Monitoring, logging, security basics, and backup

Part 4: Stateful Workloads

Running real applications with persistent state.

  1. StatefulSets Deep Dive — Stable identities, ordered operations, and headless Services
  2. Databases on Kubernetes — When to run databases on K8s, operators, and the trade-offs
  3. Persistent Storage Patterns — volumeClaimTemplates, reclaim policies, backup, and resize
  4. Jobs and CronJobs — Batch processing, indexed completions, and scheduling patterns

Part 5: Security Deep Dive

Understanding and implementing Kubernetes security from the ground up.

  1. RBAC from First Principles — Roles, bindings, ServiceAccounts, and multi-tenant design
  2. Network Policies — Default deny, namespace isolation, and egress control
  3. Supply Chain Security — Image signing, admission policies, scanning, and SLSA
  4. Secrets Management — Encryption at rest, Vault, External Secrets Operator, and best practices
  5. Pod Security Standards — Privileged, Baseline, Restricted profiles and enforcement

Part 6: Scaling and Performance

Making Kubernetes handle real-world load.

  1. Horizontal Pod Autoscaler — The scaling algorithm, custom metrics, KEDA, and tuning
  2. Vertical Pod Autoscaler and Right-Sizing — Recommendation mode, in-place resize, and resource tuning
  3. Node Scaling: Cluster Autoscaler and Karpenter — How nodes scale, Karpenter’s architecture, and consolidation
  4. Resource Tuning Deep Dive — CPU throttling, memory cgroups, NUMA, and overcommitment

Part 7: Multi-Cluster and Platform Engineering

Operating Kubernetes at organizational scale.

  1. Multi-Cluster Strategies — Federation, GitOps-driven, service mesh, and Cluster API
  2. Building Internal Developer Platforms — Backstage, the platform stack, and reducing cognitive load
  3. Crossplane: Infrastructure as CRDs — Managing cloud resources through Kubernetes
  4. Multi-Tenancy — Namespace isolation, virtual clusters, and tenant boundaries

Part 8: Advanced Topics

Deep dives for infrastructure engineers.

  1. Writing Controllers and Operators — controller-runtime, Kubebuilder, and the Reconcile pattern
  2. The Kubernetes API Internals — Aggregation, admission webhooks, API priority and fairness
  3. etcd Operations — Backup, restore, compaction, monitoring, and disaster recovery
  4. GPU Workloads and AI/ML on Kubernetes — Device plugins, DRA, GPU sharing, distributed training
  5. Running LLMs on Kubernetes — vLLM, TGI, KServe, multi-node inference, and model serving
  6. Disaster Recovery — Cluster backup, etcd snapshots, multi-region strategies
  7. Cost Optimization — Right-sizing, spot instances, Kubecost, and chargeback
  8. Observability with OpenTelemetry — Metrics, logs, traces, and the OTel Collector

How to Read This

Part 1 is the intellectual foundation. Read it first.

Part 2 fills in the historical context of the tooling. Read it after Part 1.

Part 3 is hands-on. Reference it as you work through your own cluster.

Parts 4-5 cover stateful workloads and security — essential for running real production systems.

Part 6 covers scaling — read it when your workloads need to handle real load.

Part 7 is for when you’re operating multiple clusters or building a platform team.

Part 8 is deep reference material. Read chapters as needed. The GPU/ML chapters (41-42) are especially relevant for AI infrastructure teams.

If you only have time for one chapter from each part:

Appendices

Companion Material

  • install.sh — The bootstrap script we built to provision Kubernetes nodes on EC2
  • Colophon — How this book was made, the prompts used, and accuracy notes