Appendix D: Troubleshooting Quick Reference
This appendix maps the error messages and symptoms you will encounter in practice to their most common root causes. Organized by where you see the error.
General Debugging Flowchart
flowchart LR
Start["Pod not working?"]
GetPods["kubectl get pods -n namespace"]
Status{"What status do you see?"}
Start --> GetPods --> Status
Status --> Pending
Status --> Crash["CrashLoopBackOff"]
Status --> Image["ImagePullBackOff"]
Status --> Running["Running but not working"]
Status --> Evicted
Status --> Unknown["Unknown / NodeLost"]
Pending --> PDescribe["kubectl describe pod"]
PDescribe --> P1{"Insufficient<br>cpu/memory?"}
PDescribe --> P2{"No nodes match<br>selectors?"}
PDescribe --> P3{"PVC not bound?"}
P1 --> P1Fix["Scale up or adjust requests"]
P2 --> P2Fix["Fix nodeSelector/affinity"]
P3 --> P3Fix["Check PVC"]
Crash --> CLogs["kubectl logs --previous"]
CLogs --> C1{"OOMKilled?"}
CLogs --> C2{"App error?"}
CLogs --> C3{"Missing config?"}
C1 --> C1Fix["Increase memory limits"]
C2 --> C2Fix["Fix application startup"]
C3 --> C3Fix["Check ConfigMaps/Secrets"]
Image --> IDescribe["kubectl describe pod"]
IDescribe --> I1{"repo does not exist?"}
IDescribe --> I2{"unauthorized?"}
IDescribe --> I3{"tag not found?"}
I1 --> I1Fix["Fix image name"]
I2 --> I2Fix["Fix imagePullSecrets"]
I3 --> I3Fix["Fix image tag"]
Running --> RLogs["kubectl logs -f"]
RLogs --> R1["Check readiness probe:<br>kubectl describe pod"]
RLogs --> R2["Check service endpoints:<br>kubectl get endpoints"]
RLogs --> R3["Test from inside:<br>kubectl exec -it -- sh"]
Evicted --> EDescribe["kubectl describe node"]
EDescribe --> E1["Check for DiskPressure /<br>MemoryPressure"]
Unknown --> UNodes["kubectl get nodes"]
UNodes --> U1{"Node NotReady?"}
U1 --> U1Fix["SSH to node, check kubelet"]
Pod Status Errors
Pending
What it means: The scheduler cannot find a node to place the pod on, or a prerequisite resource is not ready.
Common causes:
- No node has enough CPU or memory to satisfy the pod’s resource requests.
nodeSelector,nodeAffinity, ortolerationsdo not match any available node.- A referenced PersistentVolumeClaim is not bound.
- Resource quotas in the namespace are exhausted.
- The cluster has no nodes at all (scaling from zero).
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Look at the Events section
kubectl get nodes -o wide # Check node status and capacity
kubectl describe node <node-name> # Check Allocatable vs Allocated
kubectl get pvc -n <namespace> # Check PVC status
kubectl get resourcequota -n <namespace> # Check quota usage
Fix:
- If resource-constrained: lower the pod’s resource requests, add nodes, or remove idle workloads.
- If selector mismatch: correct
nodeSelector/nodeAffinitylabels or add labels to nodes. - If PVC not bound: ensure a matching PV exists or the StorageClass can dynamically provision one.
- If quota exceeded: request a quota increase or free capacity in the namespace.
CrashLoopBackOff
What it means: The container starts, exits with an error, and Kubernetes keeps restarting it with exponential backoff.
Common causes:
- Application crashes on startup (uncaught exception, missing dependency).
- Required ConfigMap or Secret is mounted but contains wrong data (wrong key, wrong format).
OOMKilled– the container exceeds its memory limit on startup.- Liveness probe is too aggressive and kills the container before it finishes starting.
- Entrypoint or command is misconfigured.
How to diagnose:
kubectl logs <pod-name> -n <namespace> --previous # Logs from the last crashed container
kubectl describe pod <pod-name> -n <namespace> # Check Last State, Exit Code, Reason
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState}'
Fix:
- Read the logs from
--previousto find the actual error. - If
OOMKilled(exit code 137): increaseresources.limits.memory. - If liveness probe is killing the pod: increase
initialDelaySecondsandfailureThreshold. - If config is missing: verify ConfigMap/Secret exists and has the expected keys.
ImagePullBackOff / ErrImagePull
What it means: The kubelet cannot pull the container image from the registry.
Common causes:
- Image name or tag is misspelled.
- The image tag does not exist (e.g.,
latestwas overwritten or a SHA was pruned). - The registry is private and
imagePullSecretsare missing or contain invalid credentials. - The node cannot reach the registry (network/firewall issue).
- Docker Hub rate limits are hit on unauthenticated pulls.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Read the pull error message
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].image}'
# Verify the image exists:
docker pull <image> # Or: crane manifest <image>
# Check imagePullSecrets:
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
Fix:
- Correct the image name and tag.
- Create or fix
imagePullSecrets:kubectl create secret docker-registry regcred \ --docker-server=<registry> \ --docker-username=<user> \ --docker-password=<pass> \ -n <namespace> - If rate-limited: configure a pull-through cache or authenticate to Docker Hub.
OOMKilled
What it means: The Linux kernel’s OOM killer terminated the container because it tried to use more memory than its cgroup limit allows.
Common causes:
- Memory limit is set too low for the workload.
- Java application is not configured for container-aware memory (
-XX:MaxRAMPercentagenot set, or old JVM ignoring cgroup limits). - Memory leak in the application.
- Large file processing or caching loading entire datasets into memory.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Look for "OOMKilled" in Last State
kubectl top pod <pod-name> -n <namespace> # Current memory usage
kubectl logs <pod-name> -n <namespace> --previous # App logs before kill
# On the node:
dmesg | grep -i "oom\|killed" # Kernel OOM killer logs
Fix:
- Increase
resources.limits.memoryto match the actual needs of the workload. - For Java: set
-XX:MaxRAMPercentage=75.0instead of a fixed-Xmx, and ensure JVM version 10+. - For memory leaks: profile the application, fix the leak, then right-size the limit.
- Set
resources.requests.memoryclose tolimits.memoryto avoid scheduling on nodes that cannot support the workload.
CreateContainerConfigError
What it means: The kubelet cannot create the container because a referenced ConfigMap or Secret does not exist.
Common causes:
- The ConfigMap or Secret was not created before the pod.
- Typo in the ConfigMap/Secret name in the pod spec.
- The ConfigMap/Secret is in a different namespace (they are namespace-scoped).
- It was accidentally deleted.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Events will name the missing resource
kubectl get configmap -n <namespace>
kubectl get secret -n <namespace>
Fix:
- Create the missing ConfigMap or Secret.
- Correct any name typos in the pod spec.
- If it should be optional, set
optional: trueon theconfigMapRef/secretRef.
Init:CrashLoopBackOff
What it means: An init container is repeatedly crashing, preventing the main containers from starting.
Common causes:
- The init container is waiting for a service that is not yet available (e.g., database migration init container cannot connect to the DB).
- Script error in the init container command.
- Wrong image or command for the init container.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Identify which init container is failing
kubectl logs <pod-name> -n <namespace> -c <init-container-name> --previous
Fix:
- Check the init container logs for the specific error.
- Verify the service it depends on is running and reachable.
- Fix the init container command, image, or configuration.
Evicted
What it means: The kubelet evicted the pod because the node was under resource pressure (disk, memory, or PID).
Common causes:
- Node is under
DiskPressure(ephemeral storage or container logs filled the disk). - Node is under
MemoryPressure(too many pods withBestEffortQoS). - PID exhaustion on the node.
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Shows eviction reason
kubectl describe node <node-name> # Check Conditions for pressure
kubectl get pods -n <namespace> --field-selector=status.phase=Failed | grep Evicted
Fix:
- Clean up disk usage on the node (prune unused images, clear old logs).
- Set proper
resources.requestsso BestEffort pods are evicted first. - Configure
ephemeral-storagerequests and limits. - Set up log rotation and image garbage collection on nodes.
Node Issues
NotReady
What it means: The kubelet on the node is not communicating with the API server, so the control plane marks it NotReady.
Common causes:
- Kubelet service is not running or has crashed.
- CNI plugin is not installed or is misconfigured (the node cannot report Ready without a working CNI).
- Node is under
DiskPressureorMemoryPressure. - Network partition between the node and the control plane.
- Expired kubelet client certificate.
How to diagnose:
kubectl describe node <node-name> # Check Conditions and Events
kubectl get node <node-name> -o yaml # Look at .status.conditions
# SSH to the node:
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago"
crictl ps # Check container runtime
ls /etc/cni/net.d/ # Check CNI configuration
Fix:
- Restart kubelet:
systemctl restart kubelet. - If CNI is missing: install or reinstall the CNI plugin (Calico, Cilium, Flannel, etc.).
- If certificates expired: rotate certificates with
kubeadm certs renew. - If disk pressure: free disk space on the node.
SchedulingDisabled
What it means: The node has been cordoned – new pods will not be scheduled onto it.
Common causes:
- An administrator ran
kubectl cordon <node>. - A node drain is in progress (
kubectl drain). - A cluster autoscaler is decommissioning the node.
How to diagnose:
kubectl get nodes # Look for SchedulingDisabled
kubectl describe node <node-name> # Check Taints for NoSchedule
Fix:
- If the maintenance is complete:
kubectl uncordon <node-name>. - If autoscaler-managed: the node will be removed; no action needed.
DiskPressure / MemoryPressure
What it means: The node’s available disk or memory has dropped below the kubelet’s eviction threshold.
Common causes:
- Container images consuming too much disk.
- Application logs not rotated, filling up the filesystem.
- Too many pods on the node relative to available memory.
- Large emptyDir volumes.
How to diagnose:
kubectl describe node <node-name> # Check Conditions section
# SSH to the node:
df -h # Disk usage
free -m # Memory usage
crictl images | wc -l # Number of cached images
du -sh /var/log/pods/* # Pod log sizes
Fix:
- Disk: prune unused images (
crictl rmi --prune), enable log rotation, clean/var/log. - Memory: evict low-priority pods, increase node size, or add more nodes.
- Configure kubelet garbage collection thresholds in the KubeletConfiguration.
Networking
Connection refused on Service
What it means: A TCP connection to the Service IP and port is actively refused, meaning nothing is listening.
Common causes:
- No ready endpoints behind the Service (pods are not running or not passing readiness probes).
- The Service
targetPortdoes not match the port the application is actually listening on. - The pod is running but the application inside has not started listening yet.
How to diagnose:
kubectl get endpoints <service-name> -n <namespace> # Are there any endpoints?
kubectl get pods -n <namespace> -l <selector> # Are pods running and Ready?
kubectl describe svc <service-name> -n <namespace> # Check selector and ports
kubectl exec -it <pod> -n <namespace> -- ss -tlnp # What is the pod listening on?
Fix:
- If no endpoints: ensure the Service selector matches the pod labels exactly.
- If targetPort is wrong: update the Service to match the container’s listening port.
- If pods are not Ready: fix the readiness probe or the underlying application issue.
DNS Resolution Failures
What it means: Pods cannot resolve Kubernetes service names or external hostnames.
Common causes:
- CoreDNS pods are not running or are crashing.
- The
ndotssetting (default: 5) causes excessive search domain lookups, leading to timeouts. - Pod’s
dnsPolicyis set toDefault(uses node DNS) instead ofClusterFirst. - Network policy blocking DNS traffic (UDP/TCP port 53 to
kube-system).
How to diagnose:
kubectl get pods -n kube-system -l k8s-app=kube-dns # Is CoreDNS running?
kubectl logs -n kube-system -l k8s-app=kube-dns # CoreDNS errors
# Test from inside a pod:
kubectl exec -it <pod> -n <namespace> -- nslookup kubernetes.default
kubectl exec -it <pod> -n <namespace> -- cat /etc/resolv.conf
Fix:
- If CoreDNS is down: check its deployment, resource limits, and node resources.
- For
ndotsissues: adddnsConfigwithndots: 2to the pod spec, or use FQDNs (trailing dot). - If NetworkPolicy is blocking: allow egress to
kube-systemon port 53. - If dnsPolicy is wrong: set
dnsPolicy: ClusterFirst.
Service has no endpoints
What it means: The Service exists but has no backing pods, so all traffic to it fails.
Common causes:
- The Service’s label selector does not match any pod labels (typo or mismatch).
- All matching pods are failing their readiness probes.
- The pods are in a different namespace than expected (selectors are namespace-scoped).
How to diagnose:
kubectl describe svc <service-name> -n <namespace> # Check Selector
kubectl get endpoints <service-name> -n <namespace> # Should list pod IPs
kubectl get pods -n <namespace> --show-labels # Compare labels to selector
kubectl get pods -n <namespace> -l <key>=<value> # Test the selector directly
Fix:
- Align the Service selector with the pod’s labels.
- Fix readiness probes so pods become Ready.
- Ensure pods are deployed in the correct namespace.
Timeout Connecting Between Pods
What it means: TCP connections between pods hang and eventually time out, rather than being refused.
Common causes:
- A NetworkPolicy is blocking the traffic.
- The CNI plugin is misconfigured or its pods are crashing.
- IPtables/eBPF rules are stale after a CNI upgrade or node reboot.
- Nodes are in different subnets and inter-node routing is broken.
How to diagnose:
kubectl get networkpolicy -n <namespace> # Are there policies restricting traffic?
kubectl describe networkpolicy <name> -n <namespace>
kubectl get pods -n kube-system -l k8s-app=calico-node # Or your CNI's pods
# Test connectivity from inside a pod:
kubectl exec -it <pod> -n <namespace> -- curl -v --connect-timeout 5 <target-svc>:<port>
# On the node:
iptables-save | grep <service-cluster-ip> # Check kube-proxy rules
Fix:
- If NetworkPolicy is blocking: update the policy to allow the required ingress/egress.
- If CNI is broken: restart CNI pods, or reinstall the CNI plugin.
- If iptables are stale: restart kube-proxy (
kubectl rollout restart ds/kube-proxy -n kube-system). - Check cloud provider security groups and route tables for inter-node communication.
Storage
PVC Stuck in Pending
What it means: The PersistentVolumeClaim cannot be bound to a PersistentVolume.
Common causes:
- No PV matches the PVC’s
storageClassName,accessModes, orcapacity. - The StorageClass does not exist or the provisioner is not installed.
- In multi-zone clusters: the PV is in a different zone than the node running the pod.
WaitForFirstConsumerbinding mode means the PVC will not bind until a pod using it is scheduled.
How to diagnose:
kubectl describe pvc <pvc-name> -n <namespace> # Events explain why it is pending
kubectl get storageclass # Does the StorageClass exist?
kubectl get pv # Are there available PVs?
kubectl get events -n <namespace> --field-selector reason=ProvisioningFailed
Fix:
- If no StorageClass: create one or set a default StorageClass.
- If provisioner is missing: install the CSI driver (e.g.,
ebs-csi-driver,csi-driver-nfs). - If zone mismatch: use
volumeBindingMode: WaitForFirstConsumerto bind in the correct zone. - If capacity mismatch: create a PV with the required size, or adjust the PVC request.
FailedMount / FailedAttachVolume
What it means: The kubelet cannot mount or attach the volume to the node.
Common causes:
- The volume is still attached to another node (common when a pod is rescheduled – the old node has not detached yet).
- The CSI driver is not installed or its pods are not running.
- The volume does not exist (deleted out of band).
- Filesystem corruption requiring manual
fsck. - Exceeded the maximum number of volumes per node (e.g., AWS limit of EBS volumes per instance type).
How to diagnose:
kubectl describe pod <pod-name> -n <namespace> # Look at Events for mount errors
kubectl get volumeattachments # Check if volume is attached elsewhere
kubectl get pods -n kube-system -l app=ebs-csi-node # Check CSI driver pods
kubectl get pv <pv-name> -o yaml # Check volume status
Fix:
- If stuck attachment: wait for the
VolumeAttachmentto be cleaned up (up to 6 minutes), or manually delete theVolumeAttachmentobject. - If CSI driver is missing: install it.
- If volume limit reached: use a larger instance type or distribute pods across more nodes.
- If volume was deleted: recreate it and restore from backup.
Control Plane
API Server connection refused
What it means: Clients cannot reach the Kubernetes API server.
Common causes:
- The
kube-apiserverprocess is not running. - TLS certificates have expired.
- Firewall or security group is blocking port 6443.
- Load balancer in front of API server is misconfigured or unhealthy.
How to diagnose:
# On a control plane node:
crictl ps | grep kube-apiserver # Is the container running?
crictl logs <apiserver-container-id> | tail -50 # API server logs
openssl s_client -connect <api-server>:6443 # Test TLS handshake
curl -k https://<api-server>:6443/healthz # Health endpoint
journalctl -u kubelet | grep apiserver # Kubelet managing static pod?
Fix:
- If not running: check static pod manifest at
/etc/kubernetes/manifests/kube-apiserver.yaml. - If certificates expired:
kubeadm certs renew all && systemctl restart kubelet. - If firewall blocking: open port 6443 to the required source IPs.
- If load balancer: check health check configuration and backend targets.
etcd Errors
What it means: The etcd cluster backing the API server is unhealthy.
Common causes:
- Disk latency is too high (etcd requires low-latency storage, ideally SSD).
- Quorum lost (majority of etcd members are down).
- Database size has exceeded the space quota (default 2 GB).
- Clock skew between etcd members.
How to diagnose:
# If etcd is accessible:
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table
etcdctl alarm list
# Check disk latency:
etcdctl check perf
# From API server logs:
crictl logs <apiserver-container-id> | grep etcd
Fix:
- If disk latency: move etcd to SSD-backed storage, or use dedicated etcd nodes.
- If quorum lost: restore from snapshot (
etcdctl snapshot restore). - If space quota exceeded: compact and defragment:
etcdctl compactthenetcdctl defrag. - If alarms triggered:
etcdctl alarm disarmafter resolving the root cause.
Forbidden (RBAC)
What it means: The authenticated identity does not have permission to perform the requested action.
Common causes:
- Missing Role/ClusterRole or RoleBinding/ClusterRoleBinding.
- The binding references the wrong ServiceAccount, user, or group.
- Namespace mismatch: a Role only grants permissions in its own namespace.
- The ServiceAccount token is from a different namespace.
How to diagnose:
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<ns>:<sa>
kubectl auth can-i <verb> <resource> --as=<user> -n <namespace>
kubectl get rolebinding,clusterrolebinding -A | grep <service-account-name>
kubectl describe clusterrole <role-name> # What permissions does it grant?
Fix:
- Create the missing Role and RoleBinding (or ClusterRole/ClusterRoleBinding for cluster-wide access).
- Verify the
subjectsin the binding match the actual identity making the request. - Use
kubectl auth can-i --list --as=<identity>to see all permissions for debugging.
Webhook Errors (failed calling webhook)
What it means: An admission webhook (validating or mutating) is failing, blocking resource creation or updates.
Common causes:
- The webhook’s backing Service or pod is down.
- The webhook’s TLS certificate has expired.
- The webhook was installed with
failurePolicy: Failand the service is unreachable. - The webhook is rejecting the request due to policy (this is intentional, not an error in the webhook itself).
How to diagnose:
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
kubectl describe validatingwebhookconfiguration <name> # Check service and failurePolicy
kubectl get pods -n <webhook-namespace> # Is the webhook pod running?
kubectl logs -n <webhook-namespace> <webhook-pod> # Webhook logs
Fix:
- If the webhook service is down: restart it or fix its deployment.
- If certificates expired: renew them (often managed by cert-manager).
- Emergency bypass: temporarily set
failurePolicy: Ignoreor delete the webhook configuration. - To exclude a namespace: add the appropriate
namespaceSelectorto the webhook configuration.
Deployment Issues
Rollout Stuck
What it means: A Deployment rollout is not progressing – new pods are not becoming Ready or old pods are not being terminated.
Common causes:
- New pods are failing (CrashLoopBackOff, ImagePullBackOff, Pending).
- A PodDisruptionBudget is preventing old pods from being evicted.
- Resource quota in the namespace is exhausted (cannot create new ReplicaSet pods).
- The
progressDeadlineSecondshas not yet been reached (default 600s).
How to diagnose:
kubectl rollout status deployment/<name> -n <namespace>
kubectl describe deployment <name> -n <namespace> # Check Conditions and Events
kubectl get rs -n <namespace> # Compare old vs new ReplicaSet
kubectl get pods -n <namespace> -l <selector> # What state are the new pods in?
kubectl get pdb -n <namespace> # Check PodDisruptionBudgets
kubectl get resourcequota -n <namespace>
Fix:
- Fix the underlying pod issue (image, config, resources) then let the rollout continue.
- If PDB is blocking: temporarily relax the PDB or scale up first.
- If stuck and unrecoverable:
kubectl rollout undo deployment/<name> -n <namespace>. - If quota exceeded: increase the quota or delete unused resources.
FailedCreate on ReplicaSet
What it means: The ReplicaSet controller cannot create new pods.
Common causes:
- Resource quota in the namespace is fully consumed.
- An admission webhook is rejecting pod creation.
- LimitRange in the namespace is setting constraints the pod spec violates.
- The ServiceAccount referenced by the pod does not exist.
How to diagnose:
kubectl describe rs <replicaset-name> -n <namespace> # Events will show the error
kubectl get resourcequota -n <namespace> -o yaml # Compare used vs hard
kubectl get limitrange -n <namespace> -o yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
Fix:
- If quota exhausted: increase the quota or reduce resource requests on the pods.
- If webhook rejecting: check webhook logs to understand the rejection reason.
- If LimitRange violation: adjust pod resource requests/limits to comply.
- If ServiceAccount missing: create it or correct the reference.
Useful Commands Cheat Sheet
# --- Inspecting Resources ---
kubectl get pods -n <ns> -o wide # Pod status with node and IP
kubectl describe pod <pod> -n <ns> # Full pod details and events
kubectl get events -n <ns> --sort-by='.lastTimestamp' # Recent events in namespace
kubectl get events -A --field-selector type=Warning # All warnings cluster-wide
# --- Logs ---
kubectl logs <pod> -n <ns> # Current container logs
kubectl logs <pod> -n <ns> --previous # Logs from last crashed container
kubectl logs <pod> -n <ns> -c <container> # Specific container in multi-container pod
kubectl logs -l app=<label> -n <ns> --tail=100 # Logs by label selector
# --- Interactive Debugging ---
kubectl exec -it <pod> -n <ns> -- /bin/sh # Shell into a running container
kubectl debug node/<node> -it --image=busybox # Debug node-level issues
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash # Ephemeral network debug pod
# --- Networking ---
kubectl get endpoints <svc> -n <ns> # Service endpoints
kubectl port-forward svc/<svc> 8080:80 -n <ns> # Forward service port to localhost
kubectl exec <pod> -n <ns> -- nslookup <svc> # Test DNS resolution from pod
# --- Resource Usage ---
kubectl top nodes # Node CPU and memory usage
kubectl top pods -n <ns> --sort-by=memory # Pod resource consumption
kubectl top pods -n <ns> --containers # Per-container resource usage
# --- Cluster State ---
kubectl get componentstatuses # Control plane health (deprecated but useful)
kubectl cluster-info dump | grep -i error # Dump cluster state and search for errors
kubectl api-resources # All available API resources
# --- Rollouts ---
kubectl rollout status deployment/<name> -n <ns> # Watch rollout progress
kubectl rollout history deployment/<name> -n <ns> # Rollout revision history
kubectl rollout undo deployment/<name> -n <ns> # Roll back to previous revision
Back to Table of Contents