Objective: Identify and resolve issues when Pods fail to start or terminate abnormally Estimated Time: 30 minutes
Scope of This GuideCovers: Pod startup failures, CrashLoopBackOff, ImagePullBackOff, Readiness failures
Does not cover: Network connectivity issues (see Network Troubleshooting), performance degradation due to resource constraints (see Resource Optimization)
Before You Begin#
Verify the following prerequisites.
1. Verify kubectl Installation and Version#
kubectl version --clientSuccess output:
Client Version: v1.28.0Version Compatibility
kubectl version must be within ±1 minor version of cluster version. Example: Cluster 1.27 → kubectl 1.26~1.28 supported
2. Verify Cluster Access#
kubectl cluster-infoSuccess output:
Kubernetes control plane is running at https://xxx.xxx.xxx.xxx3. Verify Permissions#
kubectl auth can-i get pods
kubectl auth can-i get eventsSuccess output (each):
yesStep 1: Check Pod Status#
First, check the current status of the Pod.
kubectl get podsExpected output:
NAME READY STATUS RESTARTS AGE
my-app-xxx-yyy 1/1 Running 0 5m
my-app-xxx-zzz 0/1 CrashLoopBackOff 3 2mBased on Pod status, proceed to the appropriate step.
| Status | Meaning | Next Step |
|---|---|---|
Pending | Waiting for scheduling | Step 2: Resolve Pending |
ImagePullBackOff | Image pull failure | Step 3: Resolve Image Issues |
CrashLoopBackOff | Container repeatedly restarting | Step 4: Resolve Crashes |
Running but not Ready | Health check failure | Step 5: Resolve Readiness |
Step 2: Resolve Pending Status#
The Pod is stuck in Pending status.
Diagnose the Cause#
kubectl describe pod <pod-name>Check the Events section. Scroll down to the last portion.
Common Causes and Solutions#
Cause 1: Insufficient Resources#
Event message:
Warning FailedScheduling Insufficient cpu/memorySolution:
Check node resource status.
kubectl top nodesReduce resource requests or add more nodes.
resources: requests: memory: "256Mi" # reduce cpu: "100m"
Cause 2: Node Selector Mismatch#
Event message:
Warning FailedScheduling 0/3 nodes are available: node(s) didn't match node selectorSolution:
Check node labels.
kubectl get nodes --show-labelsVerify Pod’s nodeSelector matches available labels.
Cause 3: Waiting for PVC Binding#
Event message:
Warning FailedScheduling persistentvolumeclaim "my-pvc" not foundSolution:
Check PVC status.
kubectl get pvcVerify PVC is in Bound status.
Success check: Status changes from Pending in kubectl get pod <pod-name>.
Step 3: Resolve Image Issues#
ImagePullBackOff or ErrImagePull status.
Diagnose the Cause#
kubectl describe pod <pod-name>Check Events section for image-related messages.
Common Causes and Solutions#
Cause 1: Image Name/Tag Typo#
Event message:
Failed to pull image "ngninx:latest": rpc error: code = NotFoundSolution:
Verify image name.
kubectl get deployment <name> -o jsonpath='{.spec.template.spec.containers[0].image}'Correct the image name and tag.
Cause 2: Private Registry Authentication Required#
Event message:
Failed to pull image: unauthorized: authentication requiredSolution:
Create imagePullSecret.
kubectl create secret docker-registry my-registry-secret \ --docker-server=registry.example.com \ --docker-username=myuser \ --docker-password=mypasswordAdd secret to Deployment.
spec: imagePullSecrets: - name: my-registry-secret
Success check: Status changes to Running in kubectl get pod <pod-name>.
Step 4: Resolve CrashLoopBackOff#
Container terminates immediately after starting and keeps restarting.
Diagnose the Cause#
First, check previous container logs.
kubectl logs <pod-name> --previousWhen Logs Are Empty
If the container terminates too quickly, logs may be empty. In this case, check the Exit Code.
Check Exit Code:
kubectl describe pod <pod-name>Find the Last State section and check the Exit Code:
Last State: Terminated
Exit Code: 137
Reason: OOMKilledSolutions by Exit Code#
| Exit Code | Meaning | Solution |
|---|---|---|
| 0 | Normal termination | Container command completed immediately. Verify ENTRYPOINT/CMD runs a long-running process. |
| 1 | Application error | Check logs for error messages. Review environment variables and configuration files. |
| 137 | OOM Killed | Increase memory limits. |
| 143 | SIGTERM | Normal termination signal. Check livenessProbe settings. |
Resolving OOM Killed#
resources:
limits:
memory: "512Mi" # increase from previous valueSuccess check: RESTARTS stops increasing and Running status is maintained in kubectl get pod <pod-name>.
Step 5: Resolve Running but not Ready#
Pod is Running but Ready is 0/1.
Diagnose the Cause#
kubectl describe pod <pod-name>Check Conditions section for Ready: False.
Look for Readiness probe failure messages in Events section:
Warning Unhealthy Readiness probe failed: Get "http://10.x.x.x:8080/health": connection refusedCommon Causes and Solutions#
Cause 1: Incorrect Endpoint#
Solution:
Test the endpoint directly inside the Pod.
kubectl exec <pod-name> -- curl -s localhost:8080/healthCorrect the endpoint path and port.
Cause 2: Insufficient Application Startup Time#
Solution:
Increase initialDelaySeconds.
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # increase
periodSeconds: 10Success check: READY shows 1/1 in kubectl get pod <pod-name>.
Common Errors#
“error: pod not found”#
Cause: Pod name is incorrect or it’s in a different namespace.
Solution:
# Search for Pod in all namespaces
kubectl get pods --all-namespaces | grep <pod-name>
# Specify correct namespace
kubectl get pod <pod-name> -n <namespace>“error: unable to upgrade connection”#
Cause: Connection issue with API server during kubectl exec.
Solution:
- Check network connectivity
- Check proxy settings
“OCI runtime create failed: container_linux.go”#
Cause: Container runtime issue.
Solution:
- Verify image is compatible with current architecture (amd64 vs arm64)
- Check node’s container runtime status
Environment-Specific Notes#
# Debug directly via Minikube SSH
minikube ssh
# When using local images
eval $(minikube docker-env)
docker images# Check logs in CloudWatch Logs
aws logs filter-log-events --log-group-name /aws/eks/<cluster>/cluster
# Check node status
kubectl get nodes -o wide# Check logs in Cloud Logging
gcloud logging read "resource.type=k8s_container"
# Check node pool status
gcloud container node-pools list --cluster <cluster-name>Checklist#
Startup Failures (Pending)#
- Verified events with
kubectl describe pod - Verified node resources are sufficient (
kubectl top nodes) - Checked nodeSelector/tolerations settings
Image Issues#
- Verified image name and tag are correct
- Verified imagePullSecrets configuration (private registry)
- Verified registry accessibility
Repeated Crashes#
- Checked logs with
kubectl logs --previous - Checked Exit Code (137 = OOM)
- Verified environment variables/ConfigMap/Secret settings
Not Ready#
- Verified Readiness Probe endpoint and port
- Verified initialDelaySeconds is sufficient
- Checked external dependencies (DB, etc.) connectivity
Next Steps#
| Goal | Recommended Document |
|---|---|
| Resolve network issues | Network Troubleshooting |
| Optimize resources | Resource Optimization |
| Configure health checks | Health Checks |
| Practice deployment | Spring Boot Deployment |