Objective: Identify and resolve issues when Pods fail to start or terminate abnormally Estimated Time: 30 minutes

Scope of This Guide

Covers: Pod startup failures, CrashLoopBackOff, ImagePullBackOff, Readiness failures

Does not cover: Network connectivity issues (see Network Troubleshooting), performance degradation due to resource constraints (see Resource Optimization)

Before You Begin#

Verify the following prerequisites.

1. Verify kubectl Installation and Version#

kubectl version --client

Success output:

Client Version: v1.28.0
Version Compatibility
kubectl version must be within ±1 minor version of cluster version. Example: Cluster 1.27 → kubectl 1.26~1.28 supported

2. Verify Cluster Access#

kubectl cluster-info

Success output:

Kubernetes control plane is running at https://xxx.xxx.xxx.xxx

3. Verify Permissions#

kubectl auth can-i get pods
kubectl auth can-i get events

Success output (each):

yes

Step 1: Check Pod Status#

First, check the current status of the Pod.

kubectl get pods

Expected output:

NAME                     READY   STATUS             RESTARTS   AGE
my-app-xxx-yyy           1/1     Running            0          5m
my-app-xxx-zzz           0/1     CrashLoopBackOff   3          2m

Based on Pod status, proceed to the appropriate step.

StatusMeaningNext Step
PendingWaiting for schedulingStep 2: Resolve Pending
ImagePullBackOffImage pull failureStep 3: Resolve Image Issues
CrashLoopBackOffContainer repeatedly restartingStep 4: Resolve Crashes
Running but not ReadyHealth check failureStep 5: Resolve Readiness

Step 2: Resolve Pending Status#

The Pod is stuck in Pending status.

Diagnose the Cause#

kubectl describe pod <pod-name>

Check the Events section. Scroll down to the last portion.

Common Causes and Solutions#

Cause 1: Insufficient Resources#

Event message:

Warning  FailedScheduling  Insufficient cpu/memory

Solution:

  1. Check node resource status.

    kubectl top nodes
  2. Reduce resource requests or add more nodes.

    resources:
      requests:
        memory: "256Mi"  # reduce
        cpu: "100m"

Cause 2: Node Selector Mismatch#

Event message:

Warning  FailedScheduling  0/3 nodes are available: node(s) didn't match node selector

Solution:

  1. Check node labels.

    kubectl get nodes --show-labels
  2. Verify Pod’s nodeSelector matches available labels.

Cause 3: Waiting for PVC Binding#

Event message:

Warning  FailedScheduling  persistentvolumeclaim "my-pvc" not found

Solution:

  1. Check PVC status.

    kubectl get pvc
  2. Verify PVC is in Bound status.

Success check: Status changes from Pending in kubectl get pod <pod-name>.


Step 3: Resolve Image Issues#

ImagePullBackOff or ErrImagePull status.

Diagnose the Cause#

kubectl describe pod <pod-name>

Check Events section for image-related messages.

Common Causes and Solutions#

Cause 1: Image Name/Tag Typo#

Event message:

Failed to pull image "ngninx:latest": rpc error: code = NotFound

Solution:

  1. Verify image name.

    kubectl get deployment <name> -o jsonpath='{.spec.template.spec.containers[0].image}'
  2. Correct the image name and tag.

Cause 2: Private Registry Authentication Required#

Event message:

Failed to pull image: unauthorized: authentication required

Solution:

  1. Create imagePullSecret.

    kubectl create secret docker-registry my-registry-secret \
      --docker-server=registry.example.com \
      --docker-username=myuser \
      --docker-password=mypassword
  2. Add secret to Deployment.

    spec:
      imagePullSecrets:
      - name: my-registry-secret

Success check: Status changes to Running in kubectl get pod <pod-name>.


Step 4: Resolve CrashLoopBackOff#

Container terminates immediately after starting and keeps restarting.

Diagnose the Cause#

First, check previous container logs.

kubectl logs <pod-name> --previous
When Logs Are Empty
If the container terminates too quickly, logs may be empty. In this case, check the Exit Code.

Check Exit Code:

kubectl describe pod <pod-name>

Find the Last State section and check the Exit Code:

Last State:     Terminated
  Exit Code:    137
  Reason:       OOMKilled

Solutions by Exit Code#

Exit CodeMeaningSolution
0Normal terminationContainer command completed immediately. Verify ENTRYPOINT/CMD runs a long-running process.
1Application errorCheck logs for error messages. Review environment variables and configuration files.
137OOM KilledIncrease memory limits.
143SIGTERMNormal termination signal. Check livenessProbe settings.

Resolving OOM Killed#

resources:
  limits:
    memory: "512Mi"  # increase from previous value

Success check: RESTARTS stops increasing and Running status is maintained in kubectl get pod <pod-name>.


Step 5: Resolve Running but not Ready#

Pod is Running but Ready is 0/1.

Diagnose the Cause#

kubectl describe pod <pod-name>

Check Conditions section for Ready: False.

Look for Readiness probe failure messages in Events section:

Warning  Unhealthy  Readiness probe failed: Get "http://10.x.x.x:8080/health": connection refused

Common Causes and Solutions#

Cause 1: Incorrect Endpoint#

Solution:

  1. Test the endpoint directly inside the Pod.

    kubectl exec <pod-name> -- curl -s localhost:8080/health
  2. Correct the endpoint path and port.

Cause 2: Insufficient Application Startup Time#

Solution:

Increase initialDelaySeconds.

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30  # increase
  periodSeconds: 10

Success check: READY shows 1/1 in kubectl get pod <pod-name>.


Common Errors#

“error: pod not found”#

Cause: Pod name is incorrect or it’s in a different namespace.

Solution:

# Search for Pod in all namespaces
kubectl get pods --all-namespaces | grep <pod-name>

# Specify correct namespace
kubectl get pod <pod-name> -n <namespace>

“error: unable to upgrade connection”#

Cause: Connection issue with API server during kubectl exec.

Solution:

  • Check network connectivity
  • Check proxy settings

“OCI runtime create failed: container_linux.go”#

Cause: Container runtime issue.

Solution:

  • Verify image is compatible with current architecture (amd64 vs arm64)
  • Check node’s container runtime status

Environment-Specific Notes#

# Debug directly via Minikube SSH
minikube ssh

# When using local images
eval $(minikube docker-env)
docker images
# Check logs in CloudWatch Logs
aws logs filter-log-events --log-group-name /aws/eks/<cluster>/cluster

# Check node status
kubectl get nodes -o wide
# Check logs in Cloud Logging
gcloud logging read "resource.type=k8s_container"

# Check node pool status
gcloud container node-pools list --cluster <cluster-name>

Checklist#

Startup Failures (Pending)#

  • Verified events with kubectl describe pod
  • Verified node resources are sufficient (kubectl top nodes)
  • Checked nodeSelector/tolerations settings

Image Issues#

  • Verified image name and tag are correct
  • Verified imagePullSecrets configuration (private registry)
  • Verified registry accessibility

Repeated Crashes#

  • Checked logs with kubectl logs --previous
  • Checked Exit Code (137 = OOM)
  • Verified environment variables/ConfigMap/Secret settings

Not Ready#

  • Verified Readiness Probe endpoint and port
  • Verified initialDelaySeconds is sufficient
  • Checked external dependencies (DB, etc.) connectivity

Next Steps#

GoalRecommended Document
Resolve network issuesNetwork Troubleshooting
Optimize resourcesResource Optimization
Configure health checksHealth Checks
Practice deploymentSpring Boot Deployment