Skip to main content

Pod Failure Troubleshooting

This runbook provides comprehensive guidance for diagnosing and resolving Kubernetes pod failures.

Quick Diagnosis

Check Pod Status

# Get all pods in namespace
kubectl get pods -n <namespace>

# Get pods with issues
kubectl get pods -n <namespace> | grep -E "Error|CrashLoop|Pending|CreateContainer|ImagePull|Evicted"

# Get detailed pod information
kubectl describe pod <pod-name> -n <namespace>

Check Recent Events

# Get recent events sorted by timestamp
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -50

# Get events for specific pod
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

Check Pod Logs

# Get current container logs
kubectl logs <pod-name> -c <container-name> -n <namespace>

# Get previous container logs (if pod restarted)
kubectl logs <pod-name> -c <container-name> -n <namespace> --previous

# Follow logs in real-time
kubectl logs -f <pod-name> -c <container-name> -n <namespace>

CrashLoopBackOff

Symptoms:
  • Pod status: CrashLoopBackOff
  • Restart count increasing
  • Pod alternates between Running and Terminating

Application Errors

# Check logs for application errors
kubectl logs <pod-name> -n <namespace> --previous | tail -100
Solutions:
  • Fix application code errors
  • Update configuration
  • Check environment variables

Filesystem Permissions (readOnlyRootFilesystem)

Error: java.nio.file.ReadOnlyFileSystemException or Permission denied Root Cause: readOnlyRootFilesystem: true without proper volume mounts Solution:
# Add writable volume mounts
volumeMounts:
- name: tmp
  mountPath: /tmp
- name: cache
  mountPath: /var/tmp
- name: app-data
  mountPath: /app/data  # Application-specific writable path

volumes:
- name: tmp
  emptyDir: {}
- name: cache
  emptyDir: {}
- name: app-data
  emptyDir: {}
Prevention:
  1. Test readOnlyRootFilesystem in development first
  2. Run validation script: python3 scripts/validate_gke_autopilot_compliance.py
  3. Monitor for filesystem errors in logs

Missing Dependencies

Error: ModuleNotFoundError, ImportError, or similar Solutions:
  • Verify Docker image contains all dependencies
  • Check requirements.txt or package.json is complete
  • Rebuild and push updated image

Configuration Errors

Error: Configuration file not found or invalid Solutions:
  • Verify ConfigMaps are created and mounted correctly
  • Check Secret references are correct
  • Validate configuration syntax

CreateContainerConfigError

Symptoms:
  • Pod status: CreateContainerConfigError
  • Container never starts
  • Pod stuck in this state

Missing Secrets/ConfigMaps

Error: secret "example-secret" not found or configmap "example-config" not found Diagnosis:
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Events:"
Solutions:
# List secrets
kubectl get secrets -n <namespace>

# List configmaps
kubectl get configmaps -n <namespace>

# Create missing secret
kubectl create secret generic <secret-name> \
  --from-literal=key=value \
  -n <namespace>

Invalid Volume Mounts

Error: Volume mount path conflicts or invalid Solutions:
  • Check volume mount paths don’t overlap
  • Verify volume names match between volumes and volumeMounts
  • Ensure mount paths are absolute

ImagePullBackOff / ErrImagePull

Symptoms:
  • Pod status: ImagePullBackOff or ErrImagePull
  • Image cannot be pulled from registry

Image Does Not Exist

Solutions:
  • Verify image tag is correct
  • Check image exists in registry:
    docker pull <image>:<tag>
    
  • Update deployment with correct image tag

Authentication Failure

Solutions:
# Create image pull secret
kubectl create secret docker-registry <secret-name> \
  --docker-server=<registry> \
  --docker-username=<username> \
  --docker-password=<password> \
  --docker-email=<email> \
  -n <namespace>

# Add to ServiceAccount
kubectl patch serviceaccount <sa-name> \
  -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}' \
  -n <namespace>

Pending (Insufficient Resources)

Symptoms:
  • Pod status: Pending
  • Events show: Insufficient cpu or Insufficient memory
Diagnosis:
# Check node resources
kubectl top nodes

# Check pod resource requests
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Requests:"

# Check resource quotas
kubectl get resourcequota -n <namespace>
Solutions:

Reduce Resource Requests

resources:
  requests:
    cpu: 100m        # Reduce from 500m
    memory: 128Mi    # Reduce from 512Mi

Scale Up Cluster

# GKE Autopilot auto-scales, but you can trigger it
gcloud container clusters resize <cluster-name> \
  --num-nodes=<new-count> \
  --zone=<zone>

GKE Autopilot Specific Errors

CPU Limit/Request Ratio Violation

Error: cpu max limit to request ratio per Container is 4, but provided ratio is X Root Cause: GKE Autopilot enforces max ratio of 4.0 Diagnosis:
# Calculate ratio
# ratio = cpu_limit / cpu_request
# Example: 1000m / 200m = 5.0 (FAILS)
Solution:
resources:
  requests:
    cpu: 250m      # Increased from 200m
  limits:
    cpu: 1000m
# New ratio: 1000/250 = 4.0 ✅
Prevention:
  • Run validation: python3 scripts/validate_gke_autopilot_compliance.py
  • Pre-commit hook automatically checks

Memory Limit/Request Ratio Violation

Error: memory max limit to request ratio per Container is 4, but provided ratio is X Solution: Same as CPU - ensure ratio ≤ 4.0

Liveness/Readiness Probe Failures

Symptoms:
  • Pod restarts frequently
  • Events show: Liveness probe failed or Readiness probe failed
Diagnosis:
# Check probe configuration
kubectl get deployment <deployment> -n <namespace> -o yaml | grep -A 10 "livenessProbe:"

Probe Endpoint Not Available

Error: connection refused or 404 Not Found Solutions:
  • Verify health endpoint exists and is accessible
  • Check port numbers match
  • Ensure health endpoint doesn’t require authentication

Probe Timeout Too Short

Solutions:
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60    # Increase if app starts slowly
  periodSeconds: 30
  timeoutSeconds: 10          # Increase if endpoint is slow
  failureThreshold: 3

Missing Health Check Extension

Error (OTEL Collector): connection refused on port 13133 Root Cause: health_check extension not enabled Solution:
# In OTEL Collector config
extensions:
  health_check:
    endpoint: 0.0.0.0:13133

service:
  extensions: [health_check]  # Must be enabled!

Step-by-Step Troubleshooting

Step 1: Identify the Problem

# Get overview of all pods
kubectl get pods -n <namespace>

# Identify problematic pods
kubectl get pods -n <namespace> | grep -vE "Running|Completed"

Step 2: Gather Information

# Describe pod for detailed info
kubectl describe pod <pod-name> -n <namespace>

# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -50

Step 3: Analyze Root Cause

Use the patterns above to match symptoms to common causes.

Step 4: Apply Fix

Follow solution steps for the identified problem.

Step 5: Verify Fix

# Watch pod status
kubectl get pods -n <namespace> -w

# Check if pod is running and ready
kubectl get pod <pod-name> -n <namespace>

# Verify no restarts
kubectl describe pod <pod-name> -n <namespace> | grep "Restart Count:"

Step 6: Clean Up

# Delete old ReplicaSets (keep last 2-3 for rollback)
kubectl delete replicaset <old-replicaset-name> -n <namespace>

# Scale down deployments if needed
kubectl scale deployment <deployment> --replicas=<count> -n <namespace>

Prevention Checklist

Before Deployment

  • Run validation script:
    python3 scripts/validate_gke_autopilot_compliance.py deployments/overlays/<overlay>
    
  • Run regression tests:
    pytest tests/regression/test_pod_deployment_regression.py -v
    
  • Verify kustomize builds:
    kubectl kustomize deployments/overlays/<overlay> | kubectl apply --dry-run=client -f -
    
  • Check resource ratios (GKE Autopilot):
    • CPU ratio ≤ 4.0
    • Memory ratio ≤ 4.0
  • Verify no environment variable conflicts:
    • No env vars with both value and valueFrom
  • Test in development/staging first

After Deployment

  • Monitor pod status for 10 minutes
  • Check logs for errors
  • Verify metrics are being collected
  • Test application functionality
  • Set up alerts for pod failures

Quick Reference Commands

Essential Commands

# Get pod status
kubectl get pods -n <namespace>

# Describe pod
kubectl describe pod <pod-name> -n <namespace>

# Get logs
kubectl logs <pod-name> -c <container> -n <namespace>

# Get previous logs (after restart)
kubectl logs <pod-name> -c <container> -n <namespace> --previous

# Get events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Delete pod (forces recreation)
kubectl delete pod <pod-name> -n <namespace>

# Rollout restart deployment
kubectl rollout restart deployment/<deployment> -n <namespace>

# Rollback deployment
kubectl rollout undo deployment/<deployment> -n <namespace>

# Scale deployment
kubectl scale deployment/<deployment> --replicas=<count> -n <namespace>

Resource Commands

# Check node resources
kubectl top nodes

# Check pod resources
kubectl top pods -n <namespace>

# Get resource quotas
kubectl get resourcequota -n <namespace>

# Get limit ranges
kubectl get limitrange -n <namespace>

Debugging Commands

# Execute command in pod
kubectl exec -it <pod-name> -c <container> -n <namespace> -- /bin/sh

# Port forward to pod
kubectl port-forward <pod-name> <local-port>:<pod-port> -n <namespace>

# Copy files from pod
kubectl cp <namespace>/<pod-name>:<path> <local-path>