Pod Failure Troubleshooting
This runbook provides comprehensive guidance for diagnosing and resolving Kubernetes pod failures.Quick Diagnosis
Check Pod Status
Check Recent Events
Check Pod Logs
CrashLoopBackOff
Symptoms:- Pod status:
CrashLoopBackOff - Restart count increasing
- Pod alternates between Running and Terminating
Application Errors
- Fix application code errors
- Update configuration
- Check environment variables
Filesystem Permissions (readOnlyRootFilesystem)
Error:java.nio.file.ReadOnlyFileSystemException or Permission denied
Root Cause: readOnlyRootFilesystem: true without proper volume mounts
Solution:
- Test
readOnlyRootFilesystemin development first - Run validation script:
python3 scripts/validate_gke_autopilot_compliance.py - Monitor for filesystem errors in logs
Missing Dependencies
Error:ModuleNotFoundError, ImportError, or similar
Solutions:
- Verify Docker image contains all dependencies
- Check requirements.txt or package.json is complete
- Rebuild and push updated image
Configuration Errors
Error: Configuration file not found or invalid Solutions:- Verify ConfigMaps are created and mounted correctly
- Check Secret references are correct
- Validate configuration syntax
CreateContainerConfigError
Symptoms:- Pod status:
CreateContainerConfigError - Container never starts
- Pod stuck in this state
Missing Secrets/ConfigMaps
Error:secret "example-secret" not found or configmap "example-config" not found
Diagnosis:
Invalid Volume Mounts
Error: Volume mount path conflicts or invalid Solutions:- Check volume mount paths don’t overlap
- Verify volume names match between
volumesandvolumeMounts - Ensure mount paths are absolute
ImagePullBackOff / ErrImagePull
Symptoms:- Pod status:
ImagePullBackOfforErrImagePull - Image cannot be pulled from registry
Image Does Not Exist
Solutions:- Verify image tag is correct
- Check image exists in registry:
- Update deployment with correct image tag
Authentication Failure
Solutions:Pending (Insufficient Resources)
Symptoms:- Pod status:
Pending - Events show:
Insufficient cpuorInsufficient memory
Reduce Resource Requests
Scale Up Cluster
GKE Autopilot Specific Errors
CPU Limit/Request Ratio Violation
Error:cpu max limit to request ratio per Container is 4, but provided ratio is X
Root Cause: GKE Autopilot enforces max ratio of 4.0
Diagnosis:
- Run validation:
python3 scripts/validate_gke_autopilot_compliance.py - Pre-commit hook automatically checks
Memory Limit/Request Ratio Violation
Error:memory max limit to request ratio per Container is 4, but provided ratio is X
Solution: Same as CPU - ensure ratio ≤ 4.0
Liveness/Readiness Probe Failures
Symptoms:- Pod restarts frequently
- Events show:
Liveness probe failedorReadiness probe failed
Probe Endpoint Not Available
Error:connection refused or 404 Not Found
Solutions:
- Verify health endpoint exists and is accessible
- Check port numbers match
- Ensure health endpoint doesn’t require authentication
Probe Timeout Too Short
Solutions:Missing Health Check Extension
Error (OTEL Collector):connection refused on port 13133
Root Cause: health_check extension not enabled
Solution:
Step-by-Step Troubleshooting
Step 1: Identify the Problem
Step 2: Gather Information
Step 3: Analyze Root Cause
Use the patterns above to match symptoms to common causes.Step 4: Apply Fix
Follow solution steps for the identified problem.Step 5: Verify Fix
Step 6: Clean Up
Prevention Checklist
Before Deployment
-
Run validation script:
-
Run regression tests:
-
Verify kustomize builds:
-
Check resource ratios (GKE Autopilot):
- CPU ratio ≤ 4.0
- Memory ratio ≤ 4.0
-
Verify no environment variable conflicts:
- No env vars with both
valueandvalueFrom
- No env vars with both
- Test in development/staging first
After Deployment
- Monitor pod status for 10 minutes
- Check logs for errors
- Verify metrics are being collected
- Test application functionality
- Set up alerts for pod failures