Documentation Index
Fetch the complete documentation index at: https://mcp-server-langgraph.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Pod Failure Troubleshooting
This runbook provides comprehensive guidance for diagnosing and resolving Kubernetes pod failures.
Quick Diagnosis
Check Pod Status
# Get all pods in namespace
kubectl get pods -n <namespace>
# Get pods with issues
kubectl get pods -n <namespace> | grep -E "Error|CrashLoop|Pending|CreateContainer|ImagePull|Evicted"
# Get detailed pod information
kubectl describe pod <pod-name> -n <namespace>
Check Recent Events
# Get recent events sorted by timestamp
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -50
# Get events for specific pod
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
Check Pod Logs
# Get current container logs
kubectl logs <pod-name> -c <container-name> -n <namespace>
# Get previous container logs (if pod restarted)
kubectl logs <pod-name> -c <container-name> -n <namespace> --previous
# Follow logs in real-time
kubectl logs -f <pod-name> -c <container-name> -n <namespace>
CrashLoopBackOff
Symptoms:
- Pod status:
CrashLoopBackOff
- Restart count increasing
- Pod alternates between Running and Terminating
Application Errors
# Check logs for application errors
kubectl logs <pod-name> -n <namespace> --previous | tail -100
Solutions:
- Fix application code errors
- Update configuration
- Check environment variables
Filesystem Permissions (readOnlyRootFilesystem)
Error: java.nio.file.ReadOnlyFileSystemException or Permission denied
Root Cause: readOnlyRootFilesystem: true without proper volume mounts
Solution:
# Add writable volume mounts
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /var/tmp
- name: app-data
mountPath: /app/data # Application-specific writable path
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
- name: app-data
emptyDir: {}
Prevention:
- Test
readOnlyRootFilesystem in development first
- Run validation script:
python3 scripts/validate_gke_autopilot_compliance.py
- Monitor for filesystem errors in logs
Missing Dependencies
Error: ModuleNotFoundError, ImportError, or similar
Solutions:
- Verify Docker image contains all dependencies
- Check requirements.txt or package.json is complete
- Rebuild and push updated image
Configuration Errors
Error: Configuration file not found or invalid
Solutions:
- Verify ConfigMaps are created and mounted correctly
- Check Secret references are correct
- Validate configuration syntax
CreateContainerConfigError
Symptoms:
- Pod status:
CreateContainerConfigError
- Container never starts
- Pod stuck in this state
Missing Secrets/ConfigMaps
Error: secret "example-secret" not found or configmap "example-config" not found
Diagnosis:
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Events:"
Solutions:
# List secrets
kubectl get secrets -n <namespace>
# List configmaps
kubectl get configmaps -n <namespace>
# Create missing secret
kubectl create secret generic <secret-name> \
--from-literal=key=value \
-n <namespace>
Invalid Volume Mounts
Error: Volume mount path conflicts or invalid
Solutions:
- Check volume mount paths don’t overlap
- Verify volume names match between
volumes and volumeMounts
- Ensure mount paths are absolute
ImagePullBackOff / ErrImagePull
Symptoms:
- Pod status:
ImagePullBackOff or ErrImagePull
- Image cannot be pulled from registry
Image Does Not Exist
Solutions:
- Verify image tag is correct
- Check image exists in registry:
docker pull <image>:<tag>
- Update deployment with correct image tag
Authentication Failure
Solutions:
# Create image pull secret
kubectl create secret docker-registry <secret-name> \
--docker-server=<registry> \
--docker-username=<username> \
--docker-password=<password> \
--docker-email=<email> \
-n <namespace>
# Add to ServiceAccount
kubectl patch serviceaccount <sa-name> \
-p '{"imagePullSecrets": [{"name": "<secret-name>"}]}' \
-n <namespace>
Pending (Insufficient Resources)
Symptoms:
- Pod status:
Pending
- Events show:
Insufficient cpu or Insufficient memory
Diagnosis:
# Check node resources
kubectl top nodes
# Check pod resource requests
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Requests:"
# Check resource quotas
kubectl get resourcequota -n <namespace>
Solutions:
Reduce Resource Requests
resources:
requests:
cpu: 100m # Reduce from 500m
memory: 128Mi # Reduce from 512Mi
Scale Up Cluster
# GKE Autopilot auto-scales, but you can trigger it
gcloud container clusters resize <cluster-name> \
--num-nodes=<new-count> \
--zone=<zone>
GKE Autopilot Specific Errors
CPU Limit/Request Ratio Violation
Error: cpu max limit to request ratio per Container is 4, but provided ratio is X
Root Cause: GKE Autopilot enforces max ratio of 4.0
Diagnosis:
# Calculate ratio
# ratio = cpu_limit / cpu_request
# Example: 1000m / 200m = 5.0 (FAILS)
Solution:
resources:
requests:
cpu: 250m # Increased from 200m
limits:
cpu: 1000m
# New ratio: 1000/250 = 4.0 ✅
Prevention:
- Run validation:
python3 scripts/validate_gke_autopilot_compliance.py
- Pre-commit hook automatically checks
Memory Limit/Request Ratio Violation
Error: memory max limit to request ratio per Container is 4, but provided ratio is X
Solution: Same as CPU - ensure ratio ≤ 4.0
Liveness/Readiness Probe Failures
Symptoms:
- Pod restarts frequently
- Events show:
Liveness probe failed or Readiness probe failed
Diagnosis:
# Check probe configuration
kubectl get deployment <deployment> -n <namespace> -o yaml | grep -A 10 "livenessProbe:"
Probe Endpoint Not Available
Error: connection refused or 404 Not Found
Solutions:
- Verify health endpoint exists and is accessible
- Check port numbers match
- Ensure health endpoint doesn’t require authentication
Probe Timeout Too Short
Solutions:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60 # Increase if app starts slowly
periodSeconds: 30
timeoutSeconds: 10 # Increase if endpoint is slow
failureThreshold: 3
Missing Health Check Extension
Error (OTEL Collector): connection refused on port 13133
Root Cause: health_check extension not enabled
Solution:
# In OTEL Collector config
extensions:
health_check:
endpoint: 0.0.0.0:13133
service:
extensions: [health_check] # Must be enabled!
Step-by-Step Troubleshooting
Step 1: Identify the Problem
# Get overview of all pods
kubectl get pods -n <namespace>
# Identify problematic pods
kubectl get pods -n <namespace> | grep -vE "Running|Completed"
# Describe pod for detailed info
kubectl describe pod <pod-name> -n <namespace>
# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -50
Step 3: Analyze Root Cause
Use the patterns above to match symptoms to common causes.
Step 4: Apply Fix
Follow solution steps for the identified problem.
Step 5: Verify Fix
# Watch pod status
kubectl get pods -n <namespace> -w
# Check if pod is running and ready
kubectl get pod <pod-name> -n <namespace>
# Verify no restarts
kubectl describe pod <pod-name> -n <namespace> | grep "Restart Count:"
Step 6: Clean Up
# Delete old ReplicaSets (keep last 2-3 for rollback)
kubectl delete replicaset <old-replicaset-name> -n <namespace>
# Scale down deployments if needed
kubectl scale deployment <deployment> --replicas=<count> -n <namespace>
Prevention Checklist
Before Deployment
After Deployment
Quick Reference Commands
Essential Commands
# Get pod status
kubectl get pods -n <namespace>
# Describe pod
kubectl describe pod <pod-name> -n <namespace>
# Get logs
kubectl logs <pod-name> -c <container> -n <namespace>
# Get previous logs (after restart)
kubectl logs <pod-name> -c <container> -n <namespace> --previous
# Get events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Delete pod (forces recreation)
kubectl delete pod <pod-name> -n <namespace>
# Rollout restart deployment
kubectl rollout restart deployment/<deployment> -n <namespace>
# Rollback deployment
kubectl rollout undo deployment/<deployment> -n <namespace>
# Scale deployment
kubectl scale deployment/<deployment> --replicas=<count> -n <namespace>
Resource Commands
# Check node resources
kubectl top nodes
# Check pod resources
kubectl top pods -n <namespace>
# Get resource quotas
kubectl get resourcequota -n <namespace>
# Get limit ranges
kubectl get limitrange -n <namespace>
Debugging Commands
# Execute command in pod
kubectl exec -it <pod-name> -c <container> -n <namespace> -- /bin/sh
# Port forward to pod
kubectl port-forward <pod-name> <local-port>:<pod-port> -n <namespace>
# Copy files from pod
kubectl cp <namespace>/<pod-name>:<path> <local-path>