EKS Operational Runbooks
Operational procedures, troubleshooting guides, and incident response playbooks for AWS EKS deployments.Quick Reference
Cluster Health
Pod Issues
Networking
Database
Cluster Health Checks
Daily Health Check
Automated Health Checks
Pod Troubleshooting
Runbook: Pod in CrashLoopBackOff
Identify the issue
Common causes
- Database connection failed
- Missing environment variables
- Config file not found
- Out of memory (OOMKilled)
- CPU throttling
- Can’t read secrets
- IRSA role misconfigured
Fix based on cause
- Database Connection
- IRSA Issues
- Resource Limits
Verify fix
Runbook: Pod Pending (Can’t Schedule)
Check why pod is pending
Insufficient CPUInsufficient memoryNo nodes available matching node selectorTaint toleration not satisfied
Check Cluster Autoscaler
Manual scaling if needed
Check for taint issues
Runbook: Image Pull Errors
Identify image pull error
ImagePullBackOff: Image not found or no permissionErrImagePull: Network issue or registry down
Verify image exists in ECR
Check VPC CNI IRSA permissions
Check VPC endpoints
Test manual pull
Networking Issues
Runbook: Pods Can’t Reach Internet
Verify NAT Gateway
Test from pod
Check security groups
Runbook: Pod-to-Pod Communication Failing
Check NetworkPolicies
Test connectivity
Check VPC CNI
RDS Operations
Runbook: RDS Backup and Restore
Verify automated backups
Create manual snapshot
Restore from snapshot
Point application to restored DB
Runbook: RDS Performance Issues
Check Performance Insights
Check slow queries
Check connections
Scale up if needed
ElastiCache Operations
Runbook: Redis Connection Issues
Verify Redis is running
Test connection from pod
Check security group
Check application logs
Runbook: Redis Failover Testing
Trigger manual failover
Verify application resilience
Check metrics
Cluster Autoscaler Operations
Runbook: Cluster Autoscaler Not Scaling
Check autoscaler logs
Verify IRSA permissions
Check node group limits
Check for pending pods
Monitoring & Alerts
CloudWatch Alarms
Critical alarms to configure:Incident Response
Runbook: Complete Cluster Outage
Assess impact
Check control plane
Check nodes
Recovery options
- API Server Down
- All Nodes Down
- Database Down
Post-incident review
- Document timeline
- Analyze CloudWatch logs
- Review metrics during incident
- Update runbooks based on learnings
Disaster Recovery
RTO/RPO Targets
| Service | RTO | RPO | Recovery Method |
|---|---|---|---|
| EKS Cluster | 30 min | 0 | Terraform re-deploy |
| RDS Database | 2 hours | 5 min | Snapshot restore |
| ElastiCache | 1 hour | 1 hour | Snapshot restore |
| Application | 15 min | 0 | GitOps redeploy |
Disaster Recovery Test Plan
Monthly: Snapshot restore test
- Create test RDS instance from latest snapshot
- Verify data integrity
- Delete test instance
Quarterly: Full cluster rebuild
- Deploy to staging using Terraform
- Restore latest RDS backup
- Verify application functionality
- Destroy staging cluster
Annually: Multi-region failover
- Deploy infrastructure in secondary region
- Restore cross-region RDS backup
- Test application in secondary region
- Document failover procedures