EKS Operational Runbooks
Operational procedures, troubleshooting guides, and incident response playbooks for AWS EKS deployments.Quick Reference
Cluster Health
Check cluster and node status
Pod Issues
Diagnose and fix pod problems
Networking
Resolve connectivity problems
Database
RDS backup, restore, and troubleshooting
Cluster Health Checks
Daily Health Check
Automated Health Checks
Pod Troubleshooting
Runbook: Pod in CrashLoopBackOff
Common causes
Application errors:
- Database connection failed
- Missing environment variables
- Config file not found
- Out of memory (OOMKilled)
- CPU throttling
- Can’t read secrets
- IRSA role misconfigured
Runbook: Pod Pending (Can’t Schedule)
Check why pod is pending
Insufficient CPUInsufficient memoryNo nodes available matching node selectorTaint toleration not satisfied
Runbook: Image Pull Errors
Identify image pull error
ImagePullBackOff: Image not found or no permissionErrImagePull: Network issue or registry down
Networking Issues
Runbook: Pods Can’t Reach Internet
Runbook: Pod-to-Pod Communication Failing
RDS Operations
Runbook: RDS Backup and Restore
Runbook: RDS Performance Issues
ElastiCache Operations
Runbook: Redis Connection Issues
Runbook: Redis Failover Testing
Cluster Autoscaler Operations
Runbook: Cluster Autoscaler Not Scaling
Monitoring & Alerts
CloudWatch Alarms
Critical alarms to configure:Incident Response
Runbook: Complete Cluster Outage
Recovery options
- API Server Down
- All Nodes Down
- Database Down
AWS handles control plane recovery automaticallyWait 5-10 minutes for AWS to recover control plane.
If persists > 15 minutes, contact AWS Support.
Disaster Recovery
RTO/RPO Targets
| Service | RTO | RPO | Recovery Method |
|---|---|---|---|
| EKS Cluster | 30 min | 0 | Terraform re-deploy |
| RDS Database | 2 hours | 5 min | Snapshot restore |
| ElastiCache | 1 hour | 1 hour | Snapshot restore |
| Application | 15 min | 0 | GitOps redeploy |
Disaster Recovery Test Plan
Monthly: Snapshot restore test
- Create test RDS instance from latest snapshot
- Verify data integrity
- Delete test instance
Quarterly: Full cluster rebuild
- Deploy to staging using Terraform
- Restore latest RDS backup
- Verify application functionality
- Destroy staging cluster