Skip to main content

EKS Operational Runbooks

Operational procedures, troubleshooting guides, and incident response playbooks for AWS EKS deployments.

Quick Reference


Cluster Health Checks

Daily Health Check

# 1. Check cluster status
kubectl cluster-info

# 2. Check node health
kubectl get nodes -o wide
kubectl top nodes

# 3. Check system pods
kubectl get pods -n kube-system
kubectl get pods -n amazon-cloudwatch

# 4. Check application pods
kubectl get pods -n mcp-server-langgraph

# 5. Check HPA status
kubectl get hpa -A

# 6. Check Cluster Autoscaler
kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=50

Automated Health Checks

#!/bin/bash
# scripts/eks-health-check.sh

echo "=== EKS Cluster Health Check ==="

# Check API server
if kubectl cluster-info &>/dev/null; then
  echo "✓ API server responsive"
else
  echo "✗ API server not responsive"
  exit 1
fi

# Check node health
NOT_READY=$(kubectl get nodes | grep -v Ready | grep -v NAME | wc -l)
if [ "$NOT_READY" -eq 0 ]; then
  echo "✓ All nodes ready"
else
  echo "✗ $NOT_READY nodes not ready"
  kubectl get nodes | grep -v Ready
fi

# Check critical pods
FAILING_PODS=$(kubectl get pods -A | grep -v Running | grep -v Completed | grep -v NAME | wc -l)
if [ "$FAILING_PODS" -eq 0 ]; then
  echo "✓ All pods healthy"
else
  echo "⚠ $FAILING_PODS pods not running"
  kubectl get pods -A | grep -v Running | grep -v Completed
fi

echo "=== Health Check Complete ==="

Pod Troubleshooting

Runbook: Pod in CrashLoopBackOff

1

Identify the issue

kubectl get pods -n mcp-server-langgraph
# Output: pod/mcp-server-xxxx  0/1  CrashLoopBackOff

# Check pod events
kubectl describe pod mcp-server-${POD_ID}x -n mcp-server-langgraph | tail -20

# Check logs
kubectl logs mcp-server-xxxx -n mcp-server-langgraph --previous
2

Common causes

Application errors:
  • Database connection failed
  • Missing environment variables
  • Config file not found
Resource limits:
  • Out of memory (OOMKilled)
  • CPU throttling
Permission issues:
  • Can’t read secrets
  • IRSA role misconfigured
3

Fix based on cause

  • Database Connection
  • IRSA Issues
  • Resource Limits
# Verify database secret exists
kubectl get secret database-credentials -n mcp-server-langgraph

# Check RDS endpoint is reachable
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h DB_ENDPOINT -U mcp_langgraph -d mcp_langgraph

# Verify RDS security group allows EKS nodes
# Check in AWS Console: RDS > Security Groups
4

Verify fix

kubectl get pods -n mcp-server-langgraph -w
# Wait for pod to reach Running state

kubectl logs -f mcp-server-xxxx -n mcp-server-langgraph
# Check logs for successful startup

Runbook: Pod Pending (Can’t Schedule)

1

Check why pod is pending

kubectl describe pod POD_NAME -n mcp-server-langgraph | grep -A 10 Events
Common reasons:
  • Insufficient CPU
  • Insufficient memory
  • No nodes available matching node selector
  • Taint toleration not satisfied
2

Check Cluster Autoscaler

# Check if autoscaler is adding nodes
kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=100

# Check autoscaler status
kubectl get nodes -l node.kubernetes.io/lifecycle=spot -o wide
3

Manual scaling if needed

# Scale node group via Terraform
cd terraform/environments/prod

# Edit terraform.tfvars
general_node_group_desired_size = 5  # Increase from 3

terraform apply -target=module.eks

# Or via AWS CLI (temporary)
aws eks update-nodegroup-config \
  --cluster-name mcp-langgraph-prod \
  --nodegroup-name general-nodes \
  --scaling-config minSize=2,maxSize=10,desiredSize=5
4

Check for taint issues

# If pod requires specific node group (e.g., compute-optimized)
kubectl describe node NODE_NAME | grep Taints

# Add toleration to pod
# In deployment.yaml:
tolerations:
- key: "workload"
  operator: "Equal"
  value: "llm"
  effect: "NoSchedule"

Runbook: Image Pull Errors

1

Identify image pull error

kubectl describe pod POD_NAME -n mcp-server-langgraph | grep -A 5 "Failed to pull image"
Error types:
  • ImagePullBackOff: Image not found or no permission
  • ErrImagePull: Network issue or registry down
2

Verify image exists in ECR

aws ecr describe-images \
  --repository-name mcp-server-langgraph \
  --image-ids imageTag=v1.0.0

# List all tags
aws ecr list-images --repository-name mcp-server-langgraph
3

Check VPC CNI IRSA permissions

# VPC CNI needs ECR permissions to pull images
kubectl describe daemonset aws-node -n kube-system | grep role-arn

# Should show IRSA role with ECR permissions
aws iam get-role-policy \
  --role-name AmazonEKSVPCCNIRole \
  --policy-name AmazonEKS_CNI_Policy
4

Check VPC endpoints

# Verify ECR endpoints exist
aws ec2 describe-vpc-endpoints \
  --filters "Name=service-name,Values=com.amazonaws.us-east-1.ecr.dkr" \
  --query "VpcEndpoints[].VpcEndpointId"

# Should return endpoint IDs for ECR API and DKR
5

Test manual pull

# SSH into node (via SSM)
aws ssm start-session --target i-INSTANCE_ID

# Try pulling image
sudo docker pull ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/mcp-server-langgraph:v1.0.0

Networking Issues

Runbook: Pods Can’t Reach Internet

1

Verify NAT Gateway

# Check NAT gateway status
aws ec2 describe-nat-gateways \
  --filter "Name=state,Values=available" \
  --query "NatGateways[].NatGatewayId"

# Check NAT gateway route
aws ec2 describe-route-tables \
  --filters "Name=tag:Name,Values=*private*" \
  --query "RouteTables[].Routes"
2

Test from pod

kubectl run -it --rm debug --image=busybox --restart=Never -- sh

# Inside pod:
wget -O- http://checkip.amazonaws.com
nslookup google.com
ping 8.8.8.8
3

Check security groups

# Get node security group
aws eks describe-cluster \
  --name mcp-langgraph-prod \
  --query "cluster.resourcesVpcConfig.clusterSecurityGroupId"

# Check outbound rules allow all traffic
aws ec2 describe-security-groups \
  --group-ids sg-XXXXX \
  --query "SecurityGroups[].IpPermissionsEgress"

Runbook: Pod-to-Pod Communication Failing

1

Check NetworkPolicies

kubectl get networkpolicies -A

# Describe policy blocking traffic
kubectl describe networkpolicy POLICY_NAME -n NAMESPACE
2

Test connectivity

# Deploy test pods
kubectl run test-source --image=busybox --restart=Never -- sleep 3600
kubectl run test-dest --image=nginx --restart=Never

# Get dest pod IP
DEST_IP=$(kubectl get pod test-dest -o jsonpath='{.status.podIP}')

# Test from source
kubectl exec test-source -- wget -O- http://$DEST_IP

# Clean up
kubectl delete pod test-source test-dest
3

Check VPC CNI

# Check VPC CNI pod logs
kubectl logs -n kube-system -l k8s-app=aws-node --tail=100

# Restart VPC CNI if needed
kubectl rollout restart daemonset aws-node -n kube-system

RDS Operations

Runbook: RDS Backup and Restore

1

Verify automated backups

aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-prod \
  --query "DBInstances[].{Backup:BackupRetentionPeriod,Window:PreferredBackupWindow}"

# List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier mcp-langgraph-prod \
  --query "DBSnapshots[].[DBSnapshotIdentifier,SnapshotCreateTime,Status]"
2

Create manual snapshot

aws rds create-db-snapshot \
  --db-instance-identifier mcp-langgraph-prod \
  --db-snapshot-identifier mcp-langgraph-prod-manual-$(date +%Y%m%d-%H%M)

# Wait for snapshot to complete
aws rds wait db-snapshot-completed \
  --db-snapshot-identifier mcp-langgraph-prod-manual-YYYYMMDD-HHMM
3

Restore from snapshot

# Restore to new instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier mcp-langgraph-restored \
  --db-snapshot-identifier mcp-langgraph-prod-manual-YYYYMMDD-HHMM \
  --db-instance-class db.t3.medium \
  --vpc-security-group-ids sg-XXXXX \
  --db-subnet-group-name mcp-langgraph-prod-db-subnet

# Wait for restore to complete (~10-15 min)
aws rds wait db-instance-available \
  --db-instance-identifier mcp-langgraph-restored
4

Point application to restored DB

# Get restored DB endpoint
RESTORED_ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-restored \
  --query "DBInstances[].Endpoint.Address" \
  --output text)

# Update secret
kubectl patch secret database-credentials \
  -n mcp-server-langgraph \
  -p "{\"data\":{\"host\":\"$(echo -n $RESTORED_ENDPOINT | base64)\"}}"

# Restart pods to pick up new endpoint
kubectl rollout restart deployment mcp-server-langgraph -n mcp-server-langgraph

Runbook: RDS Performance Issues

1

Check Performance Insights

# View in AWS Console
# RDS > mcp-langgraph-prod > Monitoring > Performance Insights

# Or via CLI
aws pi get-resource-metrics \
  --service-type RDS \
  --identifier db-XXXXX \
  --metric-queries file://metrics-query.json
2

Check slow queries

# View slow query log in CloudWatch
aws logs tail /aws/rds/instance/mcp-langgraph-prod/postgresql --follow

# Or connect and check directly
kubectl run -it --rm psql --image=postgres:15 --restart=Never -- \
  psql -h DB_ENDPOINT -U mcp_langgraph -d mcp_langgraph

# Inside psql:
SELECT * FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;
3

Check connections

# Current connections
aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-prod \
  --query "DBInstances[].DBInstanceStatus"

# Max connections setting
aws rds describe-db-parameters \
  --db-parameter-group-name mcp-langgraph-prod \
  --query "Parameters[?ParameterName=='max_connections']"
4

Scale up if needed

# Modify instance class (requires restart)
cd terraform/environments/prod

# Edit terraform.tfvars
rds_instance_class = "db.t3.large"  # Upgrade from db.t3.medium

terraform apply -target=module.rds

# Or via CLI (immediate, causes downtime)
aws rds modify-db-instance \
  --db-instance-identifier mcp-langgraph-prod \
  --db-instance-class db.t3.large \
  --apply-immediately

ElastiCache Operations

Runbook: Redis Connection Issues

1

Verify Redis is running

aws elasticache describe-replication-groups \
  --replication-group-id mcp-langgraph-prod \
  --query "ReplicationGroups[].Status"
2

Test connection from pod

kubectl run -it --rm redis-test --image=redis:7-alpine --restart=Never -- sh

# Inside pod:
redis-cli -h REDIS_ENDPOINT -p 6379 -a AUTH_TOKEN

# Test commands:
PING
INFO
CLUSTER INFO  # If cluster mode
3

Check security group

# Get ElastiCache security group
aws elasticache describe-replication-groups \
  --replication-group-id mcp-langgraph-prod \
  --query "ReplicationGroups[].NodeGroups[].PrimaryEndpoint"

# Verify EKS nodes can reach Redis on port 6379
4

Check application logs

kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph | grep -i redis

Runbook: Redis Failover Testing

1

Trigger manual failover

# Test Multi-AZ failover
aws elasticache test-failover \
  --replication-group-id mcp-langgraph-prod \
  --node-group-id 0001  # Primary shard

# Monitor failover
aws elasticache describe-events \
  --source-type replication-group \
  --source-identifier mcp-langgraph-prod \
  --duration 60
2

Verify application resilience

# Monitor application logs during failover
kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph

# Application should reconnect automatically
# Check for Redis connection errors
3

Check metrics

# View failover in CloudWatch
# CloudWatch > Metrics > ElastiCache > Replication Group

# Key metrics:
# - ReplicationLag (should spike during failover)
# - PrimaryEndpoint (should change after failover)

Cluster Autoscaler Operations

Runbook: Cluster Autoscaler Not Scaling

1

Check autoscaler logs

kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=200 | grep -E "(scale|node)"

# Look for:
# - "ScaleUp: group X -> Y nodes"
# - "ScaleDown: removing node Z"
# - Errors about IAM permissions
2

Verify IRSA permissions

# Check service account annotation
kubectl describe serviceaccount cluster-autoscaler -n kube-system | grep role-arn

# Check IAM role has autoscaling permissions
aws iam get-role-policy \
  --role-name mcp-langgraph-prod-cluster-autoscaler \
  --policy-name cluster-autoscaler-policy
3

Check node group limits

# Verify max_size is not reached
aws eks describe-nodegroup \
  --cluster-name mcp-langgraph-prod \
  --nodegroup-name general-nodes \
  --query "nodegroup.scalingConfig"

# Output should show: {minSize, maxSize, desiredSize}
4

Check for pending pods

kubectl get pods -A | grep Pending

# Autoscaler only scales up if pods are pending
# Check pod events for reason
kubectl describe pod PENDING_POD -n NAMESPACE

Monitoring & Alerts

CloudWatch Alarms

Critical alarms to configure:
# EKS Node CPU > 80%
aws cloudwatch put-metric-alarm \
  --alarm-name eks-node-cpu-high \
  --alarm-description "EKS node CPU > 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

# RDS CPU > 80%
aws cloudwatch put-metric-alarm \
  --alarm-name rds-cpu-high \
  --alarm-description "RDS CPU > 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=mcp-langgraph-prod \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

# ElastiCache Memory > 90%
aws cloudwatch put-metric-alarm \
  --alarm-name redis-memory-high \
  --alarm-description "Redis memory > 90%" \
  --metric-name DatabaseMemoryUsagePercentage \
  --namespace AWS/ElastiCache \
  --dimensions Name=ReplicationGroupId,Value=mcp-langgraph-prod \
  --statistic Average \
  --period 300 \
  --threshold 90 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

Incident Response

Runbook: Complete Cluster Outage

1

Assess impact

# Check if API server is reachable
kubectl cluster-info

# Check AWS service health
# https://health.aws.amazon.com/health/status

# Check CloudWatch Events
aws cloudwatch describe-alarms --state-value ALARM
2

Check control plane

# View control plane logs
aws logs tail /aws/eks/mcp-langgraph-prod/cluster --follow

# Check control plane status
aws eks describe-cluster --name mcp-langgraph-prod \
  --query "cluster.status"
3

Check nodes

# List EC2 instances
aws ec2 describe-instances \
  --filters "Name=tag:eks:cluster-name,Values=mcp-langgraph-prod" \
  --query "Reservations[].Instances[].[InstanceId,State.Name,PrivateIpAddress]"

# Check Auto Scaling Group
aws autoscaling describe-auto-scaling-groups \
  --query "AutoScalingGroups[?contains(AutoScalingGroupName,'mcp-langgraph-prod')]"
4

Recovery options

  • API Server Down
  • All Nodes Down
  • Database Down
AWS handles control plane recovery automaticallyWait 5-10 minutes for AWS to recover control plane. If persists > 15 minutes, contact AWS Support.
5

Post-incident review

  • Document timeline
  • Analyze CloudWatch logs
  • Review metrics during incident
  • Update runbooks based on learnings

Disaster Recovery

RTO/RPO Targets

ServiceRTORPORecovery Method
EKS Cluster30 min0Terraform re-deploy
RDS Database2 hours5 minSnapshot restore
ElastiCache1 hour1 hourSnapshot restore
Application15 min0GitOps redeploy

Disaster Recovery Test Plan

1

Monthly: Snapshot restore test

  1. Create test RDS instance from latest snapshot
  2. Verify data integrity
  3. Delete test instance
2

Quarterly: Full cluster rebuild

  1. Deploy to staging using Terraform
  2. Restore latest RDS backup
  3. Verify application functionality
  4. Destroy staging cluster
3

Annually: Multi-region failover

  1. Deploy infrastructure in secondary region
  2. Restore cross-region RDS backup
  3. Test application in secondary region
  4. Document failover procedures