Skip to main content

EKS Operational Runbooks

Operational procedures, troubleshooting guides, and incident response playbooks for AWS EKS deployments.

Quick Reference

Cluster Health

Check cluster and node status

Pod Issues

Diagnose and fix pod problems

Networking

Resolve connectivity problems

Database

RDS backup, restore, and troubleshooting

Cluster Health Checks

Daily Health Check

# 1. Check cluster status
kubectl cluster-info

# 2. Check node health
kubectl get nodes -o wide
kubectl top nodes

# 3. Check system pods
kubectl get pods -n kube-system
kubectl get pods -n amazon-cloudwatch

# 4. Check application pods
kubectl get pods -n mcp-server-langgraph

# 5. Check HPA status
kubectl get hpa -A

# 6. Check Cluster Autoscaler
kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=50

Automated Health Checks

#!/bin/bash
# scripts/eks-health-check.sh

echo "=== EKS Cluster Health Check ==="

# Check API server
if kubectl cluster-info &>/dev/null; then
  echo "✓ API server responsive"
else
  echo "✗ API server not responsive"
  exit 1
fi

# Check node health
NOT_READY=$(kubectl get nodes | grep -v Ready | grep -v NAME | wc -l)
if [ "$NOT_READY" -eq 0 ]; then
  echo "✓ All nodes ready"
else
  echo "✗ $NOT_READY nodes not ready"
  kubectl get nodes | grep -v Ready
fi

# Check critical pods
FAILING_PODS=$(kubectl get pods -A | grep -v Running | grep -v Completed | grep -v NAME | wc -l)
if [ "$FAILING_PODS" -eq 0 ]; then
  echo "✓ All pods healthy"
else
  echo "⚠ $FAILING_PODS pods not running"
  kubectl get pods -A | grep -v Running | grep -v Completed
fi

echo "=== Health Check Complete ==="

Pod Troubleshooting

Runbook: Pod in CrashLoopBackOff

1

Identify the issue

kubectl get pods -n mcp-server-langgraph
# Output: pod/mcp-server-xxxx  0/1  CrashLoopBackOff

# Check pod events
kubectl describe pod mcp-server-${POD_ID}x -n mcp-server-langgraph | tail -20

# Check logs
kubectl logs mcp-server-xxxx -n mcp-server-langgraph --previous
2

Common causes

Application errors:
  • Database connection failed
  • Missing environment variables
  • Config file not found
Resource limits:
  • Out of memory (OOMKilled)
  • CPU throttling
Permission issues:
  • Can’t read secrets
  • IRSA role misconfigured
3

Fix based on cause

# Verify database secret exists
kubectl get secret database-credentials -n mcp-server-langgraph

# Check RDS endpoint is reachable
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h DB_ENDPOINT -U mcp_langgraph -d mcp_langgraph

# Verify RDS security group allows EKS nodes
# Check in AWS Console: RDS > Security Groups
4

Verify fix

kubectl get pods -n mcp-server-langgraph -w
# Wait for pod to reach Running state

kubectl logs -f mcp-server-xxxx -n mcp-server-langgraph
# Check logs for successful startup

Runbook: Pod Pending (Can’t Schedule)

1

Check why pod is pending

kubectl describe pod POD_NAME -n mcp-server-langgraph | grep -A 10 Events
Common reasons:
  • Insufficient CPU
  • Insufficient memory
  • No nodes available matching node selector
  • Taint toleration not satisfied
2

Check Cluster Autoscaler

# Check if autoscaler is adding nodes
kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=100

# Check autoscaler status
kubectl get nodes -l node.kubernetes.io/lifecycle=spot -o wide
3

Manual scaling if needed

# Scale node group via Terraform
cd terraform/environments/prod

# Edit terraform.tfvars
general_node_group_desired_size = 5  # Increase from 3

terraform apply -target=module.eks

# Or via AWS CLI (temporary)
aws eks update-nodegroup-config \
  --cluster-name mcp-langgraph-prod \
  --nodegroup-name general-nodes \
  --scaling-config minSize=2,maxSize=10,desiredSize=5
4

Check for taint issues

# If pod requires specific node group (e.g., compute-optimized)
kubectl describe node NODE_NAME | grep Taints

# Add toleration to pod
# In deployment.yaml:
tolerations:
- key: "workload"
  operator: "Equal"
  value: "llm"
  effect: "NoSchedule"

Runbook: Image Pull Errors

1

Identify image pull error

kubectl describe pod POD_NAME -n mcp-server-langgraph | grep -A 5 "Failed to pull image"
Error types:
  • ImagePullBackOff: Image not found or no permission
  • ErrImagePull: Network issue or registry down
2

Verify image exists in ECR

aws ecr describe-images \
  --repository-name mcp-server-langgraph \
  --image-ids imageTag=v1.0.0

# List all tags
aws ecr list-images --repository-name mcp-server-langgraph
3

Check VPC CNI IRSA permissions

# VPC CNI needs ECR permissions to pull images
kubectl describe daemonset aws-node -n kube-system | grep role-arn

# Should show IRSA role with ECR permissions
aws iam get-role-policy \
  --role-name AmazonEKSVPCCNIRole \
  --policy-name AmazonEKS_CNI_Policy
4

Check VPC endpoints

# Verify ECR endpoints exist
aws ec2 describe-vpc-endpoints \
  --filters "Name=service-name,Values=com.amazonaws.us-east-1.ecr.dkr" \
  --query "VpcEndpoints[].VpcEndpointId"

# Should return endpoint IDs for ECR API and DKR
5

Test manual pull

# SSH into node (via SSM)
aws ssm start-session --target i-INSTANCE_ID

# Try pulling image
sudo docker pull ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/mcp-server-langgraph:v1.0.0

Networking Issues

Runbook: Pods Can’t Reach Internet

1

Verify NAT Gateway

# Check NAT gateway status
aws ec2 describe-nat-gateways \
  --filter "Name=state,Values=available" \
  --query "NatGateways[].NatGatewayId"

# Check NAT gateway route
aws ec2 describe-route-tables \
  --filters "Name=tag:Name,Values=*private*" \
  --query "RouteTables[].Routes"
2

Test from pod

kubectl run -it --rm debug --image=busybox --restart=Never -- sh

# Inside pod:
wget -O- http://checkip.amazonaws.com
nslookup google.com
ping 8.8.8.8
3

Check security groups

# Get node security group
aws eks describe-cluster \
  --name mcp-langgraph-prod \
  --query "cluster.resourcesVpcConfig.clusterSecurityGroupId"

# Check outbound rules allow all traffic
aws ec2 describe-security-groups \
  --group-ids sg-XXXXX \
  --query "SecurityGroups[].IpPermissionsEgress"

Runbook: Pod-to-Pod Communication Failing

1

Check NetworkPolicies

kubectl get networkpolicies -A

# Describe policy blocking traffic
kubectl describe networkpolicy POLICY_NAME -n NAMESPACE
2

Test connectivity

# Deploy test pods
kubectl run test-source --image=busybox --restart=Never -- sleep 3600
kubectl run test-dest --image=nginx --restart=Never

# Get dest pod IP
DEST_IP=$(kubectl get pod test-dest -o jsonpath='{.status.podIP}')

# Test from source
kubectl exec test-source -- wget -O- http://$DEST_IP

# Clean up
kubectl delete pod test-source test-dest
3

Check VPC CNI

# Check VPC CNI pod logs
kubectl logs -n kube-system -l k8s-app=aws-node --tail=100

# Restart VPC CNI if needed
kubectl rollout restart daemonset aws-node -n kube-system

RDS Operations

Runbook: RDS Backup and Restore

1

Verify automated backups

aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-prod \
  --query "DBInstances[].{Backup:BackupRetentionPeriod,Window:PreferredBackupWindow}"

# List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier mcp-langgraph-prod \
  --query "DBSnapshots[].[DBSnapshotIdentifier,SnapshotCreateTime,Status]"
2

Create manual snapshot

aws rds create-db-snapshot \
  --db-instance-identifier mcp-langgraph-prod \
  --db-snapshot-identifier mcp-langgraph-prod-manual-$(date +%Y%m%d-%H%M)

# Wait for snapshot to complete
aws rds wait db-snapshot-completed \
  --db-snapshot-identifier mcp-langgraph-prod-manual-YYYYMMDD-HHMM
3

Restore from snapshot

# Restore to new instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier mcp-langgraph-restored \
  --db-snapshot-identifier mcp-langgraph-prod-manual-YYYYMMDD-HHMM \
  --db-instance-class db.t3.medium \
  --vpc-security-group-ids sg-XXXXX \
  --db-subnet-group-name mcp-langgraph-prod-db-subnet

# Wait for restore to complete (~10-15 min)
aws rds wait db-instance-available \
  --db-instance-identifier mcp-langgraph-restored
4

Point application to restored DB

# Get restored DB endpoint
RESTORED_ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-restored \
  --query "DBInstances[].Endpoint.Address" \
  --output text)

# Update secret
kubectl patch secret database-credentials \
  -n mcp-server-langgraph \
  -p "{\"data\":{\"host\":\"$(echo -n $RESTORED_ENDPOINT | base64)\"}}"

# Restart pods to pick up new endpoint
kubectl rollout restart deployment mcp-server-langgraph -n mcp-server-langgraph

Runbook: RDS Performance Issues

1

Check Performance Insights

# View in AWS Console
# RDS > mcp-langgraph-prod > Monitoring > Performance Insights

# Or via CLI
aws pi get-resource-metrics \
  --service-type RDS \
  --identifier db-XXXXX \
  --metric-queries file://metrics-query.json
2

Check slow queries

# View slow query log in CloudWatch
aws logs tail /aws/rds/instance/mcp-langgraph-prod/postgresql --follow

# Or connect and check directly
kubectl run -it --rm psql --image=postgres:15 --restart=Never -- \
  psql -h DB_ENDPOINT -U mcp_langgraph -d mcp_langgraph

# Inside psql:
SELECT * FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;
3

Check connections

# Current connections
aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-prod \
  --query "DBInstances[].DBInstanceStatus"

# Max connections setting
aws rds describe-db-parameters \
  --db-parameter-group-name mcp-langgraph-prod \
  --query "Parameters[?ParameterName=='max_connections']"
4

Scale up if needed

# Modify instance class (requires restart)
cd terraform/environments/prod

# Edit terraform.tfvars
rds_instance_class = "db.t3.large"  # Upgrade from db.t3.medium

terraform apply -target=module.rds

# Or via CLI (immediate, causes downtime)
aws rds modify-db-instance \
  --db-instance-identifier mcp-langgraph-prod \
  --db-instance-class db.t3.large \
  --apply-immediately

ElastiCache Operations

Runbook: Redis Connection Issues

1

Verify Redis is running

aws elasticache describe-replication-groups \
  --replication-group-id mcp-langgraph-prod \
  --query "ReplicationGroups[].Status"
2

Test connection from pod

kubectl run -it --rm redis-test --image=redis:7-alpine --restart=Never -- sh

# Inside pod:
redis-cli -h REDIS_ENDPOINT -p 6379 -a AUTH_TOKEN

# Test commands:
PING
INFO
CLUSTER INFO  # If cluster mode
3

Check security group

# Get ElastiCache security group
aws elasticache describe-replication-groups \
  --replication-group-id mcp-langgraph-prod \
  --query "ReplicationGroups[].NodeGroups[].PrimaryEndpoint"

# Verify EKS nodes can reach Redis on port 6379
4

Check application logs

kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph | grep -i redis

Runbook: Redis Failover Testing

1

Trigger manual failover

# Test Multi-AZ failover
aws elasticache test-failover \
  --replication-group-id mcp-langgraph-prod \
  --node-group-id 0001  # Primary shard

# Monitor failover
aws elasticache describe-events \
  --source-type replication-group \
  --source-identifier mcp-langgraph-prod \
  --duration 60
2

Verify application resilience

# Monitor application logs during failover
kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph

# Application should reconnect automatically
# Check for Redis connection errors
3

Check metrics

# View failover in CloudWatch
# CloudWatch > Metrics > ElastiCache > Replication Group

# Key metrics:
# - ReplicationLag (should spike during failover)
# - PrimaryEndpoint (should change after failover)

Cluster Autoscaler Operations

Runbook: Cluster Autoscaler Not Scaling

1

Check autoscaler logs

kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=200 | grep -E "(scale|node)"

# Look for:
# - "ScaleUp: group X -> Y nodes"
# - "ScaleDown: removing node Z"
# - Errors about IAM permissions
2

Verify IRSA permissions

# Check service account annotation
kubectl describe serviceaccount cluster-autoscaler -n kube-system | grep role-arn

# Check IAM role has autoscaling permissions
aws iam get-role-policy \
  --role-name mcp-langgraph-prod-cluster-autoscaler \
  --policy-name cluster-autoscaler-policy
3

Check node group limits

# Verify max_size is not reached
aws eks describe-nodegroup \
  --cluster-name mcp-langgraph-prod \
  --nodegroup-name general-nodes \
  --query "nodegroup.scalingConfig"

# Output should show: {minSize, maxSize, desiredSize}
4

Check for pending pods

kubectl get pods -A | grep Pending

# Autoscaler only scales up if pods are pending
# Check pod events for reason
kubectl describe pod PENDING_POD -n NAMESPACE

Monitoring & Alerts

CloudWatch Alarms

Critical alarms to configure:
# EKS Node CPU > 80%
aws cloudwatch put-metric-alarm \
  --alarm-name eks-node-cpu-high \
  --alarm-description "EKS node CPU > 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

# RDS CPU > 80%
aws cloudwatch put-metric-alarm \
  --alarm-name rds-cpu-high \
  --alarm-description "RDS CPU > 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=mcp-langgraph-prod \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

# ElastiCache Memory > 90%
aws cloudwatch put-metric-alarm \
  --alarm-name redis-memory-high \
  --alarm-description "Redis memory > 90%" \
  --metric-name DatabaseMemoryUsagePercentage \
  --namespace AWS/ElastiCache \
  --dimensions Name=ReplicationGroupId,Value=mcp-langgraph-prod \
  --statistic Average \
  --period 300 \
  --threshold 90 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

Incident Response

Runbook: Complete Cluster Outage

1

Assess impact

# Check if API server is reachable
kubectl cluster-info

# Check AWS service health
# https://health.aws.amazon.com/health/status

# Check CloudWatch Events
aws cloudwatch describe-alarms --state-value ALARM
2

Check control plane

# View control plane logs
aws logs tail /aws/eks/mcp-langgraph-prod/cluster --follow

# Check control plane status
aws eks describe-cluster --name mcp-langgraph-prod \
  --query "cluster.status"
3

Check nodes

# List EC2 instances
aws ec2 describe-instances \
  --filters "Name=tag:eks:cluster-name,Values=mcp-langgraph-prod" \
  --query "Reservations[].Instances[].[InstanceId,State.Name,PrivateIpAddress]"

# Check Auto Scaling Group
aws autoscaling describe-auto-scaling-groups \
  --query "AutoScalingGroups[?contains(AutoScalingGroupName,'mcp-langgraph-prod')]"
4

Recovery options

AWS handles control plane recovery automaticallyWait 5-10 minutes for AWS to recover control plane. If persists > 15 minutes, contact AWS Support.
5

Post-incident review

  • Document timeline
  • Analyze CloudWatch logs
  • Review metrics during incident
  • Update runbooks based on learnings

Disaster Recovery

RTO/RPO Targets

ServiceRTORPORecovery Method
EKS Cluster30 min0Terraform re-deploy
RDS Database2 hours5 minSnapshot restore
ElastiCache1 hour1 hourSnapshot restore
Application15 min0GitOps redeploy

Disaster Recovery Test Plan

1

Monthly: Snapshot restore test

  1. Create test RDS instance from latest snapshot
  2. Verify data integrity
  3. Delete test instance
2

Quarterly: Full cluster rebuild

  1. Deploy to staging using Terraform
  2. Restore latest RDS backup
  3. Verify application functionality
  4. Destroy staging cluster
3

Annually: Multi-region failover

  1. Deploy infrastructure in secondary region
  2. Restore cross-region RDS backup
  3. Test application in secondary region
  4. Document failover procedures

EKS Production Guide

Complete deployment guide

Terraform AWS

Infrastructure documentation

AWS Security

Security hardening guide

Cost Optimization

Cost optimization strategies