EKS Operational Runbooks

Operational procedures, troubleshooting guides, and incident response playbooks for AWS EKS deployments.

Quick Reference

Cluster Health

Check cluster and node status

Pod Issues

Diagnose and fix pod problems

Networking

Resolve connectivity problems

Database

RDS backup, restore, and troubleshooting

Cluster Health Checks

Daily Health Check

# 1. Check cluster status
kubectl cluster-info

# 2. Check node health
kubectl get nodes -o wide
kubectl top nodes

# 3. Check system pods
kubectl get pods -n kube-system
kubectl get pods -n amazon-cloudwatch

# 4. Check application pods
kubectl get pods -n mcp-server-langgraph

# 5. Check HPA status
kubectl get hpa -A

# 6. Check Cluster Autoscaler
kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=50

Automated Health Checks

#!/bin/bash
# scripts/eks-health-check.sh

echo "=== EKS Cluster Health Check ==="

# Check API server
if kubectl cluster-info &>/dev/null; then
  echo "✓ API server responsive"
else
  echo "✗ API server not responsive"
  exit 1
fi

# Check node health
NOT_READY=$(kubectl get nodes | grep -v Ready | grep -v NAME | wc -l)
if [ "$NOT_READY" -eq 0 ]; then
  echo "✓ All nodes ready"
else
  echo "✗ $NOT_READY nodes not ready"
  kubectl get nodes | grep -v Ready
fi

# Check critical pods
FAILING_PODS=$(kubectl get pods -A | grep -v Running | grep -v Completed | grep -v NAME | wc -l)
if [ "$FAILING_PODS" -eq 0 ]; then
  echo "✓ All pods healthy"
else
  echo "⚠ $FAILING_PODS pods not running"
  kubectl get pods -A | grep -v Running | grep -v Completed
fi

echo "=== Health Check Complete ==="

Pod Troubleshooting

Runbook: Pod in CrashLoopBackOff

Identify the issue

kubectl get pods -n mcp-server-langgraph
# Output: pod/mcp-server-xxxx  0/1  CrashLoopBackOff

# Check pod events
kubectl describe pod mcp-server-${POD_ID}x -n mcp-server-langgraph | tail -20

# Check logs
kubectl logs mcp-server-xxxx -n mcp-server-langgraph --previous

Common causes

Application errors:

Database connection failed
Missing environment variables
Config file not found

Resource limits:

Out of memory (OOMKilled)
CPU throttling

Permission issues:

Can’t read secrets
IRSA role misconfigured

Fix based on cause

Database Connection
IRSA Issues
Resource Limits

# Verify database secret exists
kubectl get secret database-credentials -n mcp-server-langgraph

# Check RDS endpoint is reachable
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h DB_ENDPOINT -U mcp_langgraph -d mcp_langgraph

# Verify RDS security group allows EKS nodes
# Check in AWS Console: RDS > Security Groups

# Check service account annotation
kubectl describe serviceaccount mcp-server-langgraph -n mcp-server-langgraph | grep role-arn

# Check if role exists
aws iam get-role --role-name mcp-langgraph-prod-application

# Check pod has AWS credentials injected
kubectl describe pod mcp-server-${POD_ID}x -n mcp-server-langgraph | grep AWS_ROLE_ARN

# Check if pod was OOMKilled
kubectl describe pod mcp-server-${POD_ID}x -n mcp-server-langgraph | grep -A 5 "Last State"

# Increase memory limit
kubectl set resources deployment mcp-server-langgraph \
  -n mcp-server-langgraph \
  --limits=memory=2Gi \
  --requests=memory=1Gi

Verify fix

kubectl get pods -n mcp-server-langgraph -w
# Wait for pod to reach Running state

kubectl logs -f mcp-server-xxxx -n mcp-server-langgraph
# Check logs for successful startup

Runbook: Pod Pending (Can’t Schedule)

Check why pod is pending

kubectl describe pod POD_NAME -n mcp-server-langgraph | grep -A 10 Events

Common reasons:

Insufficient CPU
Insufficient memory
No nodes available matching node selector
Taint toleration not satisfied

Check Cluster Autoscaler

# Check if autoscaler is adding nodes
kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=100

# Check autoscaler status
kubectl get nodes -l node.kubernetes.io/lifecycle=spot -o wide

Manual scaling if needed

# Scale node group via Terraform
cd terraform/environments/prod

# Edit terraform.tfvars
general_node_group_desired_size = 5  # Increase from 3

terraform apply -target=module.eks

# Or via AWS CLI (temporary)
aws eks update-nodegroup-config \
  --cluster-name mcp-langgraph-prod \
  --nodegroup-name general-nodes \
  --scaling-config minSize=2,maxSize=10,desiredSize=5

Check for taint issues

# If pod requires specific node group (e.g., compute-optimized)
kubectl describe node NODE_NAME | grep Taints

# Add toleration to pod
# In deployment.yaml:
tolerations:
- key: "workload"
  operator: "Equal"
  value: "llm"
  effect: "NoSchedule"

Runbook: Image Pull Errors

Identify image pull error

kubectl describe pod POD_NAME -n mcp-server-langgraph | grep -A 5 "Failed to pull image"

Error types:

ImagePullBackOff: Image not found or no permission
ErrImagePull: Network issue or registry down

Verify image exists in ECR

aws ecr describe-images \
  --repository-name mcp-server-langgraph \
  --image-ids imageTag=v1.0.0

# List all tags
aws ecr list-images --repository-name mcp-server-langgraph

Check VPC CNI IRSA permissions

# VPC CNI needs ECR permissions to pull images
kubectl describe daemonset aws-node -n kube-system | grep role-arn

# Should show IRSA role with ECR permissions
aws iam get-role-policy \
  --role-name AmazonEKSVPCCNIRole \
  --policy-name AmazonEKS_CNI_Policy

Check VPC endpoints

# Verify ECR endpoints exist
aws ec2 describe-vpc-endpoints \
  --filters "Name=service-name,Values=com.amazonaws.us-east-1.ecr.dkr" \
  --query "VpcEndpoints[].VpcEndpointId"

# Should return endpoint IDs for ECR API and DKR

Test manual pull

# SSH into node (via SSM)
aws ssm start-session --target i-INSTANCE_ID

# Try pulling image
sudo docker pull ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/mcp-server-langgraph:v1.0.0

Networking Issues

Runbook: Pods Can’t Reach Internet

Verify NAT Gateway

# Check NAT gateway status
aws ec2 describe-nat-gateways \
  --filter "Name=state,Values=available" \
  --query "NatGateways[].NatGatewayId"

# Check NAT gateway route
aws ec2 describe-route-tables \
  --filters "Name=tag:Name,Values=*private*" \
  --query "RouteTables[].Routes"

Test from pod

kubectl run -it --rm debug --image=busybox --restart=Never -- sh

# Inside pod:
wget -O- http://checkip.amazonaws.com
nslookup google.com
ping 8.8.8.8

Check security groups

# Get node security group
aws eks describe-cluster \
  --name mcp-langgraph-prod \
  --query "cluster.resourcesVpcConfig.clusterSecurityGroupId"

# Check outbound rules allow all traffic
aws ec2 describe-security-groups \
  --group-ids sg-XXXXX \
  --query "SecurityGroups[].IpPermissionsEgress"

Runbook: Pod-to-Pod Communication Failing

Check NetworkPolicies

kubectl get networkpolicies -A

# Describe policy blocking traffic
kubectl describe networkpolicy POLICY_NAME -n NAMESPACE

Test connectivity

# Deploy test pods
kubectl run test-source --image=busybox --restart=Never -- sleep 3600
kubectl run test-dest --image=nginx --restart=Never

# Get dest pod IP
DEST_IP=$(kubectl get pod test-dest -o jsonpath='{.status.podIP}')

# Test from source
kubectl exec test-source -- wget -O- http://$DEST_IP

# Clean up
kubectl delete pod test-source test-dest

Check VPC CNI

# Check VPC CNI pod logs
kubectl logs -n kube-system -l k8s-app=aws-node --tail=100

# Restart VPC CNI if needed
kubectl rollout restart daemonset aws-node -n kube-system

RDS Operations

Runbook: RDS Backup and Restore

Verify automated backups

aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-prod \
  --query "DBInstances[].{Backup:BackupRetentionPeriod,Window:PreferredBackupWindow}"

# List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier mcp-langgraph-prod \
  --query "DBSnapshots[].[DBSnapshotIdentifier,SnapshotCreateTime,Status]"

Create manual snapshot

aws rds create-db-snapshot \
  --db-instance-identifier mcp-langgraph-prod \
  --db-snapshot-identifier mcp-langgraph-prod-manual-$(date +%Y%m%d-%H%M)

# Wait for snapshot to complete
aws rds wait db-snapshot-completed \
  --db-snapshot-identifier mcp-langgraph-prod-manual-YYYYMMDD-HHMM

Restore from snapshot

# Restore to new instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier mcp-langgraph-restored \
  --db-snapshot-identifier mcp-langgraph-prod-manual-YYYYMMDD-HHMM \
  --db-instance-class db.t3.medium \
  --vpc-security-group-ids sg-XXXXX \
  --db-subnet-group-name mcp-langgraph-prod-db-subnet

# Wait for restore to complete (~10-15 min)
aws rds wait db-instance-available \
  --db-instance-identifier mcp-langgraph-restored

Point application to restored DB

# Get restored DB endpoint
RESTORED_ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-restored \
  --query "DBInstances[].Endpoint.Address" \
  --output text)

# Update secret
kubectl patch secret database-credentials \
  -n mcp-server-langgraph \
  -p "{\"data\":{\"host\":\"$(echo -n $RESTORED_ENDPOINT | base64)\"}}"

# Restart pods to pick up new endpoint
kubectl rollout restart deployment mcp-server-langgraph -n mcp-server-langgraph

Runbook: RDS Performance Issues

Check Performance Insights

# View in AWS Console
# RDS > mcp-langgraph-prod > Monitoring > Performance Insights

# Or via CLI
aws pi get-resource-metrics \
  --service-type RDS \
  --identifier db-XXXXX \
  --metric-queries file://metrics-query.json

Check slow queries

# View slow query log in CloudWatch
aws logs tail /aws/rds/instance/mcp-langgraph-prod/postgresql --follow

# Or connect and check directly
kubectl run -it --rm psql --image=postgres:15 --restart=Never -- \
  psql -h DB_ENDPOINT -U mcp_langgraph -d mcp_langgraph

# Inside psql:
SELECT * FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;

Check connections

# Current connections
aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-prod \
  --query "DBInstances[].DBInstanceStatus"

# Max connections setting
aws rds describe-db-parameters \
  --db-parameter-group-name mcp-langgraph-prod \
  --query "Parameters[?ParameterName=='max_connections']"

Scale up if needed

# Modify instance class (requires restart)
cd terraform/environments/prod

# Edit terraform.tfvars
rds_instance_class = "db.t3.large"  # Upgrade from db.t3.medium

terraform apply -target=module.rds

# Or via CLI (immediate, causes downtime)
aws rds modify-db-instance \
  --db-instance-identifier mcp-langgraph-prod \
  --db-instance-class db.t3.large \
  --apply-immediately

ElastiCache Operations

Runbook: Redis Connection Issues

Verify Redis is running

aws elasticache describe-replication-groups \
  --replication-group-id mcp-langgraph-prod \
  --query "ReplicationGroups[].Status"

Test connection from pod

kubectl run -it --rm redis-test --image=redis:7-alpine --restart=Never -- sh

# Inside pod:
redis-cli -h REDIS_ENDPOINT -p 6379 -a AUTH_TOKEN

# Test commands:
PING
INFO
CLUSTER INFO  # If cluster mode

Check security group

# Get ElastiCache security group
aws elasticache describe-replication-groups \
  --replication-group-id mcp-langgraph-prod \
  --query "ReplicationGroups[].NodeGroups[].PrimaryEndpoint"

# Verify EKS nodes can reach Redis on port 6379

Check application logs

kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph | grep -i redis

Runbook: Redis Failover Testing

Trigger manual failover

# Test Multi-AZ failover
aws elasticache test-failover \
  --replication-group-id mcp-langgraph-prod \
  --node-group-id 0001  # Primary shard

# Monitor failover
aws elasticache describe-events \
  --source-type replication-group \
  --source-identifier mcp-langgraph-prod \
  --duration 60

Verify application resilience

# Monitor application logs during failover
kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph

# Application should reconnect automatically
# Check for Redis connection errors

Check metrics

# View failover in CloudWatch
# CloudWatch > Metrics > ElastiCache > Replication Group

# Key metrics:
# - ReplicationLag (should spike during failover)
# - PrimaryEndpoint (should change after failover)

Cluster Autoscaler Operations

Runbook: Cluster Autoscaler Not Scaling

Check autoscaler logs

kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=200 | grep -E "(scale|node)"

# Look for:
# - "ScaleUp: group X -> Y nodes"
# - "ScaleDown: removing node Z"
# - Errors about IAM permissions

Verify IRSA permissions

# Check service account annotation
kubectl describe serviceaccount cluster-autoscaler -n kube-system | grep role-arn

# Check IAM role has autoscaling permissions
aws iam get-role-policy \
  --role-name mcp-langgraph-prod-cluster-autoscaler \
  --policy-name cluster-autoscaler-policy

Check node group limits

# Verify max_size is not reached
aws eks describe-nodegroup \
  --cluster-name mcp-langgraph-prod \
  --nodegroup-name general-nodes \
  --query "nodegroup.scalingConfig"

# Output should show: {minSize, maxSize, desiredSize}

Check for pending pods

kubectl get pods -A | grep Pending

# Autoscaler only scales up if pods are pending
# Check pod events for reason
kubectl describe pod PENDING_POD -n NAMESPACE

Monitoring & Alerts

CloudWatch Alarms

Critical alarms to configure:

# EKS Node CPU > 80%
aws cloudwatch put-metric-alarm \
  --alarm-name eks-node-cpu-high \
  --alarm-description "EKS node CPU > 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

# RDS CPU > 80%
aws cloudwatch put-metric-alarm \
  --alarm-name rds-cpu-high \
  --alarm-description "RDS CPU > 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=mcp-langgraph-prod \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

# ElastiCache Memory > 90%
aws cloudwatch put-metric-alarm \
  --alarm-name redis-memory-high \
  --alarm-description "Redis memory > 90%" \
  --metric-name DatabaseMemoryUsagePercentage \
  --namespace AWS/ElastiCache \
  --dimensions Name=ReplicationGroupId,Value=mcp-langgraph-prod \
  --statistic Average \
  --period 300 \
  --threshold 90 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

Incident Response

Runbook: Complete Cluster Outage

Assess impact

# Check if API server is reachable
kubectl cluster-info

# Check AWS service health
# https://health.aws.amazon.com/health/status

# Check CloudWatch Events
aws cloudwatch describe-alarms --state-value ALARM

Check control plane

# View control plane logs
aws logs tail /aws/eks/mcp-langgraph-prod/cluster --follow

# Check control plane status
aws eks describe-cluster --name mcp-langgraph-prod \
  --query "cluster.status"

Check nodes

# List EC2 instances
aws ec2 describe-instances \
  --filters "Name=tag:eks:cluster-name,Values=mcp-langgraph-prod" \
  --query "Reservations[].Instances[].[InstanceId,State.Name,PrivateIpAddress]"

# Check Auto Scaling Group
aws autoscaling describe-auto-scaling-groups \
  --query "AutoScalingGroups[?contains(AutoScalingGroupName,'mcp-langgraph-prod')]"

Recovery options

API Server Down
All Nodes Down
Database Down

AWS handles control plane recovery automaticallyWait 5-10 minutes for AWS to recover control plane. If persists > 15 minutes, contact AWS Support.

# Check Auto Scaling Group health
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names GENERAL_NODE_GROUP_NAME

# Manual scaling if needed
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name GENERAL_NODE_GROUP_NAME \
  --desired-capacity 3 \
  --honor-cooldown

# Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier mcp-langgraph-prod

# Reboot if hung
aws rds reboot-db-instance \
  --db-instance-identifier mcp-langgraph-prod

# Or failover to standby
aws rds failover-db-cluster \
  --db-cluster-identifier mcp-langgraph-prod

Post-incident review

Document timeline
Analyze CloudWatch logs
Review metrics during incident
Update runbooks based on learnings

Disaster Recovery

RTO/RPO Targets

Service	RTO	RPO	Recovery Method
EKS Cluster	30 min	0	Terraform re-deploy
RDS Database	2 hours	5 min	Snapshot restore
ElastiCache	1 hour	1 hour	Snapshot restore
Application	15 min	0	GitOps redeploy

Disaster Recovery Test Plan

Monthly: Snapshot restore test

Create test RDS instance from latest snapshot
Verify data integrity
Delete test instance

Quarterly: Full cluster rebuild

Deploy to staging using Terraform
Restore latest RDS backup
Verify application functionality
Destroy staging cluster

Annually: Multi-region failover

Deploy infrastructure in secondary region
Restore cross-region RDS backup
Test application in secondary region
Document failover procedures

EKS Production Guide

Complete deployment guide

Terraform AWS

Infrastructure documentation

AWS Security

Security hardening guide

Cost Optimization

Cost optimization strategies

Getting Started

Deployment Options

LangGraph Platform

Kubernetes - GKE

Kubernetes - EKS & AKS

Kubernetes - Best Practices

Infrastructure as Code

Monitoring & Observability

Advanced Deployment

Configuration

Operations

​EKS Operational Runbooks

​Quick Reference

Cluster Health

Pod Issues

Networking

Database

​Cluster Health Checks

​Daily Health Check

​Automated Health Checks

​Pod Troubleshooting

​Runbook: Pod in CrashLoopBackOff

​Runbook: Pod Pending (Can’t Schedule)

​Runbook: Image Pull Errors

​Networking Issues

​Runbook: Pods Can’t Reach Internet

​Runbook: Pod-to-Pod Communication Failing

​RDS Operations

​Runbook: RDS Backup and Restore

​Runbook: RDS Performance Issues

​ElastiCache Operations

​Runbook: Redis Connection Issues

​Runbook: Redis Failover Testing

​Cluster Autoscaler Operations

​Runbook: Cluster Autoscaler Not Scaling

​Monitoring & Alerts

​CloudWatch Alarms

​Incident Response

​Runbook: Complete Cluster Outage

​Disaster Recovery

​RTO/RPO Targets

​Disaster Recovery Test Plan

​Related Documentation