EKS Production Deployment
Complete guide to deploying mcp-server-langgraph on AWS EKS with production-grade infrastructure, security, and observability.
Overview
This deployment achieves 96/100 infrastructure maturity with:
Infrastructure as Code Terraform modules for VPC, EKS, RDS, ElastiCache
High Availability Multi-AZ across all services with automatic failover
Security First IRSA, encryption everywhere, network isolation
Cost Optimized ~$803/month (60% savings vs. defaults)
What You’ll Deploy
Prerequisites
AWS Account Setup
AWS account with admin access
AWS CLI installed and configured (aws configure)
Account limits: 5 VPCs, 20 EIPs, 100 security groups per region
Local Tools
# Install required tools
brew install terraform kubectl awscli
# Verify versions
terraform version # >= 1.5.0
kubectl version --client # >= 1.27.0
aws --version # >= 2.13.0
Repository
git clone https://github.com/vishnu2kmohan/mcp-server-langgraph
cd mcp-server-langgraph
Deployment Architecture
Infrastructure Layers
Layer 1: Networking
Layer 2: Compute
Layer 3: Data
Layer 4: Security
VPC Module (terraform/modules/vpc)
3 Availability Zones (us-east-1a/b/c)
Public subnets (/20) for load balancers
Private subnets (/18) for workloads
NAT Gateways (multi-AZ)
VPC Endpoints (S3, ECR, CloudWatch)
VPC Flow Logs
Capacity : 16,384 IPs per private subnet (~300 EKS nodes per AZ)
Step-by-Step Deployment
Initialize backend
cd terraform/backend-setup
Edit variables.tf or create terraform.tfvars: project_name = "mcp-langgraph"
environment = "prod"
region = "us-east-1"
Deploy backend
terraform init
terraform plan
terraform apply
Creates :
S3 bucket: mcp-langgraph-terraform-state-prod
DynamoDB table: mcp-langgraph-terraform-lock-prod
Access logging bucket
Note outputs
Save the S3 bucket name and DynamoDB table name for next phase.
Phase 2: Infrastructure Deployment (20-25 minutes)
Configure environment
cd ../../environments/prod
Create terraform.tfvars: # Project configuration
project_name = "mcp-langgraph"
environment = "prod"
region = "us-east-1"
# VPC configuration
vpc_cidr = "10.0.0.0/16"
availability_zones = [ "us-east-1a" , "us-east-1b" , "us-east-1c" ]
single_nat_gateway = false # Multi-AZ NAT for HA
# EKS configuration
cluster_name = "mcp-langgraph-prod"
kubernetes_version = "1.28"
# Node groups
enable_general_node_group = true
general_node_group_desired_size = 3
general_node_group_min_size = 2
general_node_group_max_size = 10
general_node_group_instance_types = [ "t3.xlarge" ]
enable_spot_node_group = true
spot_node_group_desired_size = 2
# RDS configuration
rds_instance_class = "db.t3.medium"
rds_allocated_storage = 100
rds_multi_az = true
rds_backup_retention_days = 30
# ElastiCache configuration
elasticache_node_type = "cache.r6g.large"
elasticache_cluster_mode_enabled = false # Standard mode
elasticache_num_cache_nodes = 2
# IRSA
create_application_irsa_role = true
application_service_account_name = "mcp-server-langgraph"
application_service_account_namespace = "mcp-server-langgraph"
tags = {
Environment = "production"
ManagedBy = "terraform"
Project = "mcp-langgraph"
}
Initialize Terraform
terraform init \
-backend-config= "bucket=mcp-langgraph-terraform-state-prod" \
-backend-config= "key=prod/terraform.tfstate" \
-backend-config= "region=us-east-1" \
-backend-config= "dynamodb_table=mcp-langgraph-terraform-lock-prod"
Plan deployment
terraform plan -out=tfplan
Review :
~50 resources will be created
VPC, subnets, NAT gateways
EKS cluster and node groups
RDS instance
ElastiCache cluster
IAM roles and policies
Deploy infrastructure
Duration : 20-25 minutes
VPC: ~2 minutes
EKS control plane: ~10 minutes
Node groups: ~8 minutes
RDS Multi-AZ: ~12 minutes (parallel with EKS)
ElastiCache: ~5 minutes (parallel with EKS)
Save outputs
terraform output > outputs.txt
# Important outputs:
terraform output -raw cluster_endpoint
terraform output -raw cluster_certificate_authority_data
terraform output -raw application_irsa_role_arn
terraform output -raw db_instance_endpoint
terraform output -raw elasticache_configuration_endpoint
Phase 3: Kubernetes Configuration (10 minutes)
Configure kubectl
aws eks update-kubeconfig \
--region us-east-1 \
--name mcp-langgraph-prod \
--alias mcp-prod
# Verify connection
kubectl get nodes
kubectl get pods -A
Create namespace
kubectl create namespace mcp-server-langgraph
# Apply Pod Security Standards
kubectl label namespace mcp-server-langgraph \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/audit=restricted \
pod-security.kubernetes.io/warn=restricted
Create service account with IRSA
# k8s/service-account.yaml
apiVersion : v1
kind : ServiceAccount
metadata :
name : mcp-server-langgraph
namespace : mcp-server-langgraph
annotations :
eks.amazonaws.com/role-arn : arn:aws:iam::ACCOUNT_ID:role/mcp-langgraph-prod-application
kubectl apply -f k8s/service-account.yaml
Create database secrets
# Get RDS password from Terraform output
DB_PASSWORD = $( terraform output -raw db_instance_password )
DB_ENDPOINT = $( terraform output -raw db_instance_endpoint )
kubectl create secret generic database-credentials \
--namespace=mcp-server-langgraph \
--from-literal=host= $DB_ENDPOINT \
--from-literal=port=5432 \
--from-literal=database=mcp_langgraph \
--from-literal=username=mcp_langgraph \
--from-literal=password= $DB_PASSWORD
Create Redis secrets
REDIS_ENDPOINT = $( terraform output -raw elasticache_configuration_endpoint )
REDIS_AUTH_TOKEN = $( terraform output -raw elasticache_auth_token )
kubectl create secret generic redis-credentials \
--namespace=mcp-server-langgraph \
--from-literal=host= $REDIS_ENDPOINT \
--from-literal=port=6379 \
--from-literal=auth-token= $REDIS_AUTH_TOKEN
Phase 4: Deploy Application (5 minutes)
Build and push container image
# Authenticate to ECR
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com
# Build image
docker build -t mcp-server-langgraph:v1.0.0 .
# Tag for ECR
docker tag mcp-server-langgraph:v1.0.0 \
ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/mcp-server-langgraph:v1.0.0
# Push to ECR
docker push ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/mcp-server-langgraph:v1.0.0
Deploy using Kustomize
kubectl apply -k deployments/kubernetes/overlays/production-eks
Or manually :kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/ingress.yaml
Verify deployment
# Check pod status
kubectl get pods -n mcp-server-langgraph
# Check logs
kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph
# Check IRSA
kubectl describe pod POD_NAME -n mcp-server-langgraph | grep "AWS_ROLE_ARN"
Verify database connection
kubectl exec -it deployment/mcp-server-langgraph -n mcp-server-langgraph -- \
psql -h $DB_ENDPOINT -U mcp_langgraph -d mcp_langgraph -c "SELECT version();"
Phase 5: Monitoring & Auto-scaling (10 minutes)
Deploy Cluster Autoscaler
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
# Annotate service account with IRSA role
kubectl annotate serviceaccount cluster-autoscaler \
-n kube-system \
eks.amazonaws.com/role-arn=arn:aws:iam::ACCOUNT_ID:role/mcp-langgraph-prod-cluster-autoscaler
# Set cluster name
kubectl set image deployment cluster-autoscaler \
-n kube-system \
cluster-autoscaler=k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0
kubectl set env deployment/cluster-autoscaler \
-n kube-system \
AWS_REGION=us-east-1
Deploy Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify
kubectl get deployment metrics-server -n kube-system
Configure HPA
apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : mcp-server-langgraph
namespace : mcp-server-langgraph
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : mcp-server-langgraph
minReplicas : 3
maxReplicas : 20
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 70
- type : Resource
resource :
name : memory
target :
type : Utilization
averageUtilization : 80
kubectl apply -f k8s/hpa.yaml
Configure CloudWatch Container Insights
# Deploy Fluent Bit
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml
# Verify
kubectl get pods -n amazon-cloudwatch
Production Checklist
IRSA configured for all service accounts (no IAM keys)
Secrets stored in AWS Secrets Manager (not in code)
Network policies applied for pod-to-pod traffic
Pod Security Standards enforced (restricted)
RDS and ElastiCache in private subnets only
Encryption enabled for all data at rest (KMS)
TLS enforced for all in-transit data
Security groups follow least-privilege principle
CloudTrail enabled for audit logging
MFA required for AWS console access
Multi-AZ deployment for all services
RDS Multi-AZ with automatic failover
ElastiCache Multi-AZ with automatic failover
At least 2 replicas for application pods
Pod Disruption Budgets configured
Topology spread constraints configured
Health checks configured (liveness/readiness probes)
Load balancer health checks configured
Monitoring & Observability
CloudWatch Container Insights enabled
CloudWatch alarms for RDS (CPU, memory, storage, connections)
CloudWatch alarms for ElastiCache (CPU, memory, evictions)
CloudWatch alarms for EKS (node health, pod restarts)
X-Ray tracing configured for distributed tracing
Application logs shipped to CloudWatch Logs
Metrics Server deployed for HPA
Cluster Autoscaler deployed and configured
Backup & Disaster Recovery
RDS automated backups enabled (30-day retention)
RDS final snapshot on deletion enabled
ElastiCache automated snapshots enabled (7-day retention)
Terraform state versioning enabled in S3
Terraform state encrypted with KMS
Disaster recovery runbook documented
Backup restore procedures tested
Spot instances configured for fault-tolerant workloads
Cluster Autoscaler removing idle nodes
HPA scaling pods based on utilization
VPC endpoints configured (save 70% on data transfer)
RDS storage auto-scaling enabled
CloudWatch Logs retention configured (not infinite)
Cost allocation tags applied to all resources
AWS Cost Explorer monitoring enabled
Post-Deployment Operations
Accessing the Cluster
# Configure kubectl
aws eks update-kubeconfig --region us-east-1 --name mcp-langgraph-prod
# Get cluster info
kubectl cluster-info
# Get nodes
kubectl get nodes -o wide
# Get all resources
kubectl get all -A
Viewing Logs
# Application logs
kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph
# System logs (control plane)
# Available in CloudWatch Logs: /aws/eks/mcp-langgraph-prod/cluster
# Node logs
kubectl logs -n kube-system -l k8s-app=aws-node
Scaling
# Manual pod scaling
kubectl scale deployment mcp-server-langgraph -n mcp-server-langgraph --replicas=5
# Manual node scaling (via Terraform)
# Edit terraform.tfvars:
general_node_group_desired_size = 5
terraform apply
# View HPA status
kubectl get hpa -n mcp-server-langgraph
# View Cluster Autoscaler logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system
Updating
# Update application
kubectl set image deployment/mcp-server-langgraph \
-n mcp-server-langgraph \
mcp-server-langgraph=ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/mcp-server-langgraph:v1.1.0
# Rollback
kubectl rollout undo deployment/mcp-server-langgraph -n mcp-server-langgraph
# Check rollout status
kubectl rollout status deployment/mcp-server-langgraph -n mcp-server-langgraph
Troubleshooting
See EKS Runbooks for detailed troubleshooting procedures.
Common issues :
Pods pending with 'Insufficient CPU'
Cause : Not enough node capacitySolution : Cluster Autoscaler will add nodes automatically. Check:kubectl describe pods -n mcp-server-langgraph
kubectl logs -f deployment/cluster-autoscaler -n kube-system
Can't pull images from ECR
Cause : Missing VPC CNI IRSA permissionsSolution :kubectl describe daemonset aws-node -n kube-system | grep role-arn
# Should show IRSA annotation
Can't access Secrets Manager
Cause : Missing IRSA role or incorrect ARNSolution :kubectl describe pod POD_NAME -n mcp-server-langgraph | grep AWS_ROLE_ARN
# Should show: arn:aws:iam::ACCOUNT:role/mcp-langgraph-prod-application
Cost Estimate
Production deployment (~$803/month) :
Service Configuration Monthly Cost EKS Control Plane 1 cluster $73.00 EC2 Instances 3×t3.xlarge general nodes $295.20 EC2 Instances 2×t3.large spot nodes $14.60 RDS PostgreSQL db.t3.medium Multi-AZ $157.56 ElastiCache Redis 2×cache.r6g.large $109.50 NAT Gateway 3×Multi-AZ $97.20 VPC Endpoints 6 endpoints $21.60 EBS Volumes 5×100GB gp3 $40.00 CloudWatch Logs + metrics ~$15.00 Total ~$823.66
Cost Savings : Enable spot instances, use single NAT gateway in dev/staging, right-size node types, enable Cluster Autoscaler.
Next Steps
Deploy Production
Follow this guide to deploy your production environment
Configure Monitoring
Set up CloudWatch dashboards and alarms
Enable Auto-scaling
Configure Cluster Autoscaler and HPA
Harden Security
Follow AWS Security Hardening guide
Set Up CI/CD
Configure GitHub Actions for automated deployments