Skip to main content

EKS Production Deployment

Complete guide to deploying mcp-server-langgraph on AWS EKS with production-grade infrastructure, security, and observability.

Overview

This deployment achieves 96/100 infrastructure maturity with:

Infrastructure as Code

Terraform modules for VPC, EKS, RDS, ElastiCache

High Availability

Multi-AZ across all services with automatic failover

Security First

IRSA, encryption everywhere, network isolation

Cost Optimized

~$803/month (60% savings vs. defaults)

What You’ll Deploy


Prerequisites

1

AWS Account Setup

AWS account with admin access
AWS CLI installed and configured (aws configure)
Account limits: 5 VPCs, 20 EIPs, 100 security groups per region
2

Local Tools

# Install required tools
brew install terraform kubectl awscli

# Verify versions
terraform version  # >= 1.5.0
kubectl version --client  # >= 1.27.0
aws --version  # >= 2.13.0
3

Repository

git clone https://github.com/vishnu2kmohan/mcp-server-langgraph
cd mcp-server-langgraph

Deployment Architecture

Infrastructure Layers

  • Layer 1: Networking
  • Layer 2: Compute
  • Layer 3: Data
  • Layer 4: Security
VPC Module (terraform/modules/vpc)
  • 3 Availability Zones (us-east-1a/b/c)
  • Public subnets (/20) for load balancers
  • Private subnets (/18) for workloads
  • NAT Gateways (multi-AZ)
  • VPC Endpoints (S3, ECR, CloudWatch)
  • VPC Flow Logs
Capacity: 16,384 IPs per private subnet (~300 EKS nodes per AZ)

Step-by-Step Deployment

Phase 1: Terraform Backend (5 minutes)

1

Initialize backend

cd terraform/backend-setup
Edit variables.tf or create terraform.tfvars:
project_name = "mcp-langgraph"
environment  = "prod"
region       = "us-east-1"
2

Deploy backend

terraform init
terraform plan
terraform apply
Creates:
  • S3 bucket: mcp-langgraph-terraform-state-prod
  • DynamoDB table: mcp-langgraph-terraform-lock-prod
  • Access logging bucket
3

Note outputs

terraform output
Save the S3 bucket name and DynamoDB table name for next phase.

Phase 2: Infrastructure Deployment (20-25 minutes)

1

Configure environment

cd ../../environments/prod
Create terraform.tfvars:
# Project configuration
project_name = "mcp-langgraph"
environment  = "prod"
region       = "us-east-1"

# VPC configuration
vpc_cidr           = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
single_nat_gateway = false  # Multi-AZ NAT for HA

# EKS configuration
cluster_name       = "mcp-langgraph-prod"
kubernetes_version = "1.28"

# Node groups
enable_general_node_group       = true
general_node_group_desired_size = 3
general_node_group_min_size     = 2
general_node_group_max_size     = 10
general_node_group_instance_types = ["t3.xlarge"]

enable_spot_node_group       = true
spot_node_group_desired_size = 2

# RDS configuration
rds_instance_class         = "db.t3.medium"
rds_allocated_storage      = 100
rds_multi_az               = true
rds_backup_retention_days  = 30

# ElastiCache configuration
elasticache_node_type            = "cache.r6g.large"
elasticache_cluster_mode_enabled = false  # Standard mode
elasticache_num_cache_nodes      = 2

# IRSA
create_application_irsa_role = true
application_service_account_name = "mcp-server-langgraph"
application_service_account_namespace = "mcp-server-langgraph"

tags = {
  Environment = "production"
  ManagedBy   = "terraform"
  Project     = "mcp-langgraph"
}
2

Initialize Terraform

terraform init \
  -backend-config="bucket=mcp-langgraph-terraform-state-prod" \
  -backend-config="key=prod/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=mcp-langgraph-terraform-lock-prod"
3

Plan deployment

terraform plan -out=tfplan
Review:
  • ~50 resources will be created
  • VPC, subnets, NAT gateways
  • EKS cluster and node groups
  • RDS instance
  • ElastiCache cluster
  • IAM roles and policies
4

Deploy infrastructure

terraform apply tfplan
Duration: 20-25 minutes
  • VPC: ~2 minutes
  • EKS control plane: ~10 minutes
  • Node groups: ~8 minutes
  • RDS Multi-AZ: ~12 minutes (parallel with EKS)
  • ElastiCache: ~5 minutes (parallel with EKS)
5

Save outputs

terraform output > outputs.txt

# Important outputs:
terraform output -raw cluster_endpoint
terraform output -raw cluster_certificate_authority_data
terraform output -raw application_irsa_role_arn
terraform output -raw db_instance_endpoint
terraform output -raw elasticache_configuration_endpoint

Phase 3: Kubernetes Configuration (10 minutes)

1

Configure kubectl

aws eks update-kubeconfig \
  --region us-east-1 \
  --name mcp-langgraph-prod \
  --alias mcp-prod

# Verify connection
kubectl get nodes
kubectl get pods -A
2

Create namespace

kubectl create namespace mcp-server-langgraph

# Apply Pod Security Standards
kubectl label namespace mcp-server-langgraph \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted
3

Create service account with IRSA

# k8s/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mcp-server-langgraph
  namespace: mcp-server-langgraph
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/mcp-langgraph-prod-application
kubectl apply -f k8s/service-account.yaml
4

Create database secrets

# Get RDS password from Terraform output
DB_PASSWORD=$(terraform output -raw db_instance_password)
DB_ENDPOINT=$(terraform output -raw db_instance_endpoint)

kubectl create secret generic database-credentials \
  --namespace=mcp-server-langgraph \
  --from-literal=host=$DB_ENDPOINT \
  --from-literal=port=5432 \
  --from-literal=database=mcp_langgraph \
  --from-literal=username=mcp_langgraph \
  --from-literal=password=$DB_PASSWORD
5

Create Redis secrets

REDIS_ENDPOINT=$(terraform output -raw elasticache_configuration_endpoint)
REDIS_AUTH_TOKEN=$(terraform output -raw elasticache_auth_token)

kubectl create secret generic redis-credentials \
  --namespace=mcp-server-langgraph \
  --from-literal=host=$REDIS_ENDPOINT \
  --from-literal=port=6379 \
  --from-literal=auth-token=$REDIS_AUTH_TOKEN

Phase 4: Deploy Application (5 minutes)

1

Build and push container image

# Authenticate to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com

# Build image
docker build -t mcp-server-langgraph:v1.0.0 .

# Tag for ECR
docker tag mcp-server-langgraph:v1.0.0 \
  ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/mcp-server-langgraph:v1.0.0

# Push to ECR
docker push ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/mcp-server-langgraph:v1.0.0
2

Deploy using Kustomize

kubectl apply -k deployments/kubernetes/overlays/production-eks
Or manually:
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/ingress.yaml
3

Verify deployment

# Check pod status
kubectl get pods -n mcp-server-langgraph

# Check logs
kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph

# Check IRSA
kubectl describe pod POD_NAME -n mcp-server-langgraph | grep "AWS_ROLE_ARN"
4

Verify database connection

kubectl exec -it deployment/mcp-server-langgraph -n mcp-server-langgraph -- \
  psql -h $DB_ENDPOINT -U mcp_langgraph -d mcp_langgraph -c "SELECT version();"

Phase 5: Monitoring & Auto-scaling (10 minutes)

1

Deploy Cluster Autoscaler

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

# Annotate service account with IRSA role
kubectl annotate serviceaccount cluster-autoscaler \
  -n kube-system \
  eks.amazonaws.com/role-arn=arn:aws:iam::ACCOUNT_ID:role/mcp-langgraph-prod-cluster-autoscaler

# Set cluster name
kubectl set image deployment cluster-autoscaler \
  -n kube-system \
  cluster-autoscaler=k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0

kubectl set env deployment/cluster-autoscaler \
  -n kube-system \
  AWS_REGION=us-east-1
2

Deploy Metrics Server

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify
kubectl get deployment metrics-server -n kube-system
3

Configure HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-langgraph
  namespace: mcp-server-langgraph
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server-langgraph
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
kubectl apply -f k8s/hpa.yaml
4

Configure CloudWatch Container Insights

# Deploy Fluent Bit
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml

# Verify
kubectl get pods -n amazon-cloudwatch

Production Checklist

IRSA configured for all service accounts (no IAM keys)
Secrets stored in AWS Secrets Manager (not in code)
Network policies applied for pod-to-pod traffic
Pod Security Standards enforced (restricted)
RDS and ElastiCache in private subnets only
Encryption enabled for all data at rest (KMS)
TLS enforced for all in-transit data
Security groups follow least-privilege principle
CloudTrail enabled for audit logging
MFA required for AWS console access
Multi-AZ deployment for all services
RDS Multi-AZ with automatic failover
ElastiCache Multi-AZ with automatic failover
At least 2 replicas for application pods
Pod Disruption Budgets configured
Topology spread constraints configured
Health checks configured (liveness/readiness probes)
Load balancer health checks configured
CloudWatch Container Insights enabled
CloudWatch alarms for RDS (CPU, memory, storage, connections)
CloudWatch alarms for ElastiCache (CPU, memory, evictions)
CloudWatch alarms for EKS (node health, pod restarts)
X-Ray tracing configured for distributed tracing
Application logs shipped to CloudWatch Logs
Metrics Server deployed for HPA
Cluster Autoscaler deployed and configured
RDS automated backups enabled (30-day retention)
RDS final snapshot on deletion enabled
ElastiCache automated snapshots enabled (7-day retention)
Terraform state versioning enabled in S3
Terraform state encrypted with KMS
Disaster recovery runbook documented
RTO/RPO targets defined
Backup restore procedures tested
Spot instances configured for fault-tolerant workloads
Cluster Autoscaler removing idle nodes
HPA scaling pods based on utilization
VPC endpoints configured (save 70% on data transfer)
RDS storage auto-scaling enabled
CloudWatch Logs retention configured (not infinite)
Cost allocation tags applied to all resources
AWS Cost Explorer monitoring enabled

Post-Deployment Operations

Accessing the Cluster

# Configure kubectl
aws eks update-kubeconfig --region us-east-1 --name mcp-langgraph-prod

# Get cluster info
kubectl cluster-info

# Get nodes
kubectl get nodes -o wide

# Get all resources
kubectl get all -A

Viewing Logs

# Application logs
kubectl logs -f deployment/mcp-server-langgraph -n mcp-server-langgraph

# System logs (control plane)
# Available in CloudWatch Logs: /aws/eks/mcp-langgraph-prod/cluster

# Node logs
kubectl logs -n kube-system -l k8s-app=aws-node

Scaling

# Manual pod scaling
kubectl scale deployment mcp-server-langgraph -n mcp-server-langgraph --replicas=5

# Manual node scaling (via Terraform)
# Edit terraform.tfvars:
general_node_group_desired_size = 5
terraform apply

# View HPA status
kubectl get hpa -n mcp-server-langgraph

# View Cluster Autoscaler logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system

Updating

# Update application
kubectl set image deployment/mcp-server-langgraph \
  -n mcp-server-langgraph \
  mcp-server-langgraph=ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/mcp-server-langgraph:v1.1.0

# Rollback
kubectl rollout undo deployment/mcp-server-langgraph -n mcp-server-langgraph

# Check rollout status
kubectl rollout status deployment/mcp-server-langgraph -n mcp-server-langgraph

Troubleshooting

See EKS Runbooks for detailed troubleshooting procedures. Common issues:
Cause: Not enough node capacitySolution: Cluster Autoscaler will add nodes automatically. Check:
kubectl describe pods -n mcp-server-langgraph
kubectl logs -f deployment/cluster-autoscaler -n kube-system
Cause: Missing VPC CNI IRSA permissionsSolution:
kubectl describe daemonset aws-node -n kube-system | grep role-arn
# Should show IRSA annotation
Cause: Missing IRSA role or incorrect ARNSolution:
kubectl describe pod POD_NAME -n mcp-server-langgraph | grep AWS_ROLE_ARN
# Should show: arn:aws:iam::ACCOUNT:role/mcp-langgraph-prod-application

Cost Estimate

Production deployment (~$803/month):
ServiceConfigurationMonthly Cost
EKS Control Plane1 cluster$73.00
EC2 Instances3×t3.xlarge general nodes$295.20
EC2 Instances2×t3.large spot nodes$14.60
RDS PostgreSQLdb.t3.medium Multi-AZ$157.56
ElastiCache Redis2×cache.r6g.large$109.50
NAT Gateway3×Multi-AZ$97.20
VPC Endpoints6 endpoints$21.60
EBS Volumes5×100GB gp3$40.00
CloudWatchLogs + metrics~$15.00
Total~$823.66
Cost Savings: Enable spot instances, use single NAT gateway in dev/staging, right-size node types, enable Cluster Autoscaler.


Next Steps

1

Deploy Production

Follow this guide to deploy your production environment
2

Configure Monitoring

Set up CloudWatch dashboards and alarms
3

Enable Auto-scaling

Configure Cluster Autoscaler and HPA
4

Harden Security

Follow AWS Security Hardening guide
5

Set Up CI/CD

Configure GitHub Actions for automated deployments