Documentation Index Fetch the complete documentation index at: https://mcp-server-langgraph.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Scale your MCP Server horizontally (more replicas) and vertically (more resources) to handle varying loads efficiently. This guide covers Kubernetes HPA, VPA, cluster autoscaling, and performance tuning.
IMPORTANT: Redis Checkpointer Required for HPA For production deployments with horizontal pod autoscaling (HPA), you MUST enable the Redis checkpointer: # .env or Kubernetes ConfigMap
CHECKPOINT_BACKEND = redis
CHECKPOINT_REDIS_URL = redis://redis-service:6379/1
URL Encoding : If your Redis password contains special characters (/, +, =, @, etc.), they MUST be percent-encoded per RFC 3986:# Example: password "pass/word+123=" must be encoded as:
CHECKPOINT_REDIS_URL = redis://:pass%2Fword%2B123%3D@redis-service:6379/1
Kubernetes deployments using External Secrets should use the | urlquery filter to automatically encode passwords. Without Redis, conversation state is stored in pod memory and will be lost during:
Pod restarts
Scale-up events (new pods have no history)
Scale-down events (terminated pods lose all state)
Load balancer routing to different pods
See ADR-0022: Distributed Conversation Checkpointing for details.
Horizontal Pod Autoscaling (HPA)
Basic Configuration
apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : mcp-server-langgraph
namespace : mcp-server-langgraph
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : mcp-server-langgraph
minReplicas : 3
maxReplicas : 10
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 70
- type : Resource
resource :
name : memory
target :
type : Utilization
averageUtilization : 80
Deploy :
kubectl apply -f hpa.yaml
## Check HPA status
kubectl get hpa -n mcp-server-langgraph
## Watch scaling
kubectl get hpa -n mcp-server-langgraph --watch
Advanced HPA
apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : mcp-server-langgraph
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : mcp-server-langgraph
minReplicas : 3
maxReplicas : 20
metrics :
# CPU-based scaling
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 70
# Memory-based scaling
- type : Resource
resource :
name : memory
target :
type : Utilization
averageUtilization : 80
# Custom metrics (requests per second)
- type : Pods
pods :
metric :
name : http_requests_per_second
target :
type : AverageValue
averageValue : "1000"
behavior :
scaleDown :
stabilizationWindowSeconds : 300 # Wait 5min before scaling down
policies :
- type : Percent
value : 50 # Scale down by max 50% of current replicas
periodSeconds : 60
- type : Pods
value : 2 # Or max 2 pods per minute
periodSeconds : 60
selectPolicy : Min # Use most conservative
scaleUp :
stabilizationWindowSeconds : 0 # Scale up immediately
policies :
- type : Percent
value : 100 # Double current replicas
periodSeconds : 30
- type : Pods
value : 4 # Or add 4 pods
periodSeconds : 30
selectPolicy : Max # Use most aggressive
Custom Metrics
Install Prometheus Adapter :
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://prometheus:9090
Configure custom metrics :
apiVersion : v1
kind : ConfigMap
metadata :
name : adapter-config
data :
config.yaml : |
rules:
- seriesQuery: 'http_requests_total{namespace="mcp-server-langgraph"}'
resources:
template: <<.Resource>>
name:
matches: "^(.*)_total"
as: "${1}_per_second"
metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
Vertical Pod Autoscaling (VPA)
Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
VPA Configuration
apiVersion : autoscaling.k8s.io/v1
kind : VerticalPodAutoscaler
metadata :
name : mcp-server-langgraph-vpa
namespace : mcp-server-langgraph
spec :
targetRef :
apiVersion : apps/v1
kind : Deployment
name : mcp-server-langgraph
updatePolicy :
updateMode : "Auto" # Auto, Recreate, Initial, Off
resourcePolicy :
containerPolicies :
- containerName : mcp-server-langgraph
minAllowed :
cpu : 500m
memory : 512Mi
maxAllowed :
cpu : 4000m
memory : 8Gi
controlledResources :
- cpu
- memory
Check recommendations :
kubectl describe vpa mcp-server-langgraph-vpa -n mcp-server-langgraph
Cluster Autoscaling
GKE
gcloud container clusters update langgraph-cluster \
--enable-autoscaling \
--min-nodes=3 \
--max-nodes=10 \
--zone=us-central1-a
EKS
apiVersion : eksctl.io/v1alpha5
kind : ClusterConfig
metadata :
name : langgraph-cluster
region : us-east-1
nodeGroups :
- name : workers
instanceType : t3.xlarge
minSize : 3
maxSize : 10
desiredCapacity : 3
volumeSize : 100
ssh :
allow : false
iam :
withAddonPolicies :
autoScaler : true
AKS
az aks update \
--resource-group langgraph-rg \
--name langgraph-cluster \
--enable-cluster-autoscaler \
--min-count 3 \
--max-count 10
Load Testing
Generate Load
## Install k6
brew install k6 # macOS
## or download from https://k6.io
## Load test script
cat > load-test.js << 'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up to 100 users
{ duration: '5m', target: 100 }, // Stay at 100 users
{ duration: '2m', target: 200 }, // Ramp up to 200
{ duration: '5m', target: 200 }, // Stay at 200
{ duration: '2m', target: 0 }, // Ramp down
],
};
export default function () {
const url = 'https://api.yourdomain.com/message';
const payload = JSON.stringify({
query: 'What is the capital of France?'
});
const params = {
headers: {
'Content-Type': 'application/json',
'Authorization': 'Bearer ' + __ENV.AUTH_TOKEN,
},
};
const res = http.post(url, payload, params);
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 2s': (r) => r.timings.duration < 2000,
});
sleep(1);
}
EOF
## Run test
export AUTH_TOKEN = "your-token"
k6 run load-test.js
Monitor Scaling
## Watch pods scaling
watch kubectl get pods -n mcp-server-langgraph
## Watch HPA
watch kubectl get hpa -n mcp-server-langgraph
## View metrics
kubectl top pods -n mcp-server-langgraph
kubectl top nodes
Resource Limits
Right-Sizing
resources :
requests :
cpu : 1000m # Guaranteed CPU
memory : 1Gi # Guaranteed memory
limits :
cpu : 4000m # Max CPU
memory : 4Gi # Max memory
Guidelines :
Requests : Set to average usage (p50)
Limits : Set to peak usage (p95-p99)
CPU : Start with 1 core, adjust based on load
Memory : 1-2GB for typical workloads
Quality of Service (QoS)
## Guaranteed QoS (best)
resources :
requests :
cpu : 2000m
memory : 2Gi
limits :
cpu : 2000m # Same as request
memory : 2Gi # Same as request
## Burstable QoS (good)
resources :
requests :
cpu : 1000m
memory : 1Gi
limits :
cpu : 4000m # Higher than request
memory : 4Gi
## Best Effort QoS (avoid in production)
resources : {} # No requests or limits
Application-Level
## config.py
## LLM settings
MODEL_TIMEOUT = 60 # Seconds
MODEL_MAX_TOKENS = 4096
## Connection pooling
DATABASE_POOL_SIZE = 20
DATABASE_MAX_OVERFLOW = 10
## Caching
ENABLE_CACHE = True
CACHE_TTL = 300 # 5 minutes
## Rate limiting
RATE_LIMIT_PER_MINUTE = 1000
Kubernetes-Level
## Deployment optimizations
spec :
replicas : 3
# Rolling update strategy
strategy :
type : RollingUpdate
rollingUpdate :
maxSurge : 1 # Add 1 pod before removing old
maxUnavailable : 0 # Keep all pods available
template :
spec :
# Pod disruption budget
topologySpreadConstraints :
- maxSkew : 1
topologyKey : topology.kubernetes.io/zone
whenUnsatisfiable : DoNotSchedule
labelSelector :
matchLabels :
app : mcp-server-langgraph
# Readiness gates
readinessGates :
- conditionType : "example.com/feature-1"
Database Tuning
Redis :
## redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru
save 900 1
save 300 10
save 60 10000
PostgreSQL :
## postgresql.conf
max_connections = 200
shared_buffers = 256MB
effective_cache_size = 1GB
maintenance_work_mem = 64MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
Cost Optimization
Right-Size Instances
## Use VPA recommendations
kubectl get vpa mcp-server-langgraph-vpa -o yaml
## Adjust based on actual usage
resources :
requests :
cpu : 800m # Reduced from 1000m
memory : 1.5Gi # Reduced from 2Gi
Spot/Preemptible Instances
GKE :
gcloud container node-pools create spot-pool \
--cluster=langgraph-cluster \
--spot \
--num-nodes=3 \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=10
EKS :
nodeGroups :
- name : spot-workers
instancesDistribution :
instanceTypes : [ "t3.xlarge" , "t3a.xlarge" ]
onDemandBaseCapacity : 0
onDemandPercentageAboveBaseCapacity : 0
spotAllocationStrategy : "capacity-optimized"
minSize : 0
maxSize : 10
Scheduled Scaling
## Scale down at night
apiVersion : batch/v1
kind : CronJob
metadata :
name : scale-down-night
spec :
schedule : "0 0 * * *" # Midnight
jobTemplate :
spec :
template :
spec :
containers :
- name : kubectl
image : bitnami/kubectl
command :
- /bin/sh
- -c
- kubectl scale deployment mcp-server-langgraph --replicas=1
restartPolicy : OnFailure
## Scale up in morning
---
apiVersion : batch/v1
kind : CronJob
metadata :
name : scale-up-morning
spec :
schedule : "0 8 * * *" # 8am
jobTemplate :
spec :
template :
spec :
containers :
- name : kubectl
image : bitnami/kubectl
command :
- /bin/sh
- -c
- kubectl scale deployment mcp-server-langgraph --replicas=5
restartPolicy : OnFailure
Monitoring Scaling
Key Metrics
## Replica count
count(kube_pod_info{namespace="mcp-server-langgraph", pod=~"mcp-server-langgraph.*"})
## CPU utilization
rate(container_cpu_usage_seconds_total{namespace="mcp-server-langgraph"}[5m])
## Memory utilization
container_memory_working_set_bytes{namespace="mcp-server-langgraph"}
## Request rate
rate(http_requests_total{namespace="mcp-server-langgraph"}[5m])
## HPA desired replicas
kube_horizontalpodautoscaler_status_desired_replicas{namespace="mcp-server-langgraph"}
Alerts
groups :
- name : scaling
rules :
- alert : HPAMaxedOut
expr : |
kube_horizontalpodautoscaler_status_desired_replicas{namespace="mcp-server-langgraph"}
>=
kube_horizontalpodautoscaler_spec_max_replicas{namespace="mcp-server-langgraph"}
for : 5m
annotations :
summary : "HPA at maximum replicas"
- alert : HighCPU
expr : |
avg(rate(container_cpu_usage_seconds_total{namespace="mcp-server-langgraph"}[5m])) > 0.8
for : 2m
annotations :
summary : "High CPU utilization"
- alert : HighMemory
expr : |
avg(container_memory_working_set_bytes{namespace="mcp-server-langgraph"})
/
avg(kube_pod_container_resource_limits{resource="memory"}) > 0.85
for : 2m
annotations :
summary : "High memory utilization"
Best Practices
Begin with conservative scaling settings:
Min replicas : 3 (for HA)
Max replicas : 10 (prevent runaway scaling)
Target CPU : 70% (leave headroom)
Scale-down delay : 5 minutes (prevent flapping)
Load test before production: # Gradual load increase
k6 run --vus 10 --duration 5m load-test.js
k6 run --vus 50 --duration 10m load-test.js
k6 run --vus 100 --duration 15m load-test.js
Track scaling behavior:
HPA events
Pod creation/deletion
Resource utilization
Request latency
Error rates
Set Pod Disruption Budget
Prevent too many pods terminating: apiVersion : policy/v1
kind : PodDisruptionBudget
metadata :
name : mcp-server-langgraph-pdb
spec :
minAvailable : 2
selector :
matchLabels :
app : mcp-server-langgraph
Troubleshooting
# Check HPA status
kubectl describe hpa mcp-server-langgraph -n mcp-server-langgraph
# Check metrics server
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
# View current metrics
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/mcp-server-langgraph/pods
Reason : Out of resourcesFix :
Increase node resources
Enable cluster autoscaler
Reduce resource requests
Add more nodes
Symptom : Constant scale up/downFix :behavior :
scaleDown :
stabilizationWindowSeconds : 600 # Increase delay
Next Steps
Kubernetes Deployment Deploy to Kubernetes
Monitoring Set up monitoring
Production Checklist Scaling requirements
Disaster Recovery Backup and restore
Auto-Scaling Ready : Handle any load with automatic horizontal and vertical scaling!