Auto-Scaling - MCP Server with LangGraph

Overview

Scale your MCP Server horizontally (more replicas) and vertically (more resources) to handle varying loads efficiently. This guide covers Kubernetes HPA, VPA, cluster autoscaling, and performance tuning.

IMPORTANT: Redis Checkpointer Required for HPAFor production deployments with horizontal pod autoscaling (HPA), you MUST enable the Redis checkpointer:

# .env or Kubernetes ConfigMap
CHECKPOINT_BACKEND=redis
CHECKPOINT_REDIS_URL=redis://redis-service:6379/1

URL Encoding: If your Redis password contains special characters (/, +, =, @, etc.), they MUST be percent-encoded per RFC 3986:

# Example: password "pass/word+123=" must be encoded as:
CHECKPOINT_REDIS_URL=redis://:pass%2Fword%2B123%3D@redis-service:6379/1

Kubernetes deployments using External Secrets should use the | urlquery filter to automatically encode passwords.Without Redis, conversation state is stored in pod memory and will be lost during:

Pod restarts
Scale-up events (new pods have no history)
Scale-down events (terminated pods lose all state)
Load balancer routing to different pods

See ADR-0022: Distributed Conversation Checkpointing for details.

Horizontal Pod Autoscaling (HPA)

Basic Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-langgraph
  namespace: mcp-server-langgraph
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server-langgraph
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Deploy:

kubectl apply -f hpa.yaml

## Check HPA status
kubectl get hpa -n mcp-server-langgraph

## Watch scaling
kubectl get hpa -n mcp-server-langgraph --watch

Advanced HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-langgraph
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server-langgraph
  minReplicas: 3
  maxReplicas: 20
  metrics:
  # CPU-based scaling
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

  # Memory-based scaling
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

  # Custom metrics (requests per second)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
      - type: Percent
        value: 50  # Scale down by max 50% of current replicas
        periodSeconds: 60
      - type: Pods
        value: 2   # Or max 2 pods per minute
        periodSeconds: 60
      selectPolicy: Min  # Use most conservative

    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Double current replicas
        periodSeconds: 30
      - type: Pods
        value: 4    # Or add 4 pods
        periodSeconds: 30
      selectPolicy: Max  # Use most aggressive

Custom Metrics

Install Prometheus Adapter:

helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus:9090

Configure custom metrics:

apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_requests_total{namespace="mcp-server-langgraph"}'
      resources:
        template: <<.Resource>>
      name:
        matches: "^(.*)_total"
        as: "${1}_per_second"
      metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

Vertical Pod Autoscaling (VPA)

Install VPA

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

VPA Configuration

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: mcp-server-langgraph-vpa
  namespace: mcp-server-langgraph
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server-langgraph
  updatePolicy:
    updateMode: "Auto"  # Auto, Recreate, Initial, Off
  resourcePolicy:
    containerPolicies:
    - containerName: mcp-server-langgraph
      minAllowed:
        cpu: 500m
        memory: 512Mi
      maxAllowed:
        cpu: 4000m
        memory: 8Gi
      controlledResources:
      - cpu
      - memory

Check recommendations:

kubectl describe vpa mcp-server-langgraph-vpa -n mcp-server-langgraph

Cluster Autoscaling

GKE

gcloud container clusters update langgraph-cluster \
  --enable-autoscaling \
  --min-nodes=3 \
  --max-nodes=10 \
  --zone=us-central1-a

EKS

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: langgraph-cluster
  region: us-east-1
nodeGroups:
  - name: workers
    instanceType: t3.xlarge
    minSize: 3
    maxSize: 10
    desiredCapacity: 3
    volumeSize: 100
    ssh:
      allow: false
    iam:
      withAddonPolicies:
        autoScaler: true

AKS

az aks update \
  --resource-group langgraph-rg \
  --name langgraph-cluster \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 10

Load Testing

Generate Load

## Install k6
brew install k6  # macOS
## or download from https://k6.io

## Load test script
cat > load-test.js <<'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 users
    { duration: '5m', target: 100 },   // Stay at 100 users
    { duration: '2m', target: 200 },   // Ramp up to 200
    { duration: '5m', target: 200 },   // Stay at 200
    { duration: '2m', target: 0 },     // Ramp down
  ],
};

export default function () {
  const url = 'https://api.yourdomain.com/message';
  const payload = JSON.stringify({
    query: 'What is the capital of France?'
  });
  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer ' + __ENV.AUTH_TOKEN,
    },
  };

  const res = http.post(url, payload, params);
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 2s': (r) => r.timings.duration < 2000,
  });

  sleep(1);
}
EOF

## Run test
export AUTH_TOKEN="your-token"
k6 run load-test.js

Monitor Scaling

## Watch pods scaling
watch kubectl get pods -n mcp-server-langgraph

## Watch HPA
watch kubectl get hpa -n mcp-server-langgraph

## View metrics
kubectl top pods -n mcp-server-langgraph
kubectl top nodes

Resource Limits

Right-Sizing

resources:
  requests:
    cpu: 1000m      # Guaranteed CPU
    memory: 1Gi     # Guaranteed memory
  limits:
    cpu: 4000m      # Max CPU
    memory: 4Gi     # Max memory

Guidelines:

Requests: Set to average usage (p50)
Limits: Set to peak usage (p95-p99)
CPU: Start with 1 core, adjust based on load
Memory: 1-2GB for typical workloads

Quality of Service (QoS)

## Guaranteed QoS (best)
resources:
  requests:
    cpu: 2000m
    memory: 2Gi
  limits:
    cpu: 2000m    # Same as request
    memory: 2Gi   # Same as request

## Burstable QoS (good)
resources:
  requests:
    cpu: 1000m
    memory: 1Gi
  limits:
    cpu: 4000m    # Higher than request
    memory: 4Gi

## Best Effort QoS (avoid in production)
resources: {}  # No requests or limits

Performance Tuning

Application-Level

## config.py

## LLM settings
MODEL_TIMEOUT = 60  # Seconds
MODEL_MAX_TOKENS = 4096

## Connection pooling
DATABASE_POOL_SIZE = 20
DATABASE_MAX_OVERFLOW = 10

## Caching
ENABLE_CACHE = True
CACHE_TTL = 300  # 5 minutes

## Rate limiting
RATE_LIMIT_PER_MINUTE = 1000

Kubernetes-Level

## Deployment optimizations
spec:
  replicas: 3

  # Rolling update strategy
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Add 1 pod before removing old
      maxUnavailable: 0  # Keep all pods available

  template:
    spec:
      # Pod disruption budget
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: mcp-server-langgraph

      # Readiness gates
      readinessGates:
      - conditionType: "example.com/feature-1"

Database Tuning

Redis:

## redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru
save 900 1
save 300 10
save 60 10000

PostgreSQL:

## postgresql.conf
max_connections = 200
shared_buffers = 256MB
effective_cache_size = 1GB
maintenance_work_mem = 64MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1

Cost Optimization

Right-Size Instances

## Use VPA recommendations
kubectl get vpa mcp-server-langgraph-vpa -o yaml

## Adjust based on actual usage
resources:
  requests:
    cpu: 800m     # Reduced from 1000m
    memory: 1.5Gi # Reduced from 2Gi

Spot/Preemptible Instances

GKE:

gcloud container node-pools create spot-pool \
  --cluster=langgraph-cluster \
  --spot \
  --num-nodes=3 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=10

EKS:

nodeGroups:
  - name: spot-workers
    instancesDistribution:
      instanceTypes: ["t3.xlarge", "t3a.xlarge"]
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: "capacity-optimized"
    minSize: 0
    maxSize: 10

Scheduled Scaling

## Scale down at night
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-night
spec:
  schedule: "0 0 * * *"  # Midnight
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kubectl
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - kubectl scale deployment mcp-server-langgraph --replicas=1
          restartPolicy: OnFailure

## Scale up in morning
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-morning
spec:
  schedule: "0 8 * * *"  # 8am
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kubectl
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - kubectl scale deployment mcp-server-langgraph --replicas=5
          restartPolicy: OnFailure

Monitoring Scaling

Key Metrics

## Replica count
count(kube_pod_info{namespace="mcp-server-langgraph", pod=~"mcp-server-langgraph.*"})

## CPU utilization
rate(container_cpu_usage_seconds_total{namespace="mcp-server-langgraph"}[5m])

## Memory utilization
container_memory_working_set_bytes{namespace="mcp-server-langgraph"}

## Request rate
rate(http_requests_total{namespace="mcp-server-langgraph"}[5m])

## HPA desired replicas
kube_horizontalpodautoscaler_status_desired_replicas{namespace="mcp-server-langgraph"}

Alerts

groups:
- name: scaling
  rules:
  - alert: HPAMaxedOut
    expr: |
      kube_horizontalpodautoscaler_status_desired_replicas{namespace="mcp-server-langgraph"}
      >=
      kube_horizontalpodautoscaler_spec_max_replicas{namespace="mcp-server-langgraph"}
    for: 5m
    annotations:
      summary: "HPA at maximum replicas"

  - alert: HighCPU
    expr: |
      avg(rate(container_cpu_usage_seconds_total{namespace="mcp-server-langgraph"}[5m])) > 0.8
    for: 2m
    annotations:
      summary: "High CPU utilization"

  - alert: HighMemory
    expr: |
      avg(container_memory_working_set_bytes{namespace="mcp-server-langgraph"})
      /
      avg(kube_pod_container_resource_limits{resource="memory"}) > 0.85
    for: 2m
    annotations:
      summary: "High memory utilization"

Best Practices

Start Conservative

Begin with conservative scaling settings:

Min replicas: 3 (for HA)
Max replicas: 10 (prevent runaway scaling)
Target CPU: 70% (leave headroom)
Scale-down delay: 5 minutes (prevent flapping)

Test Thoroughly

Load test before production:

# Gradual load increase
k6 run --vus 10 --duration 5m load-test.js
k6 run --vus 50 --duration 10m load-test.js
k6 run --vus 100 --duration 15m load-test.js

Monitor Closely

Track scaling behavior:

HPA events
Pod creation/deletion
Resource utilization
Request latency
Error rates

Set Pod Disruption Budget

Prevent too many pods terminating:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: mcp-server-langgraph-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: mcp-server-langgraph

Troubleshooting

HPA not scaling

# Check HPA status
kubectl describe hpa mcp-server-langgraph -n mcp-server-langgraph

# Check metrics server
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml

# View current metrics
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/mcp-server-langgraph/pods

Pods evicted

Reason: Out of resourcesFix:

Increase node resources
Enable cluster autoscaler
Reduce resource requests
Add more nodes

Scaling flapping

Symptom: Constant scale up/downFix:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 600  # Increase delay

Next Steps

Kubernetes Deployment

Deploy to Kubernetes

Monitoring

Set up monitoring

Production Checklist

Scaling requirements

Disaster Recovery

Backup and restore

Auto-Scaling Ready: Handle any load with automatic horizontal and vertical scaling!

Getting Started

Deployment Options

LangGraph Platform

Kubernetes - GKE

Kubernetes - EKS & AKS

Kubernetes - Best Practices

Infrastructure as Code

Monitoring & Observability

Advanced Deployment

Configuration

Operations

​Overview

​Horizontal Pod Autoscaling (HPA)

​Basic Configuration

​Advanced HPA

​Custom Metrics

​Vertical Pod Autoscaling (VPA)

​Install VPA

​VPA Configuration

​Cluster Autoscaling

​GKE

​EKS

​AKS

​Load Testing

​Generate Load

​Monitor Scaling

​Resource Limits

​Right-Sizing

​Quality of Service (QoS)

​Performance Tuning

​Application-Level

​Kubernetes-Level

​Database Tuning

​Cost Optimization

​Right-Size Instances

​Spot/Preemptible Instances

​Scheduled Scaling

​Monitoring Scaling

​Key Metrics

​Alerts

​Best Practices

​Troubleshooting

​Next Steps

Kubernetes Deployment

Monitoring

Production Checklist

Disaster Recovery

Overview

Horizontal Pod Autoscaling (HPA)

Basic Configuration

Advanced HPA

Custom Metrics

Vertical Pod Autoscaling (VPA)

Install VPA

VPA Configuration

Cluster Autoscaling

GKE

EKS

AKS

Load Testing

Generate Load

Monitor Scaling

Resource Limits

Right-Sizing

Quality of Service (QoS)

Performance Tuning

Application-Level

Kubernetes-Level

Database Tuning

Cost Optimization

Right-Size Instances

Spot/Preemptible Instances

Scheduled Scaling

Monitoring Scaling

Key Metrics

Alerts

Best Practices

Troubleshooting

Next Steps