Skip to main content

Overview

Scale your MCP Server horizontally (more replicas) and vertically (more resources) to handle varying loads efficiently. This guide covers Kubernetes HPA, VPA, cluster autoscaling, and performance tuning.
IMPORTANT: Redis Checkpointer Required for HPAFor production deployments with horizontal pod autoscaling (HPA), you MUST enable the Redis checkpointer:
# .env or Kubernetes ConfigMap
CHECKPOINT_BACKEND=redis
CHECKPOINT_REDIS_URL=redis://redis-service:6379/1
URL Encoding: If your Redis password contains special characters (/, +, =, @, etc.), they MUST be percent-encoded per RFC 3986:
# Example: password "pass/word+123=" must be encoded as:
CHECKPOINT_REDIS_URL=redis://:pass%2Fword%2B123%3D@redis-service:6379/1
Kubernetes deployments using External Secrets should use the | urlquery filter to automatically encode passwords.Without Redis, conversation state is stored in pod memory and will be lost during:
  • Pod restarts
  • Scale-up events (new pods have no history)
  • Scale-down events (terminated pods lose all state)
  • Load balancer routing to different pods
See ADR-0022: Distributed Conversation Checkpointing for details.

Horizontal Pod Autoscaling (HPA)

Basic Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-langgraph
  namespace: mcp-server-langgraph
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server-langgraph
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
Deploy:
kubectl apply -f hpa.yaml

## Check HPA status
kubectl get hpa -n mcp-server-langgraph

## Watch scaling
kubectl get hpa -n mcp-server-langgraph --watch

Advanced HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-langgraph
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server-langgraph
  minReplicas: 3
  maxReplicas: 20
  metrics:
  # CPU-based scaling
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

  # Memory-based scaling
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

  # Custom metrics (requests per second)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
      - type: Percent
        value: 50  # Scale down by max 50% of current replicas
        periodSeconds: 60
      - type: Pods
        value: 2   # Or max 2 pods per minute
        periodSeconds: 60
      selectPolicy: Min  # Use most conservative

    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Double current replicas
        periodSeconds: 30
      - type: Pods
        value: 4    # Or add 4 pods
        periodSeconds: 30
      selectPolicy: Max  # Use most aggressive

Custom Metrics

Install Prometheus Adapter:
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus:9090
Configure custom metrics:
apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_requests_total{namespace="mcp-server-langgraph"}'
      resources:
        template: <<.Resource>>
      name:
        matches: "^(.*)_total"
        as: "${1}_per_second"
      metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

Vertical Pod Autoscaling (VPA)

Install VPA

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

VPA Configuration

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: mcp-server-langgraph-vpa
  namespace: mcp-server-langgraph
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server-langgraph
  updatePolicy:
    updateMode: "Auto"  # Auto, Recreate, Initial, Off
  resourcePolicy:
    containerPolicies:
    - containerName: mcp-server-langgraph
      minAllowed:
        cpu: 500m
        memory: 512Mi
      maxAllowed:
        cpu: 4000m
        memory: 8Gi
      controlledResources:
      - cpu
      - memory
Check recommendations:
kubectl describe vpa mcp-server-langgraph-vpa -n mcp-server-langgraph

Cluster Autoscaling

GKE

gcloud container clusters update langgraph-cluster \
  --enable-autoscaling \
  --min-nodes=3 \
  --max-nodes=10 \
  --zone=us-central1-a

EKS

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: langgraph-cluster
  region: us-east-1
nodeGroups:
  - name: workers
    instanceType: t3.xlarge
    minSize: 3
    maxSize: 10
    desiredCapacity: 3
    volumeSize: 100
    ssh:
      allow: false
    iam:
      withAddonPolicies:
        autoScaler: true

AKS

az aks update \
  --resource-group langgraph-rg \
  --name langgraph-cluster \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 10

Load Testing

Generate Load

## Install k6
brew install k6  # macOS
## or download from https://k6.io

## Load test script
cat > load-test.js <<'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 users
    { duration: '5m', target: 100 },   // Stay at 100 users
    { duration: '2m', target: 200 },   // Ramp up to 200
    { duration: '5m', target: 200 },   // Stay at 200
    { duration: '2m', target: 0 },     // Ramp down
  ],
};

export default function () {
  const url = 'https://api.yourdomain.com/message';
  const payload = JSON.stringify({
    query: 'What is the capital of France?'
  });
  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer ' + __ENV.AUTH_TOKEN,
    },
  };

  const res = http.post(url, payload, params);
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 2s': (r) => r.timings.duration < 2000,
  });

  sleep(1);
}
EOF

## Run test
export AUTH_TOKEN="your-token"
k6 run load-test.js

Monitor Scaling

## Watch pods scaling
watch kubectl get pods -n mcp-server-langgraph

## Watch HPA
watch kubectl get hpa -n mcp-server-langgraph

## View metrics
kubectl top pods -n mcp-server-langgraph
kubectl top nodes

Resource Limits

Right-Sizing

resources:
  requests:
    cpu: 1000m      # Guaranteed CPU
    memory: 1Gi     # Guaranteed memory
  limits:
    cpu: 4000m      # Max CPU
    memory: 4Gi     # Max memory
Guidelines:
  • Requests: Set to average usage (p50)
  • Limits: Set to peak usage (p95-p99)
  • CPU: Start with 1 core, adjust based on load
  • Memory: 1-2GB for typical workloads

Quality of Service (QoS)

## Guaranteed QoS (best)
resources:
  requests:
    cpu: 2000m
    memory: 2Gi
  limits:
    cpu: 2000m    # Same as request
    memory: 2Gi   # Same as request

## Burstable QoS (good)
resources:
  requests:
    cpu: 1000m
    memory: 1Gi
  limits:
    cpu: 4000m    # Higher than request
    memory: 4Gi

## Best Effort QoS (avoid in production)
resources: {}  # No requests or limits

Performance Tuning

Application-Level

## config.py

## LLM settings
MODEL_TIMEOUT = 60  # Seconds
MODEL_MAX_TOKENS = 4096

## Connection pooling
DATABASE_POOL_SIZE = 20
DATABASE_MAX_OVERFLOW = 10

## Caching
ENABLE_CACHE = True
CACHE_TTL = 300  # 5 minutes

## Rate limiting
RATE_LIMIT_PER_MINUTE = 1000

Kubernetes-Level

## Deployment optimizations
spec:
  replicas: 3

  # Rolling update strategy
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Add 1 pod before removing old
      maxUnavailable: 0  # Keep all pods available

  template:
    spec:
      # Pod disruption budget
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: mcp-server-langgraph

      # Readiness gates
      readinessGates:
      - conditionType: "example.com/feature-1"

Database Tuning

Redis:
## redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru
save 900 1
save 300 10
save 60 10000
PostgreSQL:
## postgresql.conf
max_connections = 200
shared_buffers = 256MB
effective_cache_size = 1GB
maintenance_work_mem = 64MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1

Cost Optimization

Right-Size Instances

## Use VPA recommendations
kubectl get vpa mcp-server-langgraph-vpa -o yaml

## Adjust based on actual usage
resources:
  requests:
    cpu: 800m     # Reduced from 1000m
    memory: 1.5Gi # Reduced from 2Gi

Spot/Preemptible Instances

GKE:
gcloud container node-pools create spot-pool \
  --cluster=langgraph-cluster \
  --spot \
  --num-nodes=3 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=10
EKS:
nodeGroups:
  - name: spot-workers
    instancesDistribution:
      instanceTypes: ["t3.xlarge", "t3a.xlarge"]
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: "capacity-optimized"
    minSize: 0
    maxSize: 10

Scheduled Scaling

## Scale down at night
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-night
spec:
  schedule: "0 0 * * *"  # Midnight
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kubectl
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - kubectl scale deployment mcp-server-langgraph --replicas=1
          restartPolicy: OnFailure

## Scale up in morning
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-morning
spec:
  schedule: "0 8 * * *"  # 8am
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kubectl
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - kubectl scale deployment mcp-server-langgraph --replicas=5
          restartPolicy: OnFailure

Monitoring Scaling

Key Metrics

## Replica count
count(kube_pod_info{namespace="mcp-server-langgraph", pod=~"mcp-server-langgraph.*"})

## CPU utilization
rate(container_cpu_usage_seconds_total{namespace="mcp-server-langgraph"}[5m])

## Memory utilization
container_memory_working_set_bytes{namespace="mcp-server-langgraph"}

## Request rate
rate(http_requests_total{namespace="mcp-server-langgraph"}[5m])

## HPA desired replicas
kube_horizontalpodautoscaler_status_desired_replicas{namespace="mcp-server-langgraph"}

Alerts

groups:
- name: scaling
  rules:
  - alert: HPAMaxedOut
    expr: |
      kube_horizontalpodautoscaler_status_desired_replicas{namespace="mcp-server-langgraph"}
      >=
      kube_horizontalpodautoscaler_spec_max_replicas{namespace="mcp-server-langgraph"}
    for: 5m
    annotations:
      summary: "HPA at maximum replicas"

  - alert: HighCPU
    expr: |
      avg(rate(container_cpu_usage_seconds_total{namespace="mcp-server-langgraph"}[5m])) > 0.8
    for: 2m
    annotations:
      summary: "High CPU utilization"

  - alert: HighMemory
    expr: |
      avg(container_memory_working_set_bytes{namespace="mcp-server-langgraph"})
      /
      avg(kube_pod_container_resource_limits{resource="memory"}) > 0.85
    for: 2m
    annotations:
      summary: "High memory utilization"

Best Practices

Begin with conservative scaling settings:
  • Min replicas: 3 (for HA)
  • Max replicas: 10 (prevent runaway scaling)
  • Target CPU: 70% (leave headroom)
  • Scale-down delay: 5 minutes (prevent flapping)
Load test before production:
# Gradual load increase
k6 run --vus 10 --duration 5m load-test.js
k6 run --vus 50 --duration 10m load-test.js
k6 run --vus 100 --duration 15m load-test.js
Track scaling behavior:
  • HPA events
  • Pod creation/deletion
  • Resource utilization
  • Request latency
  • Error rates
Prevent too many pods terminating:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: mcp-server-langgraph-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: mcp-server-langgraph

Troubleshooting

# Check HPA status
kubectl describe hpa mcp-server-langgraph -n mcp-server-langgraph

# Check metrics server
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml

# View current metrics
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/mcp-server-langgraph/pods
Reason: Out of resourcesFix:
  • Increase node resources
  • Enable cluster autoscaler
  • Reduce resource requests
  • Add more nodes
Symptom: Constant scale up/downFix:
behavior:
  scaleDown:
    stabilizationWindowSeconds: 600  # Increase delay

Next Steps


Auto-Scaling Ready: Handle any load with automatic horizontal and vertical scaling!