Skip to main content

Overview

This guide provides operational procedures for managing MCP Server LangGraph on GKE Autopilot in production, including incident response, maintenance tasks, and troubleshooting.
This guide provides essential operational procedures for managing GKE Autopilot deployments in production.

Daily Health Check (5 minutes)

Run this every morning to ensure system health:
1

Check Cluster Status

gcloud container clusters describe production-mcp-server-langgraph-gke \
  --region=us-central1 \
  --format="value(status)"
Should return: RUNNING
2

Check Pod Health

kubectl get pods -n production-mcp-server-langgraph

# Check for failed pods
kubectl get pods -n production-mcp-server-langgraph \
  --field-selector=status.phase!=Running,status.phase!=Succeeded
All pods should show Running status with 0-1 restarts
3

Review Recent Events

kubectl get events -n production-mcp-server-langgraph \
  --sort-by='.lastTimestamp' \
  --field-selector=type!=Normal \
  | tail -10
No ERROR or WARNING events in last hour
4

Check Database Status

gcloud sql instances describe mcp-prod-postgres \
  --format="table(name,state,ipAddresses[0].ipAddress)"
State should be: RUNNABLE
5

View Recent Errors

gcloud logging read \
  'resource.type="k8s_container" AND resource.labels.namespace_name="mcp-production" AND severity>=ERROR' \
  --limit=10 \
  --format="table(timestamp,jsonPayload.message)"
No critical errors in last hour

Incident Response

P0: Service Down (Complete Outage)

Symptoms: All pods crashing, health checks failing, users cannot access service
Response Time: 5-10 minutes
1

Immediate Assessment

# Check pod status
kubectl get pods -n production-mcp-server-langgraph

# Get pod logs
kubectl logs -n production-mcp-server-langgraph -l app=mcp-server-langgraph --tail=100

# Describe failing pods
kubectl describe pod POD_NAME -n production-mcp-server-langgraph
2

Quick Fixes

  • Restart Deployment
  • Rollback
  • Scale Up
kubectl rollout restart deployment/production-mcp-server-langgraph \
  -n production-mcp-server-langgraph
3

Verify Recovery

kubectl get pods -n production-mcp-server-langgraph
kubectl exec -it -n production-mcp-server-langgraph POD_NAME -- curl http://localhost:8000/health/live
4

Post-Incident Analysis

# Export logs for analysis
gcloud logging read \
  'resource.type="k8s_container" AND severity>=ERROR' \
  --limit=50 \
  --format=json \
  > incident-logs.json
Escalation: If not resolved in 10 minutes, escalate to on-call architect

P1: Performance Degradation

Symptoms: Slow response times, high CPU/memory, increased error rates
1

Check Resource Usage

kubectl top pods -n production-mcp-server-langgraph
kubectl top nodes
2

Check HPA Status

kubectl get hpa -n production-mcp-server-langgraph -o yaml
3

Review Database Performance

gcloud sql operations list \
  --instance=mcp-prod-postgres \
  --filter="startTime>=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S).000000Z"
4

Temporary Scale Up

kubectl scale deployment production-mcp-server-langgraph \
  -n production-mcp-server-langgraph \
  --replicas=10

# Monitor improvement
watch -n 5 'kubectl top pods -n production-mcp-server-langgraph'
Permanent Fix: Adjust resource requests/limits based on actual usage

P2: Database Connection Issues

Check Cloud SQL Proxy:
kubectl logs -n production-mcp-server-langgraph POD_NAME -c cloud-sql-proxy --tail=50
Verify Instance Running:
gcloud sql instances describe mcp-prod-postgres --format="value(state)"
Restart Proxy:
kubectl rollout restart deployment/production-mcp-server-langgraph -n production-mcp-server-langgraph
Check Connection Count:
kubectl port-forward -n production-mcp-server-langgraph svc/production-mcp-server-langgraph 5432:5432 &
psql "host=localhost user=postgres" -c "SELECT count(*) FROM pg_stat_activity;"
Kill Idle Connections:
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '10 minutes';"
Increase Max Connections (via Terraform):
database_flags = {
  "max_connections" = "200"
}

Deployment Operations

Standard Deployment (via CI/CD)

Recommended: Use the automated GitHub Actions workflow for production deployments.
1

Create Release

gh release create v1.1.0 \
  --title "Release 1.1.0" \
  --notes "Release notes here"
2

Monitor CI/CD

The workflow automatically:
  • Builds and scans image
  • Requests manual approval
  • Deploys to production
  • Runs validation tests
  • Rolls back on failure
Monitor at: https://github.com/USER/REPO/actions
3

Verify Deployment

kubectl rollout status deployment/production-mcp-server-langgraph -n production-mcp-server-langgraph
kubectl get pods -n production-mcp-server-langgraph

Manual Deployment (Emergency)

Use only for emergencies. Bypasses automated testing and approval gates.
cd deployments/overlays/production-gke

## Update image tag
kustomize edit set image mcp-server-langgraph=REGISTRY/IMAGE:NEW_TAG

## Dry run
kubectl apply -k . --dry-run=server

## Apply
kubectl apply -k .

## Monitor
kubectl rollout status deployment/production-mcp-server-langgraph -n production-mcp-server-langgraph

Rollback Deployment

  • Quick Rollback
  • Specific Revision
# Rollback to previous revision
kubectl rollout undo deployment/production-mcp-server-langgraph \
  -n production-mcp-server-langgraph

# Monitor
kubectl rollout status deployment/production-mcp-server-langgraph \
  -n production-mcp-server-langgraph

Database Operations

Manual Backup

gcloud sql backups create \
  --instance=mcp-prod-postgres \
  --description="Manual backup before major change" \
  --project=PROJECT_ID

## List backups
gcloud sql backups list --instance=mcp-prod-postgres

Restore from Backup

Caution: Restoring to the same instance is destructive. Restore to a new instance first for safety.
  • Restore to New Instance
  • Point-in-Time Recovery
# Clone to new instance
gcloud sql instances clone mcp-prod-postgres mcp-prod-postgres-restored \
  --backup-id=BACKUP_ID \
  --project=PROJECT_ID
Recommended: Test on new instance, then switch connection if successful

Database Maintenance

Check Maintenance Window:
gcloud sql instances describe mcp-prod-postgres \
  --format="value(settings.maintenanceWindow)"
Reschedule Maintenance:
gcloud sql instances patch mcp-prod-postgres \
  --maintenance-window-day=1 \
  --maintenance-window-hour=3 \
  --project=PROJECT_ID

Scaling Operations

Manual Scaling

kubectl scale deployment production-mcp-server-langgraph \
  -n production-mcp-server-langgraph \
  --replicas=10
```bash
```bash Update HPA
kubectl patch hpa production-mcp-server-langgraph \
  -n production-mcp-server-langgraph \
  --patch '{"spec":{"minReplicas":5,"maxReplicas":30}}'

Cluster Resource Monitoring

## View resource usage
kubectl top pods -n production-mcp-server-langgraph
kubectl top nodes

## Cloud Monitoring query
gcloud monitoring time-series list \
  --filter='metric.type="kubernetes.io/container/cpu/core_usage_time"' \
  --project=PROJECT_ID

Security Operations

Rotate Secrets

1

Generate New Secrets

NEW_JWT_SECRET=$(openssl rand -base64 32)
NEW_API_KEY=$(openssl rand -base64 32)
2

Update Secret Manager

# Update secret version
gcloud secrets versions add mcp-production-secrets \
  --data-file=<(jq --arg jwt "$NEW_JWT_SECRET" '.jwt_secret = $jwt' secrets.json) \
  --project=PROJECT_ID
3

Pods Auto-Restart

External Secrets Operator automatically syncs secrets and Reloader restarts pods.Monitor:
kubectl get pods -n production-mcp-server-langgraph -w

Audit Access Logs

## View cluster access
gcloud logging read \
  'protoPayload.serviceName="container.googleapis.com" AND protoPayload.methodName="io.k8s.core.v1.pods.exec"' \
  --limit=50

## View kubectl commands
gcloud logging read \
  'protoPayload.serviceName="container.googleapis.com"' \
  --limit=20

Review Binary Authorization Denials

gcloud logging read \
  'protoPayload.serviceName="binaryauthorization.googleapis.com" AND protoPayload.response.allow=false' \
  --limit=20

Common Tasks

kubectl logs -f -n production-mcp-server-langgraph \
  -l app=mcp-server-langgraph \
  --max-log-requests=10
# Edit ConfigMap
kubectl edit configmap production-mcp-server-langgraph-config \
  -n production-mcp-server-langgraph

# Restart to pick up changes (Reloader does this automatically)
kubectl rollout restart deployment/production-mcp-server-langgraph \
  -n production-mcp-server-langgraph
  • Cloud SQL Proxy
  • kubectl Port Forward
cloud-sql-proxy PROJECT_ID:us-central1:mcp-prod-postgres &
psql "host=localhost port=5432 user=postgres dbname=mcp_langgraph"
# Export to JSON for analysis
gcloud monitoring time-series list \
  --filter='resource.type="k8s_container"' \
  --format=json \
  > metrics-$(date +%Y%m%d).json

Maintenance Windows

Cluster Upgrades

GKE Autopilot upgrades automatically based on release channel (STABLE for production).
Release Channels:
  • RAPID: Weekly (for testing)
  • REGULAR: Monthly (for general use)
  • STABLE: Quarterly (for production) ✅
Check upgrade status:
gcloud container clusters describe production-mcp-server-langgraph-gke \
  --region=us-central1 \
  --format="yaml(currentMasterVersion,releaseChannel)"
Manual upgrade (if needed):
## Check available versions
gcloud container get-server-config --region=us-central1

## Upgrade
gcloud container clusters upgrade production-mcp-server-langgraph-gke \
  --region=us-central1 \
  --cluster-version=VERSION

Database Maintenance

Scheduled maintenance: Sunday 3 AM UTC (configured in Terraform) Reschedule:
gcloud sql instances patch mcp-prod-postgres \
  --maintenance-window-day=1 \  # Monday
  --maintenance-window-hour=3 \  # 3 AM UTC
  --project=PROJECT_ID
Defer one-time:
gcloud sql instances reschedule-maintenance mcp-prod-postgres \
  --reschedule-type=NEXT_AVAILABLE_WINDOW

Monitoring & Alerting

Active Alerts

## List firing alerts
gcloud alpha monitoring policies list \
  --filter="enabled=true" \
  --project=PROJECT_ID

## Get alert details
gcloud alpha monitoring policies describe POLICY_ID

Create Custom Alert

gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="High Error Rate" \
  --condition-threshold-value=0.05 \
  --condition-threshold-duration=300s

View Metrics

Access Cloud Monitoring dashboards:

Disaster Recovery

Cloud SQL Failover

Manual Failover (for testing):
gcloud sql instances failover mcp-prod-postgres --project=PROJECT_ID

## Monitor
watch -n 5 'gcloud sql instances describe mcp-prod-postgres --format="value(state)"'
Recovery Time: 2-3 minutes for automatic failover

Full DR Procedure

For complete disaster recovery automation:
./deployments/disaster-recovery/gcp-dr-automation.sh PROJECT_ID us-east1 us-central1 full

Complete DR Guide

Multi-region failover, backup restoration, RTO/RPO targets

Emergency Procedures

Emergency Stop

This stops all traffic. Use only in critical situations (security breach, data corruption).
## Scale to zero
kubectl scale deployment production-mcp-server-langgraph \
  -n production-mcp-server-langgraph \
  --replicas=0

## Resume
kubectl scale deployment production-mcp-server-langgraph \
  -n production-mcp-server-langgraph \
  --replicas=3

Emergency Maintenance Mode

## Apply maintenance page
apiVersion: v1
kind: Service
metadata:
  name: maintenance-page
spec:
  selector:
    app: maintenance
  ports:
  - port: 80
    targetPort: 8080

Contacts & Escalation

On-Call Rotation

Configure in PagerDuty/OpsgenieEscalation Path:
  1. P0/P1: On-call engineer (immediate)
  2. P2: Engineering team lead (within 4 hours)
  3. P3: Ticket for next sprint
Communication: #production-incidents (Slack)