Overview
This guide provides operational procedures for managing MCP Server LangGraph on GKE Autopilot in production, including incident response, maintenance tasks, and troubleshooting.
This guide provides essential operational procedures for managing GKE Autopilot deployments in production.
Daily Health Check (5 minutes)
Run this every morning to ensure system health:
Check Cluster Status
gcloud container clusters describe production-mcp-server-langgraph-gke \
--region=us-central1 \
--format= "value(status)"
Check Pod Health
kubectl get pods -n production-mcp-server-langgraph
# Check for failed pods
kubectl get pods -n production-mcp-server-langgraph \
--field-selector=status.phase!=Running,status.phase!=Succeeded
All pods should show Running status with 0-1 restarts
Review Recent Events
kubectl get events -n production-mcp-server-langgraph \
--sort-by= '.lastTimestamp' \
--field-selector=type!=Normal \
| tail -10
No ERROR or WARNING events in last hour
Check Database Status
gcloud sql instances describe mcp-prod-postgres \
--format= "table(name,state,ipAddresses[0].ipAddress)"
State should be: RUNNABLE
View Recent Errors
gcloud logging read \
'resource.type="k8s_container" AND resource.labels.namespace_name="mcp-production" AND severity>=ERROR' \
--limit=10 \
--format= "table(timestamp,jsonPayload.message)"
No critical errors in last hour
Incident Response
P0: Service Down (Complete Outage)
Symptoms : All pods crashing, health checks failing, users cannot access service
Response Time : 5-10 minutes
Immediate Assessment
# Check pod status
kubectl get pods -n production-mcp-server-langgraph
# Get pod logs
kubectl logs -n production-mcp-server-langgraph -l app=mcp-server-langgraph --tail=100
# Describe failing pods
kubectl describe pod POD_NAME -n production-mcp-server-langgraph
Quick Fixes
Restart Deployment
Rollback
Scale Up
kubectl rollout restart deployment/production-mcp-server-langgraph \
-n production-mcp-server-langgraph
Verify Recovery
kubectl get pods -n production-mcp-server-langgraph
kubectl exec -it -n production-mcp-server-langgraph POD_NAME -- curl http://localhost:8000/health/live
Post-Incident Analysis
# Export logs for analysis
gcloud logging read \
'resource.type="k8s_container" AND severity>=ERROR' \
--limit=50 \
--format=json \
> incident-logs.json
Escalation : If not resolved in 10 minutes, escalate to on-call architect
Symptoms : Slow response times, high CPU/memory, increased error rates
Check Resource Usage
kubectl top pods -n production-mcp-server-langgraph
kubectl top nodes
Check HPA Status
kubectl get hpa -n production-mcp-server-langgraph -o yaml
Review Database Performance
gcloud sql operations list \
--instance=mcp-prod-postgres \
--filter= "startTime>=$( date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S).000000Z"
Temporary Scale Up
kubectl scale deployment production-mcp-server-langgraph \
-n production-mcp-server-langgraph \
--replicas=10
# Monitor improvement
watch -n 5 'kubectl top pods -n production-mcp-server-langgraph'
Permanent Fix : Adjust resource requests/limits based on actual usage
P2: Database Connection Issues
Check Cloud SQL Proxy :kubectl logs -n production-mcp-server-langgraph POD_NAME -c cloud-sql-proxy --tail=50
Verify Instance Running :gcloud sql instances describe mcp-prod-postgres --format= "value(state)"
Restart Proxy :kubectl rollout restart deployment/production-mcp-server-langgraph -n production-mcp-server-langgraph
Check Connection Count :kubectl port-forward -n production-mcp-server-langgraph svc/production-mcp-server-langgraph 5432:5432 &
psql "host=localhost user=postgres" -c "SELECT count(*) FROM pg_stat_activity;"
Kill Idle Connections :psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '10 minutes';"
Increase Max Connections (via Terraform):database_flags = {
"max_connections" = "200"
}
Deployment Operations
Standard Deployment (via CI/CD)
Recommended : Use the automated GitHub Actions workflow for production deployments.
Create Release
gh release create v1.1.0 \
--title "Release 1.1.0" \
--notes "Release notes here"
Monitor CI/CD
The workflow automatically:
Builds and scans image
Requests manual approval
Deploys to production
Runs validation tests
Rolls back on failure
Monitor at: https://github.com/USER/REPO/actions
Verify Deployment
kubectl rollout status deployment/production-mcp-server-langgraph -n production-mcp-server-langgraph
kubectl get pods -n production-mcp-server-langgraph
Manual Deployment (Emergency)
Use only for emergencies . Bypasses automated testing and approval gates.
cd deployments/overlays/production-gke
## Update image tag
kustomize edit set image mcp-server-langgraph=REGISTRY/IMAGE:NEW_TAG
## Dry run
kubectl apply -k . --dry-run=server
## Apply
kubectl apply -k .
## Monitor
kubectl rollout status deployment/production-mcp-server-langgraph -n production-mcp-server-langgraph
Rollback Deployment
Quick Rollback
Specific Revision
# Rollback to previous revision
kubectl rollout undo deployment/production-mcp-server-langgraph \
-n production-mcp-server-langgraph
# Monitor
kubectl rollout status deployment/production-mcp-server-langgraph \
-n production-mcp-server-langgraph
Database Operations
Manual Backup
gcloud sql backups create \
--instance=mcp-prod-postgres \
--description= "Manual backup before major change" \
--project=PROJECT_ID
## List backups
gcloud sql backups list --instance=mcp-prod-postgres
Restore from Backup
Caution : Restoring to the same instance is destructive. Restore to a new instance first for safety.
Restore to New Instance
Point-in-Time Recovery
# Clone to new instance
gcloud sql instances clone mcp-prod-postgres mcp-prod-postgres-restored \
--backup-id=BACKUP_ID \
--project=PROJECT_ID
Recommended : Test on new instance, then switch connection if successful
Database Maintenance
Check Maintenance Window :
gcloud sql instances describe mcp-prod-postgres \
--format= "value(settings.maintenanceWindow)"
Reschedule Maintenance :
gcloud sql instances patch mcp-prod-postgres \
--maintenance-window-day=1 \
--maintenance-window-hour=3 \
--project=PROJECT_ID
Scaling Operations
Manual Scaling
Scale Deployment
Update Cluster Limits
kubectl scale deployment production-mcp-server-langgraph \
-n production-mcp-server-langgraph \
--replicas=10
``` bash
``` bash Update HPA
kubectl patch hpa production-mcp-server-langgraph \
-n production-mcp-server-langgraph \
--patch '{"spec":{"minReplicas":5,"maxReplicas":30}}'
Cluster Resource Monitoring
## View resource usage
kubectl top pods -n production-mcp-server-langgraph
kubectl top nodes
## Cloud Monitoring query
gcloud monitoring time-series list \
--filter= 'metric.type="kubernetes.io/container/cpu/core_usage_time"' \
--project=PROJECT_ID
Security Operations
Rotate Secrets
Generate New Secrets
NEW_JWT_SECRET =$(openssl rand -base64 32)
NEW_API_KEY =$(openssl rand -base64 32)
Update Secret Manager
# Update secret version
gcloud secrets versions add mcp-production-secrets \
--data-file= <( jq --arg jwt " $NEW_JWT_SECRET " '.jwt_secret = $jwt' secrets.json) \
--project=PROJECT_ID
Pods Auto-Restart
External Secrets Operator automatically syncs secrets and Reloader restarts pods. Monitor: kubectl get pods -n production-mcp-server-langgraph -w
Audit Access Logs
## View cluster access
gcloud logging read \
'protoPayload.serviceName="container.googleapis.com" AND protoPayload.methodName="io.k8s.core.v1.pods.exec"' \
--limit=50
## View kubectl commands
gcloud logging read \
'protoPayload.serviceName="container.googleapis.com"' \
--limit=20
Review Binary Authorization Denials
gcloud logging read \
'protoPayload.serviceName="binaryauthorization.googleapis.com" AND protoPayload.response.allow=false' \
--limit=20
Common Tasks
Real-time
Cloud Logging
Specific Container
kubectl logs -f -n production-mcp-server-langgraph \
-l app=mcp-server-langgraph \
--max-log-requests=10
# Edit ConfigMap
kubectl edit configmap production-mcp-server-langgraph-config \
-n production-mcp-server-langgraph
# Restart to pick up changes (Reloader does this automatically)
kubectl rollout restart deployment/production-mcp-server-langgraph \
-n production-mcp-server-langgraph
Cloud SQL Proxy
kubectl Port Forward
cloud-sql-proxy PROJECT_ID:us-central1:mcp-prod-postgres &
psql "host=localhost port=5432 user=postgres dbname=mcp_langgraph"
# Export to JSON for analysis
gcloud monitoring time-series list \
--filter= 'resource.type="k8s_container"' \
--format=json \
> metrics- $( date +%Y%m%d ) .json
Maintenance Windows
Cluster Upgrades
GKE Autopilot upgrades automatically based on release channel (STABLE for production).
Release Channels :
RAPID : Weekly (for testing)
REGULAR : Monthly (for general use)
STABLE : Quarterly (for production) ✅
Check upgrade status :
gcloud container clusters describe production-mcp-server-langgraph-gke \
--region=us-central1 \
--format= "yaml(currentMasterVersion,releaseChannel)"
Manual upgrade (if needed):
## Check available versions
gcloud container get-server-config --region=us-central1
## Upgrade
gcloud container clusters upgrade production-mcp-server-langgraph-gke \
--region=us-central1 \
--cluster-version=VERSION
Database Maintenance
Scheduled maintenance: Sunday 3 AM UTC (configured in Terraform)
Reschedule :
gcloud sql instances patch mcp-prod-postgres \
--maintenance-window-day=1 \ # Monday
--maintenance-window-hour = 3 \ # 3 AM UTC
--project = PROJECT_ID
Defer one-time :
gcloud sql instances reschedule-maintenance mcp-prod-postgres \
--reschedule-type=NEXT_AVAILABLE_WINDOW
Monitoring & Alerting
Active Alerts
## List firing alerts
gcloud alpha monitoring policies list \
--filter= "enabled=true" \
--project=PROJECT_ID
## Get alert details
gcloud alpha monitoring policies describe POLICY_ID
Create Custom Alert
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name= "High Error Rate" \
--condition-threshold-value=0.05 \
--condition-threshold-duration=300s
View Metrics
Access Cloud Monitoring dashboards:
Disaster Recovery
Cloud SQL Failover
Manual Failover (for testing):
gcloud sql instances failover mcp-prod-postgres --project=PROJECT_ID
## Monitor
watch -n 5 'gcloud sql instances describe mcp-prod-postgres --format="value(state)"'
Recovery Time : 2-3 minutes for automatic failover
Full DR Procedure
For complete disaster recovery automation:
./deployments/disaster-recovery/gcp-dr-automation.sh PROJECT_ID us-east1 us-central1 full
Complete DR Guide Multi-region failover, backup restoration, RTO/RPO targets
Emergency Procedures
Emergency Stop
This stops all traffic . Use only in critical situations (security breach, data corruption).
## Scale to zero
kubectl scale deployment production-mcp-server-langgraph \
-n production-mcp-server-langgraph \
--replicas=0
## Resume
kubectl scale deployment production-mcp-server-langgraph \
-n production-mcp-server-langgraph \
--replicas=3
Emergency Maintenance Mode
## Apply maintenance page
apiVersion : v1
kind : Service
metadata :
name : maintenance-page
spec :
selector :
app : maintenance
ports :
- port : 80
targetPort : 8080
On-Call Rotation Configure in PagerDuty/Opsgenie Escalation Path :
P0/P1: On-call engineer (immediate)
P2: Engineering team lead (within 4 hours)
P3: Ticket for next sprint
Communication : #production-incidents (Slack)