Overview
This guide provides step-by-step instructions for deploying MCP Server LangGraph on Google Kubernetes Engine (GKE) Autopilot with enterprise-grade security, monitoring, and cost optimization.Deployment Time
2-3 hours for complete production setup
Cost Savings
40-60% vs. traditional GKE Standard
Infrastructure Maturity
94/100 production-readiness score
Best Practices
100% GCP best practices compliance
What You’ll Deploy
- GKE Autopilot Cluster: Fully managed Kubernetes (regional, multi-zone)
- Cloud SQL PostgreSQL: High-availability database with automated backups
- 3 databases: Keycloak (identity), OpenFGA (authorization), GDPR (compliance per ADR-0041)
- Memorystore Redis: High-availability cache with persistence
- Workload Identity: Secure pod-to-GCP service authentication
- Private Networking: VPC-native with Cloud NAT
- Observability: Cloud Monitoring, Logging, Trace, Profiler
- Security: Binary Authorization, Network Policies, Encryption
Key Benefits
40-60% Cost Savings
40-60% Cost Savings
GKE Autopilot uses pay-per-pod pricing with no idle node costs. Production environment runs for 1,290-1,970 with traditional GKE.
Zero Node Management
Zero Node Management
Google manages all node infrastructure, upgrades, and scaling automatically. Focus on your application, not Kubernetes operations.
99.9% Uptime
99.9% Uptime
Regional deployment across 3 zones with automated failover for databases and cache provides enterprise-grade reliability.
Security by Default
Security by Default
Built-in Workload Identity, Binary Authorization, Shielded Nodes, Network Policies, and encryption at rest/transit.
Prerequisites
1
GCP Account & Project
Create a GCP project with billing enabled:
2
Install Required Tools
| Tool | Version | Installation |
|---|---|---|
| gcloud CLI | Latest | curl https://sdk.cloud.google.com | bash |
| Terraform | ≥ 1.5.0 | terraform.io/downloads |
| kubectl | ≥ 1.28 | gcloud components install kubectl |
| kustomize | ≥ 5.0 | brew install kustomize |
3
Enable Required APIs
Enable 20+ required GCP APIs:
4
Configure IAM Permissions
Required roles (or
roles/owner):roles/compute.networkAdminroles/container.adminroles/cloudsql.adminroles/redis.adminroles/iam.securityAdminroles/resourcemanager.projectIamAdmin
Architecture
The production deployment creates a fully managed, highly available infrastructure:Phase 1: Infrastructure Setup (30 minutes)
Step 1: Create Terraform State Backend
- Quick Setup
- With Options
Expected: GCS bucket created with versioning and logging enabled
Step 2: Configure Production Environment
terraform.tfvars with your configuration:
Phase 2: Deploy Infrastructure (25 minutes)
Step 1: Initialize and Plan
Review the plan. It should create:
- 1 VPC network with 3 subnets
- 1 GKE Autopilot cluster (regional)
- 1 Cloud SQL instance (PostgreSQL 15, HA)
- 1 Memorystore instance (Redis 7.0, HA)
- 2 NAT IPs
- Multiple service accounts
- IAM bindings
- Firewall rules
- Monitoring alerts
Step 2: Deploy
Duration: 20-25 minutes. Cloud SQL takes 10-12 minutes, GKE takes 8-10 minutes.
Step 3: Configure kubectl
Expected output:
Phase 3: Application Deployment (20 minutes)
Step 1: Create Secrets in Secret Manager
Step 2: Install External Secrets Operator
Step 3: Deploy Application
Verify deployment:
Phase 4: Security Hardening (30 minutes)
Binary Authorization
Enable image signing to ensure only trusted container images run in your cluster:1
Run Setup Script
- KMS key for signing
- Binary Authorization attestor
- Enforcement policy
2
Sign Images
3
Enable in Cluster
Edit Apply:
terraform.tfvars:Learn more about Binary Authorization for complete setup details.
Phase 5: Observability (10 minutes)
Setup Monitoring
- Custom Cloud Monitoring dashboard
- Alert policies (CPU, memory, errors, latency)
- Uptime checks
- SLO definitions
Access Dashboards
GKE Workloads
View pod status, deployments, services
Cloud Monitoring
Custom dashboards, metrics, alerts
Cloud Logging
Centralized log aggregation
Cloud Trace
Distributed tracing
Verification & Testing
Health Checks
Test all health endpoints:All health checks should return HTTP 200
Database Connectivity
Should return:
okWorkload Identity
Verify service account annotation:Should show:
mcp-prod-app-sa@PROJECT_ID.iam.gserviceaccount.comTroubleshooting
Pods Stuck in Pending
Pods Stuck in Pending
Diagnosis:Common causes:
- Resource requests too high (Autopilot provisions automatically but has limits)
- Image pull errors (check Workload Identity permissions)
- Binary Authorization blocking unsigned images
- Reduce CPU/memory requests in deployment
- Verify image exists in Artifact Registry
- Sign the image if Binary Auth is enabled
Can't Access Cloud SQL
Can't Access Cloud SQL
Diagnosis:Solution:
- Verify private service connection exists
- Check Cloud SQL instance is running
- Verify Cloud SQL Proxy sidecar configuration
- Check Workload Identity IAM bindings
Workload Identity Not Working
Workload Identity Not Working
Diagnosis:Solution:
- Verify annotation:
iam.gke.io/gcp-service-account - Check IAM binding exists
- Wait 1-2 minutes for propagation
Cost Optimization
Expected Monthly Costs
- Production
- With Commitments
| Component | Configuration | Cost/Month |
|---|---|---|
| GKE Autopilot | ~25 pods (500m CPU, 1GB RAM avg) | $360 |
| Cloud SQL | 4 vCPU, 15GB RAM, HA + replica | $280 |
| Memorystore | 5GB Redis HA | $220 |
| Networking | NAT, egress | $60 |
| Monitoring | Standard retention | $50 |
| Total | $970/month |
Cost Optimization Guide
Learn how to achieve 40-60% cost savings with rightsizing, committed use discounts, and automation.
Next Steps
Operational Runbooks
Day-2 operations, incident response, maintenance procedures
Security Hardening
Enable VPC Service Controls, configure Cloud Armor, implement policies
CI/CD Pipeline
Setup automated deployments with ArgoCD and GitHub Actions
Monitoring & SLOs
Configure custom dashboards, define SLIs/SLOs, set up alerts
Related Documentation
Infrastructure as Code
Terraform modules for VPC, GKE, Cloud SQL, Redis
Multi-Environment Setup
Dev, staging, production configurations
Disaster Recovery
Multi-region failover and backup automation
Service Mesh
Anthos Service Mesh for advanced traffic management
Binary Authorization
Image signing and policy enforcement
GKE Staging
Staging environment setup
Support Resources
Complete Technical Documentation
For detailed technical documentation, see:
- GKE Deployment Guide (800+ lines, root directory)
- Terraform Module READMEs (5,000+ lines technical docs)
- GCP Best Practices Summary (root directory)