Overview
This guide covers deploying the MCP Server to a production-grade staging environment on Google Kubernetes Engine (GKE) with:- 🔒 Security Hardening: Private nodes, Binary Authorization, Network Policies
- 🔑 Keyless Authentication: Workload Identity Federation for GitHub Actions
- 🌐 Network Isolation: Separate VPC from production
- 📊 Full Observability: Cloud Logging, Monitoring, and Trace
- 🤖 Automated Deployments: GitHub Actions with approval gates
This is a production-ready staging environment suitable for pre-production testing and validation.
Architecture
Prerequisites
- GCP Project:
vishnu-sandbox-20250310(or your project ID) - gcloud CLI: Installed and authenticated
- kubectl: Installed
- GitHub Repository: Access to repository settings
Step 1: Infrastructure Setup
Automated Setup (Recommended)
Run the automated infrastructure setup script:- ✅ Staging VPC network (10.1.0.0/20)
- ✅ GKE Autopilot cluster with security hardening
- ✅ Cloud SQL PostgreSQL instance
- ✅ Memorystore Redis instance
- ✅ Workload Identity for pods
- ✅ GitHub Actions Workload Identity Federation
- ✅ Artifact Registry repository
- ✅ Secret Manager secrets
Manual Infrastructure Setup (Advanced)
Manual Infrastructure Setup (Advanced)
If you prefer manual setup, follow these steps:
1.1 Create VPC Network
1.2 Create GKE Cluster
1.3 Create Managed Services
See the full setup script for complete Cloud SQL and Redis setup.Setup Output
After running the setup script, you’ll receive:Step 2: Update API Keys
Update the placeholder secrets with real API keys:Step 3: Install External Secrets Operator
External Secrets Operator syncs secrets from GCP Secret Manager to Kubernetes:Step 4: Configure GitHub Environment
4.1 Create GitHub Environment
- Go to your repository on GitHub
- Navigate to Settings → Environments
- Click New environment
- Name it
staging - Configure protection rules:
- ✅ Required reviewers: Add 1-2 reviewers
- ✅ Wait timer: 5 minutes (allows review before deployment)
- ✅ Deployment branches:
main,release/*
4.2 Update GitHub Workflow
Edit.github/workflows/deploy-staging-gke.yaml and replace PROJECT_NUMBER with your actual project number:
Step 5: Deploy Application
First-Time Deployment (Manual)
For the first deployment, deploy manually to verify everything works:Automated Deployments (GitHub Actions)
Once manual deployment succeeds, GitHub Actions will handle future deployments: Triggered by:- ✅ Push to
mainbranch - ✅ Pre-release creation
- ✅ Manual workflow dispatch
- Build Docker image
- Push to Artifact Registry
- Deploy to GKE staging
- Run smoke tests
- Validate deployment
- Auto-rollback on failure
Step 6: Verify Deployment
Run Smoke Tests
Check Health Endpoints
View Logs
Cloud Logging (recommended):Security Features
Network Isolation
- Separate VPC: Staging has its own VPC (10.1.0.0/20)
- Network Policies: Restrict pod-to-pod and egress traffic
- Private GKE nodes: Nodes have no public IP addresses
Workload Identity
Pods authenticate as GCP service accounts without keys:Binary Authorization
Only signed/approved images can be deployed:Secret Management
Secrets are stored in GCP Secret Manager and synced via External Secrets:Monitoring & Observability
Cloud Console Dashboards
Access monitoring in Cloud Console:Key Metrics
Monitor these metrics in Cloud Monitoring:kubernetes.io/container/cpu/core_usage_time- CPU usagekubernetes.io/container/memory/used_bytes- Memory usagekubernetes.io/pod/network/received_bytes_count- Network traffic- Custom metrics:
custom.googleapis.com/mcp-staging/*
Set Up Alerts
Create alert policies for:Troubleshooting
Pods not starting
Pods not starting
Check pod status:Common issues:
- Image pull errors: Check Artifact Registry permissions
- Cloud SQL proxy fails: Verify service account has
cloudsql.clientrole - Secrets not found: Check External Secrets sync status
External Secrets not syncing
External Secrets not syncing
Check ExternalSecret status:Common fixes:
- Verify Workload Identity binding
- Check secret exists in Secret Manager
- Ensure service account has
secretAccessorrole
Cloud SQL connection fails
Cloud SQL connection fails
Check Cloud SQL proxy logs:Verify connection:
Network policies blocking traffic
Network policies blocking traffic
List network policies:Temporarily disable for testing:
Rollback Procedures
Automatic Rollback
GitHub Actions automatically rolls back on deployment failure.Manual Rollback
Cost Optimization
Current Costs (Estimated)
| Resource | Tier | Monthly Cost |
|---|---|---|
| GKE Autopilot | 2-3 pods avg | ~$100 |
| Cloud SQL | db-custom-1-4096 | ~$40 |
| Memorystore Redis | 2GB standard | ~$50 |
| Networking | VPC, egress | ~$20 |
| Total | ~$210/month |
Optimization Tips
Use Autopilot
GKE Autopilot optimizes resource usage automatically. You only pay for pod resources, not node overhead.
Rightsize Resources
Review resource requests/limits:Adjust in
deployment-patch.yaml if needed.Use Preemptible Nodes
Not recommended for staging, but possible for dev environments.
Monitor Egress
Excessive egress to LLM APIs can increase costs. Consider:
- Caching responses
- Request batching
- Rate limiting
Next Steps
Production Deployment
Deploy to production GKE
Monitoring Setup
Advanced monitoring and alerting
Disaster Recovery
Backup and recovery procedures
CI/CD Guide
Complete CI/CD pipeline documentation
Staging Deployment Complete! Your production-grade staging environment is ready for testing.