Overview
MCP Server uses a multi-environment Terraform strategy to maintain consistent infrastructure across development, staging, and production while optimizing costs and managing risk.Development
Cost-optimized for testing
Staging
Production-like validation
Production
HA, resilient, monitored
Environment Philosophy
1
Consistent Modules
Same Terraform modules across all environments
- Reduces “works in dev, breaks in prod” issues
- Ensures feature parity
- Simplifies testing
2
Parameter-Driven Differences
Use variables to adjust sizing, HA, and features
- Dev: Single zone, smaller instances
- Staging: Regional, production-like
- Prod: Regional, HA, monitoring
3
Isolated State
Separate Terraform state per environment
- Prevents accidental cross-environment changes
- Allows independent lifecycle management
- Enables parallel deployments
4
Progressive Rollout
Test changes in dev → staging → prod
- Validate in dev
- Load test in staging
- Deploy to prod with confidence
Environment Comparison
Configuration Matrix
| Component | Development | Staging | Production |
|---|---|---|---|
| GKE Cluster | Zonal (1 zone) | Regional (3 zones) | Regional (3 zones) |
| GKE Mode | Autopilot | Autopilot | Autopilot |
| Pod Replicas | 1-2 | 2-3 | 3-20 (HPA) |
| Cloud SQL | db-custom-1-3840 | db-custom-2-7680 | db-custom-4-15360 |
| SQL HA | Single zone | Regional HA | Regional HA + replicas |
| Redis | BASIC (no SLA) | STANDARD_HA | STANDARD_HA + persistence |
| Redis Memory | 1 GB | 3 GB | 5 GB |
| Monitoring | Basic (free tier) | Full | Full + SLI/SLO |
| Backups | Daily (7 days) | Daily (14 days) | Daily (30 days) + PITR |
| Binary Auth | ❌ Disabled | ✅ Audit mode | ✅ Enforcing |
| Network | Default VPC | Custom VPC | Custom VPC + Cloud Armor |
| Cost/month | ~$100 | ~$310 | ~$970 |
Directory Structure
Key principle:
main.tf, variables.tf, outputs.tf are identical across environments. Only terraform.tfvars differs.Development Environment
Purpose
- Rapid iteration for engineers
- Cost optimization (minimal resources)
- No SLA requirements
Configuration
- terraform.tfvars
- Estimated Cost
Tradeoffs
Acceptable for dev: Downtime doesn’t impact customers, cost savings are significant.
Staging Environment
Purpose
- Pre-production validation
- Load testing with production-like config
- Integration testing across services
- Security scanning before prod deployment
Configuration
- terraform.tfvars
- Estimated Cost
Key Differences from Production
No Read Replicas
No Read Replicas
Production: Has Cloud SQL read replicas for read scaling.Staging: No read replicas (cost savings, same testing value).
Binary Auth in Audit Mode
Binary Auth in Audit Mode
Production: Binary Authorization enforcing (blocks unsigned images).Staging: Audit mode (logs denials, doesn’t block). Allows testing image signing without risk.
Smaller Database
Smaller Database
Production: db-custom-4-15360 (4 vCPU).Staging: db-custom-2-7680 (2 vCPU). Sufficient for load testing without production traffic.
Production Environment
Purpose
- Live customer traffic
- 99.9% uptime SLA
- Security compliance (Binary Auth, audit logs)
- Full observability (SLI/SLO, alerting)
Configuration
- terraform.tfvars
- Estimated Cost
Production-Only Features
Binary Authorization Enforcing
Binary Authorization Enforcing
Point-in-Time Recovery (PITR)
Point-in-Time Recovery (PITR)
Recover database to any second within retention period.Essential for disaster recovery.
Read Replicas
Read Replicas
Scale read traffic without impacting primary database.Improves performance and availability.
GKE Backup
GKE Backup
Automated cluster backups for disaster recovery.Enables cluster restoration after catastrophic failure.
SLI/SLO Monitoring
SLI/SLO Monitoring
Service-level objectives with error budgets.
- 99.9% availability SLO
- P95 latency < 2s
- P99 latency < 5s
Deployment Strategy
Progressive Rollout
1
1. Deploy to Development
- Terraform apply succeeds
- Cluster is accessible
- Basic smoke tests pass
2
2. Deploy to Staging
- Integration tests pass
- Load tests show acceptable performance
- Security scans complete (no critical issues)
3
3. Deploy to Production (with approval)
- Canary deployment (10% traffic)
- Monitor SLI/SLO metrics
- Full rollout after 24 hours with no issues
Rollback Strategy
Infrastructure Rollback
Infrastructure Rollback
If infrastructure change causes issues:
State Management
Separate State Per Environment
- Development
- Staging
- Production
terraform/environments/gcp-dev/backend.tf
Best practice: Same bucket, different prefixes (cost-effective and easy to manage).
Variable Management
Shared Variables (variables.tf)
Variables defined invariables.tf are identical across environments:
Environment-Specific Values (terraform.tfvars)
Onlyterraform.tfvars differs across environments:
| Variable | Dev | Staging | Prod |
|---|---|---|---|
project_id | mcp-dev | mcp-staging | mcp-prod |
cluster_name | mcp-dev-gke | mcp-staging-gke | production-mcp-server-langgraph-gke |
regional_cluster | false | true | true |
cloudsql_tier | db-custom-1-3840 | db-custom-2-7680 | db-custom-4-15360 |
redis_tier | BASIC | STANDARD_HA | STANDARD_HA |
enable_binary_authorization | false | true (audit) | true (enforcing) |
Testing Strategy
1
Unit Tests (Terraform Validate)
2
Integration Tests (Terraform Plan)
0= No changes1= Error2= Changes present (expected)
3
Compliance Scans
- CIS GKE Benchmark compliance
- Terraform security (Trivy, tfsec, Checkov)
- Secrets in code (Gitleaks)
4
Load Testing (Staging)
Cost Optimization by Environment
Development
Auto-shutdown dev cluster after hours
Staging
Right-size based on actual load test results