Infrastructure Deployment Checklist
Overview
This checklist ensures reproducible and idempotent infrastructure deployments following the standardized naming convention:{environment}-mcp-server-langgraph-{resource-type}.
Last Updated: 2025-11-05
Status: ✅ Validated with staging rebuild
Pre-Deployment Validation
1. Naming Convention Compliance
Run this validation before any infrastructure changes:2. GitHub Variables Check
2. GKE Autopilot Compliance (for GKE deployments)
-
Run automated validation
-
CPU limit/request ratio ≤ 4.0
- Formula:
cpu_limit / cpu_request - Example valid: 1000m / 250m = 4.0 ✅
- Example invalid: 1000m / 200m = 5.0 ❌
- Formula:
-
Memory limit/request ratio ≤ 4.0
- Formula:
memory_limit / memory_request
- Formula:
-
CPU/Memory within allowed ranges
- CPU: min 50m, max 4 cores
- Memory: min 64Mi, max 8Gi
3. Security Configuration
-
If readOnlyRootFilesystem = true
-
/tmpis mounted as writable volume - All application writable paths have volume mounts
- Tested in development environment first
-
- No privileged containers (unless required)
-
Non-root user specified
-
Capabilities dropped
4. Resource Specifications
-
All containers have resource requests and limits
-
Resource requests are realistic
- Based on actual usage patterns
- Not over-provisioned
- Not under-provisioned
-
Ephemeral storage specified if needed
5. Environment Variables
-
No env vars with both
valueandvalueFrom -
All valueFrom sources have exactly one key
configMapKeyRefORsecretKeyRefORfieldRef(not multiple)
-
All referenced secrets exist
-
All referenced configmaps exist
6. Health Probes
-
Liveness probe configured
-
Readiness probe configured
-
Probe endpoints exist and are accessible
- Test endpoint returns 200 OK when healthy
- Endpoint doesn’t require authentication
- initialDelaySeconds is sufficient for app startup
- Timeout and period values are appropriate
7. Volumes and Storage
-
All required volumes are defined
- All volume mounts reference existing volumes
-
PersistentVolumeClaims exist and are bound
-
Storage class is available
8. Networking
-
Service exists for deployment (if needed)
- Service ports match container ports
-
Network policies allow required traffic
- Ingress configured correctly (if needed)
9. IAM and Permissions
-
ServiceAccount exists
-
Workload Identity configured (for GCP)
-
GCP service account has required IAM roles
-
Workload Identity binding configured
10. Image Configuration
-
Image tag is immutable (not
latest)- Use specific version tags or commit SHAs
-
Image exists in registry
- Image pull secrets configured (if private registry)
-
imagePullPolicy set appropriately
Alwaysfor staging/productionIfNotPresentfor development
Post-Deployment Validation
1. Immediate Checks (0-5 minutes)
-
All pods are Running
-
No restarts occurring
-
Check logs for startup errors
-
Health checks passing
2. Short-term Monitoring (5-30 minutes)
- Pods remain stable (no CrashLoopBackOff)
-
Memory/CPU usage is reasonable
- No errors in application logs
- Metrics are being collected
-
External dependencies are reachable
- Database connections working
- External API calls succeeding
3. Functional Testing
-
Application endpoints are accessible
-
Core functionality works
- Test critical user flows
- Verify data persistence
- Check integrations
-
Performance is acceptable
- Response times normal
- No unusual latency
4. Long-term Monitoring (24-48 hours)
-
Set up alerts for:
- Pod restarts
- OOM kills
- Probe failures
- Error rate increases
- Monitor resource usage trends
- Check for memory leaks
- Verify log aggregation working
Rollback Procedure
If issues are detected post-deployment:1. Quick Rollback
2. Rollback to Specific Revision
3. Rollback with Kustomize
Tools and Scripts
Available Tools:
scripts/validate_gke_autopilot_compliance.py- Validation scripttests/regression/test_pod_deployment_regression.py- Regression tests.githooks/pre-commit- Pre-commit validation hook.github/workflows/validate-k8s-configs.yml- CI/CD pipeline
Usage Examples:
Common Mistakes to Avoid
-
❌ Deploying without validation
- Always run validation script first
-
❌ Enabling readOnlyRootFilesystem without testing
- Test in development first
- Ensure all writable paths are mounted
-
❌ Not checking CPU/memory ratios on GKE Autopilot
- Ratio must be ≤ 4.0
-
❌ Using
latestimage tag- Use specific versions or commit SHAs
-
❌ Not testing rollback procedure
- Verify you can rollback before deploying
-
❌ Insufficient health probe delays
- Account for slow application startup
-
❌ Missing IAM permissions
- Verify service account permissions before deploying
-
❌ Not monitoring after deployment
- Watch pods for at least 10 minutes
-
❌ Forgetting to delete old ReplicaSets
- Clean up after successful deployment
-
❌ Not documenting changes
- Always document what was changed and why
Emergency Contacts
- Platform Team: @platform-team
- On-Call Engineer: Check PagerDuty
- Documentation: See Troubleshooting Guide
Related Documentation
- Troubleshooting Guide - Deployment Problems
- GKE Autopilot Best Practices
- ADR-0040: GCP GKE Autopilot Deployment
- ADR-0054: Pod Failure Prevention Framework
Last Validated: 2025-11-12 (Staging rebuild from scratch succeeded)