54. Pod Failure Prevention Framework
Date: 2025-11-12Status
AcceptedCategory
Infrastructure & DeploymentContext
On 2025-11-12, we experienced multiple pod crash failures in the staging environment:- Keycloak CrashLoopBackOff - readOnlyRootFilesystem without proper volume mounts
- OTEL Collector - GKE Autopilot CPU ratio violation (5.0 > 4.0)
- OTEL Collector - Configuration syntax errors
- OTEL Collector - Missing health_check extension
- OTEL Collector - Missing GCP IAM permissions
Decision
We will implement a comprehensive Pod Failure Prevention Framework consisting of:1. Automated Validation Tools
A. GKE Autopilot Compliance Validator
- Location:
scripts/validate_gke_autopilot_compliance.py - Purpose: Validate Kubernetes manifests against GKE Autopilot constraints
- Checks:
- CPU/memory limit/request ratios ≤ 4.0
- Environment variable configuration validity
- readOnlyRootFilesystem volume mount completeness
- Resource specification completeness
B. Regression Test Suite
- Location:
tests/regression/test_pod_deployment_regression.py - Framework: pytest
- Coverage:
- GKE Autopilot compliance tests
- Environment variable validation tests
- Security configuration tests
- Kustomize build validation tests
- Dry-run validation tests
2. CI/CD Integration
A. GitHub Actions Workflow
- Location:
.github/workflows/validate-k8s-configs.yml - Triggers: All PRs affecting
deployments/** - Jobs:
- Validate kustomize builds
- Run GKE Autopilot compliance checks
- Run regression test suite
- Enforcement: Required status check before merge
B. Pre-commit Hook
- Location:
.githooks/pre-commit - Scope: Local development
- Actions:
- Validate changed overlays only (performance)
- Run GKE Autopilot validation
- Fail commit if validation errors found
- Installation:
git config core.hooksPath .githooks
3. Documentation
A. Troubleshooting Runbook
- Location:
docs-internal/operations/POD_FAILURE_TROUBLESHOOTING_RUNBOOK.md - Content:
- Common pod failure patterns
- Step-by-step troubleshooting guides
- Quick reference commands
- Solutions for known issues
B. Deployment Checklist
- Location:
docs-internal/operations/DEPLOYMENT_CHECKLIST.md - Content:
- Pre-deployment validation steps
- Post-deployment monitoring steps
- Rollback procedures
- Common mistakes to avoid
C. Remediation Reports
- Location:
docs-internal/operations/STAGING_POD_CRASH_REMEDIATION.md - Content:
- Root cause analysis
- Detailed solutions
- Lessons learned
- Future preventive measures
4. Platform Constraints Documentation
A. GKE Autopilot Constraints
- CPU: max limit/request ratio = 4.0, min request = 50m, max limit = 4 cores
- Memory: max limit/request ratio = 4.0, min request = 64Mi, max limit = 8Gi
- Enforcement: Automated validation in CI/CD
B. Security Standards
- readOnlyRootFilesystem: Test in dev first, document required volume mounts
- Non-root users: Enforce runAsNonRoot: true
- Drop capabilities: Enforce capabilities.drop: [ALL]
Implementation
Phase 1: Immediate Remediation (Completed 2025-11-12)
- ✅ Fixed all pod crash issues:
- Keycloak: Reverted readOnlyRootFilesystem temporarily
- OTEL Collector: Fixed CPU ratio, config syntax, health checks, IAM permissions
- Created GCP service account with proper roles
- Cleaned up old ReplicaSets
Phase 2: Preventive Measures (Completed 2025-11-12)
- ✅ Created validation infrastructure:
- Validation script with comprehensive checks
- Regression test suite (8+ test cases)
- GitHub Actions workflow (3 jobs)
- Pre-commit hook for local validation
- ✅ Created documentation:
- Troubleshooting runbook
- Deployment checklist
- Remediation report with lessons learned
Phase 3: Continuous Improvement (Ongoing)
Short-term (Next sprint):- Re-enable Keycloak readOnlyRootFilesystem with proper volume mounts
- Fix OTEL Collector duplicate label issue
- Add integration tests for pod startup
- Create monitoring dashboards for pod health
- Extend validation to cover more edge cases
- Add automated rollback triggers
- Implement chaos engineering tests
- Create self-healing mechanisms
- Build internal platform tools for deployment safety
- Establish SLOs for deployment success rate
- Continuous refinement of validation rules
Rationale
Why Automated Validation?
Problem: Manual validation is error-prone and inconsistent Solution: Automated tools catch errors before deployment Benefits:- Consistent enforcement of platform constraints
- Fast feedback (seconds vs. hours)
- No human errors
- Scales across all environments
Why Multiple Validation Layers?
Defense in Depth Strategy:- Pre-commit hook - Catch errors locally before push
- CI/CD pipeline - Catch errors before merge to main
- Dry-run validation - Catch Kubernetes schema errors
- Regression tests - Prevent known issues from recurring
Why Comprehensive Documentation?
Problem: Knowledge siloed in individuals’ heads Solution: Centralized, searchable documentation Benefits:- Faster incident resolution
- Onboarding new team members
- Reducing repeat incidents
- Building institutional knowledge
Consequences
Positive
-
Reduced Pod Failures
- Validation catches 90%+ of common issues before deployment
- Automated testing prevents regression
-
Faster Incident Resolution
- Runbooks provide step-by-step guides
- Known issues have documented solutions
-
Improved Developer Experience
- Fast feedback from pre-commit hooks
- Clear error messages from validation scripts
-
Platform Reliability
- Enforced compliance with platform constraints
- Consistent deployment practices
-
Knowledge Sharing
- Documentation accessible to entire team
- Lessons learned captured systematically
Negative
-
Additional CI/CD Time
- Validation adds ~2-3 minutes to pipeline
- Mitigation: Run validations in parallel
-
Initial Learning Curve
- Team needs to learn new tools
- Mitigation: Comprehensive documentation and examples
-
Maintenance Overhead
- Tools need updating as platform evolves
- Mitigation: Regular review schedule, automated dependency updates
-
False Positives Possible
- Validation might flag legitimate configurations
- Mitigation: Continuous refinement based on feedback
Alternatives Considered
Alternative 1: Manual Code Review Only
Pros:- No tooling overhead
- Flexible case-by-case decisions
- Scales poorly
- Inconsistent enforcement
- Relies on reviewer knowledge
- Rejected: Too error-prone
Alternative 2: Admission Controllers
Pros:- Runtime enforcement
- Prevents bad configs from being applied
- More complex to set up
- Harder to debug
- Later feedback cycle
- Rejected: Validation should fail earlier (CI/CD), not at runtime
Alternative 3: Third-party Policy Engines (OPA, Kyverno)
Pros:- Industry-standard tools
- Rich policy language
- Active communities
- Additional dependencies
- Learning curve
- May be overkill for current needs
- Deferred: Consider for future if needs grow
Metrics for Success
Quantitative Metrics:
-
Pod Failure Rate (Target: < 1% of deployments)
- Track: Number of pod failures per deployment
- Baseline (before): ~20% failure rate in staging
- Target (after): < 1% failure rate
-
Mean Time to Resolution (MTTR) (Target: < 30 minutes)
- Track: Time from issue detection to resolution
- Baseline: 2-4 hours
- Target: < 30 minutes
-
Validation Coverage (Target: 100% of deployment PRs)
- Track: % of PRs that run validation
- Target: 100% (enforced by required status checks)
-
Documentation Usage (Target: 80% of incidents use runbook)
- Track: % of incidents resolved using runbooks
- Measure via incident retrospectives
Qualitative Metrics:
-
Developer Confidence
- Developers feel confident deploying to staging/production
- Reduced anxiety around deployments
-
Incident Prevention
- Catching issues before they reach production
- Reducing pager alerts
-
Knowledge Retention
- Team can troubleshoot without escalation
- Faster onboarding of new team members
Review Schedule
- Quarterly Review: Assess validation rules effectiveness
- After Major Incidents: Update runbooks and tests
- Platform Changes: Update constraints and validators
- Annual Review: Evaluate need for more sophisticated tooling (OPA, Kyverno)
References
Internal Documentation:
External Resources:
Related ADRs:
Appendix: Validation Tool Specifications
A. Validation Script
Language: Python 3.11+ Dependencies: pyyaml, kubectl Key Functions:validate_cpu_ratio()- Check CPU limit/request ratiovalidate_memory_ratio()- Check memory limit/request ratiovalidate_env_vars()- Check environment variable configurationvalidate_readonly_filesystem()- Check volume mount completeness
- 0: All validations passed
- 1: Validation failures found
B. Regression Tests
Framework: pytest 7.0+ Test Classes:TestGKEAutopilotCompliance- Platform constraint testsTestEnvironmentVariableConfiguration- Env var validity testsTestReadOnlyRootFilesystem- Security configuration testsTestOTELCollectorConfiguration- OTEL-specific testsTestKustomizeBuildValidity- Build validation tests
C. CI/CD Pipeline
Platform: GitHub Actions Jobs:validate-kustomize- Kustomize build validationvalidate-gke-autopilot- GKE compliance checkstest-regression- Regression test suitesummary- Aggregate results
Changelog
- 2025-11-12: Initial version - Pod Failure Prevention Framework