EKS and AKS Deployment Guide - Lessons from GKE
This document provides comprehensive guidance for deploying to AWS EKS and Azure AKS, incorporating all lessons learned from GKE deployment troubleshooting to prevent similar issues.Table of Contents
- Overview
- Critical Lessons from GKE
- EKS Deployment Guide
- AKS Deployment Guide
- Common Kubernetes Issues
- Prevention Checklist
Overview
This guide ensures that the 11 issues encountered during GKE deployment do not occur on EKS or AKS deployments.GKE Issues Resolved
- ✅ Pre-commit Python version mismatch (3.11 vs 3.12)
- ✅ Docker build disk space exhaustion
- ✅ External Secrets Operator CRD API version mismatch (v1beta1 vs v1)
- ✅ RBAC permission denied errors (unused Role/RoleBinding resources)
- ✅ GKE Autopilot CPU constraints (pod anti-affinity requires 500m minimum)
- ✅ Environment variable value/valueFrom conflict (Kustomize merge issue)
- ✅ Client-side vs server-side kubectl validation (CRDs not recognized)
- ✅ Namespace creation ordering (must exist before resources)
- ✅ ConfigMap generator behavior (create vs merge)
- ✅ External Secrets Operator installation permissions
- ✅ Unused RBAC resources requiring elevated IAM permissions
Critical Lessons from GKE
1. kubectl Validation: Always Use Server-Side Dry-Run
- ❌ WRONG (Client-side - doesn’t see CRDs):
- ✅ CORRECT (Server-side - validates against actual cluster):
- Custom Resource Definitions (CRDs)
- External Secrets Operator resources
- Any custom operators
2. External Secrets Operator: Check API Version
- ❌ WRONG (Assumes v1beta1):
- ✅ CORRECT (Verify installed version first):
- Always check installed CRD version:
kubectl api-resources | grep <resource-type> - Match manifest API version to cluster API version
- ESO v0.9+ uses
v1(notv1beta1)
3. Environment Variables: Use ConfigMap Patches, Not Deployment Values
- ❌ WRONG (Creates value/valueFrom conflict):
- ✅ CORRECT (Override ConfigMap data):
valueFrom: configMapKeyRef and overlay adds value:, Kustomize merges both creating invalid Kubernetes manifest (cannot have both value and valueFrom).
Applies to: ✅ EKS, ✅ AKS, ✅ All Kustomize-based deployments
4. RBAC Resources: Remove Unused Kubernetes RBAC
- ❌ WRONG (Include unused RBAC resources):
- Require elevated IAM permissions to deploy (container.admin, EKS Cluster Admin, AKS RBAC Admin)
- Violate least privilege principle
- Increase attack surface
- ✅ CORRECT (Only include ServiceAccount):
- Application uses External Secrets Operator (ESO handles secret access)
- Application doesn’t use Kubernetes API directly
- No need for in-cluster RBAC permissions
- Search codebase for
kubernetes.clientor@kubernetes/client-node - If not found, remove Role and RoleBinding
- Keep only ServiceAccount with workload identity annotations
- GCP:
roles/container.developer(sufficient for workload deployment) - AWS:
AmazonEKSWorkerNodePolicy+ ECR access - Azure:
Azure Kubernetes Service Cluster User Role
5. Namespace Creation: Always Create Before Validation
- ❌ WRONG (Namespace created during deployment):
- ✅ CORRECT (Namespace created before validation):
- Enables rollback (namespace exists even if deployment fails)
- Allows validation to succeed (resources reference the namespace)
- Idempotent (
--dry-run=client -o yaml | kubectl applywon’t fail if exists)
6. Managed Kubernetes CPU/Memory Constraints
GKE Autopilot:- Minimum 500m CPU when using pod anti-affinity
- Minimum 250m CPU otherwise
- Validates via admission webhook (GKE Warden)
- Minimum 250m CPU
- CPU/memory must match specific combinations
- Validates at pod scheduling time
- More flexible, but validate resource quotas
- Check node pool constraints
EKS Deployment Guide
Prerequisites
-
AWS IAM Configuration:
-
IRSA (IAM Roles for Service Accounts):
-
External Secrets Operator:
Workflow Template (.github/workflows/deploy-staging-eks.yaml)
- Authentication:
configure-aws-credentialsinstead ofgoogle-github-actions/auth - Kubeconfig:
aws eks update-kubeconfiginstead ofget-gke-credentials - Image registry: ECR instead of Artifact Registry
- ServiceAccount annotation:
eks.amazonaws.com/role-arninstead ofiam.gke.io/gcp-service-account
- ✅ Server-side validation (
--dry-run=server) - ✅ Namespace creation before validation
- ✅ ESO verification (not installation)
- ✅ Kustomize installation
AKS Deployment Guide
Prerequisites
-
Azure Workload Identity:
-
RBAC Permissions:
-
External Secrets Operator:
Workflow Template (.github/workflows/deploy-staging-aks.yaml)
- Authentication:
azure/login@v2instead of Google auth - Kubeconfig:
az aks get-credentialsinstead ofget-gke-credentials - Image registry: ACR instead of Artifact Registry
- ServiceAccount annotation:
azure.workload.identity/client-idinstead of GKE annotation
- ✅ Server-side validation (
--dry-run=server) - ✅ Namespace creation before validation
- ✅ ESO verification (not installation)
- ✅ Kustomize installation
Common Kubernetes Issues
Issue 1: ConfigMap Generator Behavior
- ❌ WRONG (Merge when no base ConfigMap):
- ✅ CORRECT (Create new ConfigMap):
Issue 2: Disk Space for Docker Builds
Enhanced Cleanup (.github/workflows/ci.yaml):
Issue 3: Pre-commit Python Version
Align with GitHub Actions Runner:Issue 4: Strategic Merge Patch for Namespaces
- ❌ WRONG (Namespace in both base and overlay resources):
- ✅ CORRECT (Namespace as patch):
Prevention Checklist
Use this checklist when creating EKS/AKS deployment workflows:✅ Workflow Configuration
- Use server-side validation:
kubectl apply --dry-run=server - Create namespace before validation
- Install kustomize tool
- Verify External Secrets Operator is installed (if using)
- Check ESO API version matches cluster (
kubectl api-resources) - Set proper RBAC permissions (cluster admin, not just developer)
- Configure platform-specific authentication (IRSA for EKS, Workload Identity for AKS)
✅ Kustomize Configuration
- Use strategic merge patches for namespaces (not resources)
- Use correct ConfigMap generator behavior (
createvsmerge) - Put env var overrides in ConfigMap patches, not deployment patches
- Avoid value/valueFrom conflicts in environment variables
- Set explicit resource requests/limits (minimum 500m CPU for safety)
✅ CI/CD Pipeline
- Include enhanced disk cleanup for Docker builds
- Align pre-commit Python version with runners (3.12)
- Use Trivy SARIF fallback for compliance scans
- Fix TruffleHog base reference for secret scanning
- Set actual project/account IDs (no placeholders)
✅ External Secrets Operator
- Install ESO during infrastructure setup (not in CI/CD)
- Use correct API version (
v1for ESO 0.9+, check withkubectl api-resources) - Verify CRDs are installed before deployment
- Configure provider-specific auth (Workload Identity for GCP, IRSA for AWS, Managed Identity for Azure)
✅ Resource Specifications
- Set minimum 500m CPU for containers with pod anti-affinity
- Define resources for init containers
- Check platform-specific constraints (GKE Autopilot, EKS Fargate, AKS quotas)
- Validate with actual cluster (server-side dry-run)
Platform-Specific Quick Reference
| Aspect | GKE | EKS | AKS |
|---|---|---|---|
| Auth Action | google-github-actions/auth@v3 | aws-actions/configure-aws-credentials@v4 | azure/login@v2 |
| Kubeconfig | get-gke-credentials@v3 | aws eks update-kubeconfig | az aks get-credentials |
| Workload Identity | iam.gke.io/gcp-service-account | eks.amazonaws.com/role-arn | azure.workload.identity/client-id |
| Registry | Artifact Registry | ECR | ACR |
| Admin Role | roles/container.admin | AmazonEKSClusterPolicy | AKS RBAC Cluster Admin |
| Secrets Provider | GCP Secret Manager | AWS Secrets Manager | Azure Key Vault |
| Min CPU (anti-affinity) | 500m (Autopilot) | 250m (Fargate) | Flexible |
| Validation | Server-side dry-run | Server-side dry-run | Server-side dry-run |
Testing Before Deployment
Local Validation
Pre-Deployment Verification
Summary of Applicable Fixes
✅ UNIVERSAL (Apply to ALL platforms)
- Pre-commit Python 3.12 - Already applied globally
- Enhanced disk cleanup - Already in ci.yaml
- Trivy/TruffleHog fixes - Already in gcp-compliance-scan.yaml
- ConfigMap env overrides - Pattern for all Kustomize deployments
- Remove unused RBAC resources - Security best practice (Issue #11)
✅ KUBERNETES-SPECIFIC (Apply to GKE, EKS, AKS)
- Server-side validation - CRITICAL for all platforms
- Namespace pre-creation - Required for all platforms
- Kustomize installation - Required for all platforms
- ESO verification - If using External Secrets
- ConfigMap generator behavior - All Kustomize deployments
- Resource specifications - Check platform-specific constraints
- Minimal IAM permissions - No RBAC creation needed after removing unused resources
❌ GCP-ONLY (Different for EKS/AKS)
- Workload Identity - Use IRSA (EKS) or Managed Identity (AKS)
- GCP project values - Use AWS Account ID or Azure Subscription ID
- Artifact Registry - Use ECR (EKS) or ACR (AKS)
- GKE Autopilot constraints - Different for EKS Fargate and AKS
Conclusion
All 11 critical lessons from GKE troubleshooting have been documented and templates provided for EKS and AKS. When creating actual EKS/AKS deployment workflows, use the templates above and follow the prevention checklist to avoid all issues encountered with GKE. Key Takeaways:- Always use server-side validation for CRD support
- Create namespace before validation for proper error handling
- Use ConfigMap patches for environment variable overrides
- Check ESO API version matches cluster
- Remove unused RBAC resources - verify application needs before including
- Grant minimal IAM permissions - only what’s needed for deployment
- Set explicit resource requests meeting platform constraints
- Create EKS overlay directory:
deployments/overlays/staging-eks/ - Create AKS overlay directory:
deployments/overlays/staging-aks/ - Copy templates from this guide
- Adapt ServiceAccount annotations for IRSA/Managed Identity
- Test with
kubectl apply --dry-run=serverbefore committing