Skip to main content

Kubernetes Best Practices Implementation Guide

Status: Implementation in progress Last Updated: 2025-11-02 Priority: HIGH - GCP GKE, AWS EKS, Azure AKS deployment best practices

Executive Summary

This document tracks the implementation of 11 high-priority Kubernetes best practices improvements across GCP GKE, AWS EKS, and Azure AKS deployments. All implementations follow TDD principles (tests first, then implementation).

✅ COMPLETED IMPLEMENTATIONS

1. Cloud-Managed PostgreSQL (HIGH AVAILABILITY)

Status: ✅ COMPLETE What was implemented:
  • Azure Database for PostgreSQL Terraform module with zone-redundant HA
  • Helm chart support for external databases (CloudSQL, RDS, Azure Database)
  • CloudSQL Proxy sidecar integration for GKE
  • Comprehensive monitoring and alerting for all cloud providers
Files created:
  • terraform/modules/azure-database/main.tf (354 lines)
  • terraform/modules/azure-database/variables.tf (309 lines)
  • terraform/modules/azure-database/outputs.tf (98 lines)
  • terraform/modules/azure-database/versions.tf
Files modified:
  • deployments/helm/mcp-server-langgraph/values.yaml (added external DB config)
  • deployments/helm/mcp-server-langgraph/templates/deployment.yaml (added CloudSQL proxy sidecar)
Usage Examples:
# Helm values for GCP CloudSQL
postgresql:
  enabled: false
  external:
    host: ""
    cloud:
      cloudSql:
        enabled: true
        instanceConnectionName: "project:region:instance"
        credentialsSecret: "cloudsql-credentials"
# Terraform for Azure Database
module "postgres" {
  source = "./modules/azure-database"

  name_prefix         = "mcp-prod"
  resource_group_name = azurerm_resource_group.main.name
  location            = "eastus"

  enable_high_availability = true
  high_availability_mode   = "ZoneRedundant"

  backup_retention_days        = 35
  geo_redundant_backup_enabled = true
}
Test coverage: 12 tests in tests/infrastructure/test_database_ha.py

2. Topology Spread Constraints (ZONE-BASED HA)

Status: ✅ COMPLETE What was implemented:
  • TopologySpreadConstraints for zone distribution
  • Upgraded podAntiAffinity from preferred to required for production
  • Zone-level anti-affinity to prevent single-zone failures
  • Node-level spreading for better resource utilization
Files modified:
  • deployments/base/deployment.yaml
  • deployments/helm/mcp-server-langgraph/values.yaml
  • deployments/helm/mcp-server-langgraph/templates/deployment.yaml
Configuration:
topologySpreadConstraints:
  # Required: Spread across zones
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: mcp-server-langgraph
  # Best effort: Spread across nodes
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: mcp-server-langgraph

affinity:
  podAntiAffinity:
    # REQUIRED: Do not schedule on same zone
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app
              operator: In
              values:
                - mcp-server-langgraph
        topologyKey: topology.kubernetes.io/zone
Impact:
  • Ensures 99.99% availability (multi-zone)
  • Prevents cascading failures from zone outages
  • Meets production SLA requirements
Test coverage: 8 tests in tests/infrastructure/test_topology_spread.py

🚧 IN PROGRESS

3. Velero Backup/DR

Status: 🚧 IN PROGRESS Implementation Path:
  1. Create Velero Helm configuration:
# deployments/backup/velero-values.yaml
configuration:
  provider: aws  # or gcp, azure
  backupStorageLocation:
    bucket: mcp-server-backups
    config:
      region: us-east-1
      # GCP: project: my-project
      # Azure: resourceGroup: my-rg, storageAccount: myaccount
  volumeSnapshotLocation:
    config:
      region: us-east-1

credentials:
  useSecret: true
  existingSecret: cloud-credentials

schedules:
  daily:
    schedule: "0 2 * * *"
    template:
      includedNamespaces:
        - mcp-server-langgraph
      ttl: 720h  # 30 days
  1. Create backup schedules:
# deployments/backup/backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: mcp-daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces:
      - mcp-server-langgraph
    storageLocation: default
    volumeSnapshotLocations:
      - default
    ttl: 720h  # 30 days retention
  1. Installation:
# AWS
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --values deployments/backup/velero-values-aws.yaml

# GCP
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --values deployments/backup/velero-values-gcp.yaml

# Azure
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --values deployments/backup/velero-values-azure.yaml
Files to create:
  • deployments/backup/velero-values-aws.yaml
  • deployments/backup/velero-values-gcp.yaml
  • deployments/backup/velero-values-azure.yaml
  • deployments/backup/backup-schedule.yaml
  • deployments/backup/restore-procedure.md

📋 PENDING IMPLEMENTATIONS

4. Istio Service Mesh with mTLS STRICT

Priority: HIGH (Security) Implementation Path:
  1. Update Helm values to enable Istio:
# deployments/helm/mcp-server-langgraph/values.yaml
serviceMesh:
  enabled: true
  istio:
    injection: enabled
    mtls:
      mode: STRICT
  1. Add Istio resources (already exist at deployments/service-mesh/istio/):
    • istio-config.yaml (Gateway, VirtualService, DestinationRule)
    • ✅ AuthorizationPolicy for RBAC
    • ✅ PeerAuthentication for mTLS
  2. Update namespace labels:
# deployments/base/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: mcp-server-langgraph
  labels:
    istio-injection: enabled
Estimated time: 2 hours

5. Pod Security Standards

Priority: HIGH (Security) Implementation Path:
# deployments/base/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: mcp-server-langgraph
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
Validation: Ensure all pods comply with restricted PSS (already compliant based on current pod security contexts). Estimated time: 30 minutes

6. Network Policies for All Services

Priority: HIGH (Security) Files to create:
# deployments/base/postgres-networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: postgres
spec:
  podSelector:
    matchLabels:
      app: postgres
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: mcp-server-langgraph
        - podSelector:
            matchLabels:
              app: openfga
        - podSelector:
            matchLabels:
              app: keycloak
      ports:
        - protocol: TCP
          port: 5432
  egress:
    - to:
        - podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
Similar policies needed for:
  • redis-networkpolicy.yaml
  • keycloak-networkpolicy.yaml
  • openfga-networkpolicy.yaml
Estimated time: 1 hour

7. Loki Log Aggregation

Priority: MEDIUM (Observability) Implementation:
# deployments/monitoring/loki-stack-values.yaml
loki:
  enabled: true
  persistence:
    enabled: true
    size: 50Gi
  config:
    auth_enabled: false
    chunk_store_config:
      max_look_back_period: 720h  # 30 days

promtail:
  enabled: true
  config:
    clients:
      - url: http://loki:3100/loki/api/v1/push

grafana:
  enabled: false  # Using existing Grafana instance
  sidecar:
    datasources:
      enabled: true
helm install loki-stack grafana/loki-stack \
  --namespace monitoring \
  --values deployments/monitoring/loki-stack-values.yaml
Estimated time: 1.5 hours

8. ResourceQuota and LimitRange

Priority: MEDIUM (Cost & Stability) Files to create:
# deployments/base/resourcequota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: mcp-server-langgraph
spec:
  hard:
    requests.cpu: "50"
    requests.memory: 100Gi
    limits.cpu: "100"
    limits.memory: 200Gi
    persistentvolumeclaims: "20"
    services.loadbalancers: "3"
# deployments/base/limitrange.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: mcp-server-langgraph
spec:
  limits:
    - default:
        memory: 512Mi
        cpu: 500m
      defaultRequest:
        memory: 256Mi
        cpu: 250m
      type: Container
Estimated time: 30 minutes

9. Kubecost for FinOps

Priority: MEDIUM (Cost Optimization) Implementation:
# deployments/monitoring/kubecost-values.yaml
kubecostProductConfigs:
  clusterName: mcp-production
  productKey: ""  # Free tier or enterprise key

prometheus:
  server:
    global:
      external_labels:
        cluster_id: mcp-prod

ingress:
  enabled: true
  className: nginx
  hosts:
    - kubecost.example.com
helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --create-namespace \
  --values deployments/monitoring/kubecost-values.yaml
Cloud billing integration:
  • AWS: Configure CUR (Cost and Usage Report)
  • GCP: Enable BigQuery billing export
  • Azure: Configure Cost Management API
Estimated time: 2 hours

10. Karpenter for EKS

Priority: MEDIUM (Cost Optimization - AWS only) Implementation:
# terraform/modules/karpenter/main.tf
resource "aws_iam_role" "karpenter_controller" {
  name = "${var.cluster_name}-karpenter-controller"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = var.oidc_provider_arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${var.oidc_provider}:sub" = "system:serviceaccount:karpenter:karpenter"
        }
      }
    }]
  })
}

# Attach policies for Karpenter
# ... (see full implementation in terraform/modules/karpenter/)
# Provisioner configuration
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64"]
  limits:
    resources:
      cpu: 1000
      memory: 1000Gi
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 604800
Estimated time: 3 hours

11. VPA for Stateful Services

Priority: LOW (Cost Optimization) Implementation:
# deployments/base/postgres-vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: postgres-vpa
  namespace: mcp-server-langgraph
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: postgres
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: postgres
        minAllowed:
          cpu: 250m
          memory: 256Mi
        maxAllowed:
          cpu: 4000m
          memory: 8Gi
Similar VPAs for:
  • Redis
  • Keycloak
Estimated time: 1 hour

TESTING STRATEGY

All implementations follow TDD:
  1. Write tests first (RED phase)
  2. Implement minimal solution (GREEN phase)
  3. Refactor and optimize (REFACTOR phase)
Test files created:
  • tests/infrastructure/test_database_ha.py - Database HA tests
  • tests/infrastructure/test_topology_spread.py - Zone spreading tests
Additional tests needed:
  • tests/infrastructure/test_backup_restore.py - Velero backup/restore
  • tests/infrastructure/test_service_mesh.py - Istio mTLS validation
  • tests/infrastructure/test_network_policies.py - Network isolation
  • tests/infrastructure/test_observability.py - Loki, Kubecost integration
  • tests/infrastructure/test_autoscaling.py - Karpenter, VPA validation

DEPLOYMENT CHECKLIST

Before deploying to production:
  • Cloud-managed databases configured
  • Topology spread constraints enabled
  • Velero backups tested and validated
  • Istio mTLS STRICT mode enabled
  • Pod Security Standards enforced
  • Network policies applied to all services
  • Loki log aggregation operational
  • Resource quotas configured
  • Kubecost monitoring enabled
  • Karpenter autoscaling tested (EKS)
  • VPA recommendations validated
  • All tests passing
  • Documentation updated

ESTIMATED TIMELINE

PhaseItemsEstimated TimeStatus
Phase 1Database HA, Topology Spread, Velero6 hours70% complete
Phase 2Istio, PSS, Network Policies4 hours0% complete
Phase 3Loki, ResourceQuota, Kubecost4 hours0% complete
Phase 4Karpenter, VPA4 hours0% complete
TestingComprehensive test suite3 hours20% complete
DocsREADME, runbooks, migration guides2 hours0% complete
Total23 hours~25% complete

ROLLBACK PROCEDURES

Cloud-Managed Databases

  1. Keep in-cluster PostgreSQL running during migration
  2. Test external database connectivity before switching
  3. Update Helm values: postgresql.enabled=false
  4. Monitor application metrics post-migration
  5. Rollback: postgresql.enabled=true

Topology Spread Constraints

  1. Test in dev/staging first
  2. Ensure cluster has 3+ zones
  3. Monitor pod scheduling (watch for Pending pods)
  4. Rollback: Remove topologySpreadConstraints, revert to preferred anti-affinity

Istio Service Mesh

  1. Enable incrementally (namespace by namespace)
  2. Start with PERMISSIVE mTLS, then upgrade to STRICT
  3. Monitor latency and error rates
  4. Rollback: kubectl label namespace mcp-server-langgraph istio-injection-

SUPPORT & TROUBLESHOOTING

Common Issues

Issue: Pods stuck in Pending due to topology constraints Solution: Verify cluster has 3+ zones, reduce minReplicas temporarily Issue: CloudSQL proxy authentication failing Solution: Verify Workload Identity binding, check service account permissions Issue: Istio mTLS connection refused Solution: Check PeerAuthentication mode, verify certificates with istioctl

Useful Commands

# Check pod distribution across zones
kubectl get pods -n mcp-server-langgraph -o wide \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeName}{"\n"}{end}' \
  | xargs -I {} sh -c 'kubectl get node {} -o jsonpath="{.metadata.labels.topology\.kubernetes\.io/zone}{\"\\n\"}"'

# Validate Istio mTLS
istioctl x authz check pod/mcp-server-langgraph-xyz -n mcp-server-langgraph

# Check Velero backups
velero backup get
velero backup describe <backup-name>

# View Kubecost dashboard
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090

REFERENCES