Documentation Index
Fetch the complete documentation index at: https://mcp-server-langgraph.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Kubernetes Best Practices Implementation Guide
Status: Implementation in progress
Last Updated: 2025-11-02
Priority: HIGH - GCP GKE, AWS EKS, Azure AKS deployment best practices
Executive Summary
This document tracks the implementation of 11 high-priority Kubernetes best practices improvements across GCP GKE, AWS EKS, and Azure AKS deployments. All implementations follow TDD principles (tests first, then implementation).
✅ COMPLETED IMPLEMENTATIONS
1. Cloud-Managed PostgreSQL (HIGH AVAILABILITY)
Status: ✅ COMPLETE
What was implemented:
- Azure Database for PostgreSQL Terraform module with zone-redundant HA
- Helm chart support for external databases (CloudSQL, RDS, Azure Database)
- CloudSQL Proxy sidecar integration for GKE
- Comprehensive monitoring and alerting for all cloud providers
Files created:
terraform/modules/azure-database/main.tf (354 lines)
terraform/modules/azure-database/variables.tf (309 lines)
terraform/modules/azure-database/outputs.tf (98 lines)
terraform/modules/azure-database/versions.tf
Files modified:
deployments/helm/mcp-server-langgraph/values.yaml (added external DB config)
deployments/helm/mcp-server-langgraph/templates/deployment.yaml (added CloudSQL proxy sidecar)
Usage Examples:
# Helm values for GCP CloudSQL
postgresql:
enabled: false
external:
host: ""
cloud:
cloudSql:
enabled: true
instanceConnectionName: "project:region:instance"
credentialsSecret: "cloudsql-credentials"
# Terraform for Azure Database
module "postgres" {
source = "./modules/azure-database"
name_prefix = "mcp-prod"
resource_group_name = azurerm_resource_group.main.name
location = "eastus"
enable_high_availability = true
high_availability_mode = "ZoneRedundant"
backup_retention_days = 35
geo_redundant_backup_enabled = true
}
Test coverage: 12 tests in tests/infrastructure/test_database_ha.py
2. Topology Spread Constraints (ZONE-BASED HA)
Status: ✅ COMPLETE
What was implemented:
- TopologySpreadConstraints for zone distribution
- Upgraded podAntiAffinity from
preferred to required for production
- Zone-level anti-affinity to prevent single-zone failures
- Node-level spreading for better resource utilization
Files modified:
deployments/base/deployment.yaml
deployments/helm/mcp-server-langgraph/values.yaml
deployments/helm/mcp-server-langgraph/templates/deployment.yaml
Configuration:
topologySpreadConstraints:
# Required: Spread across zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: mcp-server-langgraph
# Best effort: Spread across nodes
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: mcp-server-langgraph
affinity:
podAntiAffinity:
# REQUIRED: Do not schedule on same zone
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- mcp-server-langgraph
topologyKey: topology.kubernetes.io/zone
Impact:
- Ensures 99.99% availability (multi-zone)
- Prevents cascading failures from zone outages
- Meets production SLA requirements
Test coverage: 8 tests in tests/infrastructure/test_topology_spread.py
🚧 IN PROGRESS
3. Velero Backup/DR
Status: 🚧 IN PROGRESS
Implementation Path:
- Create Velero Helm configuration:
# deployments/backup/velero-values.yaml
configuration:
provider: aws # or gcp, azure
backupStorageLocation:
bucket: mcp-server-backups
config:
region: us-east-1
# GCP: project: my-project
# Azure: resourceGroup: my-rg, storageAccount: myaccount
volumeSnapshotLocation:
config:
region: us-east-1
credentials:
useSecret: true
existingSecret: cloud-credentials
schedules:
daily:
schedule: "0 2 * * *"
template:
includedNamespaces:
- mcp-server-langgraph
ttl: 720h # 30 days
- Create backup schedules:
# deployments/backup/backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: mcp-daily-backup
namespace: velero
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- mcp-server-langgraph
storageLocation: default
volumeSnapshotLocations:
- default
ttl: 720h # 30 days retention
- Installation:
# AWS
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--values deployments/backup/velero-values-aws.yaml
# GCP
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--values deployments/backup/velero-values-gcp.yaml
# Azure
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--values deployments/backup/velero-values-azure.yaml
Files to create:
deployments/backup/velero-values-aws.yaml
deployments/backup/velero-values-gcp.yaml
deployments/backup/velero-values-azure.yaml
deployments/backup/backup-schedule.yaml
deployments/backup/restore-procedure.md
📋 PENDING IMPLEMENTATIONS
4. Istio Service Mesh with mTLS STRICT
Priority: HIGH (Security)
Implementation Path:
- Update Helm values to enable Istio:
# deployments/helm/mcp-server-langgraph/values.yaml
serviceMesh:
enabled: true
istio:
injection: enabled
mtls:
mode: STRICT
-
Add Istio resources (already exist at
deployments/service-mesh/istio/):
- ✅
istio-config.yaml (Gateway, VirtualService, DestinationRule)
- ✅ AuthorizationPolicy for RBAC
- ✅ PeerAuthentication for mTLS
-
Update namespace labels:
# deployments/base/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: mcp-server-langgraph
labels:
istio-injection: enabled
Estimated time: 2 hours
5. Pod Security Standards
Priority: HIGH (Security)
Implementation Path:
# deployments/base/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: mcp-server-langgraph
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Validation: Ensure all pods comply with restricted PSS (already compliant based on current pod security contexts).
Estimated time: 30 minutes
6. Network Policies for All Services
Priority: HIGH (Security)
Files to create:
# deployments/base/postgres-networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: postgres
spec:
podSelector:
matchLabels:
app: postgres
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: mcp-server-langgraph
- podSelector:
matchLabels:
app: openfga
- podSelector:
matchLabels:
app: keycloak
ports:
- protocol: TCP
port: 5432
egress:
- to:
- podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
Similar policies needed for:
redis-networkpolicy.yaml
keycloak-networkpolicy.yaml
openfga-networkpolicy.yaml
Estimated time: 1 hour
7. Loki Log Aggregation
Priority: MEDIUM (Observability)
Implementation:
# deployments/monitoring/loki-stack-values.yaml
loki:
enabled: true
persistence:
enabled: true
size: 50Gi
config:
auth_enabled: false
chunk_store_config:
max_look_back_period: 720h # 30 days
promtail:
enabled: true
config:
clients:
- url: http://loki:3100/loki/api/v1/push
grafana:
enabled: false # Using existing Grafana instance
sidecar:
datasources:
enabled: true
helm install loki-stack grafana/loki-stack \
--namespace monitoring \
--values deployments/monitoring/loki-stack-values.yaml
Estimated time: 1.5 hours
8. ResourceQuota and LimitRange
Priority: MEDIUM (Cost & Stability)
Files to create:
# deployments/base/resourcequota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: mcp-server-langgraph
spec:
hard:
requests.cpu: "50"
requests.memory: 100Gi
limits.cpu: "100"
limits.memory: 200Gi
persistentvolumeclaims: "20"
services.loadbalancers: "3"
# deployments/base/limitrange.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: mcp-server-langgraph
spec:
limits:
- default:
memory: 512Mi
cpu: 500m
defaultRequest:
memory: 256Mi
cpu: 250m
type: Container
Estimated time: 30 minutes
9. Kubecost for FinOps
Priority: MEDIUM (Cost Optimization)
Implementation:
# deployments/monitoring/kubecost-values.yaml
kubecostProductConfigs:
clusterName: mcp-production
productKey: "" # Free tier or enterprise key
prometheus:
server:
global:
external_labels:
cluster_id: mcp-prod
ingress:
enabled: true
className: nginx
hosts:
- kubecost.example.com
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace \
--values deployments/monitoring/kubecost-values.yaml
Cloud billing integration:
- AWS: Configure CUR (Cost and Usage Report)
- GCP: Enable BigQuery billing export
- Azure: Configure Cost Management API
Estimated time: 2 hours
10. Karpenter for EKS
Priority: MEDIUM (Cost Optimization - AWS only)
Implementation:
# terraform/modules/karpenter/main.tf
resource "aws_iam_role" "karpenter_controller" {
name = "${var.cluster_name}-karpenter-controller"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Federated = var.oidc_provider_arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"${var.oidc_provider}:sub" = "system:serviceaccount:karpenter:karpenter"
}
}
}]
})
}
# Attach policies for Karpenter
# ... (see full implementation in terraform/modules/karpenter/)
# Provisioner configuration
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
limits:
resources:
cpu: 1000
memory: 1000Gi
providerRef:
name: default
ttlSecondsAfterEmpty: 30
ttlSecondsUntilExpired: 604800
Estimated time: 3 hours
11. VPA for Stateful Services
Priority: LOW (Cost Optimization)
Implementation:
# deployments/base/postgres-vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: postgres-vpa
namespace: mcp-server-langgraph
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: postgres
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: postgres
minAllowed:
cpu: 250m
memory: 256Mi
maxAllowed:
cpu: 4000m
memory: 8Gi
Similar VPAs for:
Estimated time: 1 hour
TESTING STRATEGY
All implementations follow TDD:
- Write tests first (RED phase)
- Implement minimal solution (GREEN phase)
- Refactor and optimize (REFACTOR phase)
Test files created:
tests/infrastructure/test_database_ha.py - Database HA tests
tests/infrastructure/test_topology_spread.py - Zone spreading tests
Additional tests needed:
tests/infrastructure/test_backup_restore.py - Velero backup/restore
tests/infrastructure/test_service_mesh.py - Istio mTLS validation
tests/infrastructure/test_network_policies.py - Network isolation
tests/infrastructure/test_observability.py - Loki, Kubecost integration
tests/infrastructure/test_autoscaling.py - Karpenter, VPA validation
DEPLOYMENT CHECKLIST
Before deploying to production:
ESTIMATED TIMELINE
| Phase | Items | Estimated Time | Status |
|---|
| Phase 1 | Database HA, Topology Spread, Velero | 6 hours | 70% complete |
| Phase 2 | Istio, PSS, Network Policies | 4 hours | 0% complete |
| Phase 3 | Loki, ResourceQuota, Kubecost | 4 hours | 0% complete |
| Phase 4 | Karpenter, VPA | 4 hours | 0% complete |
| Testing | Comprehensive test suite | 3 hours | 20% complete |
| Docs | README, runbooks, migration guides | 2 hours | 0% complete |
| Total | | 23 hours | ~25% complete |
ROLLBACK PROCEDURES
Cloud-Managed Databases
- Keep in-cluster PostgreSQL running during migration
- Test external database connectivity before switching
- Update Helm values:
postgresql.enabled=false
- Monitor application metrics post-migration
- Rollback:
postgresql.enabled=true
Topology Spread Constraints
- Test in dev/staging first
- Ensure cluster has 3+ zones
- Monitor pod scheduling (watch for Pending pods)
- Rollback: Remove
topologySpreadConstraints, revert to preferred anti-affinity
Istio Service Mesh
- Enable incrementally (namespace by namespace)
- Start with
PERMISSIVE mTLS, then upgrade to STRICT
- Monitor latency and error rates
- Rollback:
kubectl label namespace mcp-server-langgraph istio-injection-
SUPPORT & TROUBLESHOOTING
Common Issues
Issue: Pods stuck in Pending due to topology constraints
Solution: Verify cluster has 3+ zones, reduce minReplicas temporarily
Issue: CloudSQL proxy authentication failing
Solution: Verify Workload Identity binding, check service account permissions
Issue: Istio mTLS connection refused
Solution: Check PeerAuthentication mode, verify certificates with istioctl
Useful Commands
# Check pod distribution across zones
kubectl get pods -n mcp-server-langgraph -o wide \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeName}{"\n"}{end}' \
| xargs -I {} sh -c 'kubectl get node {} -o jsonpath="{.metadata.labels.topology\.kubernetes\.io/zone}{\"\\n\"}"'
# Validate Istio mTLS
istioctl x authz check pod/mcp-server-langgraph-xyz -n mcp-server-langgraph
# Check Velero backups
velero backup get
velero backup describe <backup-name>
# View Kubecost dashboard
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090
REFERENCES