Skip to main content

Overview

Implement comprehensive disaster recovery (DR) to protect against data loss, service disruption, and regional failures. This guide covers backup strategies, restore procedures, and multi-region failover.
RTO vs RPO: Recovery Time Objective (RTO) is how quickly you can restore service. Recovery Point Objective (RPO) is how much data you can afford to lose. Balance both based on business requirements.

Disaster Recovery Flow

Backup and Restore Timeline:

Backup Strategy

What to Backup

PostgreSQL Databases

  • Keycloak user data
  • OpenFGA authorization tuples
  • Application metadata

Redis Data

  • Active sessions
  • Cache data
  • Rate limit counters

Kubernetes Resources

  • Deployments
  • ConfigMaps/Secrets
  • Ingress rules
  • PVCs

Configuration

  • Environment variables
  • Helm values
  • Infrastructure as Code

Backup Schedule

ComponentFrequencyRetentionRPORTO
PostgreSQL (full)Daily30 days24h1h
PostgreSQL (incremental)Hourly7 days1h30m
RedisHourly24 hours1h15m
KubernetesDaily7 days24h2h
SecretsWeekly90 days7d4h

PostgreSQL Backup

Automated Backups

  • pg_dump
  • CronJob (Kubernetes)
  • Cloud-Managed
#!/bin/bash
# backup-postgres.sh

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/postgres"

# Keycloak database
pg_dump -U keycloak -h postgres-keycloak \
  -d keycloak \
  -F c -b -v \
  -f "${BACKUP_DIR}/keycloak_${TIMESTAMP}.dump"

# OpenFGA database
pg_dump -U openfga -h postgres-openfga \
  -d openfga \
  -F c -b -v \
  -f "${BACKUP_DIR}/openfga_${TIMESTAMP}.dump"

# Compress
gzip "${BACKUP_DIR}"/*.dump

# Upload to S3
aws s3 sync ${BACKUP_DIR} s3://my-backups/postgres/ \
  --storage-class STANDARD_IA

# Clean up local files older than 7 days
find ${BACKUP_DIR} -name "*.dump.gz" -mtime +7 -delete
Schedule with cron:
# Daily full backup at 2 AM
0 2 * * * /scripts/backup-postgres.sh

Point-in-Time Recovery

## Restore to specific timestamp
pg_restore -U keycloak -h postgres-keycloak \
  -d keycloak_recovery \
  -c -v \
  /backups/keycloak_20251012_020000.dump.gz

## Verify data
psql -U keycloak -h postgres-keycloak -d keycloak_recovery -c "SELECT COUNT(*) FROM users;"

## Switch to recovered database
kubectl exec -it postgres-keycloak-0 -- psql -U postgres -c "ALTER DATABASE keycloak RENAME TO keycloak_old;"
kubectl exec -it postgres-keycloak-0 -- psql -U postgres -c "ALTER DATABASE keycloak_recovery RENAME TO keycloak;"

Continuous Archiving (WAL)

## postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-backups/postgres-wal/%f'
archive_timeout = 300  # 5 minutes

## Restore from WAL
restore_command = 'aws s3 cp s3://my-backups/postgres-wal/%f %p'
recovery_target_time = '2025-10-12 10:00:00'

Redis Backup

RDB Snapshots

## redis.conf
save 900 1      # Save if 1 key changed in 15 min
save 300 10     # Save if 10 keys changed in 5 min
save 60 10000   # Save if 10000 keys changed in 1 min

dir /data
dbfilename dump.rdb
Backup Script:
#!/bin/bash
## backup-redis.sh

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/redis"

## Trigger save
redis-cli -h redis-master BGSAVE

## Wait for save to complete
while [ $(redis-cli -h redis-master LASTSAVE) -eq $LAST_SAVE ]; do
  sleep 1
done

## Copy RDB file
kubectl cp redis-master-0:/data/dump.rdb \
  "${BACKUP_DIR}/dump_${TIMESTAMP}.rdb"

## Upload to S3
aws s3 cp "${BACKUP_DIR}/dump_${TIMESTAMP}.rdb" \
  s3://my-backups/redis/

AOF (Append-Only File)

## redis.conf
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec  # Balance durability/performance

## Rewrite AOF when it grows 100% and is at least 64mb
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

Redis Replication

## Redis Sentinel for HA
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-sentinel
data:
  sentinel.conf: |
    sentinel monitor mymaster redis-master 6379 2
    sentinel down-after-milliseconds mymaster 5000
    sentinel parallel-syncs mymaster 1
    sentinel failover-timeout mymaster 10000

Kubernetes Resource Backup

Velero

Install Velero:
## Install CLI
brew install velero

## Install in cluster (AWS)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero

## Verify installation
velero version
Create Backup Schedule:
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: langgraph-daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
    - mcp-server-langgraph
    excludedResources:
    - events
    - events.events.k8s.io
    storageLocation: default
    ttl: 720h0m0s  # 30 days
Manual Backup:
## Backup entire namespace
velero backup create langgraph-backup \
  --include-namespaces mcp-server-langgraph

## Backup specific resources
velero backup create langgraph-deployments \
  --include-namespaces mcp-server-langgraph \
  --include-resources deployments,services,configmaps,secrets

## Check backup status
velero backup get
velero backup describe langgraph-backup

## Download backup
velero backup download langgraph-backup

kubectl Export

#!/bin/bash
## export-k8s-resources.sh

NAMESPACE="mcp-server-langgraph"
BACKUP_DIR="/backups/k8s/$(date +%Y%m%d)"
mkdir -p ${BACKUP_DIR}

## Export all resource types
for resource in deployment service configmap secret ingress pvc; do
  kubectl get ${resource} -n ${NAMESPACE} -o yaml > \
    "${BACKUP_DIR}/${resource}.yaml"
done

## Create archive
tar -czf "${BACKUP_DIR}.tar.gz" -C /backups/k8s "$(date +%Y%m%d)"

## Upload to S3
aws s3 cp "${BACKUP_DIR}.tar.gz" s3://my-backups/k8s/

Secrets Backup

Infisical Backup

## backup-secrets.py
from infisical import InfisicalClient
import json
from datetime import datetime

client = InfisicalClient(
    client_id=os.getenv("INFISICAL_CLIENT_ID"),
    client_secret=os.getenv("INFISICAL_CLIENT_SECRET")
)

## Export all secrets
secrets = client.get_all_secrets(
    project_id=os.getenv("INFISICAL_PROJECT_ID"),
    environment="production"
)

## Save to file
backup_file = f"secrets-backup-{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(backup_file, 'w') as f:
    json.dump([{
        'name': s.secret_name,
        'value': s.secret_value
    } for s in secrets], f, indent=2)

## Encrypt backup
os.system(f"gpg --encrypt --recipient admin@company.com {backup_file}")

## Upload encrypted backup
os.system(f"aws s3 cp {backup_file}.gpg s3://my-backups/secrets/")

Kubernetes Secrets

## Export all secrets (base64 encoded)
kubectl get secrets -n mcp-server-langgraph -o yaml > secrets-backup.yaml

## Encrypt with SOPS
sops --encrypt --kms "arn:aws:kms:us-east-1:123456789:key/abc-123" \
  secrets-backup.yaml > secrets-backup.enc.yaml

## Store encrypted file in Git
git add secrets-backup.enc.yaml
git commit -m "Backup secrets $(date +%Y-%m-%d)"

Restore Procedures

Full System Restore

1

Restore Infrastructure

# Deploy Kubernetes cluster (if needed)
eksctl create cluster -f cluster-config.yaml

# Install Helm charts
helm install openfga ./charts/openfga
helm install redis ./charts/redis
helm install keycloak ./charts/keycloak
2

Restore Databases

# Restore PostgreSQL
pg_restore -U keycloak -h postgres-keycloak \
  -d keycloak -c -v \
  /backups/keycloak_20251012_020000.dump

pg_restore -U openfga -h postgres-openfga \
  -d openfga -c -v \
  /backups/openfga_20251012_020000.dump
3

Restore Redis

# Stop Redis
kubectl scale statefulset redis-master --replicas=0

# Copy RDB file
kubectl cp /backups/dump_20251012_020000.rdb \
  redis-master-0:/data/dump.rdb

# Start Redis
kubectl scale statefulset redis-master --replicas=1
4

Restore Kubernetes Resources

# Using Velero
velero restore create langgraph-restore \
  --from-backup langgraph-backup-20251012

# Monitor restore
velero restore get
velero restore describe langgraph-restore --details
5

Restore Secrets

# Decrypt and apply
sops --decrypt secrets-backup.enc.yaml | kubectl apply -f -

# Or restore to Infisical
python restore-secrets.py secrets-backup-20251012.json
6

Verify System

# Check all pods running
kubectl get pods -n mcp-server-langgraph

# Test authentication
curl -X POST http://api.yourdomain.com/auth/login \
  -d '{"username":"admin","password":"***"}'

# Test API
curl http://api.yourdomain.com/health

# Verify data integrity
psql -U keycloak -d keycloak -c "SELECT COUNT(*) FROM users;"

Partial Restore

Restore Single Database:
## Create new database
createdb -U postgres keycloak_restored

## Restore from backup
pg_restore -U keycloak \
  -d keycloak_restored \
  -v /backups/keycloak_20251012.dump

## Verify
psql -U keycloak -d keycloak_restored -c "\dt"
Restore Specific Kubernetes Resource:
## Restore from backup file
kubectl apply -f /backups/k8s/20251012/deployment.yaml

## Or using Velero
velero restore create deployment-restore \
  --from-backup langgraph-backup \
  --include-resources deployments \
  --selector app=mcp-server-langgraph

Disaster Recovery Testing

DR Drill Procedure

Monthly DR Test:
#!/bin/bash
## dr-drill.sh

set -e

echo "=== Starting DR Drill ==="

## 1. Create test namespace
kubectl create namespace langgraph-dr-test

## 2. Restore latest backup
LATEST_BACKUP=$(velero backup get --output json | jq -r '.items[0].metadata.name')
velero restore create dr-test-restore \
  --from-backup ${LATEST_BACKUP} \
  --namespace-mappings mcp-server-langgraph:langgraph-dr-test

## 3. Wait for restore
velero restore wait dr-test-restore

## 4. Run smoke tests
pytest tests/smoke/ --namespace=langgraph-dr-test

## 5. Measure RTO
START_TIME=$(date -d "$(velero backup get ${LATEST_BACKUP} -o json | jq -r '.status.completionTimestamp')" +%s)
END_TIME=$(date +%s)
RTO=$((END_TIME - START_TIME))
echo "RTO: ${RTO} seconds"

## 6. Cleanup
kubectl delete namespace langgraph-dr-test
velero restore delete dr-test-restore

echo "=== DR Drill Complete ==="
echo "RTO: ${RTO}s (Target: 3600s)"

Automated Testing

## .github/workflows/dr-test.yml
name: DR Testing
on:
  schedule:
    - cron: '0 0 1 * *'  # Monthly

jobs:
  dr-test:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3

    - name: Setup kubectl
      uses: azure/setup-kubectl@v3

    - name: Run DR drill
      run: ./scripts/dr-drill.sh

    - name: Report results
      if: failure()
      uses: 8398a7/action-slack@v3
      with:
        status: ${{ job.status }}
        text: 'DR test failed!'
        webhook_url: ${{ secrets.SLACK_WEBHOOK }}

Multi-Region Failover

Active-Passive Setup

## Primary region (us-east-1)
primary:
  region: us-east-1
  postgres:
    replication: async
    replica_regions:
      - us-west-2
  redis:
    replication: async

## Secondary region (us-west-2)
secondary:
  region: us-west-2
  postgres:
    mode: read-replica
  redis:
    mode: standby
Failover Procedure:
## 1. Promote secondary database
aws rds promote-read-replica \
  --db-instance-identifier keycloak-db-west

## 2. Update DNS
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123 \
  --change-batch file://failover-dns.json

## 3. Scale up secondary region
kubectl config use-context us-west-2
kubectl scale deployment mcp-server-langgraph --replicas=10

## 4. Verify traffic shift
watch -n 5 'kubectl top pods'

Active-Active Setup

## Global load balancer
apiVersion: networking.gke.io/v1
kind: MultiClusterIngress
metadata:
  name: langgraph-global
spec:
  template:
    spec:
      backend:
        serviceName: mcp-server-langgraph
        servicePort: 80
      rules:
      - http:
          paths:
          - backend:
              serviceName: mcp-server-langgraph
              servicePort: 80
Database Synchronization:
## CockroachDB for multi-region
cockroachdb:
  replicas: 9
  regions:
    - us-east-1 (3 nodes)
    - us-west-2 (3 nodes)
    - eu-west-1 (3 nodes)
  survivability: region

Monitoring & Alerting

Backup Monitoring

## Backup age
time() - backup_last_success_timestamp_seconds > 86400

## Backup failures
rate(backup_failures_total[1h]) > 0

## Backup size
backup_size_bytes{backup_type="postgres"} < 1e9
Alerts:
groups:
- name: backup_alerts
  rules:
  - alert: BackupFailed
    expr: backup_last_success_timestamp_seconds < time() - 86400
    for: 1h
    annotations:
      summary: "Backup has not succeeded in 24h"

  - alert: BackupMissing
    expr: absent(backup_last_success_timestamp_seconds)
    for: 30m
    annotations:
      summary: "Backup metrics missing"

Recovery Testing Dashboard

{
  "dashboard": {
    "title": "DR Metrics",
    "panels": [
      {
        "title": "Last Successful Backup",
        "targets": [{
          "expr": "time() - backup_last_success_timestamp_seconds"
        }]
      },
      {
        "title": "RTO Trend",
        "targets": [{
          "expr": "dr_test_rto_seconds"
        }]
      },
      {
        "title": "RPO Actual",
        "targets": [{
          "expr": "time() - backup_last_success_timestamp_seconds"
        }]
      }
    ]
  }
}

Best Practices

Never trust untested backups!
  • Monthly DR drills
  • Quarterly full restore tests
  • Document restore procedures
  • Measure actual RTO/RPO
  • Automate testing where possible
Maintain:
  • 3 copies of data
  • 2 different storage types
  • 1 off-site backup
# Example implementation
Primary: Live database
Copy 1: Daily snapshots (same region)
Copy 2: S3 backups (different region)
Copy 3: Glacier archive (off-site)
Always encrypt sensitive data:
# Encrypt with GPG
gpg --encrypt --recipient backup@company.com backup.sql

# Or use AWS KMS
aws s3 cp backup.sql s3://backups/ \
  --sse aws:kms \
  --sse-kms-key-id arn:aws:kms:...
Maintain runbooks for:
  • Backup procedures
  • Restore procedures
  • Failover procedures
  • DR contacts and escalation
  • Test results and improvements
Infrastructure as Code for DR:
# Terraform/Pulumi
- Database clusters
- Kubernetes clusters
- Load balancers
- DNS configuration

# Ansible/Helm
- Application deployment
- Configuration management
- Secret restoration

Compliance & Audit

Backup Retention Policies

## retention-policy.yaml
policies:
  daily_backups:
    retention_days: 7
    backup_time: "02:00 UTC"

  weekly_backups:
    retention_days: 28
    backup_day: "Sunday"

  monthly_backups:
    retention_days: 365
    backup_day: "1st"

  yearly_backups:
    retention_days: 2555  # 7 years
    backup_day: "Jan 1"

Audit Trail

## Log all backup/restore operations
logger -t backup "Started PostgreSQL backup"

## Send to centralized logging
echo "$(date): Backup completed" | \
  aws logs put-log-events \
    --log-group-name /aws/backup \
    --log-stream-name postgres

Next Steps


Resilient Infrastructure: Comprehensive DR ensures business continuity and data protection!