22. Distributed Conversation Checkpointing for Auto-Scaling
Date: 2025-10-15Status
AcceptedCategory
Data & StorageContext
The MCP Server uses LangGraph’s checkpointing feature to maintain multi-turn conversation state. Previously, conversations were stored usingMemorySaver (in-memory), which creates critical problems for production deployments with horizontal pod autoscaling (HPA):
Problem: Pod-Local Conversation State
Current Architecture:MemorySaverstores conversation state in pod memory- Each Kubernetes pod has isolated conversation history
thread_ididentifies conversations but state is not shared
- Scale-Up: New pods have NO conversation history
- Scale-Down: Terminated pods lose ALL conversation state
- Pod Restart: All conversations on that pod are LOST
- Load Balancing: Users routed to different pods lose context
Attempted Solution: Session Affinity
KubernetesService with sessionAffinity: ClientIP was configured:
- ❌ Only works for same source IP (mobile clients change IPs)
- ❌ Doesn’t prevent state loss on pod restarts
- ❌ Doesn’t help during scale-down events
- ❌ Application-layer
thread_idnot considered - ❌ Creates “pet” pods instead of “cattle” (anti-pattern)
Requirements
Production deployments with HPA (minReplicas: 3, maxReplicas: 10) require:- Distributed State: Conversation history accessible from ANY pod
- State Persistence: Survive pod restarts and scale events
- Session Continuity: Same
thread_idworks across all replicas - No Vendor Lock-in: Pluggable backend (similar to session storage ADR-0006)
- Performance: Fast checkpoint reads/writes (< 10ms)
- Simplicity: Reuse existing infrastructure
Decision
We will implement distributed conversation checkpointing using Redis with a pluggable architecture pattern.Architecture
Redis Database Separation
Same Redis instance, different databases:- Database 0: Authentication sessions (
SESSION_BACKEND=redis) - Database 1: Conversation checkpoints (
CHECKPOINT_BACKEND=redis)
Configuration
URL Encoding Requirements (IMPORTANT)
Critical Security & Reliability Requirement: Redis connection URLs MUST have passwords percent-encoded per RFC 3986. Background: Production incident with staging revision758b8f744 where an unencoded Redis password containing special characters (/, +, =) caused ValueError: Port could not be cast to integer value during redis.connection.parse_url().
Example: Password Du0PmDvmqDWqDTgfGnmi6/SKyuQydi3z7cPTgEQoE+s= contains:
/(forward slash) - treated as path delimiter+(plus sign) - treated as space in some contexts=(equals sign) - treated as query parameter delimiter
-
Production (Kubernetes External Secrets): Use
| urlqueryfilter in template: -
Local Development: Manually percent-encode passwords in
.env: -
Defense-in-Depth: Application code includes automatic encoding safeguard:
tests/unit/test_redis_url_encoding.py validates encoding for all RFC 3986 special characters.
Docker Compose
Consequences
Positive Consequences
- ✅ Auto-Scaling Works: HPA can scale 3-10 replicas without losing conversations
- ✅ State Persistence: Conversations survive pod restarts and scale events
- ✅ Zero Infrastructure Overhead: Reuses existing Redis (already used for sessions)
- ✅ Better Performance: Redis (in-memory) is faster than PostgreSQL
- ✅ Backward Compatible: Defaults to
memorybackend (existing behavior) - ✅ Consistent Architecture: Both sessions AND checkpoints use Redis
- ✅ Simple Operations: No schema migrations or new databases
Negative Consequences
- ⚠️ Redis Dependency for Production: Production deployments MUST enable Redis
- ⚠️ Slight Latency: Redis network calls add ~5-10ms per checkpoint operation
- ⚠️ Memory Usage: Conversations stored in Redis (mitigated by TTL)
Neutral Consequences
- Database Separation: Uses different Redis databases (0 vs 1) for logical separation
- TTL Management: Automatic cleanup after 7 days (configurable)
Alternatives Considered
1. PostgreSQL Checkpointer
Description: Use LangGraph’sPostgresSaver to store checkpoints in database
Pros:
- Transactional guarantees
- SQL query capabilities
- May reuse existing PostgreSQL
- ❌ Slower than Redis for high-frequency checkpoint operations
- ❌ Requires new PostgreSQL instance (or shares with application DB)
- ❌ Schema migrations needed
- ❌ More complex connection pooling
- ❌ Higher database load
2. Sticky Sessions by thread_id (Application-Level Routing)
Description: Implement custom load balancer to route samethread_id to same pod
Pros:
- Keeps conversation state local
- No external storage needed
- ❌ Doesn’t solve pod restart problem (state still lost)
- ❌ Doesn’t work during scale-down events
- ❌ Complicates load balancing significantly
- ❌ Creates “pet” pods (stateful pods are anti-pattern)
- ❌ Requires custom ingress/service mesh logic
3. Keep MemorySaver + StatefulSet
Description: Use Kubernetes StatefulSet instead of Deployment for stable pod identities Pros:- Stable network identities
- Persistent volumes per pod
- ❌ Still loses state on pod restart
- ❌ Doesn’t work with HPA (HPA doesn’t work well with StatefulSets)
- ❌ Slower rollouts
- ❌ Violates “cattle not pets” principle
- ❌ Not designed for stateless applications
4. Distributed In-Memory Cache (Memcached, Hazelcast)
Description: Use distributed cache instead of Redis Pros:- In-memory performance
- Distributed by design
- ❌ New infrastructure dependency
- ❌ More complex than Redis
- ❌ LangGraph doesn’t have native support
- ❌ Would need custom checkpointer implementation
Implementation Details
Checkpointer Factory
Configuration Settings
Request Flow (Auto-Scaling Scenario)
Performance Characteristics
| Backend | Checkpoint Read | Checkpoint Write | Pod Restart Impact | Scaling Impact |
|---|---|---|---|---|
| MemorySaver | < 1ms | < 1ms | ❌ Lost | ❌ Lost |
| Redis (local) | ~2ms | ~2ms | ✅ Preserved | ✅ Preserved |
| Redis (network) | ~5-10ms | ~5-10ms | ✅ Preserved | ✅ Preserved |
Testing Strategy
Unit Tests
Integration Tests
Load Tests
- Verify performance with Redis checkpointer under load
- Test HPA scaling behavior with active conversations
- Chaos test: Kill pods during conversations, verify recovery
Migration Path
Development → Production
Development (default):Existing Deployments
No Breaking Changes:- Defaults to
memorybackend (current behavior) - Conversations in MemorySaver are NOT migrated (acceptable - they were temporary)
- Enable Redis backend via environment variable
- Deploy new version (still uses
memoryby default) - Set
CHECKPOINT_BACKEND=redisin deployment manifests - Restart pods (conversations reset once - acceptable)
- Future conversations persist across all pod events
Kubernetes Configuration
HPA with Redis Checkpointer
HPA Configuration
Monitoring
Metrics
Add OpenTelemetry metrics for checkpointer operations:Alerts
Future Enhancements
- PostgreSQL Fallback: Add PostgreSQL checkpointer option for organizations without Redis
- Checkpoint Compression: Compress large conversation histories
- Selective Checkpointing: Only checkpoint after N messages (reduce Redis writes)
- Multi-Region: Replicate checkpoints across regions for disaster recovery
References
- Implementation:
src/mcp_server_langgraph/core/agent.py:74-125 - Configuration:
src/mcp_server_langgraph/core/config.py:90-93 - Related ADRs:
- LangGraph Documentation: https://langchain-ai.github.io/langgraph/how-tos/persistence/
- Redis Checkpointer:
pip install langgraph-checkpoint-redis>=2.0.0