28. Multi-Layer Caching Strategy
Date: 2025-10-20Status
AcceptedCategory
Performance & ResilienceContext
The MCP server makes expensive operations that could benefit from caching:- LLM API calls: $0.003-0.015 per request, 2-10s latency
- OpenFGA authorization checks: 50-100ms per check, high volume
- Prometheus metric queries: 100-500ms per query
- Embedding generation: 200-500ms per text chunk
- Knowledge base searches: 500ms-2s per query
- No caching layer implemented
- Every request hits external services
- Repeated identical requests (e.g., same auth check)
- High latency and cost
- P95 latency: Unknown (no baseline)
- LLM costs: Uncontrolled, potentially high
- OpenFGA load: 50+ checks per request
- Prometheus load: Repeated queries for same metrics
- 30% latency reduction
- 20% cost savings
- P95 latency < 500ms
Decision
Implement a multi-layer caching strategy with different TTLs and invalidation policies:Cache Layers
Layer 1: In-Memory LRU Cache
Use Case: Frequently accessed, small data (< 1MB per entry) Implementation:functools.lru_cache or cachetools.LRUCache
Configuration:
- OpenFGA authorization results (user:resource:permission)
- User profile lookups (user_id → profile)
- Feature flag evaluations
- Configuration values
Layer 2: Redis Distributed Cache
Use Case: Shared across instances, larger data, longer TTL Implementation:redis with pickle serialization
Configuration:
- LLM responses (with semantic similarity check)
- Embedding vectors for text chunks
- Prometheus query results
- Knowledge base search results
- Expensive computations
Layer 3: Provider-Native Caching
Use Case: Leverage provider-specific caching (Anthropic, Gemini) Anthropic Claude Prompt Caching:Cache Key Design
Hierarchical Key Format:- Include version suffix (
:v1) for cache invalidation - Use hashing for long identifiers (limit key length < 250 chars)
- Namespace by feature area to avoid collisions
- Include all parameters that affect output
TTL Strategy
| Cache Type | TTL | Rationale |
|---|---|---|
| OpenFGA Authorization | 5 min | Permissions don’t change frequently |
| LLM Responses | 1 hour | Same prompt → same response (deterministic) |
| User Profiles | 15 min | Profiles update infrequently |
| Feature Flags | 1 min | Need fast rollout of flag changes |
| Prometheus Queries | 1 min | Metrics change frequently |
| Embeddings | 24 hours | Text → embedding is deterministic |
| Knowledge Base Search | 30 min | Search index updates periodically |
Cache Invalidation
Strategies:- Time-based (TTL): Automatic expiration after duration
- Event-based: Invalidate on data changes
- Version-based: Change key version to invalidate all
Architecture
New Module: src/mcp_server_langgraph/core/cache.py
Metrics & Observability
Cache Metrics (20+)
Cache Hit Rate Targets
| Cache Type | Target Hit Rate | Current | Status |
|---|---|---|---|
| OpenFGA Authorization | > 80% | TBD | 🟡 |
| LLM Responses | > 40% | TBD | 🟡 |
| Embeddings | > 70% | TBD | 🟡 |
| Prometheus Queries | > 60% | TBD | 🟡 |
| User Profiles | > 90% | TBD | 🟡 |
Configuration
Environment Variables
Consequences
Positive
-
Performance Improvement
- 30-50% latency reduction for cached operations
- P95 latency: < 500ms (target achieved)
- Reduced load on external services
-
Cost Savings
- 20-40% reduction in LLM API costs
- Fewer OpenFGA API calls
- Lower Prometheus query load
-
Scalability
- Handle higher request volume with same resources
- Reduced database/API load enables horizontal scaling
-
User Experience
- Faster response times
- More consistent performance
Negative
-
Cache Staleness
- Stale data for TTL duration
- Permission changes not reflected immediately
- May violate real-time requirements
-
Memory Usage
- L1 cache consumes application memory (~100MB)
- Redis memory usage (~1GB)
- Need monitoring and alerting
-
Complexity
- Cache invalidation is hard (“one of two hard problems”)
- Debugging cache-related issues
- More moving parts (Redis dependency)
-
Cache Stampede Risk
- Thundering herd on cache expiration
- Need locking mechanisms
Mitigations
- Short TTLs: Start with 1-5 min, increase based on metrics
- Tiered Rollout: Enable caching incrementally (auth → llm → metrics)
- Cache Warming: Pre-populate cache with common queries
- Monitoring: Alert on low hit rates, high evictions
- Circuit Breaker: Bypass cache if Redis is down (fail-safe)
Implementation Plan
Week 1: Foundation
- Create
core/cache.pymodule - Implement L1 cache (LRU with TTL)
- Implement L2 cache (Redis)
- Add cache metrics and observability
- Write 40+ unit tests
Week 2: Integration - Auth Layer
- Cache OpenFGA authorization checks (5 min TTL)
- Cache user profile lookups (15 min TTL)
- Cache session lookups (already in Redis, optimize)
- Measure hit rate, tune TTLs
Week 3: Integration - LLM Layer
- Cache LLM responses with semantic similarity
- Implement Anthropic prompt caching
- Implement Gemini context caching
- Cache embedding generation results
- Measure cost savings
Week 4: Integration - Metrics Layer
- Cache Prometheus query results (1 min TTL)
- Cache SLA metrics calculations
- Cache compliance evidence queries
- Measure latency reduction
Week 5: Optimization & Rollout
- Implement cache stampede prevention
- Add cache warming for common queries
- Performance testing: 1000 req/s load
- Deploy to production (gradual rollout)
Testing Strategy
Unit Tests
Performance Tests
References
- Redis Caching Best Practices: https://redis.io/docs/manual/client-side-caching/
- Anthropic Prompt Caching: https://docs.anthropic.com/claude/docs/prompt-caching
- Google Gemini Caching: https://ai.google.dev/gemini-api/docs/caching
- cachetools Library: https://github.com/tkem/cachetools
- Cache Stampede Prevention: https://en.wikipedia.org/wiki/Cache_stampede
Success Metrics
Performance
- Target: 30% P95 latency reduction
- Baseline: Measure current P95 latency
- Target: P95 < 500ms after caching
Cost Savings
- Target: 20% reduction in LLM API costs
- Measurement: Monthly LLM spend before/after
Cache Hit Rate
- Target: > 60% overall hit rate
- Measurement:
cache_hits / (cache_hits + cache_misses)
User Experience
- Target: < 2% increase in stale data complaints
- Measurement: User feedback, support tickets
Last Updated: 2025-10-20 Next Review: 2025-11-20 (after 1 month in production)