Cache Management Strategy
Last Updated: 2025-11-17 Status: Active Policy Related: CI/CD Strategy ADR, CI/CD Failure Prevention, Coverage Threshold PhilosophyOverview
GitHub Actions caching can significantly speed up CI/CD pipelines but can also cause subtle issues when caches become stale or corrupted. This document outlines strategies to prevent and resolve cache-related problems.Problem Statement
What Happened (2025-11-17)
Symptom:uv pip check failing in CI with dependency conflicts, but passing locally:
- Cache key based on
pyproject.tomlanduv.lockhashes - Cache restored from previous run with different dependency state
uv sync --frozeninstalled from lockfile but cache had old packages- Result: Frankensteined environment with mixed old/new dependencies
Prevention Strategies
1. Cache Versioning (Primary Defense)
Implementation: Use version prefix in cache keys- ✅ After major dependency updates (e.g., Python version, uv version)
- ✅ When dependency conflicts persist across multiple runs
- ✅ After significant changes to dependency resolution logic
- ✅ When cache behavior seems inconsistent between local and CI
- ❌ For every PR (defeats purpose of caching)
- ❌ Before investigating root cause
- ❌ As first resort for CI failures
2. Dependency Consistency Validation
Implementation: Validate dependencies after installation- Fail loudly but don’t block: Show conflicts as warnings, let tests determine impact
- Capture output: Display actual conflicts for debugging
- Context matters: CI environment differences may cause false positives
3. Lockfile Validation
Implementation: Verify lockfile is current before using cache- Prevents using cache with outdated lockfile
- Catches developer errors (forgot to run
uv lock) - Ensures reproducible builds
4. Cache Scope Isolation
Implementation: Use different cache keys for different job types- Prevents cache poisoning between job types
- Allows different dependency extras per job type
- Enables targeted cache invalidation
5. Frozen Installation
Implementation: Always use--frozen flag with uv sync
--frozen:
- ✅ Fails if lockfile is out of sync (prevents drift)
- ✅ Never resolves dependencies (faster, reproducible)
- ✅ Guarantees exact versions from lockfile
uv sync without --frozen in CI!
Detection Strategies
1. Early Failure Signals
Indicators of cache corruption:2. Automated Cache Health Checks
Implementation: Add periodic cache verificationResolution Strategies
1. Cache Invalidation (Immediate Fix)
When to use: Cache corruption is confirmed Steps:-
Increment cache version:
-
Document the change:
-
Commit and push:
2. Manual Cache Clearing (Nuclear Option)
When to use: Cache versioning doesn’t work or cache is severely corrupted Steps via GitHub UI:- Go to repository → Actions → Caches
- Search for affected cache keys
- Delete problematic caches
- Re-run failed workflows
3. Downgrade to Warning (Temporary Workaround)
When to use: Suspected false positive conflicts Implementation:Best Practices
Cache Key Design
✅ Good cache key structure:$OS: Platform (linux, macos, windows)$TOOL: Package manager (uv, pip, poetry)$VERSION: Cache schema version (v1, v2, v3…)$SCOPE: Job type (unit-tests, integration-tests…)$HASH: Dependency files hash
Restore Keys Strategy
Purpose: Fallback cache lookup when exact match not found Recommended pattern:- ✅ More fallbacks = faster cache hits
- ❌ More fallbacks = higher risk of stale cache
- Balance: 2-3 restore keys maximum
Cache Retention
Default: GitHub Actions caches expire after 7 days of no access Custom retention (not supported by GitHub Actions directly):Monitoring & Alerting
Metrics to Track
1. Cache Hit Rate:Automated Alerts
Slack notification on repeated cache issues:Troubleshooting Guide
Issue: “Dependency conflicts detected”
Diagnosis:- Try cache version bump (quickest)
- If persists, check for actual dependency conflicts
- If still failing, investigate lockfile generation differences
Issue: “ModuleNotFoundError in CI”
Diagnosis:- Verify extras match between local and CI
- Check cache version compatibility
- Try fresh cache build (version bump)
Issue: “Tests pass locally but fail in CI”
Diagnosis:- Ensure Python versions match exactly
- Pin uv version in CI workflow
- Check for environment variable differences
Maintenance Schedule
Weekly
- ✅ Review cache hit rates
- ✅ Check for dependency check failures
- ✅ Monitor cache size growth
Monthly
- ✅ Audit cache keys for efficiency
- ✅ Review cache versioning strategy
- ✅ Clean up unused cache scopes
Quarterly
- ✅ Test cache invalidation procedure
- ✅ Review and update this documentation
- ✅ Evaluate new caching strategies
Related Resources
Changelog
2025-11-17: Cache v2 Migration
- Issue: Dependency conflicts in CI (passing locally)
- Root Cause: GitHub Actions cache corruption
- Solution: Bumped cache version v1 → v2
- Result: All dependency checks passing
Future Improvements
- Implement automated cache age monitoring
- Add Slack alerts for cache corruption
- Create dashboard for cache metrics
- Investigate alternative caching strategies (BuildKit, etc.)
Questions? Open an issue or ask in
#engineering Slack channel.