67. Grafana LGTM Stack Migration
Date: 2025-12-07Status
AcceptedCategory
InfrastructureContext
The test infrastructure used multiple observability tools that were complex to manage: Previous Stack (Multi-vendor):- Prometheus - Metrics collection and storage
- Alertmanager - Alert routing and management
- Jaeger - Distributed tracing
- Promtail - Log collection agent
- Loki - Log aggregation
- Multiple configuration formats (Prometheus YAML, Jaeger YAML, Promtail YAML)
- Multiple container images to maintain
- Separate alerting configuration
- No unified query interface
Decision
Migrate to Grafana LGTM Stack - a unified observability platform:| Component | Role | Replaces |
|---|---|---|
| Loki | Log aggregation | Promtail (collection moved to Alloy) |
| Grafana | Dashboards + Unified Alerting | Alertmanager |
| Tempo | Distributed tracing | Jaeger |
| Mimir | Metrics storage | Prometheus (storage only, scraping moved to Alloy) |
| Alloy | Unified collector | Prometheus (scraping) + Promtail + OTEL Collector |
Architecture
Key Changes
docker-compose.test.yml services:| Old Service | New Service | Purpose |
|---|---|---|
prometheus | mimir-test | Metrics storage |
alertmanager | Removed | Now in Grafana Unified Alerting |
jaeger | tempo-test | Distributed tracing |
promtail | Removed | Now in Alloy |
| N/A | alloy-test | Unified telemetry collector |
| Old File | New File |
|---|---|
docker/prometheus/prometheus.yml | docker/alloy/alloy-config.alloy |
docker/promtail/promtail-config.yml | docker/alloy/alloy-config.alloy |
docker/alertmanager/alertmanager.yml | monitoring/grafana/alerting/ |
docker/jaeger/jaeger-config.yml | docker/tempo/tempo-config.yaml |
| N/A | docker/mimir/mimir-config.yaml |
Consequences
Positive
- Unified Configuration: Single Alloy config for all telemetry collection
- Single Query Interface: All telemetry in Grafana (logs, traces, metrics)
- Simpler Alerting: Grafana Unified Alerting replaces separate Alertmanager
- Better Trace-Log Correlation: Native exemplar support between Tempo and Loki
- Reduced Container Count: 4 services instead of 5 (Alloy consolidates collection)
- GCP Native Integration: Mimir/Tempo support GCS backends for production
Negative
- Learning Curve: New Alloy configuration syntax (River DSL)
- Health Check Complexity: Minimal/distroless images require custom health checks
- Initial Setup: More upfront configuration for Mimir (vs simple Prometheus)
Lessons Learned
Distroless Container Health Checks
Several LGTM images lack common tools likewget/curl:
| Image | Has wget/curl? | Health Check Pattern |
|---|---|---|
grafana/alloy | No | Bash TCP: bash -c '</dev/tcp/localhost/PORT' |
grafana/mimir | No (distroless) | Disable: healthcheck: { disable: true } |
grafana/tempo | Yes (wget) | Standard: wget --spider -q URL |
grafana/loki | Yes (wget) | Standard: wget --spider -q URL |
.claude/memory/distroless-container-healthchecks.md for patterns.
Loki Timestamp Rejection
Alloy batches logs before sending, which can cause Loki to reject logs with “timestamp too old” errors in test environments. Fixed by disablingreject_old_samples in docker/loki/loki-config.yaml.
Gateway Routes
All services accessible via Traefik gateway athttp://localhost/:
| Route | Service | Purpose |
|---|---|---|
/dashboards | Grafana | Unified dashboards + alerting |
/logs | Loki | Log API |
/tempo | Tempo | Trace API |
/mimir | Mimir | Metrics API |
/alloy | Alloy | Collector UI |
Related ADRs
Files Changed
docker-compose.test.yml- New LGTM servicesdocker/alloy/alloy-config.alloy- Unified collector configdocker/tempo/tempo-config.yaml- Trace storage configdocker/mimir/mimir-config.yaml- Metrics storage configdocker/loki/loki-config.yaml- Log storage config (updated)monitoring/grafana/datasources.yml- Grafana data source config.claude/CLAUDE.md- Updated observability reference.claude/memory/distroless-container-healthchecks.md- New lessons learned