Skip to main content

67. Grafana LGTM Stack Migration

Date: 2025-12-07

Status

Accepted

Category

Infrastructure

Context

The test infrastructure used multiple observability tools that were complex to manage: Previous Stack (Multi-vendor):
  • Prometheus - Metrics collection and storage
  • Alertmanager - Alert routing and management
  • Jaeger - Distributed tracing
  • Promtail - Log collection agent
  • Loki - Log aggregation
This created operational complexity:
  1. Multiple configuration formats (Prometheus YAML, Jaeger YAML, Promtail YAML)
  2. Multiple container images to maintain
  3. Separate alerting configuration
  4. No unified query interface

Decision

Migrate to Grafana LGTM Stack - a unified observability platform:
ComponentRoleReplaces
LokiLog aggregationPromtail (collection moved to Alloy)
GrafanaDashboards + Unified AlertingAlertmanager
TempoDistributed tracingJaeger
MimirMetrics storagePrometheus (storage only, scraping moved to Alloy)
AlloyUnified collectorPrometheus (scraping) + Promtail + OTEL Collector

Architecture

Applications → OpenTelemetry SDK

              Grafana Alloy (OTLP receiver)

         ┌─────────┼─────────┐
         ↓         ↓         ↓
       Tempo     Mimir     Loki
      (traces)  (metrics)  (logs)
         └─────────┼─────────┘

               Grafana
         (Unified Dashboards + Alerting)

Key Changes

docker-compose.test.yml services:
Old ServiceNew ServicePurpose
prometheusmimir-testMetrics storage
alertmanagerRemovedNow in Grafana Unified Alerting
jaegertempo-testDistributed tracing
promtailRemovedNow in Alloy
N/Aalloy-testUnified telemetry collector
Configuration files:
Old FileNew File
docker/prometheus/prometheus.ymldocker/alloy/alloy-config.alloy
docker/promtail/promtail-config.ymldocker/alloy/alloy-config.alloy
docker/alertmanager/alertmanager.ymlmonitoring/grafana/alerting/
docker/jaeger/jaeger-config.ymldocker/tempo/tempo-config.yaml
N/Adocker/mimir/mimir-config.yaml

Consequences

Positive

  1. Unified Configuration: Single Alloy config for all telemetry collection
  2. Single Query Interface: All telemetry in Grafana (logs, traces, metrics)
  3. Simpler Alerting: Grafana Unified Alerting replaces separate Alertmanager
  4. Better Trace-Log Correlation: Native exemplar support between Tempo and Loki
  5. Reduced Container Count: 4 services instead of 5 (Alloy consolidates collection)
  6. GCP Native Integration: Mimir/Tempo support GCS backends for production

Negative

  1. Learning Curve: New Alloy configuration syntax (River DSL)
  2. Health Check Complexity: Minimal/distroless images require custom health checks
  3. Initial Setup: More upfront configuration for Mimir (vs simple Prometheus)

Lessons Learned

Distroless Container Health Checks

Several LGTM images lack common tools like wget/curl:
ImageHas wget/curl?Health Check Pattern
grafana/alloyNoBash TCP: bash -c '</dev/tcp/localhost/PORT'
grafana/mimirNo (distroless)Disable: healthcheck: { disable: true }
grafana/tempoYes (wget)Standard: wget --spider -q URL
grafana/lokiYes (wget)Standard: wget --spider -q URL
See .claude/memory/distroless-container-healthchecks.md for patterns.

Loki Timestamp Rejection

Alloy batches logs before sending, which can cause Loki to reject logs with “timestamp too old” errors in test environments. Fixed by disabling reject_old_samples in docker/loki/loki-config.yaml.

Gateway Routes

All services accessible via Traefik gateway at http://localhost/:
RouteServicePurpose
/dashboardsGrafanaUnified dashboards + alerting
/logsLokiLog API
/tempoTempoTrace API
/mimirMimirMetrics API
/alloyAlloyCollector UI

Files Changed

  • docker-compose.test.yml - New LGTM services
  • docker/alloy/alloy-config.alloy - Unified collector config
  • docker/tempo/tempo-config.yaml - Trace storage config
  • docker/mimir/mimir-config.yaml - Metrics storage config
  • docker/loki/loki-config.yaml - Log storage config (updated)
  • monitoring/grafana/datasources.yml - Grafana data source config
  • .claude/CLAUDE.md - Updated observability reference
  • .claude/memory/distroless-container-healthchecks.md - New lessons learned