67. Grafana LGTM Stack Migration
Status
Category
Context
Decision
Architecture
Key Changes
Consequences
Positive
Negative
Lessons Learned
Distroless Container Health Checks
Loki Timestamp Rejection
Gateway Routes
Related ADRs
Files Changed

67. Grafana LGTM Stack Migration

Date: 2025-12-07

Status

Accepted

Context

The test infrastructure used multiple observability tools that were complex to manage: Previous Stack (Multi-vendor):

Prometheus - Metrics collection and storage
Alertmanager - Alert routing and management
Jaeger - Distributed tracing
Promtail - Log collection agent
Loki - Log aggregation

This created operational complexity:

Multiple configuration formats (Prometheus YAML, Jaeger YAML, Promtail YAML)
Multiple container images to maintain
Separate alerting configuration
No unified query interface

Decision

Migrate to Grafana LGTM Stack - a unified observability platform:

Component	Role	Replaces
Loki	Log aggregation	Promtail (collection moved to Alloy)
Grafana	Dashboards + Unified Alerting	Alertmanager
Tempo	Distributed tracing	Jaeger
Mimir	Metrics storage	Prometheus (storage only, scraping moved to Alloy)
Alloy	Unified collector	Prometheus (scraping) + Promtail + OTEL Collector

Architecture

Applications → OpenTelemetry SDK
                    ↓
              Grafana Alloy (OTLP receiver)
                    ↓
         ┌─────────┼─────────┐
         ↓         ↓         ↓
       Tempo     Mimir     Loki
      (traces)  (metrics)  (logs)
         └─────────┼─────────┘
                   ↓
               Grafana
         (Unified Dashboards + Alerting)

Key Changes

docker-compose.test.yml services:

Old Service	New Service	Purpose
`prometheus`	`mimir-test`	Metrics storage
`alertmanager`	Removed	Now in Grafana Unified Alerting
`jaeger`	`tempo-test`	Distributed tracing
`promtail`	Removed	Now in Alloy
N/A	`alloy-test`	Unified telemetry collector

Configuration files:

Old File	New File
`docker/prometheus/prometheus.yml`	`docker/alloy/alloy-config.alloy`
`docker/promtail/promtail-config.yml`	`docker/alloy/alloy-config.alloy`
`docker/alertmanager/alertmanager.yml`	`monitoring/grafana/alerting/`
`docker/jaeger/jaeger-config.yml`	`docker/tempo/tempo-config.yaml`
N/A	`docker/mimir/mimir-config.yaml`

Consequences

Positive

Unified Configuration: Single Alloy config for all telemetry collection
Single Query Interface: All telemetry in Grafana (logs, traces, metrics)
Simpler Alerting: Grafana Unified Alerting replaces separate Alertmanager
Better Trace-Log Correlation: Native exemplar support between Tempo and Loki
Reduced Container Count: 4 services instead of 5 (Alloy consolidates collection)
GCP Native Integration: Mimir/Tempo support GCS backends for production

Negative

Learning Curve: New Alloy configuration syntax (River DSL)
Health Check Complexity: Minimal/distroless images require custom health checks
Initial Setup: More upfront configuration for Mimir (vs simple Prometheus)

Lessons Learned

Distroless Container Health Checks

Several LGTM images lack common tools like wget/curl:

Image	Has wget/curl?	Health Check Pattern
`grafana/alloy`	No	Bash TCP: `bash -c '</dev/tcp/localhost/PORT'`
`grafana/mimir`	No (distroless)	Disable: `healthcheck: { disable: true }`
`grafana/tempo`	Yes (wget)	Standard: `wget --spider -q URL`
`grafana/loki`	Yes (wget)	Standard: `wget --spider -q URL`

See .claude/memory/distroless-container-healthchecks.md for patterns.

Loki Timestamp Rejection

Alloy batches logs before sending, which can cause Loki to reject logs with “timestamp too old” errors in test environments. Fixed by disabling reject_old_samples in docker/loki/loki-config.yaml.

Gateway Routes

All services accessible via Traefik gateway at http://localhost/:

Route	Service	Purpose
`/dashboards`	Grafana	Unified dashboards + alerting
`/logs`	Loki	Log API
`/tempo`	Tempo	Trace API
`/mimir`	Mimir	Metrics API
`/alloy`	Alloy	Collector UI

ADR-0003: Dual Observability (OpenTelemetry + LangSmith)

Files Changed

docker-compose.test.yml - New LGTM services
docker/alloy/alloy-config.alloy - Unified collector config
docker/tempo/tempo-config.yaml - Trace storage config
docker/mimir/mimir-config.yaml - Metrics storage config
docker/loki/loki-config.yaml - Log storage config (updated)
monitoring/grafana/datasources.yml - Grafana data source config
.claude/CLAUDE.md - Updated observability reference
.claude/memory/distroless-container-healthchecks.md - New lessons learned

51. Memorystore Redis ExternalName Service with Cloud DNS 14. Pydantic Type Safety Strategy

⌘I

Overview

Project

Core Platform

Authentication & Identity

Infrastructure & Deployment

Development & Quality

Testing Infrastructure

CI/CD & Operations

Tooling & Standards

Compliance

67. Grafana LGTM Stack Migration

67. Grafana LGTM Stack Migration

Status

Category

Context

Decision

Architecture

Key Changes

Consequences

Positive

Negative

Lessons Learned

Distroless Container Health Checks

Loki Timestamp Rejection

Gateway Routes

Files Changed

Overview

Project

Core Platform

Authentication & Identity

Infrastructure & Deployment

Development & Quality

Testing Infrastructure

CI/CD & Operations

Tooling & Standards

Compliance

​67. Grafana LGTM Stack Migration

​Status

​Category

​Context

​Decision

​Architecture

​Key Changes

​Consequences

​Positive

​Negative

​Lessons Learned

​Distroless Container Health Checks

​Loki Timestamp Rejection

​Gateway Routes

​Related ADRs

​Files Changed

67. Grafana LGTM Stack Migration

Status

Category

Context

Decision

Architecture

Key Changes

Consequences

Positive

Negative

Lessons Learned

Distroless Container Health Checks

Loki Timestamp Rejection

Gateway Routes

Related ADRs

Files Changed