Alerting with Alertmanager
prometheus-rules.yaml :
groups :
- name : langgraph_alerts
rules :
# High error rate
- alert : HighErrorRate
expr : rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for : 5m
labels :
severity : critical
annotations :
summary : "High error rate detected"
description : "Error rate is {{ $value }} requests/second"
# High latency
- alert : HighLatency
expr : histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for : 10m
labels :
severity : warning
annotations :
summary : "High API latency"
description : "P95 latency is {{ $value }}s"
# LLM failures
- alert : LLMFailures
expr : rate(llm_requests_total{status="error"}[5m]) > 0.1
for : 5m
labels :
severity : critical
annotations :
summary : "LLM requests failing"
description : "LLM error rate: {{ $value }}"
# Redis down
- alert : RedisDown
expr : up{job="redis"} == 0
for : 1m
labels :
severity : critical
annotations :
summary : "Redis is down"
description : "Redis instance {{ $labels.instance }} is unreachable"
# High token usage
- alert : HighTokenUsage
expr : sum(rate(llm_tokens_total[1h])) > 1000000
for : 10m
labels :
severity : warning
annotations :
summary : "Unusually high token usage"
description : "Token usage: {{ $value }} tokens/hour"
# Low session count (possible issue)
- alert : NoActiveSessions
expr : active_sessions == 0
for : 30m
labels :
severity : warning
annotations :
summary : "No active sessions"
description : "No users have active sessions for 30 minutes"
Alertmanager Configuration
global :
slack_api_url : 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route :
group_by : [ 'alertname' ]
group_wait : 10s
group_interval : 10s
repeat_interval : 1h
receiver : 'slack-notifications'
routes :
- match :
severity : critical
receiver : 'pagerduty'
receivers :
- name : 'slack-notifications'
slack_configs :
- channel : '#alerts'
title : 'Alert: {{ .GroupLabels.alertname }}'
text : '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name : 'pagerduty'
pagerduty_configs :
- service_key : 'YOUR_PAGERDUTY_KEY'
Health Checks
Kubernetes Probes
apiVersion : apps/v1
kind : Deployment
spec :
template :
spec :
containers :
- name : agent
livenessProbe :
httpGet :
path : /health/live
port : 8000
initialDelaySeconds : 30
periodSeconds : 10
failureThreshold : 3
readinessProbe :
httpGet :
path : /health/ready
port : 8000
initialDelaySeconds : 10
periodSeconds : 5
failureThreshold : 3
startupProbe :
httpGet :
path : /health/startup
port : 8000
initialDelaySeconds : 10
periodSeconds : 5
failureThreshold : 30
Health Check Implementation
from fastapi import FastAPI, status
from fastapi.responses import JSONResponse
@app.get ( "/health/live" )
async def liveness ():
"""Liveness probe - is the application running?"""
return { "status" : "alive" }
@app.get ( "/health/ready" )
async def readiness ():
"""Readiness probe - can the application serve traffic?"""
checks = {}
# Check Redis
try :
await redis_client.ping()
checks[ "redis" ] = "healthy"
except Exception as e:
checks[ "redis" ] = f "unhealthy: { e } "
# Check Keycloak
try :
response = await httpx.get( f " { settings.keycloak_url } /health" )
checks[ "keycloak" ] = "healthy" if response.status_code == 200 else "unhealthy"
except Exception as e:
checks[ "keycloak" ] = f "unhealthy: { e } "
# Check OpenFGA
try :
response = await httpx.get( f " { settings.openfga_url } /healthz" )
checks[ "openfga" ] = "healthy" if response.status_code == 200 else "unhealthy"
except Exception as e:
checks[ "openfga" ] = f "unhealthy: { e } "
# Determine overall status
all_healthy = all (v == "healthy" for v in checks.values())
return JSONResponse(
content = { "status" : "ready" if all_healthy else "not_ready" , "checks" : checks},
status_code = status. HTTP_200_OK if all_healthy else status. HTTP_503_SERVICE_UNAVAILABLE
)
@app.get ( "/health/startup" )
async def startup ():
"""Startup probe - has the application finished starting?"""
# Check if initialization is complete
if not app_initialized:
return JSONResponse(
content = { "status" : "starting" },
status_code = status. HTTP_503_SERVICE_UNAVAILABLE
)
return { "status" : "started" }
Best Practices
Always use structured JSON logging for easy parsing: # Good
logger.info(
"User authenticated" ,
user_id = user_id,
provider = "keycloak" ,
duration_ms = duration
)
# Bad
logger.info( f "User { user_id } authenticated via keycloak in { duration } ms" )
Monitor business KPIs, not just technical metrics:
Conversations per user
Average conversation length
Token cost per user
Tool usage patterns
User retention
Define Service Level Objectives:
Availability: 99.9% uptime
Latency: P95 < 2s
Error rate: < 0.1%
Token budget: < $1000/day
Track requests across all services: # Generate once per request
request_id = str (uuid.uuid4())
# Pass to all downstream services
headers = { "X-Request-ID" : request_id}
# Log with correlation ID
logger.info( "Processing request" , request_id = request_id)
Next Steps
Monitoring Complete : Comprehensive observability with metrics, traces, logs, and alerts!