Grafana Dashboards
Install Grafana
## Add Grafana repo
helm repo add grafana https://grafana.github.io/helm-charts
## Install Grafana
helm install grafana grafana/grafana \
--namespace monitoring \
--set persistence.enabled= true \
--set persistence.size=10Gi \
--set adminPassword=admin123
Application Dashboard
Import this JSON dashboard :
{
"dashboard" : {
"title" : "MCP Server - Application Metrics" ,
"panels" : [
{
"title" : "Request Rate" ,
"targets" : [{
"expr" : "rate(http_requests_total[5m])"
}],
"type" : "graph"
},
{
"title" : "Error Rate" ,
"targets" : [{
"expr" : "rate(http_requests_total{status=~ \" 5.. \" }[5m])"
}],
"type" : "graph"
},
{
"title" : "P95 Latency" ,
"targets" : [{
"expr" : "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}],
"type" : "graph"
},
{
"title" : "LLM Token Usage" ,
"targets" : [{
"expr" : "rate(llm_tokens_total[5m])"
}],
"type" : "graph"
},
{
"title" : "Active Sessions" ,
"targets" : [{
"expr" : "active_sessions"
}],
"type" : "stat"
},
{
"title" : "OpenFGA Check Rate" ,
"targets" : [{
"expr" : "rate(openfga_checks_total[5m])"
}],
"type" : "graph"
}
]
}
}
LLM Observability Dashboard
{
"dashboard" : {
"title" : "MCP Server - LLM Metrics" ,
"panels" : [
{
"title" : "LLM Requests by Provider" ,
"targets" : [{
"expr" : "rate(llm_requests_total[5m])"
}],
"type" : "graph"
},
{
"title" : "Token Cost per Hour" ,
"targets" : [{
"expr" : "sum(rate(llm_tokens_total{type= \" prompt \" }[1h])) * 0.003 + sum(rate(llm_tokens_total{type= \" completion \" }[1h])) * 0.015"
}],
"type" : "stat"
},
{
"title" : "LLM Error Rate" ,
"targets" : [{
"expr" : "rate(llm_requests_total{status= \" error \" }[5m])"
}],
"type" : "graph"
},
{
"title" : "Average Response Tokens" ,
"targets" : [{
"expr" : "rate(llm_tokens_total{type= \" completion \" }[5m]) / rate(llm_requests_total[5m])"
}],
"type" : "stat"
}
]
}
}
Production-Ready Dashboards (v2.1.0)
NEW in v2.1.0 - 7 production-ready Grafana dashboards covering authentication, authorization, LLM performance, and infrastructure metrics.
The repository includes pre-built Grafana dashboards optimized for production monitoring. All dashboards are located in monitoring/grafana/dashboards/.
Authentication authentication.json
Login activity rate (attempts, success, failures)
Login failure rate gauge with thresholds
Response time percentiles (p50, p95, p99)
Active sessions count
Token operations (create, verify, refresh)
JWKS cache performance
OpenFGA Authorization openfga.json
Authorization check rate (total, allowed, denied)
Denial rate gauge
Total relationship tuples
Check latency percentiles
Tuple write operations
Role sync operations and latency
LLM Performance llm-performance.json
Agent call rate (successful/failed)
Error rate gauge
Response time percentiles
Tool calls rate
LLM invocations by model
Fallback model usage
Keycloak SSO keycloak.json
Service status gauge
Response time (p50, p95, p99)
Login request rate
Error rates (login, token refresh)
Active sessions and users
Resource utilization (CPU, memory)
Redis Sessions redis-sessions.json
Service status and memory usage
Active sessions (key count)
Operations rate (commands/sec)
Connection pool utilization
Session evictions
Memory fragmentation ratio
Security security.json
Auth/AuthZ failures per second
JWT validation errors
Security status gauge
Failures by reason and resource
Failed attempts by user/IP
Top 10 violators table
Overview mcp-server-langgraph.json
Service status uptime gauge
Request rate by tool
Error rate percentage
Response time percentiles
Memory and CPU usage per pod
Request success/failure count
Import Dashboards
Option 1: Grafana UI (Manual)
Open Grafana at http://localhost:3000
Navigate to Dashboards → Import
Click Upload JSON file
Select dashboard file from monitoring/grafana/dashboards/
Select Prometheus datasource
Click Import
Repeat for each dashboard you want to use.
Option 2: Kubernetes ConfigMap (Automated)
# Create ConfigMap from dashboard files
kubectl create configmap grafana-dashboards \
--from-file=monitoring/grafana/dashboards/ \
-n monitoring
Then mount the ConfigMap in your Grafana deployment by adding volumeMounts and volumes to the deployment manifest.
Option 3: Helm Chart Configuration
Configure dashboards in values.yaml:
grafana :
dashboardProviders :
dashboardproviders.yaml :
apiVersion : 1
providers :
- name : 'mcp-server-langgraph'
orgId : 1
folder : 'MCP Server with LangGraph'
type : file
disableDeletion : false
editable : true
options :
path : /var/lib/grafana/dashboards/langgraph
dashboards :
mcp-server-langgraph :
authentication :
file : monitoring/grafana/dashboards/authentication.json
openfga :
file : monitoring/grafana/dashboards/openfga.json
llm-performance :
file : monitoring/grafana/dashboards/llm-performance.json
overview :
file : monitoring/grafana/dashboards/mcp-server-langgraph.json
security :
file : monitoring/grafana/dashboards/security.json
keycloak :
file : monitoring/grafana/dashboards/keycloak.json
redis-sessions :
file : monitoring/grafana/dashboards/redis-sessions.json
Dashboard Features
All production dashboards include:
Auto-refresh - 10-second refresh rate for real-time monitoring
Time range presets - Last 5m, 15m, 1h, 6h, 24h, 7d
Thresholds - Color-coded gauges (green/yellow/red)
Cross-links - Navigate between related dashboards
Legend tables - Current, max, and mean values
Panel descriptions - Hover tooltips explaining metrics
Required Metrics
Ensure these metrics are exposed by the application:
Authentication (authentication.json) :
up{job="mcp-server-langgraph"}
auth_login_attempts_total
auth_login_success_total
auth_login_failed_total
auth_login_duration_bucket
token_created_total
token_verified_total
token_refreshed_total
session_active_count
jwks_cache_hits_total
jwks_cache_misses_total
OpenFGA (openfga.json) :
up{job="openfga"}
authz_checks_total
authz_successes_total
authz_failures_total
authz_check_duration_bucket
openfga_tuple_count
openfga_tuples_written_total
openfga_tuples_deleted_total
openfga_sync_operations_total
openfga_sync_duration_bucket
LLM Performance (llm-performance.json) :
agent_calls_successful_total
agent_calls_failed_total
agent_response_duration_bucket
agent_tool_calls_total
## With labels: model, operation, tool
Keycloak & Redis :
up{job="keycloak"}
up{job="redis-session"}
keycloak_request_duration_bucket
keycloak_login_attempts_total
redis_memory_used_bytes
redis_db_keys
redis_commands_processed_total
Service Level Objectives (SLOs)
NEW in v2.1.0 - Pre-computed SLO metrics via Prometheus recording rules for efficient monitoring and alerting.
SLO Recording Rules
The monitoring/prometheus/rules/slo-recording-rules.yaml file contains 40+ recording rules that pre-compute Service Level Indicators (SLIs) for fast querying in Grafana.
Load recording rules :
## Kubernetes with Prometheus Operator
kubectl apply -f monitoring/prometheus/rules/slo-recording-rules.yaml
## Docker Compose
## Add to prometheus.yml:
rule_files :
- /etc/prometheus/rules/slo-recording-rules.yaml
Available SLO Metrics
Availability
Latency
Error Rate
Saturation
Error Budget
Compliance
Target: 99.9% uptime # Overall service availability
job:up:avg
# Component availability
job:up:avg:keycloak # Target: 99.5%
job:up:avg:openfga # Target: 99.5%
job:up:avg:redis_session # Target: 99.9%
Usage in Grafana :# Current availability
job:up:avg * 100
# Downtime minutes per month
(1 - job:up:avg) * 43200
Target: p95 < 2s, p99 < 5s # Agent response time
job:agent_response_duration:p95 # Target: 2000ms
job:agent_response_duration:p99 # Target: 5000ms
# Authentication latency
job:auth_login_duration:p95 # Target: 500ms
# Authorization latency
job:authz_check_duration:p95 # Target: 100ms
# Keycloak latency
job:keycloak_request_duration:p95 # Target: 1000ms
Usage in Grafana :# P95 latency vs target
job:agent_response_duration:p95 / 2000 * 100
Target: < 1% errors (99% success) # Overall error rate
job:agent_calls:error_rate # Target: 0.01 (1%)
# Authentication failures
job:auth_login:error_rate # Target: 0.05 (5%)
# Authorization denials
job:authz_checks:denial_rate # Target: 0.15 (15%)
# Token verification failures
job:token_verification:error_rate # Target: 0.02 (2%)
# LLM fallback rate
job:llm_fallback:rate # Target: 0.10 (10%)
Usage in Grafana :# Error rate as percentage
job:agent_calls:error_rate * 100
# Success rate
(1 - job:agent_calls:error_rate) * 100
Target: < 80% CPU, < 90% memory # Memory saturation
job:memory:saturation # Target: 0.90 (90%)
# CPU saturation
job:cpu:saturation # Target: 0.80 (80%)
# Redis memory saturation
job:redis_memory:saturation # Target: 0.90 (90%)
# Redis connection pool
job:redis_pool:saturation # Target: 0.95 (95%)
Usage in Grafana :# Memory pressure
job:memory:saturation * 100
# Available memory
(1 - job:memory:saturation) * 100
Burn rate detection across multiple windows # Fast burn (1 hour window)
job:error_budget:burn_rate_1h
# Medium burn (6 hour window)
job:error_budget:burn_rate_6h
# Slow burn (3 day window)
job:error_budget:burn_rate_3d
Interpretation :
Burn rate = 1.0: Consuming error budget at expected rate
Burn rate > 1.0: Consuming faster (alert!)
Burn rate < 1.0: Consuming slower (healthy)
Alert on fast burn : - alert : FastErrorBudgetBurn
expr : job:error_budget:burn_rate_1h > 14.4
for : 5m
annotations :
summary : "Error budget burning 14.4x faster than sustainable"
30-day rolling window SLO compliance # Availability compliance
job:slo_compliance:availability_30d # Target: 99.9%
# Latency compliance (% within SLO)
job:slo_compliance:latency_p95_30d # Target: 95%
# Error rate compliance (success rate)
job:slo_compliance:error_rate_30d # Target: 99%
Usage in Grafana :# Monthly SLO report
job:slo_compliance:availability_30d * 100
SLO Dashboard Example
Create an SLO summary dashboard:
{
"dashboard" : {
"title" : "SLO Compliance - MCP Server" ,
"panels" : [
{
"title" : "Availability SLO (99.9% target)" ,
"targets" : [{
"expr" : "job:up:avg * 100"
}],
"thresholds" : [
{ "color" : "red" , "value" : 0 },
{ "color" : "yellow" , "value" : 99.5 },
{ "color" : "green" , "value" : 99.9 }
],
"type" : "gauge"
},
{
"title" : "Error Rate SLO (< 1% target)" ,
"targets" : [{
"expr" : "job:agent_calls:error_rate * 100"
}],
"thresholds" : [
{ "color" : "green" , "value" : 0 },
{ "color" : "yellow" , "value" : 0.5 },
{ "color" : "red" , "value" : 1.0 }
],
"type" : "gauge"
},
{
"title" : "Latency SLO (p95 < 2s)" ,
"targets" : [{
"expr" : "job:agent_response_duration:p95"
}],
"thresholds" : [
{ "color" : "green" , "value" : 0 },
{ "color" : "yellow" , "value" : 1500 },
{ "color" : "red" , "value" : 2000 }
],
"type" : "gauge" ,
"unit" : "ms"
},
{
"title" : "Error Budget Burn Rate (1h window)" ,
"targets" : [{
"expr" : "job:error_budget:burn_rate_1h"
}],
"thresholds" : [
{ "color" : "green" , "value" : 0 },
{ "color" : "yellow" , "value" : 2 },
{ "color" : "red" , "value" : 14.4 }
],
"type" : "graph"
}
]
}
}
Benefits of SLO Recording Rules
Performance - Pre-computed metrics query 10-100x faster
Consistency - Same calculation across all dashboards
Alerting - Alert on SLO violations, not raw metrics
Reporting - Historical SLO compliance tracking
Error Budgets - Multi-window burn rate detection
Next Steps