Skip to main content

Grafana Dashboards

Install Grafana

## Add Grafana repo
helm repo add grafana https://grafana.github.io/helm-charts

## Install Grafana
helm install grafana grafana/grafana \
  --namespace monitoring \
  --set persistence.enabled=true \
  --set persistence.size=10Gi \
  --set adminPassword=admin123

Application Dashboard

Import this JSON dashboard:
{
  "dashboard": {
    "title": "MCP Server - Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(http_requests_total[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
        }],
        "type": "graph"
      },
      {
        "title": "LLM Token Usage",
        "targets": [{
          "expr": "rate(llm_tokens_total[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "Active Sessions",
        "targets": [{
          "expr": "active_sessions"
        }],
        "type": "stat"
      },
      {
        "title": "OpenFGA Check Rate",
        "targets": [{
          "expr": "rate(openfga_checks_total[5m])"
        }],
        "type": "graph"
      }
    ]
  }
}

LLM Observability Dashboard

{
  "dashboard": {
    "title": "MCP Server - LLM Metrics",
    "panels": [
      {
        "title": "LLM Requests by Provider",
        "targets": [{
          "expr": "rate(llm_requests_total[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "Token Cost per Hour",
        "targets": [{
          "expr": "sum(rate(llm_tokens_total{type=\"prompt\"}[1h])) * 0.003 + sum(rate(llm_tokens_total{type=\"completion\"}[1h])) * 0.015"
        }],
        "type": "stat"
      },
      {
        "title": "LLM Error Rate",
        "targets": [{
          "expr": "rate(llm_requests_total{status=\"error\"}[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "Average Response Tokens",
        "targets": [{
          "expr": "rate(llm_tokens_total{type=\"completion\"}[5m]) / rate(llm_requests_total[5m])"
        }],
        "type": "stat"
      }
    ]
  }
}

Production-Ready Dashboards (v2.1.0)

NEW in v2.1.0 - 7 production-ready Grafana dashboards covering authentication, authorization, LLM performance, and infrastructure metrics.
The repository includes pre-built Grafana dashboards optimized for production monitoring. All dashboards are located in monitoring/grafana/dashboards/.

Authentication

authentication.json
  • Login activity rate (attempts, success, failures)
  • Login failure rate gauge with thresholds
  • Response time percentiles (p50, p95, p99)
  • Active sessions count
  • Token operations (create, verify, refresh)
  • JWKS cache performance

OpenFGA Authorization

openfga.json
  • Authorization check rate (total, allowed, denied)
  • Denial rate gauge
  • Total relationship tuples
  • Check latency percentiles
  • Tuple write operations
  • Role sync operations and latency

LLM Performance

llm-performance.json
  • Agent call rate (successful/failed)
  • Error rate gauge
  • Response time percentiles
  • Tool calls rate
  • LLM invocations by model
  • Fallback model usage

Keycloak SSO

keycloak.json
  • Service status gauge
  • Response time (p50, p95, p99)
  • Login request rate
  • Error rates (login, token refresh)
  • Active sessions and users
  • Resource utilization (CPU, memory)

Redis Sessions

redis-sessions.json
  • Service status and memory usage
  • Active sessions (key count)
  • Operations rate (commands/sec)
  • Connection pool utilization
  • Session evictions
  • Memory fragmentation ratio

Security

security.json
  • Auth/AuthZ failures per second
  • JWT validation errors
  • Security status gauge
  • Failures by reason and resource
  • Failed attempts by user/IP
  • Top 10 violators table

Overview

mcp-server-langgraph.json
  • Service status uptime gauge
  • Request rate by tool
  • Error rate percentage
  • Response time percentiles
  • Memory and CPU usage per pod
  • Request success/failure count
Import Dashboards
Option 1: Grafana UI (Manual)
  1. Open Grafana at http://localhost:3000
  2. Navigate to DashboardsImport
  3. Click Upload JSON file
  4. Select dashboard file from monitoring/grafana/dashboards/
  5. Select Prometheus datasource
  6. Click Import
Repeat for each dashboard you want to use. Option 2: Kubernetes ConfigMap (Automated)
# Create ConfigMap from dashboard files
kubectl create configmap grafana-dashboards \
  --from-file=monitoring/grafana/dashboards/ \
  -n monitoring
Then mount the ConfigMap in your Grafana deployment by adding volumeMounts and volumes to the deployment manifest. Option 3: Helm Chart Configuration Configure dashboards in values.yaml:
grafana:
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'mcp-server-langgraph'
        orgId: 1
        folder: 'MCP Server with LangGraph'
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/langgraph

  dashboards:
    mcp-server-langgraph:
      authentication:
        file: monitoring/grafana/dashboards/authentication.json
      openfga:
        file: monitoring/grafana/dashboards/openfga.json
      llm-performance:
        file: monitoring/grafana/dashboards/llm-performance.json
      overview:
        file: monitoring/grafana/dashboards/mcp-server-langgraph.json
      security:
        file: monitoring/grafana/dashboards/security.json
      keycloak:
        file: monitoring/grafana/dashboards/keycloak.json
      redis-sessions:
        file: monitoring/grafana/dashboards/redis-sessions.json
Dashboard Features
All production dashboards include:
  • Auto-refresh - 10-second refresh rate for real-time monitoring
  • Time range presets - Last 5m, 15m, 1h, 6h, 24h, 7d
  • Thresholds - Color-coded gauges (green/yellow/red)
  • Cross-links - Navigate between related dashboards
  • Legend tables - Current, max, and mean values
  • Panel descriptions - Hover tooltips explaining metrics
Required Metrics
Ensure these metrics are exposed by the application: Authentication (authentication.json):
up{job="mcp-server-langgraph"}
auth_login_attempts_total
auth_login_success_total
auth_login_failed_total
auth_login_duration_bucket
token_created_total
token_verified_total
token_refreshed_total
session_active_count
jwks_cache_hits_total
jwks_cache_misses_total
OpenFGA (openfga.json):
up{job="openfga"}
authz_checks_total
authz_successes_total
authz_failures_total
authz_check_duration_bucket
openfga_tuple_count
openfga_tuples_written_total
openfga_tuples_deleted_total
openfga_sync_operations_total
openfga_sync_duration_bucket
LLM Performance (llm-performance.json):
agent_calls_successful_total
agent_calls_failed_total
agent_response_duration_bucket
agent_tool_calls_total
## With labels: model, operation, tool
Keycloak & Redis:
up{job="keycloak"}
up{job="redis-session"}
keycloak_request_duration_bucket
keycloak_login_attempts_total
redis_memory_used_bytes
redis_db_keys
redis_commands_processed_total

Service Level Objectives (SLOs)

NEW in v2.1.0 - Pre-computed SLO metrics via Prometheus recording rules for efficient monitoring and alerting.

SLO Recording Rules

The monitoring/prometheus/rules/slo-recording-rules.yaml file contains 40+ recording rules that pre-compute Service Level Indicators (SLIs) for fast querying in Grafana. Load recording rules:
## Kubernetes with Prometheus Operator
kubectl apply -f monitoring/prometheus/rules/slo-recording-rules.yaml

## Docker Compose
## Add to prometheus.yml:
rule_files:
  - /etc/prometheus/rules/slo-recording-rules.yaml

Available SLO Metrics

  • Availability
  • Latency
  • Error Rate
  • Saturation
  • Error Budget
  • Compliance
Target: 99.9% uptime
# Overall service availability
job:up:avg

# Component availability
job:up:avg:keycloak      # Target: 99.5%
job:up:avg:openfga       # Target: 99.5%
job:up:avg:redis_session # Target: 99.9%
Usage in Grafana:
# Current availability
job:up:avg * 100

# Downtime minutes per month
(1 - job:up:avg) * 43200

SLO Dashboard Example

Create an SLO summary dashboard:
{
  "dashboard": {
    "title": "SLO Compliance - MCP Server",
    "panels": [
      {
        "title": "Availability SLO (99.9% target)",
        "targets": [{
          "expr": "job:up:avg * 100"
        }],
        "thresholds": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 99.5},
          {"color": "green", "value": 99.9}
        ],
        "type": "gauge"
      },
      {
        "title": "Error Rate SLO (< 1% target)",
        "targets": [{
          "expr": "job:agent_calls:error_rate * 100"
        }],
        "thresholds": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 0.5},
          {"color": "red", "value": 1.0}
        ],
        "type": "gauge"
      },
      {
        "title": "Latency SLO (p95 < 2s)",
        "targets": [{
          "expr": "job:agent_response_duration:p95"
        }],
        "thresholds": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 1500},
          {"color": "red", "value": 2000}
        ],
        "type": "gauge",
        "unit": "ms"
      },
      {
        "title": "Error Budget Burn Rate (1h window)",
        "targets": [{
          "expr": "job:error_budget:burn_rate_1h"
        }],
        "thresholds": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 2},
          {"color": "red", "value": 14.4}
        ],
        "type": "graph"
      }
    ]
  }
}

Benefits of SLO Recording Rules

  1. Performance - Pre-computed metrics query 10-100x faster
  2. Consistency - Same calculation across all dashboards
  3. Alerting - Alert on SLO violations, not raw metrics
  4. Reporting - Historical SLO compliance tracking
  5. Error Budgets - Multi-window burn rate detection

Next Steps