Skip to main content

Grafana Dashboards

Install Grafana

## Add Grafana repo
helm repo add grafana https://grafana.github.io/helm-charts

## Install Grafana
helm install grafana grafana/grafana \
  --namespace monitoring \
  --set persistence.enabled=true \
  --set persistence.size=10Gi \
  --set adminPassword=admin123

Application Dashboard

Import this JSON dashboard:
{
  "dashboard": {
    "title": "MCP Server - Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(http_requests_total[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
        }],
        "type": "graph"
      },
      {
        "title": "LLM Token Usage",
        "targets": [{
          "expr": "rate(llm_tokens_total[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "Active Sessions",
        "targets": [{
          "expr": "active_sessions"
        }],
        "type": "stat"
      },
      {
        "title": "OpenFGA Check Rate",
        "targets": [{
          "expr": "rate(openfga_checks_total[5m])"
        }],
        "type": "graph"
      }
    ]
  }
}

LLM Observability Dashboard

{
  "dashboard": {
    "title": "MCP Server - LLM Metrics",
    "panels": [
      {
        "title": "LLM Requests by Provider",
        "targets": [{
          "expr": "rate(llm_requests_total[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "Token Cost per Hour",
        "targets": [{
          "expr": "sum(rate(llm_tokens_total{type=\"prompt\"}[1h])) * 0.003 + sum(rate(llm_tokens_total{type=\"completion\"}[1h])) * 0.015"
        }],
        "type": "stat"
      },
      {
        "title": "LLM Error Rate",
        "targets": [{
          "expr": "rate(llm_requests_total{status=\"error\"}[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "Average Response Tokens",
        "targets": [{
          "expr": "rate(llm_tokens_total{type=\"completion\"}[5m]) / rate(llm_requests_total[5m])"
        }],
        "type": "stat"
      }
    ]
  }
}

Production-Ready Dashboards (v2.1.0)

NEW in v2.1.0 - 7 production-ready Grafana dashboards covering authentication, authorization, LLM performance, and infrastructure metrics.
The repository includes pre-built Grafana dashboards optimized for production monitoring. All dashboards are located in monitoring/grafana/dashboards/.

Authentication

authentication.json
  • Login activity rate (attempts, success, failures)
  • Login failure rate gauge with thresholds
  • Response time percentiles (p50, p95, p99)
  • Active sessions count
  • Token operations (create, verify, refresh)
  • JWKS cache performance

OpenFGA Authorization

openfga.json
  • Authorization check rate (total, allowed, denied)
  • Denial rate gauge
  • Total relationship tuples
  • Check latency percentiles
  • Tuple write operations
  • Role sync operations and latency

LLM Performance

llm-performance.json
  • Agent call rate (successful/failed)
  • Error rate gauge
  • Response time percentiles
  • Tool calls rate
  • LLM invocations by model
  • Fallback model usage

Keycloak SSO

keycloak.json
  • Service status gauge
  • Response time (p50, p95, p99)
  • Login request rate
  • Error rates (login, token refresh)
  • Active sessions and users
  • Resource utilization (CPU, memory)

Redis Sessions

redis-sessions.json
  • Service status and memory usage
  • Active sessions (key count)
  • Operations rate (commands/sec)
  • Connection pool utilization
  • Session evictions
  • Memory fragmentation ratio

Security

security.json
  • Auth/AuthZ failures per second
  • JWT validation errors
  • Security status gauge
  • Failures by reason and resource
  • Failed attempts by user/IP
  • Top 10 violators table

Overview

mcp-server-langgraph.json
  • Service status uptime gauge
  • Request rate by tool
  • Error rate percentage
  • Response time percentiles
  • Memory and CPU usage per pod
  • Request success/failure count
Import Dashboards
Option 1: Grafana UI (Manual)
  1. Open Grafana at http://localhost:3000
  2. Navigate to DashboardsImport
  3. Click Upload JSON file
  4. Select dashboard file from monitoring/grafana/dashboards/
  5. Select Prometheus datasource
  6. Click Import
Repeat for each dashboard you want to use. Option 2: Kubernetes ConfigMap (Automated)
# Create ConfigMap from dashboard files
kubectl create configmap grafana-dashboards \
  --from-file=monitoring/grafana/dashboards/ \
  -n monitoring
Then mount the ConfigMap in your Grafana deployment by adding volumeMounts and volumes to the deployment manifest. Option 3: Helm Chart Configuration Configure dashboards in values.yaml:
grafana:
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'mcp-server-langgraph'
        orgId: 1
        folder: 'MCP Server with LangGraph'
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/langgraph

  dashboards:
    mcp-server-langgraph:
      authentication:
        file: monitoring/grafana/dashboards/authentication.json
      openfga:
        file: monitoring/grafana/dashboards/openfga.json
      llm-performance:
        file: monitoring/grafana/dashboards/llm-performance.json
      overview:
        file: monitoring/grafana/dashboards/mcp-server-langgraph.json
      security:
        file: monitoring/grafana/dashboards/security.json
      keycloak:
        file: monitoring/grafana/dashboards/keycloak.json
      redis-sessions:
        file: monitoring/grafana/dashboards/redis-sessions.json
Dashboard Features
All production dashboards include:
  • Auto-refresh - 10-second refresh rate for real-time monitoring
  • Time range presets - Last 5m, 15m, 1h, 6h, 24h, 7d
  • Thresholds - Color-coded gauges (green/yellow/red)
  • Cross-links - Navigate between related dashboards
  • Legend tables - Current, max, and mean values
  • Panel descriptions - Hover tooltips explaining metrics
Required Metrics
Ensure these metrics are exposed by the application: Authentication (authentication.json):
up{job="mcp-server-langgraph"}
auth_login_attempts_total
auth_login_success_total
auth_login_failed_total
auth_login_duration_bucket
token_created_total
token_verified_total
token_refreshed_total
session_active_count
jwks_cache_hits_total
jwks_cache_misses_total
OpenFGA (openfga.json):
up{job="openfga"}
authz_checks_total
authz_successes_total
authz_failures_total
authz_check_duration_bucket
openfga_tuple_count
openfga_tuples_written_total
openfga_tuples_deleted_total
openfga_sync_operations_total
openfga_sync_duration_bucket
LLM Performance (llm-performance.json):
agent_calls_successful_total
agent_calls_failed_total
agent_response_duration_bucket
agent_tool_calls_total
## With labels: model, operation, tool
Keycloak & Redis:
up{job="keycloak"}
up{job="redis-session"}
keycloak_request_duration_bucket
keycloak_login_attempts_total
redis_memory_used_bytes
redis_db_keys
redis_commands_processed_total

Service Level Objectives (SLOs)

NEW in v2.1.0 - Pre-computed SLO metrics via Prometheus recording rules for efficient monitoring and alerting.

SLO Recording Rules

The monitoring/prometheus/rules/slo-recording-rules.yaml file contains 40+ recording rules that pre-compute Service Level Indicators (SLIs) for fast querying in Grafana. Load recording rules:
## Kubernetes with Prometheus Operator
kubectl apply -f monitoring/prometheus/rules/slo-recording-rules.yaml

## Docker Compose
## Add to prometheus.yml:
rule_files:
  - /etc/prometheus/rules/slo-recording-rules.yaml

Available SLO Metrics

Target: 99.9% uptime
# Overall service availability
job:up:avg

# Component availability
job:up:avg:keycloak      # Target: 99.5%
job:up:avg:openfga       # Target: 99.5%
job:up:avg:redis_session # Target: 99.9%
Usage in Grafana:
# Current availability
job:up:avg * 100

# Downtime minutes per month
(1 - job:up:avg) * 43200

SLO Dashboard Example

Create an SLO summary dashboard:
{
  "dashboard": {
    "title": "SLO Compliance - MCP Server",
    "panels": [
      {
        "title": "Availability SLO (99.9% target)",
        "targets": [{
          "expr": "job:up:avg * 100"
        }],
        "thresholds": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 99.5},
          {"color": "green", "value": 99.9}
        ],
        "type": "gauge"
      },
      {
        "title": "Error Rate SLO (< 1% target)",
        "targets": [{
          "expr": "job:agent_calls:error_rate * 100"
        }],
        "thresholds": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 0.5},
          {"color": "red", "value": 1.0}
        ],
        "type": "gauge"
      },
      {
        "title": "Latency SLO (p95 < 2s)",
        "targets": [{
          "expr": "job:agent_response_duration:p95"
        }],
        "thresholds": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 1500},
          {"color": "red", "value": 2000}
        ],
        "type": "gauge",
        "unit": "ms"
      },
      {
        "title": "Error Budget Burn Rate (1h window)",
        "targets": [{
          "expr": "job:error_budget:burn_rate_1h"
        }],
        "thresholds": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 2},
          {"color": "red", "value": 14.4}
        ],
        "type": "graph"
      }
    ]
  }
}

Benefits of SLO Recording Rules

  1. Performance - Pre-computed metrics query 10-100x faster
  2. Consistency - Same calculation across all dashboards
  3. Alerting - Alert on SLO violations, not raw metrics
  4. Reporting - Historical SLO compliance tracking
  5. Error Budgets - Multi-window burn rate detection

Next Steps

Alerting

Configure alerts based on dashboard metrics

Prometheus Metrics

Add more metrics to track

LangSmith

Enhance LLM observability

Back to Overview

Return to monitoring overview