Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mcp-server-langgraph.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Comprehensive troubleshooting guide for the MCP Server with LangGraph. This guide covers common issues, diagnostic techniques, and step-by-step solutions.
Always check logs first: kubectl logs -f <pod-name> or docker compose logs -f

Quick Diagnostic Commands

## Check overall health
curl http://localhost:8000/health

## Check pod status
kubectl get pods -n mcp-server-langgraph

## View recent logs
kubectl logs --tail=100 -n mcp-server-langgraph deployment/mcp-server-langgraph

## Check resource usage
kubectl top pods -n mcp-server-langgraph
kubectl top nodes

## Test database connectivity
kubectl exec -it postgres-0 -- psql -U postgres -c "SELECT 1"

## Test Redis connectivity
kubectl exec -it redis-master-0 -- redis-cli ping

## Check OpenFGA
curl http://localhost:8080/healthz

## Check Keycloak
curl http://localhost:8080/realms/langgraph/.well-known/openid-configuration

Authentication Issues

Symptom:
{
  "detail": "Invalid authentication credentials"
}
Causes:
  1. Token expired
  2. Invalid token format
  3. Wrong signing key
  4. Token not in request
Solutions:
# Check token expiration
jwt decode <your-token>

# Verify token format
echo $TOKEN | cut -d'.' -f1 | base64 -d  # Header
echo $TOKEN | cut -d'.' -f2 | base64 -d  # Payload

# Get new token
curl -X POST http://localhost:8000/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"admin"}'

# Check JWKS endpoint
curl http://localhost:8080/realms/langgraph/protocol/openid-connect/certs
Code fix:
# Always include Authorization header
headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

response = requests.post(url, headers=headers, json=data)
Symptom:
{
  "detail": "Token has expired"
}
Solution:
# Refresh token flow
async def refresh_access_token(refresh_token: str) -> str:
    response = await httpx.post(
        "http://localhost:8000/auth/refresh",
        json={"refresh_token": refresh_token}
    )

    if response.status_code == 200:
        data = response.json()
        return data["access_token"]
    else:
        # Re-login required
        return await login(username, password)

# Auto-refresh implementation
class TokenRefresher:
    def __init__(self):
        self.access_token = None
        self.refresh_token = None
        self.expires_at = None

    async def get_valid_token(self):
        if not self.access_token or datetime.now() >= self.expires_at:
            await self.refresh()
        return self.access_token

    async def refresh(self):
        if self.refresh_token:
            new_token = await refresh_access_token(self.refresh_token)
            self.access_token = new_token
            self.expires_at = datetime.now() + timedelta(hours=1)
        else:
            await self.login()
Symptom:
ERROR: Failed to fetch JWKS from Keycloak
ConnectionRefusedError: [Errno 111] Connection refused
Debug:
# Check Keycloak is running
kubectl get pods -l app=keycloak

# Check Keycloak logs
kubectl logs -l app=keycloak --tail=50

# Test Keycloak endpoint
curl http://keycloak:8080/health

# Check network policy
kubectl get networkpolicies

# Test DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup keycloak
Fix:
# Update Keycloak URL in config
env:
- name: KEYCLOAK_SERVER_URL
  value: "http://keycloak.mcp-server-langgraph.svc.cluster.local:8080"

# Or use external URL
- name: KEYCLOAK_SERVER_URL
  value: "https://keycloak.yourdomain.com"
Symptom:
{
  "detail": "Multi-factor authentication required"
}
Solution:
# Disable MFA for development
export MFA_REQUIRED=false

# Configure MFA in Keycloak
# 1. Login to Keycloak admin console
# 2. Go to Realm Settings → Authentication
# 3. Click "Required Actions"
# 4. Enable "Configure OTP"
# 5. User must configure OTP on next login

# Test MFA flow
curl -X POST http://localhost:8000/auth/login \
  -d '{"username":"alice","password":"***","otp":"123456"}'

Authorization Issues

Symptom:
{
  "detail": "Permission denied: user:alice cannot executor tool:chat"
}
Debug:
# Check user's permissions
from mcp_server_langgraph.auth.openfga import OpenFGAClient

client = OpenFGAClient()

# List all user's tuples
tuples = await client.read_tuples(user="user:alice")
for t in tuples:
    print(f"{t['user']}{t['relation']}{t['object']}")

# Check specific permission
allowed = await client.check_permission(
    user="user:alice",
    relation="executor",
    object="tool:chat"
)
print(f"Allowed: {allowed}")

# List what tools user can execute
tools = await client.list_objects(
    user="user:alice",
    relation="executor",
    object_type="tool"
)
print(f"Executable tools: {tools}")
Fix:
# Grant permission
await client.write_tuples([{
    "user": "user:alice",
    "relation": "executor",
    "object": "tool:chat"
}])

# Or grant via organization
await client.write_tuples([
    # Make alice a member
    {
        "user": "user:alice",
        "relation": "member",
        "object": "organization:default"
    },
    # Grant org members access to tool
    {
        "user": "organization:default#member",
        "relation": "executor",
        "object": "tool:chat"
    }
])
Symptom:
ERROR: OpenFGA API request failed: Connection refused
Debug:
# Check OpenFGA is running
kubectl get pods -l app=openfga

# Check OpenFGA health
curl http://localhost:8080/healthz

# Check store ID
echo $OPENFGA_STORE_ID

# List stores
curl http://localhost:8080/stores

# Check authorization model
curl http://localhost:8080/stores/$OPENFGA_STORE_ID/authorization-models
Fix:
# Re-run setup script
python scripts/setup/setup_openfga.py

# Update configuration
export OPENFGA_API_URL=http://openfga.mcp-server-langgraph.svc.cluster.local:8080
export OPENFGA_STORE_ID=<store-id-from-setup>
export OPENFGA_MODEL_ID=<model-id-from-setup>

# Restart application
kubectl rollout restart deployment/mcp-server-langgraph
Symptom: User has role in Keycloak but no permissions in OpenFGADebug:
# Check Keycloak roles
from mcp_server_langgraph.auth.keycloak import KeycloakClient

kc = KeycloakClient()
user = await kc.get_user_by_username("alice")
roles = await kc.get_user_roles(user['id'])
print(f"Keycloak roles: {roles}")

# Check OpenFGA tuples
tuples = await openfga_client.read_tuples(user="user:alice")
print(f"OpenFGA tuples: {tuples}")
Fix:
# Manual sync
from scripts.sync_keycloak_roles import sync_user_permissions

await sync_user_permissions("alice")

# Or sync all users
await sync_all_users()

# Enable automatic sync
export KEYCLOAK_SYNC_ENABLED=true
export KEYCLOAK_SYNC_INTERVAL=300  # 5 minutes

Database Issues

Symptom:
psycopg2.OperationalError: could not connect to server: Connection refused
Debug:
# Check PostgreSQL is running
kubectl get pods -l app=postgres

# Check logs
kubectl logs postgres-0 --tail=50

# Test connection
kubectl exec -it postgres-0 -- psql -U postgres -c "SELECT 1"

# Check credentials
kubectl get secret postgres-credentials -o yaml

# Check service
kubectl get svc postgres
Fix:
# Verify connection string
export DATABASE_URL="postgresql://user:password@postgres:5432/dbname"

# Check network policy allows connection
kubectl get networkpolicies

# Restart PostgreSQL
kubectl rollout restart statefulset/postgres

# Check PostgreSQL config
kubectl exec -it postgres-0 -- cat /var/lib/postgresql/data/postgresql.conf | grep listen_addresses
# Should be: listen_addresses = '*'
Symptom:
redis.exceptions.TimeoutError: Timeout reading from socket
Debug:
# Check Redis is running
kubectl get pods -l app=redis

# Test Redis
kubectl exec -it redis-master-0 -- redis-cli ping
# Should return: PONG

# Check Redis config
kubectl exec -it redis-master-0 -- redis-cli CONFIG GET timeout
kubectl exec -it redis-master-0 -- redis-cli CONFIG GET tcp-keepalive

# Check connection from app pod
kubectl exec -it mcp-server-langgraph-${POD_ID} -- redis-cli -h redis-master ping
Fix:
# Increase timeout
redis_client = Redis(
    host="redis-master",
    port=6379,
    socket_timeout=10,  # Increased
    socket_connect_timeout=10,
    retry_on_timeout=True
)

# Connection pooling
pool = redis.ConnectionPool(
    host="redis-master",
    port=6379,
    max_connections=50,
    socket_timeout=10
)
redis_client = redis.Redis(connection_pool=pool)
Symptom:
psycopg2.errors.DeadlockDetected: deadlock detected
Debug:
-- Find blocking queries
SELECT
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocked_activity.query AS blocked_statement,
    blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

-- Kill blocking query
SELECT pg_terminate_backend(`<blocking_pid>`);
Fix:
# Use consistent lock ordering
async def update_conversation_and_user(conv_id, user_id):
    # Always acquire locks in same order
    async with db.transaction():
        # Lock user first, then conversation
        await db.execute(
            "SELECT * FROM users WHERE id = %s FOR UPDATE",
            (user_id,)
        )
        await db.execute(
            "SELECT * FROM conversations WHERE id = %s FOR UPDATE",
            (conv_id,)
        )

        # Perform updates
        ...

# Set statement timeout
await db.execute("SET statement_timeout = '30s'")

LLM Issues

Symptom:
{
  "error": {
    "type": "rate_limit_error",
    "message": "Rate limit exceeded"
  }
}
Solution:
# Enable automatic fallback
from mcp_server_langgraph.llm.factory import LLMFactory

llm = LLMFactory(
    provider="anthropic",
    model_name="claude-sonnet-4-5-20250929",
    enable_fallback=True,
    fallback_models=[
        "gemini-2.5-flash",  # Google fallback
        "gpt-5-mini"  # OpenAI fallback
    ]
)

# Implement exponential backoff
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_llm_with_retry(prompt: str):
    return await llm.invoke(prompt)

# Rate limiting
from slowapi import Limiter

limiter = Limiter(key_func=lambda: get_current_user().id)

@app.post("/message")
@limiter.limit("30/minute")  # Limit requests per user
async def send_message(request: Request):
    pass
Symptom:
TimeoutError: LLM request timed out after 60 seconds
Debug:
# Check LLM settings
print(f"Timeout: {settings.llm_timeout}")
print(f"Max tokens: {settings.llm_max_tokens}")

# Test LLM directly
from anthropic import Anthropic

client = Anthropic(api_key=settings.anthropic_api_key)
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=100,
    messages=[{"role": "user", "content": "Hi"}]
)
print(response.content[0].text)
Fix:
# Increase timeout
export LLM_TIMEOUT=120  # 2 minutes

# Reduce max tokens
export LLM_MAX_TOKENS=4096  # Smaller responses

# Use streaming
async def stream_llm_response(prompt: str):
    async for chunk in llm.astream(prompt):
        yield chunk
        # Process incrementally
Symptom:
{
  "error": {
    "type": "authentication_error",
    "message": "Invalid API key"
  }
}
Debug:
# Check API key is set
echo $ANTHROPIC_API_KEY | head -c 10

# Verify in Infisical
python -c "
from mcp_server_langgraph.core.config import settings
print(f'Has API key: {bool(settings.anthropic_api_key)}')
print(f'Key prefix: {settings.anthropic_api_key[:10]}')
"

# Test key directly
curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-sonnet-4-5-20250929","max_tokens":10,"messages":[{"role":"user","content":"Hi"}]}'
Fix:
# Update API key in Infisical
# 1. Login to Infisical dashboard
# 2. Go to project > Secrets
# 3. Update ANTHROPIC_API_KEY
# 4. Restart application

# Or set temporarily
export ANTHROPIC_API_KEY=sk-ant-...

# Restart pod to pick up new secret
kubectl rollout restart deployment/mcp-server-langgraph

Performance Issues

Symptom: Requests taking > 5 secondsDebug:
# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Add timing middleware
import time
from fastapi import Request

@app.middleware("http")
async def log_requests(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    process_time = time.time() - start_time
    logger.info(f"{request.method} {request.url.path} took {process_time:.2f}s")
    return response

# Profile with cProfile
import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()
# Run request
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)
Common fixes:
# 1. Add database indexes
CREATE INDEX idx_conversations_user_id ON conversations(user_id);
CREATE INDEX idx_sessions_user_id ON sessions(user_id);

# 2. Enable Redis caching
from functools import lru_cache

@lru_cache(maxsize=1000)
async def get_user_permissions(user_id: str):
    return await openfga_client.list_objects(user=f"user:{user_id}")

# 3. Connection pooling
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,
    max_overflow=10
)

# 4. Parallel processing
import asyncio

async def process_request(query: str):
    # Run in parallel
    auth_check, user_data, context = await asyncio.gather(
        check_permissions(),
        get_user_data(),
        load_conversation_context()
    )
Symptom: Pods being OOMKilledDebug:
# Check memory usage
kubectl top pods -n mcp-server-langgraph

# View OOMKilled events
kubectl get events --sort-by='.metadata.creationTimestamp' | grep OOMKilled

# Check resource limits
kubectl describe pod mcp-server-langgraph-${POD_ID} | grep -A 5 Limits

# Profile memory
import tracemalloc
tracemalloc.start()

# Run application
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

for stat in top_stats[:10]:
    print(stat)
Fix:
# Increase memory limits
resources:
  requests:
    memory: 2Gi
  limits:
    memory: 4Gi

# Or fix memory leaks
# 1. Clear caches periodically
import gc

async def clear_caches():
    cache.clear()
    gc.collect()

schedule.every(1).hours.do(clear_caches)

# 2. Limit LLM response sizes
llm = LLMFactory(
    max_tokens=2048  # Smaller limit
)

# 3. Stream large responses
async for chunk in llm.astream(prompt):
    yield chunk
    # Don't accumulate in memory
Symptom: Slow performance, high CPU usageDebug:
# Check CPU usage
kubectl top pods -n mcp-server-langgraph

# Check throttling
kubectl describe pod mcp-server-langgraph-${POD_ID} | grep -i throttl

# Profile CPU
python -m cProfile -o profile.stats main.py

# Analyze
python -c "
import pstats
p = pstats.Stats('profile.stats')
p.sort_stats('cumulative')
p.print_stats(20)
"
Fix:
# Increase CPU limits
resources:
  requests:
    cpu: 2000m
  limits:
    cpu: 4000m

# Or optimize code
# 1. Async I/O instead of sync
# Bad
result = requests.get(url)

# Good
async with httpx.AsyncClient() as client:
    result = await client.get(url)

# 2. Batch processing
# Bad
for item in items:
    await process(item)

# Good
await asyncio.gather(*[process(item) for item in items])

# 3. Reduce validation overhead
# Use Pydantic v2 (faster than v1)
pip install pydantic==2.x

Kubernetes Issues

Debug:
# View pod status
kubectl get pods -n mcp-server-langgraph

# View logs from crashed container
kubectl logs <pod-name> --previous

# Describe pod for events
kubectl describe pod <pod-name>

# Check init containers
kubectl logs <pod-name> -c init-migrations
Common fixes:
# 1. Increase startup time
startupProbe:
  httpGet:
    path: /health/startup
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 30  # 150 seconds total

# 2. Fix dependency order
initContainers:
- name: wait-for-postgres
  image: busybox
  command:
  - sh
  - -c
  - |
    until nc -z postgres 5432; do
      echo "Waiting for PostgreSQL..."
      sleep 2
    done

# 3. Add resource limits
resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "2Gi"
    cpu: "2000m"
Debug:
# Check image pull status
kubectl describe pod <pod-name> | grep -A 10 Events

# Check image exists
docker pull mcp-server-langgraph:latest

# Check registry credentials
kubectl get secret regcred -o yaml
Fix:
# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=https://index.docker.io/v1/ \
  --docker-username=`<username>` \
  --docker-password=`<password>` \
  --docker-email=`<email>`

# Use in deployment
spec:
  imagePullSecrets:
  - name: regcred
  containers:
  - name: agent
    image: your-registry/mcp-server-langgraph:latest
Debug:
# Check service
kubectl get svc mcp-server-langgraph

# Check endpoints
kubectl get endpoints mcp-server-langgraph

# Test from another pod
kubectl run -it --rm debug --image=busybox --restart=Never -- \
  wget -O- http://mcp-server-langgraph:8000/health

# Check network policies
kubectl get networkpolicies

# Check ingress
kubectl get ingress
kubectl describe ingress mcp-server-langgraph
Fix:
# Verify selector matches pod labels
apiVersion: v1
kind: Service
metadata:
  name: mcp-server-langgraph
spec:
  selector:
    app: mcp-server-langgraph  # Must match pod label
  ports:
  - port: 80
    targetPort: 8000

Observability & Debugging

Enable Debug Logging

## config.py
import logging

## Set log level
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")

logging.basicConfig(
    level=getattr(logging, LOG_LEVEL),
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

## Enable debug logging
export LOG_LEVEL=DEBUG

## Structured logging
import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()
logger.debug("debug_message", user_id="alice", action="login")

Distributed Tracing

## Enable OpenTelemetry tracing
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

## Configure tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

## Add tracing to requests
@app.post("/message")
async def send_message(request: Request):
    with tracer.start_as_current_span("send_message") as span:
        span.set_attribute("user_id", user.id)
        span.set_attribute("query_length", len(request.query))

        result = await process_message(request)

        span.set_attribute("response_length", len(result))
        return result

## View traces in Jaeger
## http://localhost:16686

Health Checks

## Detailed health check
from fastapi import status

@app.get("/health/detailed")
async def health_detailed():
    checks = {}

    # Database
    try:
        await db.execute("SELECT 1")
        checks["database"] = "healthy"
    except Exception as e:
        checks["database"] = f"unhealthy: {str(e)}"

    # Redis
    try:
        await redis.ping()
        checks["redis"] = "healthy"
    except Exception as e:
        checks["redis"] = f"unhealthy: {str(e)}"

    # OpenFGA
    try:
        async with httpx.AsyncClient() as client:
            r = await client.get(f"{settings.openfga_api_url}/healthz")
            checks["openfga"] = "healthy" if r.status_code == 200 else "unhealthy"
    except Exception as e:
        checks["openfga"] = f"unhealthy: {str(e)}"

    # Keycloak
    try:
        async with httpx.AsyncClient() as client:
            r = await client.get(f"{settings.keycloak_server_url}/health")
            checks["keycloak"] = "healthy" if r.status_code == 200 else "unhealthy"
    except Exception as e:
        checks["keycloak"] = f"unhealthy: {str(e)}"

    # Overall status
    all_healthy = all(v == "healthy" for v in checks.values())
    status_code = status.HTTP_200_OK if all_healthy else status.HTTP_503_SERVICE_UNAVAILABLE

    return JSONResponse(
        status_code=status_code,
        content={"status": "healthy" if all_healthy else "unhealthy", "checks": checks}
    )

Getting Help

Collect Debug Information

#!/bin/bash
## collect-debug-info.sh

NAMESPACE="mcp-server-langgraph"
OUTPUT_DIR="debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p $OUTPUT_DIR

## Pod status
kubectl get pods -n $NAMESPACE > $OUTPUT_DIR/pods.txt

## Pod logs
for pod in $(kubectl get pods -n $NAMESPACE -o name); do
    kubectl logs -n $NAMESPACE $pod > "$OUTPUT_DIR/$(basename $pod).log" 2>&1
    kubectl logs -n $NAMESPACE $pod --previous > "$OUTPUT_DIR/$(basename $pod)-previous.log" 2>&1 || true
done

## Describe pods
for pod in $(kubectl get pods -n $NAMESPACE -o name); do
    kubectl describe -n $NAMESPACE $pod > "$OUTPUT_DIR/$(basename $pod)-describe.txt"
done

## Events
kubectl get events -n $NAMESPACE --sort-by='.metadata.creationTimestamp' > $OUTPUT_DIR/events.txt

## ConfigMaps and Secrets (names only)
kubectl get configmaps -n $NAMESPACE -o yaml > $OUTPUT_DIR/configmaps.yaml
kubectl get secrets -n $NAMESPACE -o yaml | sed 's/data:.*/data: REDACTED/' > $OUTPUT_DIR/secrets.yaml

## Services and Ingress
kubectl get svc,ingress -n $NAMESPACE -o yaml > $OUTPUT_DIR/network.yaml

## Create archive
tar -czf $OUTPUT_DIR.tar.gz $OUTPUT_DIR

echo "Debug information collected in $OUTPUT_DIR.tar.gz"

Community Support

Include in support requests:
  1. Debug information bundle
  2. Kubernetes/Docker version
  3. Python version
  4. Steps to reproduce
  5. Expected vs actual behavior

Next Steps

Architecture

Understand system architecture

Observability

Set up monitoring

Production Checklist

Pre-deployment verification

Security Best Practices

Secure your deployment

Debugging Made Easy: Systematic troubleshooting gets you back online quickly!