17. Error Handling Strategy
Date: 2025-10-13Status
AcceptedCategory
Core ArchitectureContext
The MCP Server with LangGraph needs a consistent, production-grade error handling strategy that:- Provides clear error messages to users
- Enables debugging with sufficient context
- Maintains security by not leaking sensitive information
- Supports observability and monitoring
- Handles failures gracefully with appropriate fallback mechanisms
- Complies with enterprise SLAs and uptime requirements
Decision
We will implement a layered error handling strategy with the following principles:1. Error Categories
We categorize all errors into five types:Client Errors (4xx)
- Authentication Errors (401): Invalid or expired credentials
- Authorization Errors (403): Insufficient permissions
- Validation Errors (400): Invalid input data
- Not Found Errors (404): Resource doesn’t exist
Server Errors (5xx)
- Internal Errors (500): Unexpected server failures
- Service Unavailable (503): Temporary service disruption
- Timeout Errors (504): Operation exceeded time limit
2. Error Propagation Pattern
- Catch at appropriate layer: Catch errors where you have context to handle them
- Add context: Each layer adds relevant context before re-raising
- Transform sensitive errors: Never expose infrastructure details to clients
- Log once: Log at the layer where you handle the error, not at every layer
3. Error Response Format
All API errors return a consistent JSON structure:code: Machine-readable error code (uppercase snake_case)message: Human-readable error messagedetails: Additional context (optional, non-sensitive)trace_id: OpenTelemetry trace ID for debuggingrequest_id: Request identifier
4. Retry Strategy
Automatic Retries
- LLM API Calls: 3 retries with exponential backoff (1s, 2s, 4s)
- OpenFGA Checks: 2 retries with 500ms backoff
- Redis Operations: 2 retries with 100ms backoff
No Retries
- Authentication failures (permanent failures)
- Validation errors (requires user correction)
- Resource not found (won’t exist on retry)
5. Fallback Mechanisms
Error Handling & Recovery Flow
Multi-Provider LLM Fallback Decision Tree
LLM Fallback
When primary model fails:Authorization Fallback
When OpenFGA is unavailable:Session Storage Fallback
When Redis is unavailable:6. Logging Strategy
Error Logging Levels
-
ERROR: Unexpected failures requiring investigation
- Database connection failures
- External API errors (after retries)
- Unhandled exceptions
-
WARNING: Expected failures with fallback handling
- OpenFGA timeout (using fallback)
- LLM primary model failure (trying fallback)
- Rate limit approaching threshold
-
INFO: Normal operational events
- Successful fallback execution
- Retry attempts
- Circuit breaker state changes
Log Structure
7. OpenTelemetry Integration
All errors are captured in distributed traces:error.type: Exception class nameerror.message: Error messageerror.stack: Stack trace (truncated)error.recoverable: Whether error is recoverable
8. Error Codes
Authentication/Authorization (AUTH_*)
AUTH_INVALID_CREDENTIALS: Invalid username/passwordAUTH_TOKEN_EXPIRED: JWT token expiredAUTH_TOKEN_INVALID: Malformed or invalid tokenAUTH_INSUFFICIENT_PERMISSIONS: User lacks required permissionsAUTH_SESSION_EXPIRED: Session no longer validAUTH_MFA_REQUIRED: Multi-factor authentication required
Validation (VALIDATION_*)
VALIDATION_REQUIRED_FIELD: Required field missingVALIDATION_INVALID_FORMAT: Field format incorrectVALIDATION_OUT_OF_RANGE: Value outside allowed rangeVALIDATION_CONSTRAINT_VIOLATION: Business rule violated
Resource (RESOURCE_*)
RESOURCE_NOT_FOUND: Requested resource doesn’t existRESOURCE_ALREADY_EXISTS: Resource already exists (conflict)RESOURCE_LOCKED: Resource is locked by another processRESOURCE_QUOTA_EXCEEDED: Resource quota limit reached
Infrastructure (INFRA_*)
INFRA_DATABASE_ERROR: Database connection/query failureINFRA_REDIS_ERROR: Redis connection/operation failureINFRA_OPENFGA_ERROR: OpenFGA service errorINFRA_NETWORK_ERROR: Network connectivity issue
External Service (EXT_*)
EXT_LLM_API_ERROR: LLM provider API errorEXT_LLM_RATE_LIMIT: LLM rate limit exceededEXT_KEYCLOAK_ERROR: Keycloak authentication errorEXT_INFISICAL_ERROR: Infisical secrets retrieval error
Internal (INTERNAL_*)
INTERNAL_UNEXPECTED_ERROR: Unhandled exceptionINTERNAL_TIMEOUT: Operation timeoutINTERNAL_CIRCUIT_BREAKER_OPEN: Circuit breaker protecting service
9. Error Metrics
Track error rates with Prometheus:10. Security Considerations
Never expose:- Database connection strings
- Internal IP addresses or service names
- Authentication secrets or tokens
- Full stack traces (only trace_id for debugging)
- SQL queries or database schema details
- API keys or credentials
- User input in error messages
- File paths (show relative, not absolute)
- Email addresses (show partial: a***@acme.com)
Implementation Examples
Example 1: Authentication Error
Example 2: LLM Fallback
Example 3: Service Layer Error
Consequences
Positive
- Consistency: All errors follow the same structure and patterns
- Debuggability: Trace IDs link errors to distributed traces
- Reliability: Fallback mechanisms prevent single points of failure
- Observability: Comprehensive metrics and logging
- Security: No sensitive information leaked in error messages
Negative
- Complexity: Developers must understand error categorization
- Overhead: Additional logging and tracing adds latency (minimal)
- Maintenance: Error codes must be documented and maintained
Mitigation
- Provide error handling templates and examples
- Create linting rules to enforce patterns
- Document all error codes in centralized registry
- Monitor error handling overhead in production