Overview
Comprehensive checklist to ensure your MCP Server with LangGraph deployment is production-ready. This checklist covers security, performance, reliability, observability, and operational readiness for v2.1.0.Security
Authentication & Authorization
Keycloak SSO
Keycloak SSO
Critical:
- Keycloak deployed with PostgreSQL (not H2)
- Strong admin password set (32+ characters)
- Realm
mcp-server-langgraphcreated - Client
langgraph-clientconfigured (confidential) - Client secret stored in secrets manager
- SSL/TLS enabled (
KEYCLOAK_VERIFY_SSL=true) - Redirect URIs properly configured
- Test user can authenticate successfully
- 2+ Keycloak replicas for HA
- Brute force protection enabled
- Password policies configured
- Multi-factor authentication enabled
- Session timeout configured appropriately
- Admin console access restricted
JWT & Sessions
JWT & Sessions
Critical:
-
AUTH_MODE=sessionfor production (not token) -
SESSION_BACKEND=redis(not memory) - JWT secret is cryptographically random (256-bit)
- Redis password is strong and unique
- Secrets stored in secret manager (not .env files)
- Session TTL appropriate for your use case
-
SESSION_SLIDING_WINDOW=truefor better UX -
SESSION_MAX_CONCURRENTset per security policy - Session metadata includes IP and user agent
- Token expiration ≤ 4 hours for access tokens
OpenFGA Authorization
OpenFGA Authorization
Network Security
TLS/SSL
TLS/SSL
Critical:
- HTTPS enforced for all ingress (no HTTP)
- Valid SSL certificates (not self-signed)
- Certificate auto-renewal configured
- All internal services use TLS where applicable
- Redis SSL enabled in production
- TLS 1.2+ only (disable TLS 1.0/1.1)
- Strong cipher suites configured
- HSTS headers enabled
- Certificate monitoring/alerts
Network Policies
Network Policies
Critical:
- Network policies deployed (Kubernetes)
- Only necessary ports exposed
- No public access to databases
- Service-to-service communication restricted
- Egress filtering configured
- Zero-trust network policies
- VPC/subnet isolation
- Firewall rules documented
Secrets Management
API Keys & Secrets
API Keys & Secrets
Critical:
- All secrets in secret manager (not env files)
- API keys valid and quota sufficient
- Secrets rotated from default values
- Access to secrets restricted (RBAC)
- No secrets in Git history
- Infisical or similar secrets manager
- Secret rotation policy defined
- Secret access auditing enabled
- Encrypted backups of secrets
Infrastructure
Kubernetes Cluster
Cluster Configuration
Cluster Configuration
Critical:
- Kubernetes 1.25+ running
- Multi-zone/region deployment
- Node auto-scaling configured
- Sufficient resources (8+ vCPU, 16GB+ RAM)
- Persistent volumes for stateful services
- Pod security standards enforced
- 3+ nodes minimum
- Dedicated node pools for workloads
- Node taints and tolerations configured
- Resource quotas set per namespace
High Availability
High Availability
Critical:
- 3+ application replicas
- Pod anti-affinity configured
- PodDisruptionBudget set (minAvailable: 2)
- Horizontal Pod Autoscaler enabled
- Keycloak HA (2+ replicas)
- Redis replication enabled
- Database replication configured
- Cross-zone pod distribution
- Graceful shutdown configured
- Rolling update strategy tuned
- Circuit breakers configured
Data Persistence
Databases
Databases
Critical:
- PostgreSQL for Keycloak (persistent)
- PostgreSQL for OpenFGA (persistent)
- Automated backups configured
- Backup retention policy defined
- Restore procedure tested
- Connection pooling configured
- Database replication enabled
- Point-in-time recovery available
- Monitoring and alerting set up
- Performance tuning completed
Redis Sessions
Redis Sessions
Critical:
- Redis persistence enabled (RDB + AOF)
- Redis replication configured (master + replicas)
- Redis password set
- Memory limits configured
- Eviction policy set (allkeys-lru)
- Redis Sentinel for HA
- Automated backups
- Monitoring enabled
- SSL/TLS enabled
Observability
Monitoring
Metrics Collection
Metrics Collection
Critical:
- Prometheus or equivalent deployed
- Application metrics exported
- System metrics collected
- Dashboards created for key metrics
- Basic alerts configured
- Grafana dashboards imported
- Metric retention configured
- High cardinality metrics reviewed
- Cost monitoring enabled
- Request rate, latency, errors (RED)
- CPU, memory, disk usage
- LLM token usage and costs
- Authentication success/failure rate
- Session creation/expiration rate
Tracing
Tracing
Critical:
- OpenTelemetry configured
- Jaeger or equivalent deployed
- End-to-end traces visible
- Sampling rate appropriate
- Trace storage configured
- LangSmith integration enabled
- Trace retention policy set
- Performance anomaly detection
- Distributed context propagation
Logging
Logging
Critical:
- Structured logging enabled (JSON)
- Log aggregation configured
- Log levels appropriate (INFO in prod)
- Sensitive data not logged
- Log retention policy set
- Centralized logging (ELK, Loki, etc.)
- Log-based alerting
- Audit logs for security events
- Log encryption at rest
Alerting
Critical Alerts
Critical Alerts
Critical Alerts (immediate response):
- Service down (all replicas unhealthy)
- Error rate > 5%
- p95 latency > 5s
- Database connection failures
- Out of memory errors
- Disk space < 10%
- Certificate expiration < 7 days
- Keycloak unavailable
- Redis unavailable
- OpenFGA unavailable
- Error rate > 1%
- p95 latency > 2s
- CPU usage > 80%
- Memory usage > 80%
- HPA at max replicas
- Authentication failures > threshold
Performance
Resource Configuration
Resource Limits
Resource Limits
Critical:
- CPU requests/limits set for all pods
- Memory requests/limits set for all pods
- Limits based on load testing
- No pods with unbounded resources
- Quality of Service (QoS) class = Guaranteed
- Resource quotas per namespace
- LimitRanges configured
- Resource usage monitored
Auto-Scaling
Auto-Scaling
Critical:
- HPA configured and tested
- Min replicas ≥ 3
- Max replicas sufficient for peak load
- Scaling metrics appropriate (CPU, memory)
- Scaling behavior tuned
- Custom metrics for scaling
- Cluster autoscaler enabled
- Vertical Pod Autoscaler considered
- Load testing performed
LLM Configuration
Provider Setup
Provider Setup
Critical:
- Primary LLM provider configured
- API keys valid and quota sufficient
- Model name correct
- Timeouts configured (60s)
- Fallback provider enabled
- Rate limiting configured
- Retry logic with exponential backoff
- Token usage tracking
- Cost monitoring and alerts
Operational Readiness
Documentation
Runbooks
Runbooks
Critical:
- Deployment procedure documented
- Rollback procedure documented
- Incident response playbook
- On-call rotation defined
- Escalation path documented
- Architecture diagrams
- Troubleshooting guide
- Disaster recovery plan
- Performance tuning guide
Testing
Testing
Critical:
- Unit tests passing (100% critical paths)
- Integration tests passing
- End-to-end tests passing
- Load testing completed
- Chaos engineering tests run
- Property-based tests
- Security testing (penetration tests)
- Disaster recovery drills
- Regression tests automated
Deployment
Pre-Deployment
Pre-Deployment
Critical (complete before deployment):
- All secrets generated and stored
- DNS records configured
- SSL certificates issued
- Database migrations tested
- Smoke tests prepared
- Rollback plan ready
- Deploy to staging first
- Run smoke tests in staging
- Review and approve
- Deploy to production
- Run smoke tests in production
- Monitor for 1 hour
- Declare success or rollback
Post-Deployment
Post-Deployment
Immediate (within 1 hour):
- Health checks passing
- All pods running
- No error alerts firing
- End-to-end test successful
- Authentication working
- Authorization working
- Monitor error rates
- Monitor latency metrics
- Check resource usage
- Review logs for anomalies
- Verify backups running
Compliance & Governance
Data Privacy
Data Privacy
Critical:
- PII handling reviewed
- Data retention policy enforced
- GDPR/CCPA compliance verified
- Data deletion procedure tested
- Privacy policy updated
- Data encryption at rest
- Data encryption in transit
- Data residency requirements met
- Privacy impact assessment completed
Audit & Compliance
Audit & Compliance
Critical:
- Audit logging enabled
- Access logs retained
- Security events tracked
- Compliance requirements met
- SOC 2 / ISO 27001 compliance
- Regular security audits
- Penetration testing
- Vulnerability scanning
Final Validation
Pre-Launch Checklist
Smoke Tests
Next Steps
Kubernetes Deployment
Deploy to Kubernetes
Helm Deployment
Deploy with Helm charts
Monitoring Setup
Configure observability
Disaster Recovery
Backup and restore
Production Ready: Complete this checklist to ensure a successful production deployment!