27. Rate Limiting Strategy for API Protection
Date: 2025-10-20Status
AcceptedCategory
Performance & ResilienceContext
The MCP server exposes HTTP endpoints without rate limiting, making it vulnerable to:- Denial of Service (DoS): Malicious actors overwhelming the system
- Resource exhaustion: Legitimate users consuming excessive resources
- Cost explosion: Uncontrolled LLM API usage leading to high bills
- Brute force attacks: Repeated authentication attempts
- Data scraping: Automated extraction of sensitive data
- No application-level rate limiting
- Kong gateway integration documented but not deployed
- All users have unlimited access
- No protection against abuse
- LLM costs not controlled per user
- Likelihood: HIGH (public API, no authentication required for some endpoints)
- Impact: CRITICAL (service unavailability, financial loss)
- CVSS Score: 7.5 (High) - DoS vulnerability
- SOC 2: Access controls and abuse prevention
- GDPR: Prevent excessive data processing
- OWASP Top 10: A05:2021 - Security Misconfiguration
Decision
Implement a hybrid rate limiting strategy with two layers:Layer 1: Application-Level Rate Limiting (Immediate)
Implementation: FastAPI middleware usingslowapi library.
Why slowapi:
- Native FastAPI/Starlette support
- Redis-backed for distributed rate limiting
- Decorator-based API (developer-friendly)
- Customizable response codes and headers
- IP address and user-based limiting
| Tier | Requests/Min | Requests/Hour | Requests/Day | Use Case |
|---|---|---|---|---|
| Anonymous | 10 | 100 | 1,000 | Public endpoints (health, docs) |
| Free | 60 | 1,000 | 10,000 | Registered users (JWT required) |
| Standard | 300 | 5,000 | 50,000 | Paid tier 1 |
| Premium | 1,000 | 20,000 | 200,000 | Paid tier 2 |
| Enterprise | Unlimited | Unlimited | Unlimited | Enterprise contracts |
Layer 2: Kong API Gateway (Production)
Implementation: Deploy Kong gateway in front of MCP server. Why Kong:- Industry-standard API gateway
- Advanced rate limiting (sliding window, fixed window, leaky bucket)
- Per-consumer, per-route, global limits
- Rate limit sharing across cluster
- Plugin ecosystem (auth, logging, monitoring)
- Application-Level: Development, staging, single-instance deployments
- Kong Gateway: Production, multi-region, high-scale deployments
Architecture
New Module: src/mcp_server_langgraph/middleware/rate_limiter.py
FastAPI Integration
Dynamic Rate Limiting (Tier-Based)
Endpoint-Specific Limits
Metrics & Observability
New Metrics (15+)
Alerts
Grafana Dashboard
Panel: Rate Limit Overview- Requests by tier (stacked area chart)
- Rate limit violations (time series)
- Top violators (table: IP, user, count)
- Redis latency (heatmap)
Configuration
Environment Variables
Feature Flag
Consequences
Positive
-
DoS Protection
- Prevent resource exhaustion from malicious actors
- Limit blast radius of attacks (per-user isolation)
-
Cost Control
- Cap LLM API usage per user (prevent runaway bills)
- Predictable infrastructure costs
-
Fair Resource Allocation
- Prevent one user from monopolizing resources
- Ensure equitable access for all users
-
Compliance
- Meet SOC 2 access control requirements
- GDPR excessive processing prevention
-
Monetization
- Enable tiered pricing (free, standard, premium)
- Upsell opportunities (upgrade for higher limits)
Negative
-
Legitimate Users Blocked
- Burst traffic may hit limits (false positives)
- Shared IP addresses (NAT, VPN) penalized
-
Configuration Complexity
- Need to tune limits per endpoint
- Balance between security and usability
-
Performance Overhead
- Redis lookup on every request (~1-2ms)
- Increased system complexity
-
User Friction
- 429 errors may frustrate users
- Need clear error messages and upgrade paths
Mitigations
- Start Conservative: High initial limits, lower based on usage
- Burst Allowance: Allow short bursts above limit (leaky bucket algorithm)
- Whitelist: Exempt trusted IPs, monitoring tools
- Clear Communication: Display limits in API docs, error messages
- Graceful Degradation: Fall back to in-memory if Redis is down
Alternatives Considered
Alternative 1: NGINX Rate Limiting
- Pros: High performance, battle-tested
- Cons: Limited to IP-based, no user-tier support, requires separate config
- Decision: Use for infrastructure layer, slowapi for application logic
Alternative 2: Cloudflare Rate Limiting
- Pros: DDoS protection, global edge network
- Cons: Vendor lock-in, cost, limited customization
- Decision: Keep as option for enterprise deployments
Alternative 3: Token Bucket Algorithm (Custom)
- Pros: Full control, optimal for burst traffic
- Cons: Complex implementation, testing overhead
- Decision: Use slowapi (proven library) instead
Alternative 4: No Rate Limiting (Current State)
- Pros: Simple, no friction
- Cons: Vulnerable to abuse, uncontrolled costs
- Decision: Unacceptable for production
Implementation Plan
Week 1: Foundation
- Create ADR-0027 (this document)
- Install
slowapilibrary:pip install slowapi - Create
middleware/rate_limiter.pymodule - Implement basic rate limiter with Redis backend
- Add tier-based limit configuration
- Write 30+ unit tests
Week 2: Integration
- Apply rate limiting to all FastAPI endpoints
- Implement custom key function (user ID > IP)
- Add exception handler for 429 responses
- Configure endpoint-specific limits
- Add rate limit headers to responses
Week 3: Observability
- Implement rate limit metrics
- Create Grafana dashboard
- Add Prometheus alerts for violations
- Integrate with OpenTelemetry tracing
- Write integration tests
Week 4: Testing & Rollout
- Load test: Verify limits are enforced
- Chaos test: Kill Redis, verify fail-open
- User acceptance test: Verify error messages
- Deploy to staging (log-only mode)
- Monitor for 1 week, tune limits
Week 5: Production
- Deploy to production (10% traffic)
- Monitor metrics, watch for issues
- Gradually increase to 100% over 2 weeks
- Document troubleshooting guide
Testing Strategy
Unit Tests
Integration Tests
Chaos Tests
Migration Path
Phase 1: Log-Only (Week 1)
- Deploy rate limiter in
log_onlymode - Collect metrics on who would be rate limited
- Tune limits based on actual usage
Phase 2: Soft Enforcement (Week 2-3)
- Switch to
enforcemode for anonymous users only - Monitor impact, adjust limits
- Communicate with users about upcoming enforcement
Phase 3: Full Enforcement (Week 4+)
- Enable for all users
- Monitor closely for regressions
- Provide upgrade paths for users hitting limits
References
- OWASP Rate Limiting: https://cheatsheetseries.owasp.org/cheatsheets/Denial_of_Service_Cheat_Sheet.html
- slowapi Library: https://github.com/laurents/slowapi
- Kong Rate Limiting: https://docs.konghq.com/hub/kong-inc/rate-limiting/
- Google Cloud Armor: https://cloud.google.com/armor/docs/rate-limiting-overview
- ADR-0030: Resilience Patterns: ./adr-0026-resilience-patterns.md
- Kong Integration Guide: ../integrations/kong.md
Success Metrics
Security
- Target: 0 successful DoS attacks
- Measurement: No service degradation from single source
Performance
- Target: < 2ms latency overhead from rate limiting
- Measurement: P95 latency with vs without rate limiting
User Experience
- Target: < 1% of legitimate requests rate limited
- Measurement:
rate_limit_exceeded_total / http_requests_total < 0.01
Cost Control
- Target: LLM costs capped at $X per user per day
- Measurement: Daily cost tracking per user ID
Last Updated: 2025-10-20 Next Review: 2025-11-20 (after 1 month in production)