LLM Provider Resilience Best Practices
This guide provides production-ready strategies for managing rate limits, concurrency, and resilience across multiple LLM providers. These patterns ensure optimal throughput while preventing rate limit errors and cascading failures.Overview
When integrating with LLM APIs, you must balance three competing concerns:- Latency - Minimize response time for individual requests
- Throughput - Maximize total requests per unit time
- Reliability - Prevent rate limit errors and service degradation
Provider Rate Limits Reference
Anthropic Claude (Direct API)
Anthropic uses a tiered system based on usage history. Rate limits are measured in:- RPM (Requests Per Minute)
- ITPM (Input Tokens Per Minute)
- OTPM (Output Tokens Per Minute)
| Tier | Eligibility | Claude 3.5 Sonnet RPM | Claude 3 Opus RPM | Claude 3 Haiku RPM |
|---|---|---|---|---|
| Free | New accounts | 50 | 5 | 50 |
| Build | Pay-as-you-go | 50 | 50 | 50 |
| Scale | High volume | 1,000+ | 2,000+ | 4,000+ |
Anthropic uses the token bucket algorithm for rate limiting. Capacity continuously replenishes up to your maximum limit, rather than resetting at fixed intervals.
- Claude 3.5 Sonnet: 40,000 TPM
- Claude 3 Opus: 20,000 TPM
- Claude 3 Haiku: 50,000 TPM
OpenAI (Direct API)
OpenAI organizes customers into usage tiers that unlock automatically based on spend:| Tier | GPT-4o RPM | GPT-4o TPM | GPT-3.5 Turbo RPM | GPT-3.5 Turbo TPM |
|---|---|---|---|---|
| Tier 1 | 500 | 30,000 | 3,500 | 200,000 |
| Tier 2 | 5,000 | 300,000 | 3,500 | 2,000,000 |
| Tier 3 | 5,000+ | 450,000 | 3,500 | 4,000,000 |
| Tier 4 | 10,000 | 800,000 | 10,000 | 10,000,000 |
Google Vertex AI (Gemini)
Gemini models on Vertex AI use Dynamic Shared Quota (DSQ) - no hard per-user limits. Quotas are shared across the platform based on capacity.| Model | Recommended RPM | Notes |
|---|---|---|
| Gemini 2.5 Flash | 60+ | High throughput model |
| Gemini 2.5 Pro | 60+ | Complex reasoning |
| Gemini 3.0 Preview | 60+ | Latest capabilities |
With DSQ, focus on monitoring actual throughput rather than pre-defined limits. Use adaptive concurrency (see below) to maximize utilization.
Google Vertex AI (Anthropic Claude via MaaS)
Anthropic Claude models accessed through Vertex AI have different limits than the direct API:| Model | RPM | Input TPM | Output TPM | Region |
|---|---|---|---|---|
| Claude Opus 4.5 | 1,200 | 12,000,000 | 1,200,000 | us-east5 |
| Claude Sonnet 4.5 | 3,000 | 3,000,000 | 300,000 | us-east5 |
| Claude Haiku 4.5 | 10,000 | 10,000,000 | 1,000,000 | us-east5 |
Azure OpenAI Service
Azure quotas are per-region, per-subscription. The TPM/RPM conversion is:| Model | Default TPM | Default RPM | Limit Per Region |
|---|---|---|---|
| GPT-4.1 Global Standard | 1,000,000 | 6,000 | Per subscription |
| GPT-4o | 240,000 | 1,440 | Per subscription |
| GPT-4o Realtime | 100,000 | 1,000 | Per subscription |
AWS Bedrock (Anthropic Claude)
AWS Bedrock quotas are region-specific and vary by account age:| Account Type | Claude 3.5 Sonnet RPM | Claude 3 Opus RPM |
|---|---|---|
| New accounts (2024+) | 2-50 | 2-50 |
| Established accounts | 200-250 | 200-250 |
Concurrency Sizing with Little’s Law
Little’s Law provides the mathematical foundation for sizing concurrent request pools:Little's Law
- Concurrency = Number of parallel requests in flight
- Throughput = Requests per second (RPM ÷ 60)
- Average Latency = Mean response time in seconds
Practical Application
Example: Claude Sonnet 4.5 on Vertex AI Given:- RPM limit: 3,000
- Average latency: 1.2 seconds
Per-Provider Recommendations
| Provider | Effective RPM | Avg Latency | Calculated Concurrency | Recommended |
|---|---|---|---|---|
| Anthropic (Scale) | 1,000 | 1.5s | 25 | 15 |
| OpenAI (Tier 2) | 5,000 | 0.8s | 67 | 40 |
| Vertex AI Gemini | 60 | 0.5s | 0.5 | 10 (floor) |
| Vertex AI Claude Sonnet | 3,000 | 1.2s | 60 | 36 |
| Azure OpenAI GPT-4 | 1,440 | 1.0s | 24 | 15 |
| AWS Bedrock Claude | 250 | 1.5s | 6 | 5 |
Adaptive Concurrency with AIMD
The AIMD (Additive Increase, Multiplicative Decrease) algorithm, inspired by TCP congestion control, automatically tunes concurrency limits based on observed error rates.How AIMD Works
AIMD Algorithm Details
AIMD Algorithm Details
Parameters:
min_limit: Floor value (e.g., 2)max_limit: Ceiling value (e.g., 50)initial_limit: Starting point (e.g., 10)success_streak_threshold: Successes before increase (e.g., 10)decrease_factor: Multiplicative factor on error (e.g., 0.75)
- Self-healing: Automatically recovers from rate limit errors
- Adaptive: Finds optimal concurrency for current conditions
- Conservative: Fast decrease, slow recovery prevents oscillation
Implementation
Token Bucket Rate Limiting
The token bucket algorithm provides pre-emptive rate limiting with burst capacity, preventing rate limit errors before they occur.How Token Bucket Works
Implementation
Per-Provider Defaults
| Provider | Default RPM | Burst Seconds | Tokens/Second |
|---|---|---|---|
| Anthropic (Direct) | 50 | 10 | 0.83 |
| OpenAI | 60 | 10 | 1.0 |
| Google Vertex AI (Gemini) | 60 | 10 | 1.0 |
| Vertex AI (Anthropic) | 1,000 | 10 | 16.67 |
| Azure OpenAI | 60 | 10 | 1.0 |
| AWS Bedrock | 50 | 10 | 0.83 |
Fallback Strategies
When the primary model fails (rate limit, timeout, circuit open), fallbacks maintain service availability.Fallback Chain Pattern
Best Practices
Protect Fallbacks
Each fallback model should have its own circuit breaker to prevent cascade failures
Exponential Backoff
Wait 1s, 2s, 4s between fallback attempts to let rate limits recover
Cross-Provider
Include models from different providers in fallback chain
Graceful Degradation
Fallbacks may have different capabilities - handle appropriately
Retry with Backoff Strategies
Standard vs Overload-Aware Retry
| Configuration | Standard Retry | Overload-Aware Retry |
|---|---|---|
| Max Attempts | 3 | 6 |
| Exponential Max | 10s | 60s |
| Jitter | Full | Decorrelated |
| Retry-After Honor | No | Yes |
Retry-After Header
When receiving a 429 response, use theRetry-After header:
Error Classification
| Error Type | Should Retry? | Strategy |
|---|---|---|
| 429 Rate Limit | Yes | Use Retry-After or exponential backoff |
| 529 Overload | Yes | Extended retry with decorrelated jitter |
| 500 Server Error | Yes | Standard exponential backoff |
| 400 Bad Request | No | Fix request and retry |
| 401/403 Auth | No | Fix credentials |
| Timeout | Yes | With increased timeout |
Quota Management by Provider
This section provides detailed step-by-step instructions for requesting and managing quota increases for each LLM provider.Google Cloud Platform (Vertex AI)
GCP Vertex AI quotas control access to Gemini models and Anthropic Claude models (via Model as a Service).- Console (Recommended)
- gcloud CLI
- Terraform
Step-by-Step: GCP Console
-
Navigate to Quotas
- Go to IAM & Admin → Quotas & System Limits
- Select your project from the dropdown
-
Find the Quota
- In the Filter box, enter:
aiplatform.googleapis.com - For Gemini models, search for metrics like:
GenerateContentRequestsPerMinutePerProjectPerRegionTokensPerMinutePerProjectPerRegion
- For Claude models (MaaS), search for:
anthropic_claude_requests_per_minuteanthropic_claude_tokens_per_minute
- In the Filter box, enter:
-
Request Increase
- Select the checkbox next to the quota
- Click Edit Quotas button at the top
- Enter your desired new limit
- Provide a justification (e.g., “Production workload requires 3,000 RPM for Claude Sonnet 4.5”)
- Click Submit Request
-
Track Status
- Check your email for approval (typically 24-48 hours)
- View pending requests at Quota Increase Requests
AWS Bedrock
AWS Bedrock quotas are managed through Service Quotas and vary by region and account age.- Console
- AWS CLI
- Multi-Region Strategy
Step-by-Step: AWS Console
-
Navigate to Service Quotas
- Go to Service Quotas → Amazon Bedrock
- Ensure you’re in the correct region (e.g.,
us-east-1)
-
Find the Model Quota
- Search for the model name (e.g., “Claude 3.5 Sonnet”)
- Look for quotas like:
Anthropic Claude 3.5 Sonnet Invocations per minuteAnthropic Claude 3.5 Sonnet Tokens per minute
-
Request Increase
- Click on the quota name
- Click Request quota increase
- Enter desired value (e.g., 500 for RPM)
- Click Request
-
Check Status
- Go to Quota request history
- Status will show: Pending → Approved/Denied
Azure OpenAI Service
Azure OpenAI quotas are managed per-subscription, per-region, and can be allocated across deployments.- Azure Portal
- Azure CLI
- Multi-Region Deployment
Step-by-Step: Azure Portal
-
Access Azure OpenAI Studio
- Go to Azure OpenAI Studio
- Select your resource
-
View Current Quotas
- Click Quotas in the left sidebar
- View available TPM by model and region
-
Adjust Deployment Quota
- Go to Deployments
- Select your deployment → Edit
- Adjust Tokens per Minute Rate Limit
- Click Save
-
Request Additional Quota
- If you need more than your subscription limit:
- Go to Help + support → New support request
- Select: Issue type = Service and subscription limits (quotas)
- Service = Azure OpenAI Service
- Provide justification and desired limits
Anthropic (Direct API)
Anthropic uses a tier-based system that advances automatically based on usage and spend.- Tier System
- Requesting Higher Limits
- API Headers
Understanding Anthropic Tiers
| Tier | Requirements | Claude 3.5 Sonnet | Claude 3 Opus | Claude 3 Haiku |
|---|---|---|---|---|
| Free | New account | 50 RPM, 40K TPM | 5 RPM, 10K TPM | 50 RPM, 25K TPM |
| Build | Credit card added | 50 RPM, 40K TPM | 50 RPM, 20K TPM | 50 RPM, 50K TPM |
| Scale | $100+ monthly spend | 1,000+ RPM | 2,000+ RPM | 4,000+ RPM |
| Enterprise | Custom agreement | Custom | Custom | Custom |
Tiers advance automatically based on successful API usage and payment history. There’s no manual upgrade process for Free → Build → Scale.
OpenAI (Direct API)
OpenAI uses automatic tier advancement based on account spend and history.- Tier System
- Checking & Requesting Limits
- API Headers
Understanding OpenAI Tiers
| Tier | Requirements | GPT-4o RPM | GPT-4o TPM | GPT-3.5 Turbo TPM |
|---|---|---|---|---|
| Free | New account | 3 | 200 | 40,000 |
| Tier 1 | $5+ paid | 500 | 30,000 | 200,000 |
| Tier 2 | $50+ paid, 7+ days | 5,000 | 300,000 | 2,000,000 |
| Tier 3 | $100+ paid, 7+ days | 5,000 | 450,000 | 4,000,000 |
| Tier 4 | $250+ paid, 14+ days | 10,000 | 800,000 | 10,000,000 |
| Tier 5 | $1,000+ paid, 30+ days | 10,000 | 10,000,000 | 50,000,000 |
Configuration Reference
Environment Variables
Provider-Specific Configuration
Monitoring and Observability
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
llm_rate_limit_errors_total | Count of 429/529 errors | > 10/minute |
llm_circuit_breaker_state | Circuit state (0=closed, 1=open) | == 1 |
llm_bulkhead_concurrency | Current concurrent requests | > 80% of limit |
llm_retry_attempts_total | Retry attempts by outcome | > 100/minute |
llm_token_bucket_wait_seconds | Time waiting for tokens | > 5s P95 |
Grafana Dashboard
Seemonitoring/grafana/dashboards/llm-resilience.json for a pre-built dashboard.
Summary
Quick Start Checklist
- Identify your provider tier and corresponding limits
- Calculate concurrency using Little’s Law with 60% headroom
- Configure token bucket for pre-emptive rate limiting
- Enable adaptive bulkhead for self-tuning concurrency
- Set up fallback chain with exponential backoff
- Monitor metrics and adjust based on observed behavior
References
- Anthropic Rate Limits Documentation
- OpenAI Rate Limits Guide
- Azure OpenAI Quotas
- AWS Bedrock Quotas
- Google Vertex AI Quotas
- ADR-0030: Resilience Patterns
- ADR-0027: Rate Limiting Strategy
Last Updated: December 2025