Skip to main content

LLM Provider Resilience Best Practices

This guide provides production-ready strategies for managing rate limits, concurrency, and resilience across multiple LLM providers. These patterns ensure optimal throughput while preventing rate limit errors and cascading failures.

Overview

When integrating with LLM APIs, you must balance three competing concerns:
  1. Latency - Minimize response time for individual requests
  2. Throughput - Maximize total requests per unit time
  3. Reliability - Prevent rate limit errors and service degradation
This guide covers the mathematical foundations, provider-specific configurations, and implementation patterns to achieve this balance.

Provider Rate Limits Reference

Anthropic Claude (Direct API)

Anthropic uses a tiered system based on usage history. Rate limits are measured in:
  • RPM (Requests Per Minute)
  • ITPM (Input Tokens Per Minute)
  • OTPM (Output Tokens Per Minute)
TierEligibilityClaude 3.5 Sonnet RPMClaude 3 Opus RPMClaude 3 Haiku RPM
FreeNew accounts50550
BuildPay-as-you-go505050
ScaleHigh volume1,000+2,000+4,000+
Anthropic uses the token bucket algorithm for rate limiting. Capacity continuously replenishes up to your maximum limit, rather than resetting at fixed intervals.
Token Limits (Build Tier):
  • Claude 3.5 Sonnet: 40,000 TPM
  • Claude 3 Opus: 20,000 TPM
  • Claude 3 Haiku: 50,000 TPM
For Scale Tier or higher limits, contact Anthropic sales.

OpenAI (Direct API)

OpenAI organizes customers into usage tiers that unlock automatically based on spend:
TierGPT-4o RPMGPT-4o TPMGPT-3.5 Turbo RPMGPT-3.5 Turbo TPM
Tier 150030,0003,500200,000
Tier 25,000300,0003,5002,000,000
Tier 35,000+450,0003,5004,000,000
Tier 410,000800,00010,00010,000,000
As of September 2025, OpenAI increased GPT-5 Tier 1 limits to 500,000 TPM and 1,000 RPM.
Shared Limits: Some model families share rate limits. Check your organization’s limits page in the OpenAI Dashboard.

Google Vertex AI (Gemini)

Gemini models on Vertex AI use Dynamic Shared Quota (DSQ) - no hard per-user limits. Quotas are shared across the platform based on capacity.
ModelRecommended RPMNotes
Gemini 2.5 Flash60+High throughput model
Gemini 2.5 Pro60+Complex reasoning
Gemini 3.0 Preview60+Latest capabilities
With DSQ, focus on monitoring actual throughput rather than pre-defined limits. Use adaptive concurrency (see below) to maximize utilization.

Google Vertex AI (Anthropic Claude via MaaS)

Anthropic Claude models accessed through Vertex AI have different limits than the direct API:
ModelRPMInput TPMOutput TPMRegion
Claude Opus 4.51,20012,000,0001,200,000us-east5
Claude Sonnet 4.53,0003,000,000300,000us-east5
Claude Haiku 4.510,00010,000,0001,000,000us-east5
Claude models are only available in the us-east5 region on Vertex AI. Ensure your application accounts for this geographic constraint.

Azure OpenAI Service

Azure quotas are per-region, per-subscription. The TPM/RPM conversion is:
RPM = TPM ÷ 1000 × 6
ModelDefault TPMDefault RPMLimit Per Region
GPT-4.1 Global Standard1,000,0006,000Per subscription
GPT-4o240,0001,440Per subscription
GPT-4o Realtime100,0001,000Per subscription
Azure allows creating deployments across multiple regions. Your effective limit is quota × number_of_regions.

AWS Bedrock (Anthropic Claude)

AWS Bedrock quotas are region-specific and vary by account age:
Account TypeClaude 3.5 Sonnet RPMClaude 3 Opus RPM
New accounts (2024+)2-502-50
Established accounts200-250200-250
New AWS accounts created since 2024 receive significantly lower default quotas. Use established accounts or request quota increases.
Multi-Region Strategy:
Total effective RPM = Quota × Number of regions
Example: 250 RPM × 3 regions = 750 RPM

Concurrency Sizing with Little’s Law

Little’s Law provides the mathematical foundation for sizing concurrent request pools:

Little's Law

Concurrency = Throughput × Average Latency
Where:
  • Concurrency = Number of parallel requests in flight
  • Throughput = Requests per second (RPM ÷ 60)
  • Average Latency = Mean response time in seconds

Practical Application

Example: Claude Sonnet 4.5 on Vertex AI Given:
  • RPM limit: 3,000
  • Average latency: 1.2 seconds
Calculate:
throughput = 3000 / 60  # = 50 requests/second
concurrency = 50 × 1.2  # = 60 concurrent requests
Recommended: Apply 60% Headroom
safe_concurrency = 60 × 0.6  # = 36 concurrent requests

Per-Provider Recommendations

ProviderEffective RPMAvg LatencyCalculated ConcurrencyRecommended
Anthropic (Scale)1,0001.5s2515
OpenAI (Tier 2)5,0000.8s6740
Vertex AI Gemini600.5s0.510 (floor)
Vertex AI Claude Sonnet3,0001.2s6036
Azure OpenAI GPT-41,4401.0s2415
AWS Bedrock Claude2501.5s65

Adaptive Concurrency with AIMD

The AIMD (Additive Increase, Multiplicative Decrease) algorithm, inspired by TCP congestion control, automatically tunes concurrency limits based on observed error rates.

How AIMD Works

On Success Streak: limit = min(max_limit, limit + 1)
On Rate Limit Error: limit = max(min_limit, limit × 0.75)
Parameters:
  • min_limit: Floor value (e.g., 2)
  • max_limit: Ceiling value (e.g., 50)
  • initial_limit: Starting point (e.g., 10)
  • success_streak_threshold: Successes before increase (e.g., 10)
  • decrease_factor: Multiplicative factor on error (e.g., 0.75)
Benefits:
  • Self-healing: Automatically recovers from rate limit errors
  • Adaptive: Finds optimal concurrency for current conditions
  • Conservative: Fast decrease, slow recovery prevents oscillation

Implementation

from mcp_server_langgraph.resilience.adaptive import AdaptiveBulkhead

# Create provider-specific adaptive bulkhead
bulkhead = AdaptiveBulkhead(
    min_limit=5,
    max_limit=50,
    initial_limit=25,
    error_threshold=0.1,  # 10% error rate triggers decrease
    decrease_factor=0.75,
    success_streak_threshold=10,
)

async def call_llm(prompt: str) -> str:
    async with bulkhead.get_semaphore():
        try:
            result = await llm.invoke(prompt)
            bulkhead.record_success()
            return result
        except RateLimitError:
            bulkhead.record_error()
            raise

Token Bucket Rate Limiting

The token bucket algorithm provides pre-emptive rate limiting with burst capacity, preventing rate limit errors before they occur.

How Token Bucket Works

1

Bucket Capacity

Bucket holds tokens up to a maximum (burst) capacity
2

Token Consumption

Each request consumes one token from the bucket
3

Refill Rate

Tokens are added at a constant rate (RPM ÷ 60)
4

Request Gating

Requests wait if bucket is empty (no 429 errors)

Implementation

from mcp_server_langgraph.resilience.rate_limit import TokenBucketRateLimiter

# Configure for provider limits
rate_limiter = TokenBucketRateLimiter(
    rpm=50,             # Requests per minute
    burst_seconds=10,   # Burst capacity (10 seconds worth)
)

async def call_llm(prompt: str) -> str:
    await rate_limiter.acquire()  # Blocks if bucket empty
    return await llm.invoke(prompt)

Per-Provider Defaults

ProviderDefault RPMBurst SecondsTokens/Second
Anthropic (Direct)50100.83
OpenAI60101.0
Google Vertex AI (Gemini)60101.0
Vertex AI (Anthropic)1,0001016.67
Azure OpenAI60101.0
AWS Bedrock50100.83

Fallback Strategies

When the primary model fails (rate limit, timeout, circuit open), fallbacks maintain service availability.

Fallback Chain Pattern

FALLBACK_MODELS = [
    "claude-sonnet-4-5",   # Primary
    "claude-haiku-4-5",    # Fast fallback
    "gpt-4o",              # Cross-provider fallback
]

async def invoke_with_fallback(prompt: str) -> str:
    delay = 1.0  # Exponential backoff between attempts

    for i, model in enumerate(FALLBACK_MODELS):
        try:
            return await invoke_model(model, prompt)
        except (RateLimitError, TimeoutError):
            if i < len(FALLBACK_MODELS) - 1:
                await asyncio.sleep(delay)
                delay *= 2  # Exponential backoff

    raise AllFallbacksExhaustedError()

Best Practices

Protect Fallbacks

Each fallback model should have its own circuit breaker to prevent cascade failures

Exponential Backoff

Wait 1s, 2s, 4s between fallback attempts to let rate limits recover

Cross-Provider

Include models from different providers in fallback chain

Graceful Degradation

Fallbacks may have different capabilities - handle appropriately

Retry with Backoff Strategies

Standard vs Overload-Aware Retry

ConfigurationStandard RetryOverload-Aware Retry
Max Attempts36
Exponential Max10s60s
JitterFullDecorrelated
Retry-After HonorNoYes

Retry-After Header

When receiving a 429 response, use the Retry-After header:
try:
    response = await llm.invoke(prompt)
except RateLimitError as e:
    if e.retry_after:
        await asyncio.sleep(e.retry_after)
        response = await llm.invoke(prompt)

Error Classification

Error TypeShould Retry?Strategy
429 Rate LimitYesUse Retry-After or exponential backoff
529 OverloadYesExtended retry with decorrelated jitter
500 Server ErrorYesStandard exponential backoff
400 Bad RequestNoFix request and retry
401/403 AuthNoFix credentials
TimeoutYesWith increased timeout

Quota Management by Provider

This section provides detailed step-by-step instructions for requesting and managing quota increases for each LLM provider.

Google Cloud Platform (Vertex AI)

GCP Vertex AI quotas control access to Gemini models and Anthropic Claude models (via Model as a Service).

AWS Bedrock

AWS Bedrock quotas are managed through Service Quotas and vary by region and account age.
Step-by-Step: AWS Console
  1. Navigate to Service Quotas
  2. Find the Model Quota
    • Search for the model name (e.g., “Claude 3.5 Sonnet”)
    • Look for quotas like:
      • Anthropic Claude 3.5 Sonnet Invocations per minute
      • Anthropic Claude 3.5 Sonnet Tokens per minute
  3. Request Increase
    • Click on the quota name
    • Click Request quota increase
    • Enter desired value (e.g., 500 for RPM)
    • Click Request
  4. Check Status
    • Go to Quota request history
    • Status will show: Pending → Approved/Denied
Account Age Matters: New AWS accounts (created after 2024) receive significantly lower default quotas (2-50 RPM). Consider using established accounts or requesting increases immediately.

Azure OpenAI Service

Azure OpenAI quotas are managed per-subscription, per-region, and can be allocated across deployments.
Step-by-Step: Azure Portal
  1. Access Azure OpenAI Studio
  2. View Current Quotas
    • Click Quotas in the left sidebar
    • View available TPM by model and region
  3. Adjust Deployment Quota
    • Go to Deployments
    • Select your deployment → Edit
    • Adjust Tokens per Minute Rate Limit
    • Click Save
  4. Request Additional Quota
    • If you need more than your subscription limit:
    • Go to Help + supportNew support request
    • Select: Issue type = Service and subscription limits (quotas)
    • Service = Azure OpenAI Service
    • Provide justification and desired limits
TPM to RPM Conversion: RPM = TPM ÷ 1000 × 6Example: 240,000 TPM = 1,440 RPM

Anthropic (Direct API)

Anthropic uses a tier-based system that advances automatically based on usage and spend.
Understanding Anthropic Tiers
TierRequirementsClaude 3.5 SonnetClaude 3 OpusClaude 3 Haiku
FreeNew account50 RPM, 40K TPM5 RPM, 10K TPM50 RPM, 25K TPM
BuildCredit card added50 RPM, 40K TPM50 RPM, 20K TPM50 RPM, 50K TPM
Scale$100+ monthly spend1,000+ RPM2,000+ RPM4,000+ RPM
EnterpriseCustom agreementCustomCustomCustom
Tiers advance automatically based on successful API usage and payment history. There’s no manual upgrade process for Free → Build → Scale.

OpenAI (Direct API)

OpenAI uses automatic tier advancement based on account spend and history.
Understanding OpenAI Tiers
TierRequirementsGPT-4o RPMGPT-4o TPMGPT-3.5 Turbo TPM
FreeNew account320040,000
Tier 1$5+ paid50030,000200,000
Tier 2$50+ paid, 7+ days5,000300,0002,000,000
Tier 3$100+ paid, 7+ days5,000450,0004,000,000
Tier 4$250+ paid, 14+ days10,000800,00010,000,000
Tier 5$1,000+ paid, 30+ days10,00010,000,00050,000,000

Configuration Reference

Environment Variables

# Token Bucket Rate Limiting
RATE_LIMIT_RPM=50                    # Requests per minute
RATE_LIMIT_BURST_SECONDS=10          # Burst capacity

# Adaptive Bulkhead (AIMD)
BULKHEAD_MIN_LIMIT=5                 # Floor
BULKHEAD_MAX_LIMIT=50                # Ceiling
BULKHEAD_INITIAL_LIMIT=25            # Starting point
BULKHEAD_ERROR_THRESHOLD=0.1         # 10% triggers decrease
BULKHEAD_DECREASE_FACTOR=0.75        # Reduce by 25%
BULKHEAD_SUCCESS_STREAK=10           # Successes to increase

# Retry Configuration
RETRY_MAX_ATTEMPTS=3                 # Standard retry
RETRY_OVERLOAD_MAX_ATTEMPTS=6        # Extended for 529
RETRY_EXPONENTIAL_MAX=10.0           # Max backoff seconds
RETRY_OVERLOAD_EXPONENTIAL_MAX=60.0  # Extended max
RETRY_HONOR_RETRY_AFTER=true         # Use Retry-After header

# Timeout Configuration
LLM_TIMEOUT_SECONDS=60               # LLM call timeout
FALLBACK_TIMEOUT_SECONDS=30          # Per-fallback timeout

# Fallback Configuration
FALLBACK_DELAY_SECONDS=1.0           # Initial delay
FALLBACK_BACKOFF_MULTIPLIER=2.0      # Exponential factor

Provider-Specific Configuration

# config/resilience.yaml
providers:
  anthropic:
    rpm: 50
    burst_seconds: 10
    bulkhead_limit: 10

  vertex_ai_anthropic:
    rpm: 1000
    burst_seconds: 10
    bulkhead_limit: 25

  openai:
    rpm: 60
    burst_seconds: 10
    bulkhead_limit: 20

  azure:
    rpm: 60
    burst_seconds: 10
    bulkhead_limit: 15

  bedrock:
    rpm: 50
    burst_seconds: 10
    bulkhead_limit: 10

  google:
    rpm: 60
    burst_seconds: 10
    bulkhead_limit: 20

Monitoring and Observability

Key Metrics

MetricDescriptionAlert Threshold
llm_rate_limit_errors_totalCount of 429/529 errors> 10/minute
llm_circuit_breaker_stateCircuit state (0=closed, 1=open)== 1
llm_bulkhead_concurrencyCurrent concurrent requests> 80% of limit
llm_retry_attempts_totalRetry attempts by outcome> 100/minute
llm_token_bucket_wait_secondsTime waiting for tokens> 5s P95

Grafana Dashboard

See monitoring/grafana/dashboards/llm-resilience.json for a pre-built dashboard.

Summary

Quick Start Checklist

  1. Identify your provider tier and corresponding limits
  2. Calculate concurrency using Little’s Law with 60% headroom
  3. Configure token bucket for pre-emptive rate limiting
  4. Enable adaptive bulkhead for self-tuning concurrency
  5. Set up fallback chain with exponential backoff
  6. Monitor metrics and adjust based on observed behavior

References


Last Updated: December 2025