LLM Provider Resilience Best Practices

This guide provides production-ready strategies for managing rate limits, concurrency, and resilience across multiple LLM providers. These patterns ensure optimal throughput while preventing rate limit errors and cascading failures.

Overview

When integrating with LLM APIs, you must balance three competing concerns:

Latency - Minimize response time for individual requests
Throughput - Maximize total requests per unit time
Reliability - Prevent rate limit errors and service degradation

This guide covers the mathematical foundations, provider-specific configurations, and implementation patterns to achieve this balance.

Provider Rate Limits Reference

Anthropic Claude (Direct API)

Anthropic uses a tiered system based on usage history. Rate limits are measured in:

RPM (Requests Per Minute)
ITPM (Input Tokens Per Minute)
OTPM (Output Tokens Per Minute)

Tier	Eligibility	Claude 3.5 Sonnet RPM	Claude 3 Opus RPM	Claude 3 Haiku RPM
Free	New accounts	50	5	50
Build	Pay-as-you-go	50	50	50
Scale	High volume	1,000+	2,000+	4,000+

Anthropic uses the token bucket algorithm for rate limiting. Capacity continuously replenishes up to your maximum limit, rather than resetting at fixed intervals.

Token Limits (Build Tier):

Claude 3.5 Sonnet: 40,000 TPM
Claude 3 Opus: 20,000 TPM
Claude 3 Haiku: 50,000 TPM

For Scale Tier or higher limits, contact Anthropic sales.

OpenAI (Direct API)

OpenAI organizes customers into usage tiers that unlock automatically based on spend:

Tier	GPT-4o RPM	GPT-4o TPM	GPT-3.5 Turbo RPM	GPT-3.5 Turbo TPM
Tier 1	500	30,000	3,500	200,000
Tier 2	5,000	300,000	3,500	2,000,000
Tier 3	5,000+	450,000	3,500	4,000,000
Tier 4	10,000	800,000	10,000	10,000,000

As of September 2025, OpenAI increased GPT-5 Tier 1 limits to 500,000 TPM and 1,000 RPM.

Shared Limits: Some model families share rate limits. Check your organization’s limits page in the OpenAI Dashboard.

Google Vertex AI (Gemini)

Gemini models on Vertex AI use Dynamic Shared Quota (DSQ) - no hard per-user limits. Quotas are shared across the platform based on capacity.

Model	Recommended RPM	Notes
Gemini 2.5 Flash	60+	High throughput model
Gemini 2.5 Pro	60+	Complex reasoning
Gemini 3.0 Preview	60+	Latest capabilities

With DSQ, focus on monitoring actual throughput rather than pre-defined limits. Use adaptive concurrency (see below) to maximize utilization.

Google Vertex AI (Anthropic Claude via MaaS)

Anthropic Claude models accessed through Vertex AI have different limits than the direct API:

Model	RPM	Input TPM	Output TPM	Region
Claude Opus 4.5	1,200	12,000,000	1,200,000	us-east5
Claude Sonnet 4.5	3,000	3,000,000	300,000	us-east5
Claude Haiku 4.5	10,000	10,000,000	1,000,000	us-east5

Claude models are only available in the us-east5 region on Vertex AI. Ensure your application accounts for this geographic constraint.

Azure OpenAI Service

Azure quotas are per-region, per-subscription. The TPM/RPM conversion is:

RPM = TPM ÷ 1000 × 6

Model	Default TPM	Default RPM	Limit Per Region
GPT-4.1 Global Standard	1,000,000	6,000	Per subscription
GPT-4o	240,000	1,440	Per subscription
GPT-4o Realtime	100,000	1,000	Per subscription

Azure allows creating deployments across multiple regions. Your effective limit is quota × number_of_regions.

AWS Bedrock (Anthropic Claude)

AWS Bedrock quotas are region-specific and vary by account age:

Account Type	Claude 3.5 Sonnet RPM	Claude 3 Opus RPM
New accounts (2024+)	2-50	2-50
Established accounts	200-250	200-250

New AWS accounts created since 2024 receive significantly lower default quotas. Use established accounts or request quota increases.

Multi-Region Strategy:

Total effective RPM = Quota × Number of regions
Example: 250 RPM × 3 regions = 750 RPM

Concurrency Sizing with Little’s Law

Little’s Law provides the mathematical foundation for sizing concurrent request pools:

Little's Law

Concurrency = Throughput × Average Latency

Where:

Concurrency = Number of parallel requests in flight
Throughput = Requests per second (RPM ÷ 60)
Average Latency = Mean response time in seconds

Practical Application

Example: Claude Sonnet 4.5 on Vertex AI Given:

RPM limit: 3,000
Average latency: 1.2 seconds

Calculate:

throughput = 3000 / 60  # = 50 requests/second
concurrency = 50 × 1.2  # = 60 concurrent requests

Recommended: Apply 60% Headroom

safe_concurrency = 60 × 0.6  # = 36 concurrent requests

Per-Provider Recommendations

Provider	Effective RPM	Avg Latency	Calculated Concurrency	Recommended
Anthropic (Scale)	1,000	1.5s	25	15
OpenAI (Tier 2)	5,000	0.8s	67	40
Vertex AI Gemini	60	0.5s	0.5	10 (floor)
Vertex AI Claude Sonnet	3,000	1.2s	60	36
Azure OpenAI GPT-4	1,440	1.0s	24	15
AWS Bedrock Claude	250	1.5s	6	5

Adaptive Concurrency with AIMD

The AIMD (Additive Increase, Multiplicative Decrease) algorithm, inspired by TCP congestion control, automatically tunes concurrency limits based on observed error rates.

How AIMD Works

On Success Streak: limit = min(max_limit, limit + 1)
On Rate Limit Error: limit = max(min_limit, limit × 0.75)

AIMD Algorithm Details

Parameters:

min_limit: Floor value (e.g., 2)
max_limit: Ceiling value (e.g., 50)
initial_limit: Starting point (e.g., 10)
success_streak_threshold: Successes before increase (e.g., 10)
decrease_factor: Multiplicative factor on error (e.g., 0.75)

Benefits:

Self-healing: Automatically recovers from rate limit errors
Adaptive: Finds optimal concurrency for current conditions
Conservative: Fast decrease, slow recovery prevents oscillation

Implementation

from mcp_server_langgraph.resilience.adaptive import AdaptiveBulkhead

# Create provider-specific adaptive bulkhead
bulkhead = AdaptiveBulkhead(
    min_limit=5,
    max_limit=50,
    initial_limit=25,
    error_threshold=0.1,  # 10% error rate triggers decrease
    decrease_factor=0.75,
    success_streak_threshold=10,
)

async def call_llm(prompt: str) -> str:
    async with bulkhead.get_semaphore():
        try:
            result = await llm.invoke(prompt)
            bulkhead.record_success()
            return result
        except RateLimitError:
            bulkhead.record_error()
            raise

Token Bucket Rate Limiting

The token bucket algorithm provides pre-emptive rate limiting with burst capacity, preventing rate limit errors before they occur.

How Token Bucket Works

Bucket Capacity

Bucket holds tokens up to a maximum (burst) capacity

Token Consumption

Each request consumes one token from the bucket

Refill Rate

Tokens are added at a constant rate (RPM ÷ 60)

Request Gating

Requests wait if bucket is empty (no 429 errors)

Implementation

from mcp_server_langgraph.resilience.rate_limit import TokenBucketRateLimiter

# Configure for provider limits
rate_limiter = TokenBucketRateLimiter(
    rpm=50,             # Requests per minute
    burst_seconds=10,   # Burst capacity (10 seconds worth)
)

async def call_llm(prompt: str) -> str:
    await rate_limiter.acquire()  # Blocks if bucket empty
    return await llm.invoke(prompt)

Per-Provider Defaults

Provider	Default RPM	Burst Seconds	Tokens/Second
Anthropic (Direct)	50	10	0.83
OpenAI	60	10	1.0
Google Vertex AI (Gemini)	60	10	1.0
Vertex AI (Anthropic)	1,000	10	16.67
Azure OpenAI	60	10	1.0
AWS Bedrock	50	10	0.83

Fallback Strategies

When the primary model fails (rate limit, timeout, circuit open), fallbacks maintain service availability.

Fallback Chain Pattern

FALLBACK_MODELS = [
    "claude-sonnet-4-5",   # Primary
    "claude-haiku-4-5",    # Fast fallback
    "gpt-4o",              # Cross-provider fallback
]

async def invoke_with_fallback(prompt: str) -> str:
    delay = 1.0  # Exponential backoff between attempts

    for i, model in enumerate(FALLBACK_MODELS):
        try:
            return await invoke_model(model, prompt)
        except (RateLimitError, TimeoutError):
            if i < len(FALLBACK_MODELS) - 1:
                await asyncio.sleep(delay)
                delay *= 2  # Exponential backoff

    raise AllFallbacksExhaustedError()

Best Practices

Protect Fallbacks

Each fallback model should have its own circuit breaker to prevent cascade failures

Exponential Backoff

Wait 1s, 2s, 4s between fallback attempts to let rate limits recover

Cross-Provider

Include models from different providers in fallback chain

Graceful Degradation

Fallbacks may have different capabilities - handle appropriately

Retry with Backoff Strategies

Standard vs Overload-Aware Retry

Configuration	Standard Retry	Overload-Aware Retry
Max Attempts	3	6
Exponential Max	10s	60s
Jitter	Full	Decorrelated
Retry-After Honor	No	Yes

Retry-After Header

When receiving a 429 response, use the Retry-After header:

try:
    response = await llm.invoke(prompt)
except RateLimitError as e:
    if e.retry_after:
        await asyncio.sleep(e.retry_after)
        response = await llm.invoke(prompt)

Error Classification

Error Type	Should Retry?	Strategy
429 Rate Limit	Yes	Use Retry-After or exponential backoff
529 Overload	Yes	Extended retry with decorrelated jitter
500 Server Error	Yes	Standard exponential backoff
400 Bad Request	No	Fix request and retry
401/403 Auth	No	Fix credentials
Timeout	Yes	With increased timeout

Quota Management by Provider

This section provides detailed step-by-step instructions for requesting and managing quota increases for each LLM provider.

Google Cloud Platform (Vertex AI)

GCP Vertex AI quotas control access to Gemini models and Anthropic Claude models (via Model as a Service).

Console (Recommended)
gcloud CLI
Terraform

Step-by-Step: GCP Console

Navigate to Quotas
- Go to IAM & Admin → Quotas & System Limits
- Select your project from the dropdown
Find the Quota
- In the Filter box, enter: aiplatform.googleapis.com
- For Gemini models, search for metrics like:
  - GenerateContentRequestsPerMinutePerProjectPerRegion
  - TokensPerMinutePerProjectPerRegion
- For Claude models (MaaS), search for:
  - anthropic_claude_requests_per_minute
  - anthropic_claude_tokens_per_minute
Request Increase
- Select the checkbox next to the quota
- Click Edit Quotas button at the top
- Enter your desired new limit
- Provide a justification (e.g., “Production workload requires 3,000 RPM for Claude Sonnet 4.5”)
- Click Submit Request
Track Status
- Check your email for approval (typically 24-48 hours)
- View pending requests at Quota Increase Requests

Common Quota Metrics for Vertex AI:

generate_content_requests_per_minute_per_project_per_region
online_prediction_requests_per_minute_per_project_per_region
anthropic/claude-3-5-sonnet/requests_per_minute

Step-by-Step: gcloud CLI

# 1. List all quotas for Vertex AI
gcloud alpha services quota list \
  --service=aiplatform.googleapis.com \
  --consumer=projects/YOUR_PROJECT_ID \
  --filter="metric:anthropic OR metric:generate_content"

# 2. View current quota value for a specific metric
gcloud alpha services quota describe \
  --service=aiplatform.googleapis.com \
  --consumer=projects/YOUR_PROJECT_ID \
  --metric=aiplatform.googleapis.com/generate_content_requests_per_minute

# 3. Request quota increase (opens support case)
gcloud alpha services quota update \
  --service=aiplatform.googleapis.com \
  --consumer=projects/YOUR_PROJECT_ID \
  --metric=aiplatform.googleapis.com/generate_content_requests_per_minute \
  --value=10000 \
  --unit="1/min/{project}/{region}"

# 4. Check request status
gcloud alpha services quota requests list \
  --consumer=projects/YOUR_PROJECT_ID

The gcloud alpha services quota commands are in alpha and may change. Use the Console for production quota management.

Infrastructure as Code: Terraform

# Note: Quota management via Terraform requires google-beta provider
# and is limited to certain quota types

resource "google_project_service" "vertex_ai" {
  service = "aiplatform.googleapis.com"
  project = var.project_id
}

# For quota increases, use google_service_usage_consumer_quota_override
# (Limited availability - check Terraform docs)

For most quota increases, manual Console requests are still required. Terraform can help track desired quotas as code documentation.

AWS Bedrock

AWS Bedrock quotas are managed through Service Quotas and vary by region and account age.

Console
AWS CLI
Multi-Region Strategy

Step-by-Step: AWS Console

Navigate to Service Quotas
- Go to Service Quotas → Amazon Bedrock
- Ensure you’re in the correct region (e.g., us-east-1)
Find the Model Quota
- Search for the model name (e.g., “Claude 3.5 Sonnet”)
- Look for quotas like:
  - Anthropic Claude 3.5 Sonnet Invocations per minute
  - Anthropic Claude 3.5 Sonnet Tokens per minute
Request Increase
- Click on the quota name
- Click Request quota increase
- Enter desired value (e.g., 500 for RPM)
- Click Request
Check Status
- Go to Quota request history
- Status will show: Pending → Approved/Denied

Account Age Matters: New AWS accounts (created after 2024) receive significantly lower default quotas (2-50 RPM). Consider using established accounts or requesting increases immediately.

Step-by-Step: AWS CLI

# 1. List Bedrock quotas
aws service-quotas list-service-quotas \
  --service-code bedrock \
  --query 'Quotas[?contains(QuotaName, `Claude`)].[QuotaName,Value]' \
  --output table

# 2. Get specific quota details
aws service-quotas get-service-quota \
  --service-code bedrock \
  --quota-code L-XXXXXXXX  # Get code from list-service-quotas

# 3. Request increase
aws service-quotas request-service-quota-increase \
  --service-code bedrock \
  --quota-code L-XXXXXXXX \
  --desired-value 500

# 4. Check request history
aws service-quotas list-requested-service-quota-change-history \
  --service-code bedrock

Scaling Across RegionsAWS Bedrock quotas are region-specific. Use multi-region deployment for higher effective limits:

# Example: Multi-region Bedrock client
BEDROCK_REGIONS = ["us-east-1", "us-east-2", "us-west-2"]

async def invoke_with_region_fallback(prompt: str) -> str:
    for region in BEDROCK_REGIONS:
        try:
            client = get_bedrock_client(region)
            return await client.invoke(prompt)
        except ThrottlingException:
            continue
    raise AllRegionsThrottled()

Effective Limits:

Per-Region RPM	Regions	Effective RPM
250	3	750
250	5	1,250

Azure OpenAI Service

Azure OpenAI quotas are managed per-subscription, per-region, and can be allocated across deployments.

Azure Portal
Azure CLI
Multi-Region Deployment

Step-by-Step: Azure Portal

Access Azure OpenAI Studio
- Go to Azure OpenAI Studio
- Select your resource
View Current Quotas
- Click Quotas in the left sidebar
- View available TPM by model and region
Adjust Deployment Quota
- Go to Deployments
- Select your deployment → Edit
- Adjust Tokens per Minute Rate Limit
- Click Save
Request Additional Quota
- If you need more than your subscription limit:
- Go to Help + support → New support request
- Select: Issue type = Service and subscription limits (quotas)
- Service = Azure OpenAI Service
- Provide justification and desired limits

TPM to RPM Conversion: RPM = TPM ÷ 1000 × 6Example: 240,000 TPM = 1,440 RPM

Step-by-Step: Azure CLI

# 1. List Azure OpenAI resources
az cognitiveservices account list \
  --resource-group YOUR_RG \
  --query '[?kind==`OpenAI`].[name,location]' \
  --output table

# 2. View deployments and current settings
az cognitiveservices account deployment list \
  --name YOUR_RESOURCE_NAME \
  --resource-group YOUR_RG \
  --query '[].{name:name, model:properties.model.name, capacity:sku.capacity}'

# 3. Update deployment capacity (TPM)
az cognitiveservices account deployment update \
  --name YOUR_RESOURCE_NAME \
  --resource-group YOUR_RG \
  --deployment-name YOUR_DEPLOYMENT \
  --sku-capacity 50  # In thousands of TPM

# 4. List all subscription quotas
az cognitiveservices usage list \
  --location eastus

Scaling with Regional DeploymentsCreate deployments in multiple regions to increase effective quota:

# Create deployment in additional region
az cognitiveservices account create \
  --name myopenai-westus \
  --resource-group mygroup \
  --kind OpenAI \
  --sku S0 \
  --location westus

# Deploy model in new region
az cognitiveservices account deployment create \
  --name myopenai-westus \
  --resource-group mygroup \
  --deployment-name gpt-4o-deployment \
  --model-name gpt-4o \
  --model-version "2024-05-13" \
  --model-format OpenAI \
  --sku-capacity 50 \
  --sku-name Standard

Load Balancing:

Use Azure API Management or application-level routing
Route by latency, availability, or round-robin

Anthropic (Direct API)

Anthropic uses a tier-based system that advances automatically based on usage and spend.

Tier System
Requesting Higher Limits
API Headers

Understanding Anthropic Tiers

Tier	Requirements	Claude 3.5 Sonnet	Claude 3 Opus	Claude 3 Haiku
Free	New account	50 RPM, 40K TPM	5 RPM, 10K TPM	50 RPM, 25K TPM
Build	Credit card added	50 RPM, 40K TPM	50 RPM, 20K TPM	50 RPM, 50K TPM
Scale	$100+ monthly spend	1,000+ RPM	2,000+ RPM	4,000+ RPM
Enterprise	Custom agreement	Custom	Custom	Custom

Tiers advance automatically based on successful API usage and payment history. There’s no manual upgrade process for Free → Build → Scale.

Monitoring Your LimitsAnthropic includes rate limit headers in every response:

anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 45
anthropic-ratelimit-requests-reset: 2024-01-15T12:00:00Z
anthropic-ratelimit-tokens-limit: 40000
anthropic-ratelimit-tokens-remaining: 35000
anthropic-ratelimit-tokens-reset: 2024-01-15T12:00:00Z

Implementation:

response = await client.messages.create(...)

# Access rate limit headers
remaining = response.headers.get("anthropic-ratelimit-requests-remaining")
reset_time = response.headers.get("anthropic-ratelimit-requests-reset")

if int(remaining) < 5:
    logger.warning(f"Approaching rate limit, resets at {reset_time}")

OpenAI (Direct API)

OpenAI uses automatic tier advancement based on account spend and history.

Tier System
Checking & Requesting Limits
API Headers

Understanding OpenAI Tiers

Tier	Requirements	GPT-4o RPM	GPT-4o TPM	GPT-3.5 Turbo TPM
Free	New account	3	200	40,000
Tier 1	$5+ paid	500	30,000	200,000
Tier 2	$50+ paid, 7+ days	5,000	300,000	2,000,000
Tier 3	$100+ paid, 7+ days	5,000	450,000	4,000,000
Tier 4	$250+ paid, 14+ days	10,000	800,000	10,000,000
Tier 5	$1,000+ paid, 30+ days	10,000	10,000,000	50,000,000

Monitoring with HeadersOpenAI includes rate limit info in response headers:

x-ratelimit-limit-requests: 10000
x-ratelimit-limit-tokens: 10000000
x-ratelimit-remaining-requests: 9995
x-ratelimit-remaining-tokens: 9990000
x-ratelimit-reset-requests: 6ms
x-ratelimit-reset-tokens: 100ms

Proactive Monitoring:

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

# Check remaining capacity
remaining = int(response.headers.get("x-ratelimit-remaining-requests", 0))
if remaining < 100:
    logger.warning(f"Only {remaining} requests remaining until reset")

Configuration Reference

Environment Variables

# Token Bucket Rate Limiting
RATE_LIMIT_RPM=50                    # Requests per minute
RATE_LIMIT_BURST_SECONDS=10          # Burst capacity

# Adaptive Bulkhead (AIMD)
BULKHEAD_MIN_LIMIT=5                 # Floor
BULKHEAD_MAX_LIMIT=50                # Ceiling
BULKHEAD_INITIAL_LIMIT=25            # Starting point
BULKHEAD_ERROR_THRESHOLD=0.1         # 10% triggers decrease
BULKHEAD_DECREASE_FACTOR=0.75        # Reduce by 25%
BULKHEAD_SUCCESS_STREAK=10           # Successes to increase

# Retry Configuration
RETRY_MAX_ATTEMPTS=3                 # Standard retry
RETRY_OVERLOAD_MAX_ATTEMPTS=6        # Extended for 529
RETRY_EXPONENTIAL_MAX=10.0           # Max backoff seconds
RETRY_OVERLOAD_EXPONENTIAL_MAX=60.0  # Extended max
RETRY_HONOR_RETRY_AFTER=true         # Use Retry-After header

# Timeout Configuration
LLM_TIMEOUT_SECONDS=60               # LLM call timeout
FALLBACK_TIMEOUT_SECONDS=30          # Per-fallback timeout

# Fallback Configuration
FALLBACK_DELAY_SECONDS=1.0           # Initial delay
FALLBACK_BACKOFF_MULTIPLIER=2.0      # Exponential factor

Provider-Specific Configuration

# config/resilience.yaml
providers:
  anthropic:
    rpm: 50
    burst_seconds: 10
    bulkhead_limit: 10

  vertex_ai_anthropic:
    rpm: 1000
    burst_seconds: 10
    bulkhead_limit: 25

  openai:
    rpm: 60
    burst_seconds: 10
    bulkhead_limit: 20

  azure:
    rpm: 60
    burst_seconds: 10
    bulkhead_limit: 15

  bedrock:
    rpm: 50
    burst_seconds: 10
    bulkhead_limit: 10

  google:
    rpm: 60
    burst_seconds: 10
    bulkhead_limit: 20

Monitoring and Observability

Key Metrics

Metric	Description	Alert Threshold
`llm_rate_limit_errors_total`	Count of 429/529 errors	> 10/minute
`llm_circuit_breaker_state`	Circuit state (0=closed, 1=open)	== 1
`llm_bulkhead_concurrency`	Current concurrent requests	> 80% of limit
`llm_retry_attempts_total`	Retry attempts by outcome	> 100/minute
`llm_token_bucket_wait_seconds`	Time waiting for tokens	> 5s P95

Grafana Dashboard

See monitoring/grafana/dashboards/llm-resilience.json for a pre-built dashboard.

Summary

Quick Start Checklist

Identify your provider tier and corresponding limits
Calculate concurrency using Little’s Law with 60% headroom
Configure token bucket for pre-emptive rate limiting
Enable adaptive bulkhead for self-tuning concurrency
Set up fallback chain with exponential backoff
Monitor metrics and adjust based on observed behavior

References

Last Updated: December 2025

Getting Started

Deployment Options

LangGraph Platform

Kubernetes - GKE

Kubernetes - EKS & AKS

Kubernetes - Best Practices

Infrastructure as Code

Monitoring & Observability

Advanced Deployment

Configuration

Operations

​LLM Provider Resilience Best Practices

​Overview

​Provider Rate Limits Reference

​Anthropic Claude (Direct API)

​OpenAI (Direct API)

​Google Vertex AI (Gemini)

​Google Vertex AI (Anthropic Claude via MaaS)

​Azure OpenAI Service

​AWS Bedrock (Anthropic Claude)

​Concurrency Sizing with Little’s Law

Little's Law

​Practical Application

​Per-Provider Recommendations

​Adaptive Concurrency with AIMD

​How AIMD Works

​Implementation

​Token Bucket Rate Limiting

​How Token Bucket Works

​Implementation

​Per-Provider Defaults

​Fallback Strategies

​Fallback Chain Pattern

​Best Practices

Protect Fallbacks

Exponential Backoff

Cross-Provider

Graceful Degradation

​Retry with Backoff Strategies

​Standard vs Overload-Aware Retry

​Retry-After Header

​Error Classification

​Quota Management by Provider

​Google Cloud Platform (Vertex AI)

​AWS Bedrock

​Azure OpenAI Service

​Anthropic (Direct API)

​OpenAI (Direct API)

​Configuration Reference

​Environment Variables

​Provider-Specific Configuration

​Monitoring and Observability

​Key Metrics

​Grafana Dashboard

​Summary

Quick Start Checklist

​References

LLM Provider Resilience Best Practices

Overview

Provider Rate Limits Reference

Anthropic Claude (Direct API)

OpenAI (Direct API)

Google Vertex AI (Gemini)

Google Vertex AI (Anthropic Claude via MaaS)

Azure OpenAI Service

AWS Bedrock (Anthropic Claude)

Concurrency Sizing with Little’s Law

Practical Application

Per-Provider Recommendations

Adaptive Concurrency with AIMD

How AIMD Works

Implementation

Token Bucket Rate Limiting

How Token Bucket Works

Implementation

Per-Provider Defaults

Fallback Strategies

Fallback Chain Pattern

Best Practices

Retry with Backoff Strategies

Standard vs Overload-Aware Retry

Retry-After Header

Error Classification

Quota Management by Provider

Google Cloud Platform (Vertex AI)

AWS Bedrock

Azure OpenAI Service

Anthropic (Direct API)

OpenAI (Direct API)

Configuration Reference

Environment Variables

Provider-Specific Configuration

Monitoring and Observability

Key Metrics

Grafana Dashboard

Summary

References