Skip to main content

Overview

Configure 100+ LLM providers via LiteLLM with automatic fallback, load balancing, and cost optimization. Supports cloud providers (Anthropic, OpenAI, Google) and open-source models (Llama, Mistral, Qwen via Ollama).
LiteLLM provides a unified interface to all major LLM providers with automatic retries, fallback, and intelligent routing.

Supported Providers

Anthropic

  • Claude Sonnet 4.5
  • Claude Opus 4.1
  • Claude Haiku 4.5

OpenAI

  • GPT-5
  • GPT-5 Pro
  • GPT-5 Mini
  • GPT-5 Nano

Google

  • Gemini 2.5 Flash
  • Gemini 2.0 Pro
  • Gemini 1.5 Pro

Azure OpenAI

  • GPT-4 (Azure)
  • GPT-3.5 (Azure)
  • Custom deployments

AWS Bedrock

  • Claude (Bedrock)
  • Llama (Bedrock)
  • Titan

Ollama

  • Llama 3.1
  • Mistral
  • Qwen 2.5
  • DeepSeek

Quick Start

Anthropic Claude

1

Get API Key

  1. Sign up at https://console.anthropic.com
  2. Generate API key
  3. Note your organization ID
2

Configure

# .env
LLM_PROVIDER=anthropic
MODEL_NAME=claude-sonnet-4-5
ANTHROPIC_API_KEY=sk-ant-api03-...
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=4096
3

Test

from mcp_server_langgraph.llm.factory import LLMFactory

llm = LLMFactory(
    provider="anthropic",
    model_name="claude-sonnet-4-5"
)

response = llm.invoke([
    {"role": "user", "content": "Hello!"}
])
print(response.content)

OpenAI GPT

1

Get API Key

  1. Sign up at https://platform.openai.com
  2. Create API key
  3. Add billing information
2

Configure

# .env
LLM_PROVIDER=openai
MODEL_NAME=gpt-5.1
OPENAI_API_KEY=sk-proj-...
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=4096
3

Test

llm = LLMFactory(
    provider="openai",
    model_name="gpt-5.1"
)

response = llm.invoke([
    {"role": "user", "content": "What is AI?"}
])

Google Gemini

1

Get API Key

  1. Go to https://makersuite.google.com/app/apikey
  2. Create API key
  3. Enable Gemini API
2

Configure

# .env
LLM_PROVIDER=google
MODEL_NAME=gemini-2.5-flash
GOOGLE_API_KEY=AIza...
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=8192
3

Test

llm = LLMFactory(
    provider="google",
    model_name="gemini-2.5-flash"
)

response = llm.invoke([
    {"role": "user", "content": "Explain quantum computing"}
])

Azure OpenAI

1

Setup Azure

  1. Create Azure OpenAI resource
  2. Deploy a model (e.g., gpt-4)
  3. Get endpoint and API key
2

Configure

# .env
LLM_PROVIDER=azure
MODEL_NAME=azure/gpt-4
AZURE_API_KEY=your-azure-key
AZURE_API_BASE=https://your-resource.openai.azure.com
AZURE_API_VERSION=2024-02-15-preview
AZURE_DEPLOYMENT_NAME=gpt-4
3

Test

llm = LLMFactory(
    provider="azure",
    model_name="azure/gpt-4",
    api_base=os.getenv("AZURE_API_BASE"),
    api_version=os.getenv("AZURE_API_VERSION")
)

AWS Bedrock

1

Setup AWS

  1. Enable Bedrock in AWS Console
  2. Request model access
  3. Configure IAM credentials
2

Configure

# .env
LLM_PROVIDER=bedrock
MODEL_NAME=bedrock/anthropic.claude-sonnet-4-5-v2:0
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION_NAME=us-east-1
3

Test

llm = LLMFactory(
    provider="bedrock",
    model_name="bedrock/anthropic.claude-sonnet-4-5-v2:0"
)

Ollama (Local Models)

1

Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from https://ollama.com/download
2

Pull Model

# Pull Llama 3.1
ollama pull llama3.1:8b

# Pull Mistral
ollama pull mistral:7b

# Pull Qwen
ollama pull qwen2.5:7b

# List models
ollama list
3

Configure

# .env
LLM_PROVIDER=ollama
MODEL_NAME=ollama/llama3.1:8b
OLLAMA_API_BASE=http://localhost:11434
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=4096
4

Test

llm = LLMFactory(
    provider="ollama",
    model_name="ollama/llama3.1:8b",
    api_base="http://localhost:11434"
)

response = llm.invoke([
    {"role": "user", "content": "Hello!"}
])

Automatic Fallback

Configure automatic fallback when primary model fails:
## .env
LLM_PROVIDER=anthropic
MODEL_NAME=claude-sonnet-4-5
ENABLE_FALLBACK=true
FALLBACK_MODELS=gemini-2.5-flash,gpt-5.1,ollama/llama3.1:8b
Fallback Flow: Configuration:
from mcp_server_langgraph.llm.factory import LLMFactory

llm = LLMFactory(
    provider="anthropic",
    model_name="claude-sonnet-4-5",
    enable_fallback=True,
    fallback_models=[
        "gemini-2.5-flash",
        "gpt-5.1",
        "ollama/llama3.1:8b"
    ]
)

## Automatically falls back on error
response = llm.invoke(messages)
Common Failure Scenarios:
  • API quota exceeded
  • Rate limiting
  • Model unavailable
  • Timeout
  • Invalid API key

Model Comparison

  • Performance
  • Use Cases
  • Pricing
ModelSpeedQualityContextCost
Claude Sonnet 4.5⚡⚡⚡⭐⭐⭐⭐⭐200K$$
GPT-4o⚡⚡⚡⭐⭐⭐⭐⭐128K$$$
Gemini 2.5 Flash⚡⚡⚡⚡⚡⭐⭐⭐⭐1M$
Llama 3.1 8B⚡⚡⚡⚡⭐⭐⭐128KFree

Advanced Configuration

Load Balancing

Distribute requests across multiple providers:
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "claude-sonnet-4-5",
            "litellm_params": {
                "model": "claude-sonnet-4-5",
                "api_key": os.getenv("ANTHROPIC_API_KEY")
            }
        },
        {
            "model_name": "gpt-5.1",
            "litellm_params": {
                "model": "gpt-5.1",
                "api_key": os.getenv("OPENAI_API_KEY")
            }
        }
    ],
    routing_strategy="least-busy"  # or "simple-shuffle", "latency-based"
)

response = await router.acompletion(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Hello"}]
)

Rate Limiting

Prevent quota exhaustion:
from litellm import Router

router = Router(
    model_list=[...],
    redis_host="localhost",
    redis_port=6379,
    rpm=1000,  # Requests per minute
    tpm=100000  # Tokens per minute
)

Cost Tracking

Monitor LLM costs:
from litellm import completion_cost

response = llm.invoke(messages)

## Calculate cost
cost = completion_cost(
    model=model_name,
    prompt_tokens=response.usage.prompt_tokens,
    completion_tokens=response.usage.completion_tokens
)

print(f"Cost: ${cost:.4f}")

Caching

Cache responses to reduce costs:
from litellm import Cache

cache = Cache(
    type="redis",
    host="localhost",
    port=6379,
    ttl=3600  # 1 hour
)

## Cached completion
response = completion(
    model="claude-sonnet-4-5",
    messages=messages,
    cache=cache
)

LLM Streaming Response Flow

The following diagram shows how streaming responses flow from the LLM through the server to the client: Flow Description:
  1. Client Request: Client sends request to API endpoint
  2. API Routing: API Gateway routes request to LangGraph agent
  3. Agent Processing: LangGraph agent processes request and invokes LLM
  4. LLM Generation: LLM provider generates response token by token
  5. Token Streaming: Tokens stream back through agent to API
  6. SSE Delivery: Server-Sent Events push tokens to client in real-time
  7. Client Reception: Client receives and displays streaming response
Streaming Benefits: Real-time responses, lower perceived latency, better user experience, and efficient token-by-token delivery.

Configuration Reference

Environment Variables

## Provider Selection
LLM_PROVIDER=anthropic|openai|google|azure|bedrock|ollama
MODEL_NAME=model-identifier

## Model Parameters
MODEL_TEMPERATURE=0.0-2.0  # Default: 0.7
MODEL_MAX_TOKENS=1-32000   # Default: 4096
MODEL_TIMEOUT=10-300       # Seconds, Default: 60
MODEL_TOP_P=0.0-1.0        # Default: 1.0

## Fallback
ENABLE_FALLBACK=true|false  # Default: false
FALLBACK_MODELS=model1,model2,model3

## Provider API Keys
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-proj-...
GOOGLE_API_KEY=AIza...
AZURE_API_KEY=...
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...

## Provider-Specific
AZURE_API_BASE=https://your-resource.openai.azure.com
AZURE_API_VERSION=2024-02-15-preview
AZURE_DEPLOYMENT_NAME=gpt-4
AWS_REGION_NAME=us-east-1
OLLAMA_API_BASE=http://localhost:11434

Model IDs

  • Anthropic
  • OpenAI
  • Google
  • Azure
  • Bedrock
  • Ollama
claude-sonnet-4-5             # Latest Sonnet
claude-opus-4-1               # Opus (extended reasoning)
claude-haiku-4-5              # Haiku (fast, cost-effective)

Troubleshooting

# Test API key
curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-sonnet-4-5","max_tokens":10,"messages":[{"role":"user","content":"Hi"}]}'

# Check environment
echo $ANTHROPIC_API_KEY
env | grep API_KEY
Solutions:
  • Enable fallback models
  • Implement request queuing
  • Increase rate limits (paid plans)
  • Add retry with exponential backoff
llm = LLMFactory(
    provider="anthropic",
    enable_fallback=True,
    fallback_models=["gemini-2.5-flash"],
    timeout=120  # Longer timeout
)
# Check Ollama running
ollama list

# Start Ollama
ollama serve

# Test connection
curl http://localhost:11434/api/tags

# Set correct endpoint
OLLAMA_API_BASE=http://localhost:11434
Optimizations:
  • Use faster models (Gemini Flash, Claude Haiku)
  • Reduce max_tokens
  • Increase temperature for faster sampling
  • Enable streaming
llm = LLMFactory(
    model_name="gemini-2.5-flash",  # Faster
    max_tokens=1024,  # Lower limit
    temperature=0.9  # Faster sampling
)

Next Steps


Flexible & Resilient: Multi-LLM support with automatic fallback ensures high availability and cost optimization!