Overview
Configure 100+ LLM providers via LiteLLM with automatic fallback, load balancing, and cost optimization. Supports cloud providers (Anthropic, OpenAI, Google) and open-source models (Llama, Mistral, Qwen via Ollama).
LiteLLM provides a unified interface to all major LLM providers with automatic retries, fallback, and intelligent routing.
Supported Providers
Anthropic
- Claude Sonnet 4.5
- Claude Opus 4.1
- Claude Haiku 4.5
OpenAI
- GPT-5
- GPT-5 Pro
- GPT-5 Mini
- GPT-5 Nano
Google
- Gemini 2.5 Flash
- Gemini 2.0 Pro
- Gemini 1.5 Pro
Azure OpenAI
- GPT-4 (Azure)
- GPT-3.5 (Azure)
- Custom deployments
AWS Bedrock
- Claude (Bedrock)
- Llama (Bedrock)
- Titan
Ollama
- Llama 3.1
- Mistral
- Qwen 2.5
- DeepSeek
Quick Start
Anthropic Claude
Configure
# .env
LLM_PROVIDER=anthropic
MODEL_NAME=claude-sonnet-4-5
ANTHROPIC_API_KEY=sk-ant-api03-...
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=4096
Test
from mcp_server_langgraph.llm.factory import LLMFactory
llm = LLMFactory(
provider="anthropic",
model_name="claude-sonnet-4-5"
)
response = llm.invoke([
{"role": "user", "content": "Hello!"}
])
print(response.content)
OpenAI GPT
Configure
# .env
LLM_PROVIDER=openai
MODEL_NAME=gpt-5.1
OPENAI_API_KEY=sk-proj-...
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=4096
Test
llm = LLMFactory(
provider="openai",
model_name="gpt-5.1"
)
response = llm.invoke([
{"role": "user", "content": "What is AI?"}
])
Google Gemini
Configure
# .env
LLM_PROVIDER=google
MODEL_NAME=gemini-2.5-flash
GOOGLE_API_KEY=AIza...
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=8192
Test
llm = LLMFactory(
provider="google",
model_name="gemini-2.5-flash"
)
response = llm.invoke([
{"role": "user", "content": "Explain quantum computing"}
])
Azure OpenAI
Setup Azure
- Create Azure OpenAI resource
- Deploy a model (e.g., gpt-4)
- Get endpoint and API key
Configure
# .env
LLM_PROVIDER=azure
MODEL_NAME=azure/gpt-4
AZURE_API_KEY=your-azure-key
AZURE_API_BASE=https://your-resource.openai.azure.com
AZURE_API_VERSION=2024-02-15-preview
AZURE_DEPLOYMENT_NAME=gpt-4
Test
llm = LLMFactory(
provider="azure",
model_name="azure/gpt-4",
api_base=os.getenv("AZURE_API_BASE"),
api_version=os.getenv("AZURE_API_VERSION")
)
AWS Bedrock
Setup AWS
- Enable Bedrock in AWS Console
- Request model access
- Configure IAM credentials
Configure
# .env
LLM_PROVIDER=bedrock
MODEL_NAME=bedrock/anthropic.claude-sonnet-4-5-v2:0
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION_NAME=us-east-1
Test
llm = LLMFactory(
provider="bedrock",
model_name="bedrock/anthropic.claude-sonnet-4-5-v2:0"
)
Ollama (Local Models)
Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or download from https://ollama.com/download
Pull Model
# Pull Llama 3.1
ollama pull llama3.1:8b
# Pull Mistral
ollama pull mistral:7b
# Pull Qwen
ollama pull qwen2.5:7b
# List models
ollama list
Configure
# .env
LLM_PROVIDER=ollama
MODEL_NAME=ollama/llama3.1:8b
OLLAMA_API_BASE=http://localhost:11434
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=4096
Test
llm = LLMFactory(
provider="ollama",
model_name="ollama/llama3.1:8b",
api_base="http://localhost:11434"
)
response = llm.invoke([
{"role": "user", "content": "Hello!"}
])
Automatic Fallback
Configure automatic fallback when primary model fails:
## .env
LLM_PROVIDER=anthropic
MODEL_NAME=claude-sonnet-4-5
ENABLE_FALLBACK=true
FALLBACK_MODELS=gemini-2.5-flash,gpt-5.1,ollama/llama3.1:8b
Fallback Flow:
Configuration:
from mcp_server_langgraph.llm.factory import LLMFactory
llm = LLMFactory(
provider="anthropic",
model_name="claude-sonnet-4-5",
enable_fallback=True,
fallback_models=[
"gemini-2.5-flash",
"gpt-5.1",
"ollama/llama3.1:8b"
]
)
## Automatically falls back on error
response = llm.invoke(messages)
Common Failure Scenarios:
- API quota exceeded
- Rate limiting
- Model unavailable
- Timeout
- Invalid API key
Model Comparison
Advanced Configuration
Load Balancing
Distribute requests across multiple providers:
from litellm import Router
router = Router(
model_list=[
{
"model_name": "claude-sonnet-4-5",
"litellm_params": {
"model": "claude-sonnet-4-5",
"api_key": os.getenv("ANTHROPIC_API_KEY")
}
},
{
"model_name": "gpt-5.1",
"litellm_params": {
"model": "gpt-5.1",
"api_key": os.getenv("OPENAI_API_KEY")
}
}
],
routing_strategy="least-busy" # or "simple-shuffle", "latency-based"
)
response = await router.acompletion(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": "Hello"}]
)
Rate Limiting
Prevent quota exhaustion:
from litellm import Router
router = Router(
model_list=[...],
redis_host="localhost",
redis_port=6379,
rpm=1000, # Requests per minute
tpm=100000 # Tokens per minute
)
Cost Tracking
Monitor LLM costs:
from litellm import completion_cost
response = llm.invoke(messages)
## Calculate cost
cost = completion_cost(
model=model_name,
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens
)
print(f"Cost: ${cost:.4f}")
Caching
Cache responses to reduce costs:
from litellm import Cache
cache = Cache(
type="redis",
host="localhost",
port=6379,
ttl=3600 # 1 hour
)
## Cached completion
response = completion(
model="claude-sonnet-4-5",
messages=messages,
cache=cache
)
LLM Streaming Response Flow
The following diagram shows how streaming responses flow from the LLM through the server to the client:
Flow Description:
- Client Request: Client sends request to API endpoint
- API Routing: API Gateway routes request to LangGraph agent
- Agent Processing: LangGraph agent processes request and invokes LLM
- LLM Generation: LLM provider generates response token by token
- Token Streaming: Tokens stream back through agent to API
- SSE Delivery: Server-Sent Events push tokens to client in real-time
- Client Reception: Client receives and displays streaming response
Streaming Benefits: Real-time responses, lower perceived latency, better user experience, and efficient token-by-token delivery.
Configuration Reference
Environment Variables
## Provider Selection
LLM_PROVIDER=anthropic|openai|google|azure|bedrock|ollama
MODEL_NAME=model-identifier
## Model Parameters
MODEL_TEMPERATURE=0.0-2.0 # Default: 0.7
MODEL_MAX_TOKENS=1-32000 # Default: 4096
MODEL_TIMEOUT=10-300 # Seconds, Default: 60
MODEL_TOP_P=0.0-1.0 # Default: 1.0
## Fallback
ENABLE_FALLBACK=true|false # Default: false
FALLBACK_MODELS=model1,model2,model3
## Provider API Keys
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-proj-...
GOOGLE_API_KEY=AIza...
AZURE_API_KEY=...
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
## Provider-Specific
AZURE_API_BASE=https://your-resource.openai.azure.com
AZURE_API_VERSION=2024-02-15-preview
AZURE_DEPLOYMENT_NAME=gpt-4
AWS_REGION_NAME=us-east-1
OLLAMA_API_BASE=http://localhost:11434
Model IDs
Anthropic
OpenAI
Google
Azure
Bedrock
Ollama
claude-sonnet-4-5 # Latest Sonnet
claude-opus-4-1 # Opus (extended reasoning)
claude-haiku-4-5 # Haiku (fast, cost-effective)
Troubleshooting
# Test API key
curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{"model":"claude-sonnet-4-5","max_tokens":10,"messages":[{"role":"user","content":"Hi"}]}'
# Check environment
echo $ANTHROPIC_API_KEY
env | grep API_KEY
Solutions:
- Enable fallback models
- Implement request queuing
- Increase rate limits (paid plans)
- Add retry with exponential backoff
llm = LLMFactory(
provider="anthropic",
enable_fallback=True,
fallback_models=["gemini-2.5-flash"],
timeout=120 # Longer timeout
)
# Check Ollama running
ollama list
# Start Ollama
ollama serve
# Test connection
curl http://localhost:11434/api/tags
# Set correct endpoint
OLLAMA_API_BASE=http://localhost:11434
Optimizations:
- Use faster models (Gemini Flash, Claude Haiku)
- Reduce max_tokens
- Increase temperature for faster sampling
- Enable streaming
llm = LLMFactory(
model_name="gemini-2.5-flash", # Faster
max_tokens=1024, # Lower limit
temperature=0.9 # Faster sampling
)
Next Steps
Flexible & Resilient: Multi-LLM support with automatic fallback ensures high availability and cost optimization!