Overview
The codebase supports multiple LLM providers through LiteLLM, including:- Google (Gemini models)
- Anthropic (Claude models)
- OpenAI (GPT models)
- Azure OpenAI
- AWS Bedrock
- Ollama (local models)
Environment-Specific Defaults
Development (Docker Compose, Local)
Default Configuration:- Fast Iterations: Gemini Flash has low latency (~1-2s response time)
- Low Cost: ~0.30 per 1M output tokens
- Good Quality: Sufficient for development and testing
- High Quota: Generous free tier and rate limits
- Local development
- Unit/integration testing
- Rapid prototyping
- CI/CD pipeline tests
Staging (Kubernetes Staging)
Default Configuration:- Production Parity: Same model as production
- Quality Validation: Test with production-grade model
- Cost Awareness: Monitor costs before production
- Behavior Validation: Ensure responses match production
- Pre-production testing
- User acceptance testing (UAT)
- Performance benchmarking
- Load testing
Production (Kubernetes Production, Helm)
Default Configuration:- Highest Quality: Claude 3.5 Sonnet offers superior reasoning
- Reliability: Anthropic’s enterprise SLA and uptime
- Safety: Strong Constitutional AI safety features
- Compliance: Better content moderation for production
- Input: $3.00 per 1M tokens
- Output: $15.00 per 1M tokens
- Typical request: ~1000 input tokens, ~500 output tokens
- Cost per request: ~$0.0105
- Customer-facing applications
- Production workloads
- High-quality content generation
- Mission-critical tasks
Overriding Default Configuration
Method 1: Environment Variables
Override for a specific deployment:Method 2: Helm Values Override
For production Helm deployments:Method 3: Kustomize Patch
For Kustomize deployments, create a patch file:Method 4: .env File (Local Development)
Fallback Configuration
Default Fallback Chain
WhenENABLE_FALLBACK=true, the system attempts models in this order:
- Primary: Configured
MODEL_NAME - Fallback 1:
gemini-2.5-pro(Google) - Fallback 2:
claude-sonnet-4-5-20250929(Anthropic) - Fallback 3:
gpt-5.1(OpenAI)
Custom Fallback Chain
Override via environment variable:Disabling Fallback
Cost Comparison
Input Tokens (per 1M tokens)
| Provider | Model | Cost | Relative |
|---|---|---|---|
| gemini-2.5-flash-002 | $0.075 | 1x (baseline) | |
| gemini-2.5-pro | $1.25 | 17x | |
| Anthropic | claude-sonnet-4-5-20250929 | $3.00 | 40x |
| OpenAI | gpt-5.1 | $1.25 | 17x |
Output Tokens (per 1M tokens)
| Provider | Model | Cost | Relative |
|---|---|---|---|
| gemini-2.5-flash-002 | $0.30 | 1x (baseline) | |
| gemini-2.5-pro | $5.00 | 17x | |
| Anthropic | claude-sonnet-4-5-20250929 | $15.00 | 50x |
| OpenAI | gpt-5.1 | $10.00 | 33x |
Typical Request Cost (1000 input + 500 output tokens)
| Model | Total Cost | Monthly (100K requests) |
|---|---|---|
| gemini-2.5-flash-002 | $0.00023 | $23 |
| gemini-2.5-pro | $0.00375 | $375 |
| claude-sonnet-4-5-20250929 | $0.0105 | $1,050 |
| gpt-5.1 | $0.00625 | $625 |
Performance Characteristics
Latency (p95, typical request)
| Model | Latency | Use Case |
|---|---|---|
| gemini-2.5-flash-002 | ~1.5s | Interactive applications |
| gemini-2.5-pro | ~3s | Background processing |
| claude-sonnet-4-5-20250929 | ~4s | Quality-critical tasks |
| gpt-5.1 | ~3.5s | Balanced performance |
Quality (Subjective, 1-10 scale)
| Model | Reasoning | Creativity | Code Gen | Safety |
|---|---|---|---|---|
| gemini-2.5-flash-002 | 7 | 7 | 8 | 8 |
| gemini-2.5-pro | 9 | 8 | 9 | 9 |
| claude-sonnet-4-5-20250929 | 10 | 9 | 10 | 10 |
| gpt-5.1 | 9 | 10 | 9 | 8 |
Model Selection Guidelines
Choose Gemini Flash When:
- ✅ Cost is a primary concern
- ✅ Need fast response times (<2s)
- ✅ Development/testing environment
- ✅ High request volume expected
- ✅ Quality requirements are moderate
Choose Gemini Pro When:
- ✅ Need better reasoning than Flash
- ✅ Can tolerate slightly higher cost
- ✅ Prefer Google’s ecosystem
- ✅ Need strong multilingual support
Choose Claude 3.5 Sonnet When:
- ✅ Quality is paramount ⭐
- ✅ Complex reasoning required
- ✅ Code generation is primary use case
- ✅ Safety/content moderation critical
- ✅ Production customer-facing deployment
Choose GPT-5.1 When:
- ✅ Need creative content generation
- ✅ Existing OpenAI integration
- ✅ Require vision capabilities
- ✅ Balance of cost and quality
Choose Ollama (Local) When:
- ✅ Data privacy is critical (on-premise)
- ✅ No internet connectivity
- ✅ Zero API costs desired
- ✅ Can provide GPU infrastructure
- ✅ Full control over model weights
Monitoring and Optimization
Metrics to Track
-
Cost Metrics:
- Total API spend per day/month
- Cost per request
- Token usage (input/output separately)
-
Performance Metrics:
- Average response latency
- p95/p99 latency
- Timeout rate
- Fallback usage rate
-
Quality Metrics:
- User satisfaction scores
- Retry/regeneration rate
- Error rate per model
Cost Optimization Strategies
-
Prompt Optimization:
- Reduce unnecessary context
- Use more concise system prompts
- Implement prompt caching (if supported)
-
Smart Routing:
- Simple queries → Gemini Flash
- Complex queries → Claude Sonnet
- Code generation → Claude Sonnet or GPT-5.1
-
Batching:
- Batch non-urgent requests
- Process during off-peak hours
- Use asynchronous processing
-
Caching:
- Cache common responses
- Implement semantic deduplication
- Use Redis for response cache
Security Considerations
API Key Management
❌ Never:- Commit API keys to git
- Hardcode in source code
- Share keys across environments
- Use the same key for dev and prod
- Use environment variables or secrets manager
- Rotate keys regularly (quarterly minimum)
- Use separate keys per environment
- Monitor for key exposure in logs
Rate Limiting
Configure rate limits per environment:Troubleshooting
Model Returns 401 Unauthorized
Cause: Invalid or expired API key Solution:Fallback Chain Not Working
Cause: Missing API keys for fallback models Solution: Ensure ALL fallback models have API keys configured:High Latency / Timeouts
Cause: Model is overloaded or timeout too short Solutions:- Increase timeout:
MODEL_TIMEOUT=120 - Switch to faster model:
MODEL_NAME=gemini-2.5-flash-002 - Enable fallback:
ENABLE_FALLBACK=true - Implement request queuing and retry logic
References
- LiteLLM Documentation
- Google Gemini Pricing
- Anthropic Claude Pricing
- OpenAI Pricing
- Feature Flags Documentation
- Deployment Guide
Last Updated: 2025-10-13 Document Version: 1.0 Maintainer: Platform Team