Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mcp-server-langgraph.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This document explains the LLM model configuration strategy across different deployment environments.

Overview

The codebase supports multiple LLM providers through LiteLLM, including:
  • Google (Gemini models)
  • Anthropic (Claude models)
  • OpenAI (GPT models)
  • Azure OpenAI
  • AWS Bedrock
  • Ollama (local models)
Different environments are configured with different default models based on cost, performance, and use case requirements.

Environment-Specific Defaults

Development (Docker Compose, Local)

Default Configuration:
LLM_PROVIDER=google
MODEL_NAME=gemini-2.5-flash-002
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=8192
MODEL_TIMEOUT=60
ENABLE_FALLBACK=true
Rationale:
  • Fast Iterations: Gemini Flash has low latency (~1-2s response time)
  • Low Cost: ~0.075per1Minputtokens, 0.075 per 1M input tokens, ~0.30 per 1M output tokens
  • Good Quality: Sufficient for development and testing
  • High Quota: Generous free tier and rate limits
Use Cases:
  • Local development
  • Unit/integration testing
  • Rapid prototyping
  • CI/CD pipeline tests

Staging (Kubernetes Staging)

Default Configuration:
LLM_PROVIDER=anthropic
MODEL_NAME=claude-sonnet-4-5-20250929
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=4096
MODEL_TIMEOUT=60
ENABLE_FALLBACK=true
Rationale:
  • Production Parity: Same model as production
  • Quality Validation: Test with production-grade model
  • Cost Awareness: Monitor costs before production
  • Behavior Validation: Ensure responses match production
Use Cases:
  • Pre-production testing
  • User acceptance testing (UAT)
  • Performance benchmarking
  • Load testing

Production (Kubernetes Production, Helm)

Default Configuration:
LLM_PROVIDER=anthropic
MODEL_NAME=claude-sonnet-4-5-20250929
MODEL_TEMPERATURE=0.7
MODEL_MAX_TOKENS=4096
MODEL_TIMEOUT=60
ENABLE_FALLBACK=true
Rationale:
  • Highest Quality: Claude 3.5 Sonnet offers superior reasoning
  • Reliability: Anthropic’s enterprise SLA and uptime
  • Safety: Strong Constitutional AI safety features
  • Compliance: Better content moderation for production
Cost Considerations:
  • Input: $3.00 per 1M tokens
  • Output: $15.00 per 1M tokens
  • Typical request: ~1000 input tokens, ~500 output tokens
  • Cost per request: ~$0.0105
Use Cases:
  • Customer-facing applications
  • Production workloads
  • High-quality content generation
  • Mission-critical tasks

Overriding Default Configuration

Method 1: Environment Variables

Override for a specific deployment:
## Use OpenAI GPT-5.1
export LLM_PROVIDER=openai
export MODEL_NAME=gpt-5.1
export OPENAI_API_KEY=sk-...

## Use Google Gemini Pro
export LLM_PROVIDER=google
export MODEL_NAME=gemini-2.5-pro
export GOOGLE_API_KEY=...

Method 2: Helm Values Override

For production Helm deployments:
helm upgrade --install mcp-server-langgraph ./deployments/helm/mcp-server-langgraph \
  --set config.llmProvider=openai \
  --set config.modelName=gpt-5.1 \
  --set secrets.openaiApiKey=$OPENAI_API_KEY

Method 3: Kustomize Patch

For Kustomize deployments, create a patch file:
## deployments/kustomize/overlays/custom/configmap-patch.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: mcp-server-langgraph-config
data:
  llm_provider: "openai"
  model_name: "gpt-5.1"
  model_temperature: "0.8"
  model_max_tokens: "8192"

Method 4: .env File (Local Development)

## .env
LLM_PROVIDER=ollama
MODEL_NAME=llama3.1:70b
OLLAMA_BASE_URL=http://localhost:11434

Fallback Configuration

Default Fallback Chain

When ENABLE_FALLBACK=true, the system attempts models in this order:
  1. Primary: Configured MODEL_NAME
  2. Fallback 1: gemini-2.5-pro (Google)
  3. Fallback 2: claude-sonnet-4-5-20250929 (Anthropic)
  4. Fallback 3: gpt-5.1 (OpenAI)

Custom Fallback Chain

Override via environment variable:
FALLBACK_MODELS='["claude-sonnet-4-5-20250929","gpt-5.1","gemini-2.5-pro"]'

Disabling Fallback

ENABLE_FALLBACK=false
Warning: Disabling fallback means service failures if primary model is unavailable.

Cost Comparison

Input Tokens (per 1M tokens)

ProviderModelCostRelative
Googlegemini-2.5-flash-002$0.0751x (baseline)
Googlegemini-2.5-pro$1.2517x
Anthropicclaude-sonnet-4-5-20250929$3.0040x
OpenAIgpt-5.1$1.2517x

Output Tokens (per 1M tokens)

ProviderModelCostRelative
Googlegemini-2.5-flash-002$0.301x (baseline)
Googlegemini-2.5-pro$5.0017x
Anthropicclaude-sonnet-4-5-20250929$15.0050x
OpenAIgpt-5.1$10.0033x

Typical Request Cost (1000 input + 500 output tokens)

ModelTotal CostMonthly (100K requests)
gemini-2.5-flash-002$0.00023$23
gemini-2.5-pro$0.00375$375
claude-sonnet-4-5-20250929$0.0105$1,050
gpt-5.1$0.00625$625
Recommendation: Use Gemini Flash for development, Claude for production quality.

Performance Characteristics

Latency (p95, typical request)

ModelLatencyUse Case
gemini-2.5-flash-002~1.5sInteractive applications
gemini-2.5-pro~3sBackground processing
claude-sonnet-4-5-20250929~4sQuality-critical tasks
gpt-5.1~3.5sBalanced performance

Quality (Subjective, 1-10 scale)

ModelReasoningCreativityCode GenSafety
gemini-2.5-flash-0027788
gemini-2.5-pro9899
claude-sonnet-4-5-202509291091010
gpt-5.191098

Model Selection Guidelines

Choose Gemini Flash When:

  • ✅ Cost is a primary concern
  • ✅ Need fast response times (<2s)
  • ✅ Development/testing environment
  • ✅ High request volume expected
  • ✅ Quality requirements are moderate

Choose Gemini Pro When:

  • ✅ Need better reasoning than Flash
  • ✅ Can tolerate slightly higher cost
  • ✅ Prefer Google’s ecosystem
  • ✅ Need strong multilingual support

Choose Claude 3.5 Sonnet When:

  • Quality is paramount
  • ✅ Complex reasoning required
  • ✅ Code generation is primary use case
  • ✅ Safety/content moderation critical
  • ✅ Production customer-facing deployment

Choose GPT-5.1 When:

  • ✅ Need creative content generation
  • ✅ Existing OpenAI integration
  • ✅ Require vision capabilities
  • ✅ Balance of cost and quality

Choose Ollama (Local) When:

  • ✅ Data privacy is critical (on-premise)
  • ✅ No internet connectivity
  • ✅ Zero API costs desired
  • ✅ Can provide GPU infrastructure
  • ✅ Full control over model weights

Monitoring and Optimization

Metrics to Track

  1. Cost Metrics:
    • Total API spend per day/month
    • Cost per request
    • Token usage (input/output separately)
  2. Performance Metrics:
    • Average response latency
    • p95/p99 latency
    • Timeout rate
    • Fallback usage rate
  3. Quality Metrics:
    • User satisfaction scores
    • Retry/regeneration rate
    • Error rate per model

Cost Optimization Strategies

  1. Prompt Optimization:
    • Reduce unnecessary context
    • Use more concise system prompts
    • Implement prompt caching (if supported)
  2. Smart Routing:
    • Simple queries → Gemini Flash
    • Complex queries → Claude Sonnet
    • Code generation → Claude Sonnet or GPT-5.1
  3. Batching:
    • Batch non-urgent requests
    • Process during off-peak hours
    • Use asynchronous processing
  4. Caching:
    • Cache common responses
    • Implement semantic deduplication
    • Use Redis for response cache

Security Considerations

API Key Management

❌ Never:
  • Commit API keys to git
  • Hardcode in source code
  • Share keys across environments
  • Use the same key for dev and prod
✅ Always:
  • Use environment variables or secrets manager
  • Rotate keys regularly (quarterly minimum)
  • Use separate keys per environment
  • Monitor for key exposure in logs

Rate Limiting

Configure rate limits per environment:
## Development
LLM_RATE_LIMIT_RPM: 60  # 60 requests per minute

## Staging
LLM_RATE_LIMIT_RPM: 300

## Production
LLM_RATE_LIMIT_RPM: 1000

Troubleshooting

Model Returns 401 Unauthorized

Cause: Invalid or expired API key Solution:
## Verify API key is set
echo $ANTHROPIC_API_KEY | head -c 10

## Test with curl
curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01"

Fallback Chain Not Working

Cause: Missing API keys for fallback models Solution: Ensure ALL fallback models have API keys configured:
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export GOOGLE_API_KEY=...

High Latency / Timeouts

Cause: Model is overloaded or timeout too short Solutions:
  1. Increase timeout: MODEL_TIMEOUT=120
  2. Switch to faster model: MODEL_NAME=gemini-2.5-flash-002
  3. Enable fallback: ENABLE_FALLBACK=true
  4. Implement request queuing and retry logic

References


Last Updated: 2025-10-13 Document Version: 1.0 Maintainer: Platform Team