Skip to main content

Overview

Google Gemini provides state-of-the-art multimodal AI models through two access methods: Google AI Studio (direct API) and Vertex AI (enterprise platform). This guide covers both approaches with the MCP Server.
Gemini 3.0 Pro (Nov 2025) is the latest model with a 1M token context window and advanced reasoning. For production workloads, Gemini 2.5 Flash and Gemini 2.5 Pro are stable, production-grade models.
New: Anthropic Claude models are also available via Vertex AI! See the Vertex AI Setup Guide for unified access to both Claude and Gemini models.

Available Models

The following models are production-ready and suitable for enterprise use:
ModelContext WindowUse CasePricing (per 1M tokens)Status
gemini-3-pro-preview1M tokensLatest, advanced reasoning2input/2 input / 12 output🆕 Preview (Nov 2025)
gemini-2.5-flash1M tokensFast responses, chat0.30input/0.30 input / 2.50 output✅ Production
gemini-2.5-pro2M tokensComplex reasoning1.25input/1.25 input / 10.00 output✅ Production
Vertex AI models (use vertex_ai/ prefix):
  • vertex_ai/gemini-3-pro-preview - Latest Gemini 3.0 Pro via Vertex AI
  • vertex_ai/gemini-2.5-flash - Gemini 2.5 Flash via Vertex AI
  • vertex_ai/gemini-2.5-pro - Gemini 2.5 Pro via Vertex AI

Quick Start

  • Google AI Studio
  • Vertex AI
Best for: Development, testing, small projects
1

Get API Key

  1. Go to https://aistudio.google.com/app/apikey
  2. Click “Create API Key”
  3. Copy the key (starts with AIza...)
2

Configure

# .env
LLM_PROVIDER=google
GOOGLE_API_KEY=AIzaSy...your-key
LLM_MODEL_NAME=gemini-2.5-flash
Or use Infisical:
# Add to Infisical
GOOGLE_API_KEY=AIzaSy...your-key

# In .env, reference Infisical
INFISICAL_PROJECT_ID=your-project-id
3

Test

from mcp_server_langgraph.llm.factory import LLMFactory

llm = LLMFactory(
    provider="google",
    model_name="gemini-2.5-flash"
)

response = await llm.ainvoke("What is the capital of France?")
print(response.content)
# Output: "The capital of France is Paris."

Configuration Options

Basic Configuration

from mcp_server_langgraph.llm.factory import LLMFactory

## Google AI Studio
llm = LLMFactory(
    provider="google",
    model_name="gemini-2.5-flash",
    temperature=0.7,
    max_tokens=8192,
    top_p=0.95,
    top_k=40
)

## Vertex AI
llm = LLMFactory(
    provider="vertex_ai",
    model_name="gemini-2.5-flash",
    project="your-project-id",
    location="us-central1",
    temperature=0.7
)

Advanced Configuration

## Multimodal with vision
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    google_api_key=settings.google_api_key,
    temperature=0.7,
    max_output_tokens=8192,

    # Safety settings
    safety_settings={
        "HARM_CATEGORY_HARASSMENT": "BLOCK_NONE",
        "HARM_CATEGORY_HATE_SPEECH": "BLOCK_MEDIUM_AND_ABOVE",
        "HARM_CATEGORY_SEXUALLY_EXPLICIT": "BLOCK_MEDIUM_AND_ABOVE",
        "HARM_CATEGORY_DANGEROUS_CONTENT": "BLOCK_MEDIUM_AND_ABOVE",
    },

    # Generation config
    generation_config={
        "temperature": 0.7,
        "top_p": 0.95,
        "top_k": 40,
        "candidate_count": 1,
        "max_output_tokens": 8192,
    }
)

Features

Multimodal Capabilities

  • Text + Images
  • Text + Video
  • Text + Audio
import base64
from langchain_core.messages import HumanMessage

# Load image
with open("chart.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

# Create multimodal message
message = HumanMessage(
    content=[
        {"type": "text", "text": "What's in this image?"},
        {
            "type": "image_url",
            "image_url": f"data:image/png;base64,{image_data}"
        }
    ]
)

response = await llm.ainvoke([message])
print(response.content)

Function Calling

from langchain_core.tools import tool

@tool
def get_weather(location: str) -> str:
    """Get current weather for a location"""
    # Implementation
    return f"Weather in {location}: 72°F, Sunny"

@tool
def search_web(query: str) -> str:
    """Search the web for information"""
    # Implementation
    return f"Search results for: {query}"

## Bind tools to LLM
llm_with_tools = llm.bind_tools([get_weather, search_web])

## Use with agent
from langgraph.prebuilt import create_react_agent

agent = create_react_agent(
    llm_with_tools,
    tools=[get_weather, search_web]
)

## Run agent
response = await agent.ainvoke({
    "messages": [("user", "What's the weather in Paris?")]
})
print(response["messages"][-1].content)

Structured Output

from pydantic import BaseModel, Field

class Person(BaseModel):
    name: str = Field(description="Person's full name")
    age: int = Field(description="Person's age in years")
    occupation: str = Field(description="Person's job title")

## Force structured output
structured_llm = llm.with_structured_output(Person)

response = await structured_llm.ainvoke(
    "Extract info: John Smith is a 35-year-old software engineer"
)

print(response.model_dump())
## Output: {'name': 'John Smith', 'age': 35, 'occupation': 'software engineer'}

Production Deployment

Kubernetes with Vertex AI

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server-langgraph
spec:
  template:
    spec:
      containers:
      - name: agent
        image: mcp-server-langgraph:latest
        env:
        - name: LLM_PROVIDER
          value: "vertex_ai"
        - name: LLM_MODEL_NAME
          value: "gemini-2.5-flash"
        - name: VERTEX_AI_PROJECT
          valueFrom:
            secretKeyRef:
              name: gcp-credentials
              key: project_id
        - name: VERTEX_AI_LOCATION
          value: "us-central1"
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: "/var/secrets/google/key.json"

        volumeMounts:
        - name: gcp-credentials
          mountPath: /var/secrets/google
          readOnly: true

      volumes:
      - name: gcp-credentials
        secret:
          secretName: gcp-service-account
Create secret:
kubectl create secret generic gcp-service-account \
  --from-file=key.json=vertex-key.json \
  --from-literal=project_id=YOUR_PROJECT_ID

Workload Identity (GKE)

## Enable Workload Identity on cluster
gcloud container clusters update CLUSTER_NAME \
  --workload-pool=PROJECT_ID.svc.id.goog

## Create Kubernetes service account
kubectl create serviceaccount mcp-server-langgraph

## Bind to GCP service account
gcloud iam service-accounts add-iam-policy-binding \
  vertex-ai-agent@PROJECT_ID.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[mcp-server-langgraph/mcp-server-langgraph]"

## Annotate K8s service account
kubectl annotate serviceaccount mcp-server-langgraph \
  iam.gke.io/gcp-service-account=vertex-ai-agent@PROJECT_ID.iam.gserviceaccount.com

## Use in deployment
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      serviceAccountName: mcp-server-langgraph
      containers:
      - name: agent
        env:
        - name: LLM_PROVIDER
          value: "vertex_ai"

Performance Optimization

Caching

## Enable prompt caching (Gemini 2.5 Pro only)
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro",
    google_api_key=settings.google_api_key,
    cache_content=True  # Enable caching
)

## First call - full cost
response1 = await llm.ainvoke("Analyze this long document...")

## Second call with similar prompt - cached, cheaper
response2 = await llm.ainvoke("What are the key points in the document?")

Batching

## Process multiple requests in parallel
import asyncio

queries = [
    "What is machine learning?",
    "Explain neural networks",
    "Define deep learning"
]

## Concurrent processing
responses = await asyncio.gather(*[
    llm.ainvoke(query) for query in queries
])

for query, response in zip(queries, responses):
    print(f"Q: {query}\nA: {response.content}\n")

Streaming

## Stream responses for better UX
async def stream_response(query: str):
    async for chunk in llm.astream(query):
        yield chunk.content
        # Send to client incrementally

## Use in FastAPI
from fastapi.responses import StreamingResponse

@app.post("/message/stream")
async def stream_message(request: Request):
    query = request.query

    return StreamingResponse(
        stream_response(query),
        media_type="text/event-stream"
    )

Cost Optimization

Model Selection

## Use cheaper model for simple tasks
def select_model(task_complexity: str) -> str:
    if task_complexity == "simple":
        return "gemini-2.5-flash"  # Free
    elif task_complexity == "medium":
        return "gemini-2.5-flash-002"  # $0.15/1M
    else:
        return "gemini-2.5-pro"  # $1.25/1M

## Dynamic selection
llm = LLMFactory(
    provider="google",
    model_name=select_model(analyze_complexity(query))
)

Token Usage Tracking

from langchain.callbacks import get_openai_callback

## Track tokens
with get_openai_callback() as cb:
    response = await llm.ainvoke(query)

    print(f"Tokens used: {cb.total_tokens}")
    print(f"Prompt tokens: {cb.prompt_tokens}")
    print(f"Completion tokens: {cb.completion_tokens}")
    print(f"Total cost: ${cb.total_cost}")

## Set budget limits
MAX_TOKENS_PER_REQUEST = 10000

if cb.total_tokens > MAX_TOKENS_PER_REQUEST:
    raise ValueError("Token limit exceeded")

Quota Management

## Check quota usage (Vertex AI)
gcloud alpha monitoring time-series list \
  --filter='metric.type="aiplatform.googleapis.com/quota/online_prediction_requests/usage"'

## Set quota alerts
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="Vertex AI Quota Alert" \
  --condition-display-name="80% quota usage" \
  --condition-threshold-value=0.8 \
  --condition-threshold-duration=300s

Safety & Content Filtering

from langchain_google_genai import HarmBlockThreshold, HarmCategory

## Configure safety settings
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    safety_settings={
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
        HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
        HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    }
)

## Handle safety blocks
try:
    response = await llm.ainvoke(query)
except Exception as e:
    if "SAFETY" in str(e):
        logger.warning(f"Content blocked by safety filters: {query}")
        return "I cannot provide a response to that query due to safety concerns."
    raise

Monitoring

LangSmith Integration

## Track Gemini usage in LangSmith
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "langgraph-gemini"

## Calls automatically tracked
response = await llm.ainvoke(query)
## View in https://smith.langchain.com

Custom Metrics

from prometheus_client import Counter, Histogram

## Define metrics
gemini_requests = Counter(
    'gemini_requests_total',
    'Total Gemini API requests',
    ['model', 'status']
)

gemini_latency = Histogram(
    'gemini_request_duration_seconds',
    'Gemini request latency',
    ['model']
)

## Track usage
import time

@gemini_latency.labels(model="gemini-2.5-flash").time()
async def call_gemini(query: str):
    try:
        response = await llm.ainvoke(query)
        gemini_requests.labels(model="gemini-2.5-flash", status="success").inc()
        return response
    except Exception as e:
        gemini_requests.labels(model="gemini-2.5-flash", status="error").inc()
        raise

Troubleshooting

Error: google.api_core.exceptions.Unauthenticated: 401 API key not validSolutions:
# Verify API key
echo $GOOGLE_API_KEY | head -c 10
# Should start with: AIza

# Test key
curl "https://generativelanguage.googleapis.com/v1beta/models?key=$GOOGLE_API_KEY"

# Regenerate key
# 1. Go to https://aistudio.google.com/app/apikey
# 2. Delete old key
# 3. Create new key
# 4. Update in Infisical/config
Error: google.auth.exceptions.DefaultCredentialsErrorSolutions:
# Check service account key
cat vertex-key.json | jq .

# Test authentication
gcloud auth activate-service-account --key-file=vertex-key.json
gcloud auth list

# Verify permissions
gcloud projects get-iam-policy YOUR_PROJECT_ID \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:vertex-ai-agent@YOUR_PROJECT_ID.iam.gserviceaccount.com"

# Should include: roles/aiplatform.user
Error: 429 Resource has been exhaustedSolutions:
# Implement exponential backoff
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_gemini_with_retry(query: str):
    return await llm.ainvoke(query)

# Enable fallback to another model
llm = LLMFactory(
    provider="google",
    model_name="gemini-2.5-flash",
    enable_fallback=True,
    fallback_models=["claude-sonnet-4-5-20250929"]
)
Error: Response contains safety blockSolutions:
# Adjust safety settings (use with caution)
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    safety_settings={
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
        # More permissive
    }
)

# Or rephrase query
query_rephrased = rephrase_for_safety(original_query)
response = await llm.ainvoke(query_rephrased)

Next Steps


Google Gemini Ready: Leverage Google’s most advanced AI models in your MCP Server!