Google Gemini Setup - MCP Server with LangGraph

Overview

Google Gemini provides state-of-the-art multimodal AI models through two access methods: Google AI Studio (direct API) and Vertex AI (enterprise platform). This guide covers both approaches with the MCP Server.

Gemini 3.0 Pro (Nov 2025) is the latest model with a 1M token context window and advanced reasoning. For production workloads, Gemini 2.5 Flash and Gemini 2.5 Pro are stable, production-grade models.

New: Anthropic Claude models are also available via Vertex AI! See the Vertex AI Setup Guide for unified access to both Claude and Gemini models.

Available Models

The following models are production-ready and suitable for enterprise use:

Model	Context Window	Use Case	Pricing (per 1M tokens)	Status
gemini-3-pro-preview	1M tokens	Latest, advanced reasoning	$2 input /$ 12 output	🆕 Preview (Nov 2025)
gemini-2.5-flash	1M tokens	Fast responses, chat	$0.30 input /$ 2.50 output	✅ Production
gemini-2.5-pro	2M tokens	Complex reasoning	$1.25 input /$ 10.00 output	✅ Production

Vertex AI models (use vertex_ai/ prefix):

vertex_ai/gemini-3-pro-preview - Latest Gemini 3.0 Pro via Vertex AI
vertex_ai/gemini-2.5-flash - Gemini 2.5 Flash via Vertex AI
vertex_ai/gemini-2.5-pro - Gemini 2.5 Pro via Vertex AI

Quick Start

Google AI Studio
Vertex AI

Best for: Development, testing, small projects

Get API Key

Go to https://aistudio.google.com/app/apikey
Click “Create API Key”
Copy the key (starts with AIza...)

Configure

# .env
LLM_PROVIDER=google
GOOGLE_API_KEY=AIzaSy...your-key
LLM_MODEL_NAME=gemini-2.5-flash

Or use Infisical:

# Add to Infisical
GOOGLE_API_KEY=AIzaSy...your-key

# In .env, reference Infisical
INFISICAL_PROJECT_ID=your-project-id

Test

from mcp_server_langgraph.llm.factory import LLMFactory

llm = LLMFactory(
    provider="google",
    model_name="gemini-2.5-flash"
)

response = await llm.ainvoke("What is the capital of France?")
print(response.content)
# Output: "The capital of France is Paris."

Configuration Options

Basic Configuration

from mcp_server_langgraph.llm.factory import LLMFactory

## Google AI Studio
llm = LLMFactory(
    provider="google",
    model_name="gemini-2.5-flash",
    temperature=0.7,
    max_tokens=8192,
    top_p=0.95,
    top_k=40
)

## Vertex AI
llm = LLMFactory(
    provider="vertex_ai",
    model_name="gemini-2.5-flash",
    project="your-project-id",
    location="us-central1",
    temperature=0.7
)

Advanced Configuration

## Multimodal with vision
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    google_api_key=settings.google_api_key,
    temperature=0.7,
    max_output_tokens=8192,

    # Safety settings
    safety_settings={
        "HARM_CATEGORY_HARASSMENT": "BLOCK_NONE",
        "HARM_CATEGORY_HATE_SPEECH": "BLOCK_MEDIUM_AND_ABOVE",
        "HARM_CATEGORY_SEXUALLY_EXPLICIT": "BLOCK_MEDIUM_AND_ABOVE",
        "HARM_CATEGORY_DANGEROUS_CONTENT": "BLOCK_MEDIUM_AND_ABOVE",
    },

    # Generation config
    generation_config={
        "temperature": 0.7,
        "top_p": 0.95,
        "top_k": 40,
        "candidate_count": 1,
        "max_output_tokens": 8192,
    }
)

Features

Multimodal Capabilities

Text + Images
Text + Video
Text + Audio

import base64
from langchain_core.messages import HumanMessage

# Load image
with open("chart.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

# Create multimodal message
message = HumanMessage(
    content=[
        {"type": "text", "text": "What's in this image?"},
        {
            "type": "image_url",
            "image_url": f"data:image/png;base64,{image_data}"
        }
    ]
)

response = await llm.ainvoke([message])
print(response.content)

Function Calling

from langchain_core.tools import tool

@tool
def get_weather(location: str) -> str:
    """Get current weather for a location"""
    # Implementation
    return f"Weather in {location}: 72°F, Sunny"

@tool
def search_web(query: str) -> str:
    """Search the web for information"""
    # Implementation
    return f"Search results for: {query}"

## Bind tools to LLM
llm_with_tools = llm.bind_tools([get_weather, search_web])

## Use with agent
from langgraph.prebuilt import create_react_agent

agent = create_react_agent(
    llm_with_tools,
    tools=[get_weather, search_web]
)

## Run agent
response = await agent.ainvoke({
    "messages": [("user", "What's the weather in Paris?")]
})
print(response["messages"][-1].content)

Structured Output

from pydantic import BaseModel, Field

class Person(BaseModel):
    name: str = Field(description="Person's full name")
    age: int = Field(description="Person's age in years")
    occupation: str = Field(description="Person's job title")

## Force structured output
structured_llm = llm.with_structured_output(Person)

response = await structured_llm.ainvoke(
    "Extract info: John Smith is a 35-year-old software engineer"
)

print(response.model_dump())
## Output: {'name': 'John Smith', 'age': 35, 'occupation': 'software engineer'}

Production Deployment

Kubernetes with Vertex AI

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server-langgraph
spec:
  template:
    spec:
      containers:
      - name: agent
        image: mcp-server-langgraph:latest
        env:
        - name: LLM_PROVIDER
          value: "vertex_ai"
        - name: LLM_MODEL_NAME
          value: "gemini-2.5-flash"
        - name: VERTEX_AI_PROJECT
          valueFrom:
            secretKeyRef:
              name: gcp-credentials
              key: project_id
        - name: VERTEX_AI_LOCATION
          value: "us-central1"
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: "/var/secrets/google/key.json"

        volumeMounts:
        - name: gcp-credentials
          mountPath: /var/secrets/google
          readOnly: true

      volumes:
      - name: gcp-credentials
        secret:
          secretName: gcp-service-account

Create secret:

kubectl create secret generic gcp-service-account \
  --from-file=key.json=vertex-key.json \
  --from-literal=project_id=YOUR_PROJECT_ID

Workload Identity (GKE)

## Enable Workload Identity on cluster
gcloud container clusters update CLUSTER_NAME \
  --workload-pool=PROJECT_ID.svc.id.goog

## Create Kubernetes service account
kubectl create serviceaccount mcp-server-langgraph

## Bind to GCP service account
gcloud iam service-accounts add-iam-policy-binding \
  vertex-ai-agent@PROJECT_ID.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[mcp-server-langgraph/mcp-server-langgraph]"

## Annotate K8s service account
kubectl annotate serviceaccount mcp-server-langgraph \
  iam.gke.io/gcp-service-account=vertex-ai-agent@PROJECT_ID.iam.gserviceaccount.com

## Use in deployment
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      serviceAccountName: mcp-server-langgraph
      containers:
      - name: agent
        env:
        - name: LLM_PROVIDER
          value: "vertex_ai"

Performance Optimization

Caching

## Enable prompt caching (Gemini 2.5 Pro only)
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro",
    google_api_key=settings.google_api_key,
    cache_content=True  # Enable caching
)

## First call - full cost
response1 = await llm.ainvoke("Analyze this long document...")

## Second call with similar prompt - cached, cheaper
response2 = await llm.ainvoke("What are the key points in the document?")

Batching

## Process multiple requests in parallel
import asyncio

queries = [
    "What is machine learning?",
    "Explain neural networks",
    "Define deep learning"
]

## Concurrent processing
responses = await asyncio.gather(*[
    llm.ainvoke(query) for query in queries
])

for query, response in zip(queries, responses):
    print(f"Q: {query}\nA: {response.content}\n")

Streaming

## Stream responses for better UX
async def stream_response(query: str):
    async for chunk in llm.astream(query):
        yield chunk.content
        # Send to client incrementally

## Use in FastAPI
from fastapi.responses import StreamingResponse

@app.post("/message/stream")
async def stream_message(request: Request):
    query = request.query

    return StreamingResponse(
        stream_response(query),
        media_type="text/event-stream"
    )

Cost Optimization

Model Selection

## Use cheaper model for simple tasks
def select_model(task_complexity: str) -> str:
    if task_complexity == "simple":
        return "gemini-2.5-flash"  # Free
    elif task_complexity == "medium":
        return "gemini-2.5-flash-002"  # $0.15/1M
    else:
        return "gemini-2.5-pro"  # $1.25/1M

## Dynamic selection
llm = LLMFactory(
    provider="google",
    model_name=select_model(analyze_complexity(query))
)

Token Usage Tracking

from langchain.callbacks import get_openai_callback

## Track tokens
with get_openai_callback() as cb:
    response = await llm.ainvoke(query)

    print(f"Tokens used: {cb.total_tokens}")
    print(f"Prompt tokens: {cb.prompt_tokens}")
    print(f"Completion tokens: {cb.completion_tokens}")
    print(f"Total cost: ${cb.total_cost}")

## Set budget limits
MAX_TOKENS_PER_REQUEST = 10000

if cb.total_tokens > MAX_TOKENS_PER_REQUEST:
    raise ValueError("Token limit exceeded")

Quota Management

## Check quota usage (Vertex AI)
gcloud alpha monitoring time-series list \
  --filter='metric.type="aiplatform.googleapis.com/quota/online_prediction_requests/usage"'

## Set quota alerts
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="Vertex AI Quota Alert" \
  --condition-display-name="80% quota usage" \
  --condition-threshold-value=0.8 \
  --condition-threshold-duration=300s

Safety & Content Filtering

from langchain_google_genai import HarmBlockThreshold, HarmCategory

## Configure safety settings
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    safety_settings={
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
        HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
        HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    }
)

## Handle safety blocks
try:
    response = await llm.ainvoke(query)
except Exception as e:
    if "SAFETY" in str(e):
        logger.warning(f"Content blocked by safety filters: {query}")
        return "I cannot provide a response to that query due to safety concerns."
    raise

Monitoring

LangSmith Integration

## Track Gemini usage in LangSmith
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "langgraph-gemini"

## Calls automatically tracked
response = await llm.ainvoke(query)
## View in https://smith.langchain.com

Custom Metrics

from prometheus_client import Counter, Histogram

## Define metrics
gemini_requests = Counter(
    'gemini_requests_total',
    'Total Gemini API requests',
    ['model', 'status']
)

gemini_latency = Histogram(
    'gemini_request_duration_seconds',
    'Gemini request latency',
    ['model']
)

## Track usage
import time

@gemini_latency.labels(model="gemini-2.5-flash").time()
async def call_gemini(query: str):
    try:
        response = await llm.ainvoke(query)
        gemini_requests.labels(model="gemini-2.5-flash", status="success").inc()
        return response
    except Exception as e:
        gemini_requests.labels(model="gemini-2.5-flash", status="error").inc()
        raise

Troubleshooting

API key invalid

Error: google.api_core.exceptions.Unauthenticated: 401 API key not validSolutions:

# Verify API key
echo $GOOGLE_API_KEY | head -c 10
# Should start with: AIza

# Test key
curl "https://generativelanguage.googleapis.com/v1beta/models?key=$GOOGLE_API_KEY"

# Regenerate key
# 1. Go to https://aistudio.google.com/app/apikey
# 2. Delete old key
# 3. Create new key
# 4. Update in Infisical/config

Vertex AI authentication failed

Error: google.auth.exceptions.DefaultCredentialsErrorSolutions:

# Check service account key
cat vertex-key.json | jq .

# Test authentication
gcloud auth activate-service-account --key-file=vertex-key.json
gcloud auth list

# Verify permissions
gcloud projects get-iam-policy YOUR_PROJECT_ID \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:vertex-ai-agent@YOUR_PROJECT_ID.iam.gserviceaccount.com"

# Should include: roles/aiplatform.user

Rate limit exceeded

Error: 429 Resource has been exhaustedSolutions:

# Implement exponential backoff
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_gemini_with_retry(query: str):
    return await llm.ainvoke(query)

# Enable fallback to another model
llm = LLMFactory(
    provider="google",
    model_name="gemini-2.5-flash",
    enable_fallback=True,
    fallback_models=["claude-sonnet-4-5-20250929"]
)

Content blocked by safety filters

Error: Response contains safety blockSolutions:

# Adjust safety settings (use with caution)
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    safety_settings={
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
        # More permissive
    }
)

# Or rephrase query
query_rephrased = rephrase_for_safety(original_query)
response = await llm.ainvoke(query_rephrased)

Next Steps

Multi-LLM Setup

Configure multiple LLM providers

Anthropic Claude

Set up Claude models

Observability

Monitor LLM usage

Production Checklist

LLM deployment best practices

Google Gemini Ready: Leverage Google’s most advanced AI models in your MCP Server!

Quick Start

LLM Providers

Production

Migration

Authorization

Enterprise Identity & Access

Secrets Management

Sessions & Storage

Observability

​Overview

​Available Models

​Quick Start

​Configuration Options

​Basic Configuration

​Advanced Configuration

​Features

​Multimodal Capabilities

​Function Calling

​Structured Output

​Production Deployment

​Kubernetes with Vertex AI

​Workload Identity (GKE)

​Performance Optimization

​Caching

​Batching

​Streaming

​Cost Optimization

​Model Selection

​Token Usage Tracking

​Quota Management

​Safety & Content Filtering

​Monitoring

​LangSmith Integration

​Custom Metrics

​Troubleshooting

​Next Steps

Multi-LLM Setup

Anthropic Claude

Observability

Production Checklist

Overview

Available Models

Quick Start

Configuration Options

Basic Configuration

Advanced Configuration

Features

Multimodal Capabilities

Function Calling

Structured Output

Production Deployment

Kubernetes with Vertex AI

Workload Identity (GKE)

Performance Optimization

Caching

Batching

Streaming

Cost Optimization

Model Selection

Token Usage Tracking

Quota Management

Safety & Content Filtering

Monitoring

LangSmith Integration

Custom Metrics

Troubleshooting

Next Steps