Monitoring & Observability

Overview

LangGraph Platform deployments automatically integrate with LangSmith for comprehensive observability. Every request is traced with full LLM details.

Built-in Tracing: No configuration needed - all deployments are automatically traced in LangSmith.

Viewing Traces

Access LangSmith

Go to smith.langchain.com

Select Project

Choose your project (e.g., “mcp-server-langgraph”)

View Traces

See all requests with:

Full prompts and completions
Token usage and costs
Latency breakdown
Error details

What’s Captured

Every trace includes:

LLM Calls

Full prompts sent to LLM
Complete model responses
Token counts (input/output)
Model parameters
Latency per call

Agent Steps

Routing decisions
Tool invocations
State transitions
Conditional flows
Execution order

Metadata

User ID and session ID
Request timestamp
Environment (prod/staging)
Custom tags
Deployment version

Errors

Full stack traces
Input that caused error
Error context
Failure timing
Retry attempts

Metrics Dashboard

View key metrics in LangSmith:

Request Volume

Total invocations over time
Requests per second
Peak traffic periods

Latency

P50 Latency: Median response time
P95 Latency: 95th percentile
P99 Latency: 99th percentile
Max Latency: Slowest requests

Success Rate

Successful requests (200 OK)
Failed requests (4xx, 5xx)
Error rate percentage
Error types breakdown

Token Usage

Total tokens consumed
Input vs output tokens
Tokens per request
Token usage trends

Cost Tracking

Estimated costs by model
Cost per user/session
Daily/monthly spend
Cost breakdown by feature

Filtering Traces

By Status

status:error
status:success

By Latency

latency > 5s
latency < 1s

By User

metadata.user_id:"alice@example.com"

By Tags

tags:"production"
tags:"high-priority"

By Date

timestamp > 2025-10-01
timestamp < 2025-10-10

Debugging Workflow

Find Failing Traces

Filter by status:error and sort by timestamp descending

Analyze Error

Click on trace to see:

Exact input that caused failure
Full Python stack trace
All steps before error
Timing information

Compare with Success

Find similar successful trace and compare side-by-side

Fix and Redeploy

Fix issue in code and redeploy:

langgraph deploy

Verify Fix

Monitor new traces to confirm error is resolved

Performance Optimization

Identify Slow Traces

Filter: latency > 5s
Sort by latency descending
Expand trace to see timing breakdown
Identify bottlenecks:
- Slow LLM calls → Try faster model
- Slow tool calls → Add caching
- Redundant calls → Optimize logic

Example Optimization

Before: 8.5s total latency

LLM call 1: 3.2s
Tool call: 2.1s
LLM call 2: 3.2s

Optimization: Add caching to tool call After: 4.5s total latency

LLM call 1: 3.2s
Tool call (cached): 0.1s
LLM call 2: 1.2s (smaller context)

Alerts

Set up alerts in LangSmith:

Go to Project Settings

Navigate to Settings → Alerts

Create Alert Rule

Configure alert conditions:

High Error Rate: Error rate > 5%
High Latency: P95 > 5 seconds
Budget Exceeded: Daily cost > $50

Configure Notifications

Choose notification channels:

Email
Slack
Webhook
PagerDuty

Custom Metadata

Add custom metadata to traces for better filtering:

from langchain_core.runnables import RunnableConfig

config = RunnableConfig(
    tags=["premium-user", "high-priority"],
    metadata={
        "user_id": "alice@example.com",
        "session_id": "sess_abc123",
        "feature": "analysis",
        "cost_center": "sales"
    }
)

result = await graph.ainvoke(input, config=config)

Now filter in LangSmith:

tags:"premium-user"
metadata.cost_center:"sales"
metadata.feature:"analysis"

Datasets & Evaluation

Create Dataset from Production

Filter Successful Traces

Filter: status:success AND tags:"production"

Select Examples

Choose representative traces (varied inputs/outputs)

Add to Dataset

Click “Add to Dataset” → Name it “prod-examples-oct-2025”

Run Evaluation

Compare model performance:

from langsmith import Client

client = Client()

## Test on production dataset
results = client.run_on_dataset(
    dataset_name="prod-examples-oct-2025",
    llm_or_chain_factory=lambda: graph,
    project_name="eval-claude-vs-gpt4"
)

View results in LangSmith to compare:

Latency
Token usage
Cost
Quality (with custom evaluators)

Viewing Logs

Via CLI

## Stream logs in real-time
langgraph deployment logs my-agent-prod --follow

## View recent logs
langgraph deployment logs my-agent-prod --limit 100

## Filter by level
langgraph deployment logs my-agent-prod --level ERROR

Via LangSmith UI

Logs are included in each trace - expand trace to see full logs.

Best Practices

Use Consistent Tagging

# Good: Consistent tags
tags=["production", "premium-tier", "chat-feature"]

# Bad: Inconsistent tags
tags=["prod", "Premium User", "CHAT"]

Add Business Context

metadata={
    "user_id": "alice@example.com",
    "user_tier": "premium",
    "cost_center": "sales",
    "session_id": "sess_123",
    "request_source": "mobile_app"
}

Monitor Key Metrics Daily

Check daily:

Error rate (should be < 1%)
P95 latency (should be < 5s)
Daily cost (should be within budget)
User satisfaction (via feedback)

Set Up Alerts

Configure alerts for:

Error rate > 5%
P95 latency > 5s
Daily cost > $100
Budget 80% consumed

Troubleshooting

No traces appearing

Solution:

Verify LANGSMITH_TRACING=true in environment
Check LangSmith API key is set
Confirm correct project name
Make test request to generate trace

Traces missing metadata

Solution:

# Ensure metadata is passed to invoke
config = RunnableConfig(
    tags=["your-tags"],
    metadata={"user_id": "alice"}
)
result = await graph.ainvoke(input, config=config)

High latency in traces

Investigation:

Expand trace to see timing breakdown
Identify slowest step
Optimize:
- LLM calls: Try faster model or smaller prompts
- Tool calls: Add caching or parallel execution
- State operations: Optimize state size

Next Steps

LangSmith Tracing

Complete LangSmith guide

CI/CD

Automate deployments

Configuration

Optimize configuration

Quickstart

Deploy your agent

All set! Your LangGraph Platform deployment is automatically monitored with comprehensive LangSmith tracing.

Getting Started

Deployment Options

LangGraph Platform

Kubernetes - GKE

Kubernetes - EKS & AKS

Kubernetes - Best Practices

Infrastructure as Code

Advanced Deployment

Configuration

Operations

​Overview

​Viewing Traces

​What’s Captured

LLM Calls

Agent Steps

Metadata

Errors

​Metrics Dashboard

​Request Volume

​Latency

​Success Rate

​Token Usage

​Cost Tracking

​Filtering Traces

​By Status

​By Latency

​By User

​By Tags

​By Date

​Debugging Workflow

​Performance Optimization

​Identify Slow Traces

​Example Optimization

​Alerts

​Custom Metadata

​Datasets & Evaluation

​Create Dataset from Production

​Run Evaluation

​Viewing Logs

​Via CLI

​Via LangSmith UI

​Best Practices

​Troubleshooting

​Next Steps

LangSmith Tracing

CI/CD

Configuration

Quickstart

Overview

Viewing Traces

What’s Captured

Metrics Dashboard

Request Volume

Latency

Success Rate

Token Usage

Cost Tracking

Filtering Traces

By Status

By Latency

By User

By Tags

By Date

Debugging Workflow

Performance Optimization

Identify Slow Traces

Example Optimization

Alerts

Custom Metadata

Datasets & Evaluation

Create Dataset from Production

Run Evaluation

Viewing Logs

Via CLI

Via LangSmith UI

Best Practices

Troubleshooting

Next Steps