3. Dual Observability: OpenTelemetry + LangSmith
Date: 2025-10-11Status
AcceptedCategory
Core ArchitectureContext
Production systems require comprehensive observability to:- Debug issues quickly
- Monitor performance
- Track user behavior
- Optimize LLM usage
- Meet SLA requirements
-
Infrastructure Observability: Traditional application metrics
- Request latency, error rates, throughput
- System resources (CPU, memory, disk)
- Distributed tracing across services
- Database query performance
-
LLM-Specific Observability: AI/ML-specific insights
- Prompt quality and effectiveness
- LLM response quality
- Token usage and costs
- Model performance comparison
- Chain/agent execution flow
Decision
We will implement a dual observability strategy:-
OpenTelemetry for infrastructure observability
- Distributed tracing with Jaeger
- Metrics with Prometheus
- Structured logging with trace correlation
- Standard OTLP exporters
-
LangSmith for LLM-specific observability
- Prompt engineering and debugging
- LLM call tracing and analysis
- Token usage tracking
- Model comparison and evaluation
- Dataset management
OBSERVABILITY_BACKEND=opentelemetry(default)OBSERVABILITY_BACKEND=langsmithOBSERVABILITY_BACKEND=both(production recommended)
Consequences
Positive Consequences
- Best of Both Worlds: Infrastructure + AI observability
- Complete Visibility: No blind spots in production
- Tool Specialization: Each tool does what it does best
- Flexibility: Can use one or both based on needs
- Industry Standard: OpenTelemetry is CNCF standard
- LLM Optimization: LangSmith enables prompt engineering
- Cost Tracking: Detailed token usage visibility
Negative Consequences
- Increased Complexity: Two systems to configure and maintain
- Higher Infrastructure Cost: Running both Jaeger and LangSmith
- Learning Curve: Team must learn both systems
- Data Duplication: Some overlap in traced data
- Configuration Overhead: Separate config for each system
Neutral Consequences
- Performance: Minimal overhead (~1-2% with both enabled)
- Storage: Increased log/trace storage requirements
- Vendor Risk: LangSmith is commercial (OpenTelemetry is free)
Alternatives Considered
1. OpenTelemetry Only
Description: Use only OpenTelemetry for all observability Pros:- Single system to maintain
- Open-source, no vendor lock-in
- Industry standard
- Great for infrastructure
- Poor LLM-specific insights
- No prompt debugging tools
- Manual token tracking
- No model comparison features
2. LangSmith Only
Description: Use only LangSmith for all observability Pros:- Excellent LLM tracing
- Great prompt debugging
- Built-in evaluations
- Cost tracking
- Vendor lock-in (LangChain product)
- Poor infrastructure metrics
- No distributed tracing
- Less flexible than OpenTelemetry
3. Datadog or New Relic
Description: Use commercial APM solution Pros:- All-in-one solution
- Good infrastructure observability
- Some LLM features
- Expensive at scale
- Vendor lock-in
- LLM features not as mature
- Less flexible than open standards
4. Prometheus + Grafana Only
Description: Use metrics-focused stack Pros:- Excellent metrics
- Great visualization
- Open-source
- No distributed tracing
- No LLM-specific features
- Manual instrumentation
5. Custom Logging Solution
Description: Build custom observability Pros:- Full control
- Exactly what we need
- Massive development effort
- Reinventing the wheel
- Hard to maintain
- No standard tools
Implementation Details
OpenTelemetry Stack
LangSmith Integration
Configuration
Docker Compose Stack
Usage
Metrics Tracked
OpenTelemetry Metrics
agent.tool.calls- Tool invocation countagent.calls.successful- Success rateagent.calls.failed- Error rateagent.response.duration- Latency histogramauth.failures- Auth errorsauthz.failures- Authorization denials
LangSmith Metrics
- Token usage per model
- Cost per request
- Prompt templates used
- Model performance comparison
- Chain execution paths
References
- OpenTelemetry Documentation
- LangSmith Documentation
- LangSmith Tracing Guide
- Observability Overview
- Related Files:
observability.py,langsmith_src/mcp_server_langgraph/core/config.py - Related ADRs: 0001 (LLM abstraction)