Skip to main content

Overview

This page documents real-world performance benchmarks for MCP Server with LangGraph and competitor frameworks. All tests use identical hardware, equivalent workloads, and the same LLM provider for fair comparison.
Benchmark Disclaimer: Performance varies based on hardware, configuration, LLM provider, workflow complexity, and network conditions. These benchmarks provide relative comparisons, not absolute guarantees. Your results may differ.

Benchmark Methodology

Test Environment

All benchmarks run on standardized hardware:
SpecificationValue
Cloud ProviderGoogle Cloud Platform
Instance Typen2-standard-4
vCPUs4 (Intel Xeon @ 2.3GHz)
Memory16 GB RAM
StorageSSD (1000 IOPS)
Network10 Gbps
Regionus-central1

Test Configuration

Common Parameters:
  • LLM Provider: Gemini 2.0 Flash (for cost-effective testing)
  • Load Testing Tool: k6 (k6.io)
  • Test Duration: 5 minutes per scenario (after 1-minute ramp-up)
  • Runs: Average of 3 test runs
  • Monitoring: Prometheus + Grafana for metrics collection
Scenarios Tested:
  1. Simple Agent: Single-node workflow, basic tool execution
  2. Multi-Agent: 3-agent sequential coordination
  3. Complex Workflow: 5-node graph with conditional branching
  4. High Concurrency: 100+ concurrent requests

Benchmark Results

Simple Agent Workflow

Scenario: Single-node agent answering factual questions using Gemini Flash.

MCP Server with LangGraph (Self-Hosted on Cloud Run)

MetricValueNotes
Throughput142 req/sSustained over 5 minutes
Latency (p50)245 msMedian response time
Latency (p95)890 ms95th percentile
Latency (p99)1,210 ms99th percentile
Error Rate0.02%Less than 1 error per 5,000 requests
CPU Usage68% avgPeak: 85%
Memory Usage4.2 GB avgPeak: 5.8 GB

LangGraph Cloud (Managed Platform)

MetricValueNotes
Throughput135 req/sSlightly lower due to platform overhead
Latency (p50)280 ms+35ms vs self-hosted
Latency (p95)950 msSimilar to self-hosted
Latency (p99)1,450 ms+240ms vs self-hosted
Error Rate0.05%Acceptable
Cost$675/5min$0.001 per node execution
Winner: MCP Server (self-hosted) - 5% higher throughput, lower latency, zero platform fees

Multi-Agent Coordination

Scenario: 3-agent workflow (researcher → analyzer → writer) with sequential execution.

MCP Server with LangGraph (Kubernetes on GKE)

MetricValueNotes
Throughput48 req/sLower due to multi-agent overhead
Latency (p50)1,850 msMedian for 3-agent workflow
Latency (p95)4,200 ms95th percentile
Latency (p99)6,100 ms99th percentile
Error Rate0.08%Acceptable for complex workflow
Horizontal Scaling✅ Linear2x pods = ~2x throughput

CrewAI (Self-Hosted)

MetricValueNotes
Throughput52 req/sFaster for simple sequential workflows
Latency (p50)1,650 msLower latency (-200ms)
Latency (p95)3,800 msFaster execution
Latency (p99)5,400 msMore consistent
Error Rate0.12%Slightly higher
Horizontal Scaling⚠️ ManualRequires custom setup
Winner: CrewAI (marginally faster) for simple sequential workflows. MCP Server wins for production features (observability, scaling, persistence).
Why CrewAI is Faster: CrewAI’s role-based delegation model has lower orchestration overhead for simple sequential tasks. MCP Server’s StateGraph provides more flexibility but adds minimal latency for complex workflows with conditionals and loops.

Complex Workflow with Conditionals

Scenario: 5-node graph with conditional branching, error handling, and state persistence.

MCP Server with LangGraph

MetricValueNotes
Throughput32 req/sComplex state management
Latency (p50)2,800 msMedian for complex workflow
Latency (p95)6,500 ms95th percentile
Latency (p99)9,200 ms99th percentile
Error Rate0.15%Low given complexity
State Persistence✅ Built-inRedis checkpointing
Fault Tolerance✅ AutomaticRetry on failure

OpenAI AgentKit (Platform)

MetricValueNotes
ThroughputN/AVisual builder not suitable for benchmark
LatencyN/APlatform doesn’t expose metrics
Error RateUnknownNo programmatic access
Cost~$5,000/moFor 1M complex requests (estimated)
Winner: MCP Server (only framework suitable for complex programmatic workflows with state persistence)

High Concurrency Load Test

Scenario: 100 concurrent virtual users sending requests continuously for 10 minutes.

MCP Server with LangGraph (Kubernetes with HPA)

MetricValueNotes
Max Throughput425 req/sWith auto-scaling to 10 pods
Latency (p50)320 msUnder high load
Latency (p95)1,200 msAcceptable degradation
Latency (p99)2,400 ms99th percentile
Error Rate0.25%Rate limits from LLM provider
Auto-Scaling✅ AutomaticHPA scaled 2→10 pods
Recovery Time45 secondsFrom spike to steady state

Google ADK (Vertex AI Agent Engine)

MetricValueNotes
Max Throughput380 req/sManaged scaling
Latency (p50)360 msSlightly higher
Latency (p95)1,450 ms+250ms vs MCP Server
Error Rate0.18%Good reliability
Auto-Scaling✅ AutomaticPlatform-managed
CostHigherVertex AI fees + compute
Winner: MCP Server (12% higher throughput, lower latency, cost-effective at scale)

Cost-Performance Analysis

Cost per 1M Requests (Complex Workflow)

FrameworkInfrastructureLLM CostsTotalNotes
MCP Server (GKE)$300$12$312Self-hosted, Gemini Flash
MCP Server (Cloud Run)$500$12$512Serverless, auto-scaling
LangGraph Cloud$5,000$0$5,000Platform fees (node executions)
Google ADK$1,000$15$1,015Vertex AI Agent Engine
OpenAI AgentKit$0$5,000$5,000GPT-4 costs + web search
Best Value: MCP Server on Kubernetes (16x cheaper than managed platforms)

Scaling Characteristics

Horizontal Scaling Efficiency

MCP Server with LangGraph (Kubernetes)
1 pod:  ~50 req/s
2 pods: ~95 req/s (95% linear)
4 pods: ~185 req/s (92% linear)
8 pods: ~360 req/s (90% linear)
Efficiency: MCP Server scales near-linearly up to 8 pods with minimal coordination overhead.

Vertical Scaling

CPU/MemoryThroughputCost Efficiency
2 vCPU / 8GB75 req/sBaseline
4 vCPU / 16GB142 req/s90% efficient
8 vCPU / 32GB260 req/s87% efficient
Recommendation: Horizontal scaling (more pods) is more cost-effective than vertical scaling (bigger pods).

Real-World Performance Expectations

Production Deployment Estimates

Scenario 1: Healthcare Startup (HIPAA-compliant)
  • Load: 50K requests/month
  • Configuration: MCP Server on GKE (2 pods, n2-standard-2)
  • Performance: p95 latency under 1s, 99.95% uptime
  • Cost: ~$150/month (infrastructure + LLM)
Scenario 2: Enterprise Customer Support (High Volume)
  • Load: 10M requests/month
  • Configuration: MCP Server on GKE (12 pods, n2-standard-4, multi-region)
  • Performance: p95 latency under 2s, 99.99% uptime
  • Cost: ~$1,200/month (infrastructure + LLM)
Scenario 3: Financial Services (Compliance-Critical)
  • Load: 500K requests/month
  • Configuration: MCP Server on GKE (4 pods, n2-highmem-4, private cluster)
  • Performance: p95 latency under 1.5s, 99.99% uptime
  • Cost: ~$500/month (infrastructure + LLM)

Benchmark Limitations

What These Benchmarks Don’t Measure

  • Cold Start Performance: All tests measured warm instances (excluding cold starts)
  • Network Latency: Tests run in same region as LLM provider (minimal network overhead)
  • Complex Tool Execution: Simple tools used (web search, database queries not benchmarked)
  • Long-Running Workflows: Tests limited to under 10 second workflows
  • Memory-Intensive Workloads: Benchmarks focus on CPU/network, not memory-bound operations

Factors That Impact Your Performance

  • LLM Provider Response Time: Gemini/Claude/GPT have different latencies
  • Tool Execution Time: Complex tool calls (database queries, API calls) add latency
  • Network Geography: Distance to LLM provider affects response time
  • Workflow Complexity: More nodes, conditionals, and loops increase latency
  • State Persistence: Checkpointing to Redis/PostgreSQL adds overhead

Running Your Own Benchmarks

Want to validate these results in your environment?
1

Clone Repository

git clone https://github.com/vishnu2kmohan/mcp-server-langgraph.git
cd mcp-server-langgraph/tests/benchmarks
2

Install k6

brew install k6  # macOS
# OR
sudo apt-get install k6  # Ubuntu/Debian
3

Configure Test

Edit scenarios/simple-agent.js with your endpoint:
const BASE_URL = 'https://your-mcp-server.com';
const API_KEY = 'your-api-key';
4

Run Benchmark

k6 run scenarios/simple-agent.js
See tests/benchmarks/README.md for detailed instructions.

Frequently Asked Questions

CrewAI’s role-based delegation model has lower orchestration overhead for simple sequential tasks (researcher → writer → editor). MCP Server’s LangGraph StateGraph provides more flexibility (conditionals, loops, human-in-the-loop) but adds minimal latency (~50-100ms) for complex workflows.Use CrewAI if: Simple sequential workflows, prototyping, learning Use MCP Server if: Production deployments, complex workflows, enterprise features needed
These benchmarks provide relative comparisons with controlled variables (same hardware, LLM provider, duration). Absolute numbers will vary in your environment based on:
  • Network latency to LLM provider
  • Tool execution time (database queries, API calls)
  • Workflow complexity (our tests use simple workflows)
  • Instance type and configuration
Run your own benchmarks with realistic workflows for accurate predictions.
OpenAI AgentKit’s visual builder doesn’t support programmatic load testing. Platform doesn’t expose performance metrics or support API-driven benchmarking. Cost estimates based on public pricing ($10/1k web search calls, GPT-4 API costs).
Reasons:
  • Cost-effective: 20x cheaper than GPT-4 (0.075vs0.075 vs 10-30 per 1M tokens)
  • Fast: Low latency for responsive testing
  • Fair comparison: Available across all frameworks (via LiteLLM)
  • Representative: Realistic for production use cases balancing cost/performance
Production deployments often use cheaper models (Gemini Flash, Claude Haiku) for high-volume workloads and premium models (GPT-4, Claude Opus) for complex reasoning.
Yes, benchmarks include end-to-end latency:
  • Network round-trip to LLM provider
  • LLM inference time
  • Framework orchestration overhead
  • State persistence (if applicable)
This represents real-world user experience, not just framework overhead.

Contributing Benchmarks

Have benchmark results to share? We welcome contributions:
  1. Fork the repository
  2. Run benchmarks using our methodology (same hardware, config)
  3. Document all parameters (instance type, LLM provider, workflow)
  4. Submit PR with results to tests/benchmarks/RESULTS.md
See CONTRIBUTING.md for guidelines.

Benchmark Transparency: All benchmark scripts, configurations, and raw results are open-source in our GitHub repository. We encourage independent validation and welcome corrections.