Service Mesh (Anthos)

Overview

Anthos Service Mesh (managed Istio) provides secure service-to-service communication, advanced traffic management, and deep observability for microservices on GKE. Fully managed by Google with automatic upgrades.

Mutual TLS

Automatic encryption between services

Traffic Control

Canary deployments, A/B testing, circuit breaking

Observability

Service topology, latency, error rates

Policy Enforcement

Fine-grained authorization, rate limiting

Why Service Mesh?

Zero-Trust Networking

Challenge: By default, pods can talk to any other podSolution: Service mesh enforces mTLS + authorization policiesImplementation:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production-mcp-server-langgraph
spec:
  mtls:
    mode: STRICT  # All traffic must be mTLS

Result: Encrypted, authenticated communication

Advanced Deployments

Use cases:

Canary releases (10% traffic to v2)
A/B testing (iOS users → v2)
Blue-green deployments
Circuit breaking (prevent cascading failures)

Without mesh: Complex custom codeWith mesh: Declarative traffic rules

Service-Level Observability

Built-in metrics:

Request rate (QPS per service)
P50/P95/P99 latency
Success rate (% 2xx responses)
Service dependency graph

Without mesh: Instrumentation code in every serviceWith mesh: Automatic sidecar collection

Multi-Cluster Mesh

Scenario: Services across dev, staging, prod clustersCapability: Single mesh spanning clustersBenefit: Consistent policies, cross-cluster service discovery

Architecture

Components:

Istiod: Control plane (managed by Google, auto-upgraded)
Envoy sidecars: Injected into each pod, handle traffic
Telemetry: Metrics sent to Cloud Monitoring

Quick Setup (30 minutes)

Enable APIs & Fleet Registration

./deployments/service-mesh/anthos/setup-anthos-service-mesh.sh \
  PROJECT_ID production-mcp-server-langgraph-gke us-central1

What it does:

Enables Anthos Service Mesh APIs
Registers cluster with GKE Fleet
Enables managed service mesh
Waits for control plane (~10-15 min)

Verify Installation

# Check mesh status
gcloud container fleet mesh describe --project=PROJECT_ID

# Should show:
# state: ACTIVE
# controlPlaneManagement: AUTOMATIC

# Verify Istiod running
kubectl get pods -n istio-system

istiod pod should be Running

Enable Sidecar Injection

# Label namespace for automatic injection
kubectl label namespace production-mcp-server-langgraph istio-injection=enabled

# Restart deployments to inject sidecars
kubectl rollout restart deployment/production-mcp-server-langgraph \
  -n production-mcp-server-langgraph

Verify Sidecars Injected

# Check pods have 2 containers (app + envoy)
kubectl get pods -n production-mcp-server-langgraph

# Should show:
# NAME                                   READY   STATUS
# production-mcp-server-langgraph-...    2/2     Running
#                                        ^^^
#                                    app + sidecar

# Describe pod to see istio-proxy container
kubectl describe pod POD_NAME -n production-mcp-server-langgraph | grep istio-proxy

Enable Strict mTLS

kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production-mcp-server-langgraph
spec:
  mtls:
    mode: STRICT
EOF

All traffic now encrypted with mTLS!

Verify mTLS

# Check Kiali dashboard or use istioctl
istioctl proxy-config secret -n production-mcp-server-langgraph POD_NAME

# Should show TLS certificates

Traffic Management

Canary Deployment

Deploy new version to 10% of traffic:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: mcp-server
  namespace: production-mcp-server-langgraph
spec:
  hosts:
  - mcp-server
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: mcp-server
        subset: v2
      weight: 100
  - route:
    - destination:
        host: mcp-server
        subset: v1
      weight: 90
    - destination:
        host: mcp-server
        subset: v2
      weight: 10  # 10% to canary

Workflow:

Deploy v2 with label version: v2
Apply VirtualService (10% → v2)
Monitor metrics for 30 minutes
If healthy, increase to 50%, then 100%
If unhealthy, revert to 0%

Circuit Breaking

Prevent cascading failures:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: postgres-proxy
  namespace: production-mcp-server-langgraph
spec:
  host: postgres-proxy
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
        maxRequestsPerConnection: 2
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Behavior: After 5 consecutive errors, eject pod for 30 seconds

Retry Policy

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: mcp-server
spec:
  http:
  - route:
    - destination:
        host: mcp-server
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,reset,connect-failure

Security

Strict mTLS

Cluster-Wide
Namespace-Specific
Permissive (Migration)

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT

Applies to all namespaces.

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production-mcp-server-langgraph
spec:
  mtls:
    mode: STRICT

Only mcp-production namespace.

spec:
  mtls:
    mode: PERMISSIVE  # Allow both mTLS and plaintext

Use during migration, then switch to STRICT.

Authorization Policies

Deny-all by default:

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: production-mcp-server-langgraph
spec: {}  # Empty = deny all

Allow specific service:

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-mcp-server
  namespace: production-mcp-server-langgraph
spec:
  selector:
    matchLabels:
      app: postgres-proxy
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/mcp-production/sa/mcp-server"]
    to:
    - operation:
        methods: ["GET", "POST"]
        paths: ["/api/*"]

Result: Only mcp-server SA can call postgres-proxy

Observability

Service Topology

View in Google Cloud Console:

Navigation → Anthos → Service Mesh → Topology

Shows:

Service dependency graph
Traffic flow between services
Error rates per edge

Metrics

Request Rate
Latency (P95)
Error Rate

rate(istio_requests_total{
  destination_service_name="mcp-server",
  destination_workload_namespace="mcp-production"
}[1m])

histogram_quantile(0.95,
  rate(istio_request_duration_milliseconds_bucket{
    destination_service_name="mcp-server"
  }[1m])
)

rate(istio_requests_total{
  destination_service_name="mcp-server",
  response_code=~"5.."
}[1m])

Dashboards

Import pre-built dashboards:

# Install Kiali (service mesh dashboard)
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/kiali.yaml

# Port-forward
kubectl port-forward svc/kiali -n istio-system 20001:20001

# Open http://localhost:20001

Features:

Service graph visualization
Traffic animation
Configuration validation
Distributed tracing

Multi-Cluster Mesh

# Dev cluster
gcloud container fleet memberships register mcp-dev-membership \
  --gke-cluster=us-central1/mcp-dev-gke \
  --project=PROJECT_ID

# Staging cluster
gcloud container fleet memberships register mcp-staging-membership \
  --gke-cluster=us-central1/mcp-preview-gke \
  --project=PROJECT_ID

# Prod cluster (already registered)

Enable Mesh for All

gcloud container fleet mesh update \
  --management automatic \
  --memberships=mcp-dev-membership,mcp-staging-membership,mcp-prod-membership \
  --project=PROJECT_ID

Configure Cross-Cluster Service Discovery

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: external-staging-service
  namespace: production-mcp-server-langgraph
spec:
  hosts:
  - mcp-server.mcp-staging.svc.cluster.local
  location: MESH_INTERNAL
  ports:
  - number: 8000
    name: http
    protocol: HTTP
  resolution: DNS

Use case: Production can call staging services for integration testing

Troubleshooting

Sidecar not injected

Symptom: Pod has 1/1 containers (should be 2/2)Checks:

# Verify namespace labeled
kubectl get namespace production-mcp-server-langgraph --show-labels

# Should see: istio-injection=enabled

# Check injection status
kubectl get mutatingwebhookconfigurations

Solution: Label namespace and restart pods

mTLS connection failure

Symptom: Service A can’t connect to Service BChecks:

# Check PeerAuthentication
kubectl get peerauthentication -n production-mcp-server-langgraph

# Check DestinationRule
kubectl get destinationrule -n production-mcp-server-langgraph

# Verify certificates
istioctl proxy-config secret POD_NAME -n production-mcp-server-langgraph

Common fix: Ensure both sides have sidecars injected

Control plane not ready

Symptom: Mesh status shows PROVISIONING for >20 minutesSolution:

# Check fleet status
gcloud container fleet mesh describe --project=PROJECT_ID

# View logs
kubectl logs -n istio-system deployment/istiod

# If stuck, re-enable
gcloud container fleet mesh update \
  --management automatic \
  --memberships=MEMBERSHIP_NAME \
  --project=PROJECT_ID

Best Practices

Start with PERMISSIVE mTLS, then move to STRICT

# Week 1: Permissive (allow migration)
mtls:
  mode: PERMISSIVE

# Week 2: Strict (after all services have sidecars)
mtls:
  mode: STRICT

Use namespace-scoped policies for isolation

# Production has strict mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production-mcp-server-langgraph
spec:
  mtls:
    mode: STRICT

# Dev can be permissive
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: mcp-dev
spec:
  mtls:
    mode: PERMISSIVE

Enable resource limits on sidecars

apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-sidecar-injector
  namespace: istio-system
data:
  values: |
    global:
      proxy:
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi

Monitor mesh health with SLIs

# SLI: 99% of requests < 500ms
# SLI: 99.9% success rate (non-5xx)
# Alert if error budget depleted

GKE Production

Deploy with service mesh enabled

Security Hardening

mTLS as part of 67-control framework

Operations Runbooks

Service mesh troubleshooting

Monitoring

Service mesh metrics and alerting

Next Steps

Install Anthos Service Mesh

./deployments/service-mesh/anthos/setup-anthos-service-mesh.sh PROJECT_ID

Enable Sidecar Injection

kubectl label namespace production-mcp-server-langgraph istio-injection=enabled
kubectl rollout restart deployment -n production-mcp-server-langgraph

Enable Strict mTLS

kubectl apply -f deployments/service-mesh/anthos/peer-authentication.yaml

Configure Traffic Rules

Set up canary deployments, circuit breaking, retries

Monitor Service Topology

Console → Anthos → Service Mesh → Topology

Getting Started

Deployment Options

LangGraph Platform

Kubernetes - GKE

Kubernetes - EKS & AKS

Kubernetes - Best Practices

Infrastructure as Code

Monitoring & Observability

Advanced Deployment

Configuration

Operations

Overview

Mutual TLS

Traffic Control

Observability

Policy Enforcement

Why Service Mesh?

Architecture

Quick Setup (30 minutes)

Traffic Management

Canary Deployment

Circuit Breaking

Retry Policy

Security

Strict mTLS

Authorization Policies

Observability

Service Topology

Metrics

Dashboards

Multi-Cluster Mesh

Troubleshooting

Best Practices

GKE Production

Security Hardening

Operations Runbooks

Monitoring

Next Steps

Getting Started

Deployment Options

LangGraph Platform

Kubernetes - GKE

Kubernetes - EKS & AKS

Kubernetes - Best Practices

Infrastructure as Code

Monitoring & Observability

Advanced Deployment

Configuration

Operations

​Overview

Mutual TLS

Traffic Control

Observability

Policy Enforcement

​Why Service Mesh?

​Architecture

​Quick Setup (30 minutes)

​Traffic Management

​Canary Deployment

​Circuit Breaking

​Retry Policy

​Security

​Strict mTLS

​Authorization Policies

​Observability

​Service Topology

​Metrics

​Dashboards

​Multi-Cluster Mesh

​Troubleshooting

​Best Practices

​Related Documentation

GKE Production

Security Hardening

Operations Runbooks

Monitoring

​Next Steps

Overview

Why Service Mesh?

Architecture

Quick Setup (30 minutes)

Traffic Management

Canary Deployment

Circuit Breaking

Retry Policy

Security

Strict mTLS

Authorization Policies

Observability

Service Topology

Metrics

Dashboards

Multi-Cluster Mesh

Troubleshooting

Best Practices

Related Documentation

Next Steps