Overview
Anthos Service Mesh (managed Istio) provides secure service-to-service communication, advanced traffic management, and deep observability for microservices on GKE. Fully managed by Google with automatic upgrades.
Mutual TLS Automatic encryption between services
Traffic Control Canary deployments, A/B testing, circuit breaking
Observability Service topology, latency, error rates
Policy Enforcement Fine-grained authorization, rate limiting
Why Service Mesh?
Challenge : By default, pods can talk to any other podSolution : Service mesh enforces mTLS + authorization policiesImplementation :apiVersion : security.istio.io/v1beta1
kind : PeerAuthentication
metadata :
name : default
namespace : production-mcp-server-langgraph
spec :
mtls :
mode : STRICT # All traffic must be mTLS
Result : Encrypted, authenticated communication
Use cases :
Canary releases (10% traffic to v2)
A/B testing (iOS users → v2)
Blue-green deployments
Circuit breaking (prevent cascading failures)
Without mesh : Complex custom codeWith mesh : Declarative traffic rules
Service-Level Observability
Built-in metrics :
Request rate (QPS per service)
P50/P95/P99 latency
Success rate (% 2xx responses)
Service dependency graph
Without mesh : Instrumentation code in every serviceWith mesh : Automatic sidecar collection
Scenario : Services across dev, staging, prod clustersCapability : Single mesh spanning clustersBenefit : Consistent policies, cross-cluster service discovery
Architecture
Components :
Istiod : Control plane (managed by Google, auto-upgraded)
Envoy sidecars : Injected into each pod, handle traffic
Telemetry : Metrics sent to Cloud Monitoring
Quick Setup (30 minutes)
Enable APIs & Fleet Registration
./deployments/service-mesh/anthos/setup-anthos-service-mesh.sh \
PROJECT_ID production-mcp-server-langgraph-gke us-central1
What it does :
Enables Anthos Service Mesh APIs
Registers cluster with GKE Fleet
Enables managed service mesh
Waits for control plane (~10-15 min)
Verify Installation
# Check mesh status
gcloud container fleet mesh describe --project=PROJECT_ID
# Should show:
# state: ACTIVE
# controlPlaneManagement: AUTOMATIC
# Verify Istiod running
kubectl get pods -n istio-system
istiod pod should be Running
Enable Sidecar Injection
# Label namespace for automatic injection
kubectl label namespace production-mcp-server-langgraph istio-injection=enabled
# Restart deployments to inject sidecars
kubectl rollout restart deployment/production-mcp-server-langgraph \
-n production-mcp-server-langgraph
Verify Sidecars Injected
# Check pods have 2 containers (app + envoy)
kubectl get pods -n production-mcp-server-langgraph
# Should show:
# NAME READY STATUS
# production-mcp-server-langgraph-... 2/2 Running
# ^^^
# app + sidecar
# Describe pod to see istio-proxy container
kubectl describe pod POD_NAME -n production-mcp-server-langgraph | grep istio-proxy
Enable Strict mTLS
kubectl apply -f - << EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production-mcp-server-langgraph
spec:
mtls:
mode: STRICT
EOF
All traffic now encrypted with mTLS!
Verify mTLS
# Check Kiali dashboard or use istioctl
istioctl proxy-config secret -n production-mcp-server-langgraph POD_NAME
# Should show TLS certificates
Traffic Management
Canary Deployment
Deploy new version to 10% of traffic:
VirtualService (Traffic Split)
DestinationRule (Define Subsets)
apiVersion : networking.istio.io/v1beta1
kind : VirtualService
metadata :
name : mcp-server
namespace : production-mcp-server-langgraph
spec :
hosts :
- mcp-server
http :
- match :
- headers :
x-canary :
exact : "true"
route :
- destination :
host : mcp-server
subset : v2
weight : 100
- route :
- destination :
host : mcp-server
subset : v1
weight : 90
- destination :
host : mcp-server
subset : v2
weight : 10 # 10% to canary
Workflow :
Deploy v2 with label version: v2
Apply VirtualService (10% → v2)
Monitor metrics for 30 minutes
If healthy, increase to 50%, then 100%
If unhealthy, revert to 0%
Circuit Breaking
Prevent cascading failures:
apiVersion : networking.istio.io/v1beta1
kind : DestinationRule
metadata :
name : postgres-proxy
namespace : production-mcp-server-langgraph
spec :
host : postgres-proxy
trafficPolicy :
connectionPool :
tcp :
maxConnections : 100
http :
http1MaxPendingRequests : 50
http2MaxRequests : 100
maxRequestsPerConnection : 2
outlierDetection :
consecutiveErrors : 5
interval : 30s
baseEjectionTime : 30s
maxEjectionPercent : 50
Behavior : After 5 consecutive errors, eject pod for 30 seconds
Retry Policy
apiVersion : networking.istio.io/v1beta1
kind : VirtualService
metadata :
name : mcp-server
spec :
http :
- route :
- destination :
host : mcp-server
retries :
attempts : 3
perTryTimeout : 2s
retryOn : 5xx,reset,connect-failure
Security
Strict mTLS
Cluster-Wide
Namespace-Specific
Permissive (Migration)
apiVersion : security.istio.io/v1beta1
kind : PeerAuthentication
metadata :
name : default
namespace : istio-system
spec :
mtls :
mode : STRICT
Applies to all namespaces.
Authorization Policies
Deny-all by default :
apiVersion : security.istio.io/v1beta1
kind : AuthorizationPolicy
metadata :
name : deny-all
namespace : production-mcp-server-langgraph
spec : {} # Empty = deny all
Allow specific service :
apiVersion : security.istio.io/v1beta1
kind : AuthorizationPolicy
metadata :
name : allow-mcp-server
namespace : production-mcp-server-langgraph
spec :
selector :
matchLabels :
app : postgres-proxy
action : ALLOW
rules :
- from :
- source :
principals : [ "cluster.local/ns/mcp-production/sa/mcp-server" ]
to :
- operation :
methods : [ "GET" , "POST" ]
paths : [ "/api/*" ]
Result : Only mcp-server SA can call postgres-proxy
Observability
Service Topology
View in Google Cloud Console :
Navigation → Anthos → Service Mesh → Topology
Shows:
Service dependency graph
Traffic flow between services
Error rates per edge
Metrics
Request Rate
Latency (P95)
Error Rate
rate(istio_requests_total{
destination_service_name="mcp-server",
destination_workload_namespace="mcp-production"
}[1m])
Dashboards
Import pre-built dashboards:
# Install Kiali (service mesh dashboard)
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/kiali.yaml
# Port-forward
kubectl port-forward svc/kiali -n istio-system 20001:20001
# Open http://localhost:20001
Features :
Service graph visualization
Traffic animation
Configuration validation
Distributed tracing
Multi-Cluster Mesh
Register All Clusters
# Dev cluster
gcloud container fleet memberships register mcp-dev-membership \
--gke-cluster=us-central1/mcp-dev-gke \
--project=PROJECT_ID
# Staging cluster
gcloud container fleet memberships register mcp-staging-membership \
--gke-cluster=us-central1/mcp-staging-gke \
--project=PROJECT_ID
# Prod cluster (already registered)
Enable Mesh for All
gcloud container fleet mesh update \
--management automatic \
--memberships=mcp-dev-membership,mcp-staging-membership,mcp-prod-membership \
--project=PROJECT_ID
Configure Cross-Cluster Service Discovery
apiVersion : networking.istio.io/v1beta1
kind : ServiceEntry
metadata :
name : external-staging-service
namespace : production-mcp-server-langgraph
spec :
hosts :
- mcp-server.mcp-staging.svc.cluster.local
location : MESH_INTERNAL
ports :
- number : 8000
name : http
protocol : HTTP
resolution : DNS
Use case : Production can call staging services for integration testing
Troubleshooting
Symptom : Pod has 1/1 containers (should be 2/2)Checks :# Verify namespace labeled
kubectl get namespace production-mcp-server-langgraph --show-labels
# Should see: istio-injection=enabled
# Check injection status
kubectl get mutatingwebhookconfigurations
Solution : Label namespace and restart pods
Symptom : Service A can’t connect to Service BChecks :# Check PeerAuthentication
kubectl get peerauthentication -n production-mcp-server-langgraph
# Check DestinationRule
kubectl get destinationrule -n production-mcp-server-langgraph
# Verify certificates
istioctl proxy-config secret POD_NAME -n production-mcp-server-langgraph
Common fix : Ensure both sides have sidecars injected
Symptom : Mesh status shows PROVISIONING for >20 minutesSolution :# Check fleet status
gcloud container fleet mesh describe --project=PROJECT_ID
# View logs
kubectl logs -n istio-system deployment/istiod
# If stuck, re-enable
gcloud container fleet mesh update \
--management automatic \
--memberships=MEMBERSHIP_NAME \
--project=PROJECT_ID
Best Practices
Start with PERMISSIVE mTLS , then move to STRICT
# Week 1: Permissive (allow migration)
mtls :
mode : PERMISSIVE
# Week 2: Strict (after all services have sidecars)
mtls :
mode : STRICT
Use namespace-scoped policies for isolation
# Production has strict mTLS
apiVersion : security.istio.io/v1beta1
kind : PeerAuthentication
metadata :
name : default
namespace : production-mcp-server-langgraph
spec :
mtls :
mode : STRICT
# Dev can be permissive
apiVersion : security.istio.io/v1beta1
kind : PeerAuthentication
metadata :
name : default
namespace : mcp-dev
spec :
mtls :
mode : PERMISSIVE
Enable resource limits on sidecars
apiVersion : v1
kind : ConfigMap
metadata :
name : istio-sidecar-injector
namespace : istio-system
data :
values : |
global:
proxy:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
Monitor mesh health with SLIs
# SLI: 99% of requests < 500ms
# SLI: 99.9% success rate (non-5xx)
# Alert if error budget depleted
Next Steps
Install Anthos Service Mesh
./deployments/service-mesh/anthos/setup-anthos-service-mesh.sh PROJECT_ID
Enable Sidecar Injection
kubectl label namespace production-mcp-server-langgraph istio-injection=enabled
kubectl rollout restart deployment -n production-mcp-server-langgraph
Enable Strict mTLS
kubectl apply -f deployments/service-mesh/anthos/peer-authentication.yaml
Configure Traffic Rules
Set up canary deployments, circuit breaking, retries
Monitor Service Topology
Console → Anthos → Service Mesh → Topology