GKE Staging Deployment - MCP Server with LangGraph

Overview

This guide covers deploying the MCP Server to a production-grade staging environment on Google Kubernetes Engine (GKE) with:

🔒 Security Hardening: Private nodes, Binary Authorization, Network Policies
🔑 Keyless Authentication: Workload Identity Federation for GitHub Actions
🌐 Network Isolation: Separate VPC from production
📊 Full Observability: Cloud Logging, Monitoring, and Trace
🤖 Automated Deployments: GitHub Actions with approval gates

This is a production-ready staging environment suitable for pre-production testing and validation.

Architecture

Prerequisites

GCP Project: vishnu-sandbox-20250310 (or your project ID)
gcloud CLI: Installed and authenticated
kubectl: Installed
GitHub Repository: Access to repository settings

Step 1: Infrastructure Setup

Automated Setup (Recommended)

Run the automated infrastructure setup script:

## Set your GCP project ID
export GCP_PROJECT_ID=vishnu-sandbox-20250310

## Run the setup script
./scripts/gcp/setup-staging-infrastructure.sh

This script will create:

✅ Staging VPC network (10.1.0.0/20)
✅ GKE Autopilot cluster with security hardening
✅ Cloud SQL PostgreSQL instance
✅ Memorystore Redis instance
✅ Workload Identity for pods
✅ GitHub Actions Workload Identity Federation
✅ Artifact Registry repository
✅ Secret Manager secrets

Manual Infrastructure Setup (Advanced)

If you prefer manual setup, follow these steps:

1.1 Create VPC Network

## Create VPC
gcloud compute networks create staging-vpc \
  --subnet-mode=custom \
  --project=vishnu-sandbox-20250310

## Create subnet
gcloud compute networks subnets create staging-gke-subnet \
  --network=staging-vpc \
  --range=10.1.0.0/20 \
  --region=us-central1 \
  --secondary-range pods=10.2.0.0/16,services=10.3.0.0/16 \
  --enable-flow-logs \
  --enable-private-ip-google-access

1.2 Create GKE Cluster

gcloud container clusters create-auto staging-mcp-server-langgraph-gke \
  --region=us-central1 \
  --network=staging-vpc \
  --subnetwork=staging-gke-subnet \
  --enable-private-nodes \
  --workload-pool=vishnu-sandbox-20250310.svc.id.goog \
  --enable-shielded-nodes \
  --shielded-secure-boot \
  --binauthz-evaluation-mode=PROJECT_SINGLETON_POLICY_ENFORCE

1.3 Create Managed Services

See the full setup script for complete Cloud SQL and Redis setup.

Setup Output

After running the setup script, you’ll receive:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Staging Infrastructure Setup Complete!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

GitHub Actions Configuration:
  workload_identity_provider: 'projects/PROJECT_NUMBER/locations/global/...'
  service_account: 'mcp-staging-sa@vishnu-sandbox-20250310.iam.gserviceaccount.com'

Security Features Enabled:
  ✓ Private GKE nodes
  ✓ Shielded nodes with secure boot
  ✓ Binary authorization
  ✓ Workload Identity
  ✓ Network isolation (separate VPC)

⚠️ Important: Save the Workload Identity Provider value - you’ll need it for GitHub Actions.

Step 2: Update API Keys

Update the placeholder secrets with real API keys:

## Update Anthropic API key
echo -n "sk-ant-YOUR_REAL_KEY" | gcloud secrets versions add staging-anthropic-api-key --data-file=-

## Update Google API key
echo -n "YOUR_REAL_GOOGLE_KEY" | gcloud secrets versions add staging-google-api-key --data-file=-

Step 3: Install External Secrets Operator

External Secrets Operator syncs secrets from GCP Secret Manager to Kubernetes:

## Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm repo update

helm install external-secrets \
  external-secrets/external-secrets \
  --namespace external-secrets-system \
  --create-namespace \
  --set installCRDs=true

Verify installation:

kubectl get pods -n external-secrets-system

Step 4: Configure GitHub Environment

4.1 Create GitHub Environment

Go to your repository on GitHub
Navigate to Settings → Environments
Click New environment
Name it staging
Configure protection rules:

Protection Rules:

✅ Required reviewers: Add 1-2 reviewers
✅ Wait timer: 5 minutes (allows review before deployment)
✅ Deployment branches: main, release/*

4.2 Update GitHub Workflow

Edit .github/workflows/deploy-staging-gke.yaml and replace PROJECT_NUMBER with your actual project number:

## Get your project number:
gcloud projects describe vishnu-sandbox-20250310 --format="value(projectNumber)"

## Update these lines in the workflow:
workload_identity_provider: 'projects/YOUR_PROJECT_NUMBER/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider'

Step 5: Deploy Application

First-Time Deployment (Manual)

For the first deployment, deploy manually to verify everything works:

## Get GKE credentials
gcloud container clusters get-credentials staging-mcp-server-langgraph-gke \
  --region=us-central1

## Verify connection
kubectl cluster-info

## Deploy application
kubectl apply -k deployments/overlays/staging-gke

## Watch deployment
kubectl rollout status deployment/staging-mcp-server-langgraph -n staging-mcp-server-langgraph

## Verify pods
kubectl get pods -n staging-mcp-server-langgraph

Automated Deployments (GitHub Actions)

Once manual deployment succeeds, GitHub Actions will handle future deployments: Triggered by:

✅ Push to main branch
✅ Pre-release creation
✅ Manual workflow dispatch

Deployment Flow:

Build Docker image
Push to Artifact Registry
Deploy to GKE staging
Run smoke tests
Validate deployment
Auto-rollback on failure

Step 6: Verify Deployment

Run Smoke Tests

./scripts/gcp/staging-smoke-tests.sh

Expected output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Smoke Test Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Total Tests:  11
Passed:       11
Failed:       0

✓ All smoke tests passed!

Check Health Endpoints

## Port-forward to service
kubectl port-forward -n staging-mcp-server-langgraph svc/staging-mcp-server-langgraph 8080:80

## Test endpoints
curl http://localhost:8080/health
curl http://localhost:8080/health/ready
curl http://localhost:8080/health/live

View Logs

Cloud Logging (recommended):

gcloud logging read \
  "resource.type=k8s_container
   resource.labels.cluster_name=staging-mcp-server-langgraph-gke
   resource.labels.namespace_name=mcp-staging" \
  --limit=50 \
  --format=json

Kubectl logs:

kubectl logs -n staging-mcp-server-langgraph -l app=mcp-server-langgraph --tail=100

Security Features

Network Isolation

Separate VPC: Staging has its own VPC (10.1.0.0/20)
Network Policies: Restrict pod-to-pod and egress traffic
Private GKE nodes: Nodes have no public IP addresses

Verify network policies:

kubectl get networkpolicies -n staging-mcp-server-langgraph

Workload Identity

Pods authenticate as GCP service accounts without keys:

## Verify Workload Identity annotation
kubectl describe sa mcp-server-langgraph -n staging-mcp-server-langgraph | grep iam.gke.io

## Test from pod
kubectl run -it --rm test \
  --image=google/cloud-sdk:slim \
  --serviceaccount=mcp-server-langgraph \
  --namespace=mcp-staging \
  -- gcloud auth list

Binary Authorization

Only signed/approved images can be deployed:

## Check Binary Authorization status
gcloud container clusters describe staging-mcp-server-langgraph-gke \
  --region=us-central1 \
  --format='value(binaryAuthorization)'

Secret Management

Secrets are stored in GCP Secret Manager and synced via External Secrets:

## List secrets in Secret Manager
gcloud secrets list --filter="name:staging-*"

## View External Secrets sync status
kubectl get externalsecrets -n staging-mcp-server-langgraph

Monitoring & Observability

Cloud Console Dashboards

Access monitoring in Cloud Console:

Key Metrics

Monitor these metrics in Cloud Monitoring:

kubernetes.io/container/cpu/core_usage_time - CPU usage
kubernetes.io/container/memory/used_bytes - Memory usage
kubernetes.io/pod/network/received_bytes_count - Network traffic
Custom metrics: custom.googleapis.com/mcp-staging/*

Set Up Alerts

Create alert policies for:

## CPU usage > 80%
## Memory usage > 80%
## Pod restart count > 5
## Deployment replica mismatch
## Cloud SQL connection failures

Troubleshooting

Pods not starting

Check pod status:

kubectl describe pod <pod-name> -n staging-mcp-server-langgraph

Common issues:

Image pull errors: Check Artifact Registry permissions
Cloud SQL proxy fails: Verify service account has cloudsql.client role
Secrets not found: Check External Secrets sync status

External Secrets not syncing

Check ExternalSecret status:

kubectl describe externalsecret mcp-staging-secrets -n staging-mcp-server-langgraph

Common fixes:

Verify Workload Identity binding
Check secret exists in Secret Manager
Ensure service account has secretAccessor role

Cloud SQL connection fails

Check Cloud SQL proxy logs:

kubectl logs -n staging-mcp-server-langgraph <pod-name> -c cloud-sql-proxy

Verify connection:

# From inside pod
kubectl exec -it <pod-name> -n staging-mcp-server-langgraph -- \
  nc -zv 127.0.0.1 5432

Network policies blocking traffic

List network policies:

kubectl get networkpolicies -n staging-mcp-server-langgraph

Temporarily disable for testing:

kubectl delete networkpolicy <policy-name> -n staging-mcp-server-langgraph
# Re-apply when done testing

Rollback Procedures

Automatic Rollback

GitHub Actions automatically rolls back on deployment failure.

Manual Rollback

## Rollback to previous version
kubectl rollout undo deployment/staging-mcp-server-langgraph -n staging-mcp-server-langgraph

## Rollback to specific revision
kubectl rollout history deployment/staging-mcp-server-langgraph -n staging-mcp-server-langgraph
kubectl rollout undo deployment/staging-mcp-server-langgraph -n staging-mcp-server-langgraph --to-revision=3

## Verify rollback
kubectl rollout status deployment/staging-mcp-server-langgraph -n staging-mcp-server-langgraph

Cost Optimization

Current Costs (Estimated)

Resource	Tier	Monthly Cost
GKE Autopilot	2-3 pods avg	~$100
Cloud SQL	db-custom-1-4096	~$40
Memorystore Redis	2GB standard	~$50
Networking	VPC, egress	~$20
Total		~$210/month

Optimization Tips

Use Autopilot

GKE Autopilot optimizes resource usage automatically. You only pay for pod resources, not node overhead.

Rightsize Resources

Review resource requests/limits:

kubectl top pods -n staging-mcp-server-langgraph

Adjust in deployment-patch.yaml if needed.

Use Preemptible Nodes

Not recommended for staging, but possible for dev environments.

Monitor Egress

Excessive egress to LLM APIs can increase costs. Consider:

Caching responses
Request batching
Rate limiting

Next Steps

Production Deployment

Deploy to production GKE

Monitoring Setup

Advanced monitoring and alerting

Disaster Recovery

Backup and recovery procedures

CI/CD Guide

Complete CI/CD pipeline documentation

Staging Deployment Complete! Your production-grade staging environment is ready for testing.

Getting Started

Deployment Options

LangGraph Platform

Kubernetes - GKE

Kubernetes - EKS & AKS

Kubernetes - Best Practices

Infrastructure as Code

Monitoring & Observability

Advanced Deployment

Configuration

Operations

​Overview

​Architecture

​Prerequisites

​Step 1: Infrastructure Setup

​Automated Setup (Recommended)

​1.1 Create VPC Network

​1.2 Create GKE Cluster

​1.3 Create Managed Services

​Setup Output

​Step 2: Update API Keys

​Step 3: Install External Secrets Operator

​Step 4: Configure GitHub Environment

​4.1 Create GitHub Environment

​4.2 Update GitHub Workflow

​Step 5: Deploy Application

​First-Time Deployment (Manual)

​Automated Deployments (GitHub Actions)

​Step 6: Verify Deployment

​Run Smoke Tests

​Check Health Endpoints

​View Logs

​Security Features

​Network Isolation

​Workload Identity

​Binary Authorization

​Secret Management

​Monitoring & Observability

​Cloud Console Dashboards

​Key Metrics

​Set Up Alerts

​Troubleshooting

​Rollback Procedures

​Automatic Rollback

​Manual Rollback

​Cost Optimization

​Current Costs (Estimated)

​Optimization Tips

Use Autopilot

Rightsize Resources

Use Preemptible Nodes

Monitor Egress

​Next Steps

Production Deployment

Monitoring Setup

Disaster Recovery

CI/CD Guide

Overview

Architecture

Prerequisites

Step 1: Infrastructure Setup

Automated Setup (Recommended)

1.1 Create VPC Network

1.2 Create GKE Cluster

1.3 Create Managed Services

Setup Output

Step 2: Update API Keys

Step 3: Install External Secrets Operator

Step 4: Configure GitHub Environment

4.1 Create GitHub Environment

4.2 Update GitHub Workflow

Step 5: Deploy Application

First-Time Deployment (Manual)

Automated Deployments (GitHub Actions)

Step 6: Verify Deployment

Run Smoke Tests

Check Health Endpoints

View Logs

Security Features

Network Isolation

Workload Identity

Binary Authorization

Secret Management

Monitoring & Observability

Cloud Console Dashboards

Key Metrics

Set Up Alerts

Troubleshooting

Rollback Procedures

Automatic Rollback

Manual Rollback

Cost Optimization

Current Costs (Estimated)

Optimization Tips

Next Steps