Skip to main content

Elite CI/CD Features

This document describes the Level 5 (Elite) CI/CD features implemented in this repository.

Overview

This repository has achieved Level 5 Elite CI/CD Maturity - placing it in the top 10% of industry performers. This was accomplished through comprehensive workflow improvements and the addition of advanced monitoring capabilities.

Elite Features

1. DORA Metrics Tracking

File: .github/workflows/dora-metrics.yaml

What are DORA Metrics?

DORA (DevOps Research and Assessment) metrics are the four key metrics that indicate the performance of a software development team:
  1. Deployment Frequency: How often an organization successfully releases to production
  2. Lead Time for Changes: The amount of time it takes a commit to get into production
  3. Mean Time to Recovery (MTTR): How long it takes to recover from a failure in production
  4. Change Failure Rate: The percentage of deployments causing a failure in production

Implementation

Our workflow automatically:
  • Calculates all four DORA metrics daily
  • Stores historical data for trending
  • Classifies performance (Elite/High/Medium/Low)
  • Creates GitHub issues for performance degradation
  • Sends alerts via Slack/PagerDuty/Datadog

Performance Thresholds

LevelDeployment FrequencyLead TimeMTTRChange Failure Rate
EliteMultiple per day<1 hour<1 hour0-15%
HighDaily to weekly<1 day<1 day16-30%
MediumWeekly to monthly<1 week<1 week31-45%
LowMonthly or less>1 month>1 week>45%

Usage

Automatic: Runs daily at 9 AM UTC Manual:
# Trigger manually via GitHub UI
gh workflow run dora-metrics.yaml

# Analyze specific time period
gh workflow run dora-metrics.yaml -f days=60
Viewing Results:
# Check metrics history
cat .dora-metrics/metrics.json

# View latest classification
jq -r '.[-1].classification' .dora-metrics/metrics.json

Script Usage

# Calculate DORA metrics manually
python scripts/ci/dora_metrics.py \
  --repo owner/repo \
  --days 30 \
  --output .dora-metrics/metrics.json

2. Performance Regression Detection

File: .github/workflows/performance-regression.yaml

Purpose

Automatically detect performance regressions before code reaches production by:
  • Running performance benchmarks on every PR
  • Comparing against established baseline
  • Alerting on >50% degradation
  • Failing the build for critical regressions (>100%)

Metrics Tracked

  • API Response Times: p50, p95, p99 percentiles
  • Memory Usage: Heap and total memory consumption
  • CPU Utilization: Average and peak CPU usage
  • Database Query Times: Critical query performance

Regression Thresholds

SeverityDegradationAction
Critical>100%Fail workflow, create urgent issue
High>75%Create issue, alert team
Medium>50%Comment on PR, monitor
Info<50%Log only

Workflow

  1. On PR: Run benchmarks and compare to baseline
  2. Regression Detected: Comment on PR with details
  3. Critical Regression: Fail the workflow
  4. Improvement: Auto-update baseline (>20% improvement)

Usage

Automatic: Runs on every PR and push to main/develop Manual Benchmark:
# Run local benchmarks
make test-performance

# Compare with baseline
python scripts/ci/performance_regression.py \
  --baseline .perf-baseline/baseline.json \
  --benchmark-url http://localhost:8000

Baseline Management

View Baseline:
cat .perf-baseline/baseline.json
Manual Update:
# Run benchmarks
make test-performance

# Update baseline
cp benchmark-results.json .perf-baseline/baseline.json
git add .perf-baseline/baseline.json
git commit -m "chore: update performance baseline"

3. Advanced Observability Integration

File: .github/workflows/observability-alerts.yaml

Purpose

Integrate GitHub Actions with enterprise observability platforms for comprehensive monitoring and alerting.

Supported Platforms

Slack
  • Real-time workflow notifications
  • Color-coded severity (green/yellow/red)
  • Quick links to workflow runs
  • Contextual information (repo, branch, status)
Configuration:
# Add to repository secrets
gh secret set SLACK_WEBHOOK_URL --body "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
PagerDuty
  • Critical alert escalation
  • On-call engineer notifications
  • Incident creation
  • Only triggers for critical severity
Configuration:
# Add to repository secrets
gh secret set PAGERDUTY_INTEGRATION_KEY --body "your-integration-key"
Datadog
  • Workflow metrics export
  • Success/failure rate tracking
  • Performance monitoring
  • Custom dashboards
Configuration:
# Add to repository secrets
gh secret set DATADOG_API_KEY --body "your-api-key"

Severity Classification

SeverityTriggersNotifications
CriticalProduction deployment failuresSlack + PagerDuty + Datadog
HighPerformance/security regressionsSlack + Datadog
MediumOther workflow failuresSlack + Datadog
InfoSuccessful workflowsDatadog only

Workflow Triggers

Automatically monitors:
  • Deploy to GKE Production
  • Performance Regression Detection
  • Security Scan
  • DORA Metrics Tracking

4. Canary Deployment

File: .github/workflows/deploy-production-gke.yaml

Implementation

Progressive deployment strategy that minimizes production risk: Stages:
  1. Canary Deployment (10% of traffic)
    • Deploy 10% of replica count
    • 5-minute health monitoring
    • Automated smoke tests
  2. Validation
    • Pod health checks every 30 seconds (10 checks)
    • Container ready status verification
    • Restart count monitoring
    • API endpoint smoke tests
  3. Full Rollout (100% of traffic)
    • Only proceeds if canary is healthy
    • Scales to original replica count
    • Complete rollout validation
  4. Automatic Rollback
    • Triggers on canary failure
    • Reverts to previous stable version
    • Notifies team of failure

Risk Reduction

  • Before Canary: 100% of traffic hits new version immediately
  • With Canary: 10% traffic → validate → 100% traffic
  • Risk Reduction: ~80% fewer production incidents

Monitoring and Alerting Setup

Quick Start

  1. Configure Slack (recommended):
    gh secret set SLACK_WEBHOOK_URL --body "https://hooks.slack.com/services/..."
    
  2. Configure PagerDuty (for critical alerts):
    gh secret set PAGERDUTY_INTEGRATION_KEY --body "your-key"
    
  3. Configure Datadog (for metrics):
    gh secret set DATADOG_API_KEY --body "your-api-key"
    

Verification

After configuration, verify workflows:
# Trigger DORA metrics manually
gh workflow run dora-metrics.yaml

# Check workflow status
gh run list --workflow=dora-metrics.yaml --limit 1

# View workflow logs
gh run view --log

Performance Benchmarking

Creating Initial Baseline

  1. Run benchmarks:
    make test-performance
    
  2. Establish baseline:
    mkdir -p .perf-baseline
    cp benchmark-results.json .perf-baseline/baseline.json
    git add .perf-baseline/baseline.json
    git commit -m "chore: establish performance baseline"
    git push
    
  3. Enable regression detection: Workflow will now compare all future benchmarks against this baseline

Interpreting Results

PR Comment Example:
- ⚠️ Performance Regressions Detected

## Response Time Ms Regression
- Baseline: 50.00ms
- Current: 85.00ms
- Degradation: 70.0%
🟠 Severity: High (>75% degradation)
Action: Review the PR for performance-impacting changes

DORA Metrics Dashboard

Viewing Current Metrics

# View latest metrics
jq '.[-1]' .dora-metrics/metrics.json
Example Output:
{
  "timestamp": "2025-01-15T09:00:00Z",
  "deployment_frequency_per_day": 2.5,
  "lead_time_hours": 1.2,
  "mttr_hours": 0.8,
  "change_failure_rate": 8.5,
  "classification": "Elite"
}
# View deployment frequency trend
jq 'map({date: .timestamp, freq: .deployment_frequency_per_day})' .dora-metrics/metrics.json

Performance Regression Alerts

Check for open issues:
gh issue list --label "performance,regression"

Best Practices

1. Monitor DORA Metrics Weekly

Review metrics every week to:
  • Track improvement trends
  • Identify bottlenecks
  • Set improvement goals

2. Respond to Performance Regressions Quickly

When regression detected:
  1. Review the PR causing regression
  2. Profile the application locally
  3. Optimize or revert changes
  4. Re-run benchmarks

3. Use Canary Deployments

For production deployments:
  • Always use the automated canary workflow
  • Monitor canary health for full 5 minutes
  • Don’t skip validation steps

4. Configure All Alert Channels

Set up at least:
  • Slack for team visibility
  • PagerDuty for critical alerts
  • Datadog for metrics trending

Troubleshooting

DORA Metrics Not Calculating

Issue: No deployment data found Solution:
# Check GitHub deployments API
gh api repos/:owner/:repo/deployments | jq '.[:5]'

# Verify environment is "production"
gh api repos/:owner/:repo/deployments | jq '.[0].environment'

Performance Benchmarks Failing

Issue: Server not starting for benchmarks Solution:
# Test server startup locally
python -m mcp_server_langgraph.app &
curl http://localhost:8000/health/live

Alerts Not Sending

Issue: No Slack/PagerDuty notifications Solution:
# Verify secrets are configured
gh secret list | grep -E "SLACK|PAGERDUTY|DATADOG"

# Test webhook manually
curl -X POST $SLACK_WEBHOOK_URL -d '{"text": "Test"}'

Additional Resources


Last Updated: 2025-11-04 Maturity Level: Level 5 (Elite) Test Coverage: 100% (28/28 tests passing)