CI/CD Troubleshooting Guide

Comprehensive troubleshooting guide for GitHub Actions workflows and CI/CD pipeline issues.

Common Issues
Workflow-Specific Issues
Performance Issues
Cost Optimization
Security Issues
Debugging Techniques

Common Issues

1. Tests Failing in CI But Passing Locally

Symptoms:

Tests pass on local machine
Same tests fail in GitHub Actions
Error messages mention environment differences

Root Causes:

Environment variable differences
Timing/race conditions
Different Python versions
Missing dependencies

Solutions:

# 1. Check Python version match
python --version  # Local
# vs CI workflow (check ci.yaml matrix.python-version)

# 2. Run tests with same Python version
pyenv install 3.12
pyenv local 3.12
pytest tests/

# 3. Disable OpenTelemetry (matches CI)
OTEL_SDK_DISABLED=true pytest tests/

# 4. Use parallel execution (matches CI)
pytest -n auto tests/

# 5. Check for timing issues
pytest tests/ --timeout=30  # Add timeout

Prevention:

Use .env.example to document required variables
Run pre-commit hooks before pushing
Test with multiple Python versions locally: make test-all-pythons

2. Docker Build Failures

Symptoms:

docker build fails in CI
“No space left on device” errors
Network timeout errors

Root Causes:

Disk space exhaustion
Network timeouts
Cache issues
Base image unavailability

Solutions:

# Already implemented in ci.yaml:
- name: Free disk space
  run: |
    docker system prune -af --volumes
    sudo rm -rf /usr/share/dotnet
    sudo rm -rf /usr/local/lib/android

Additional Debug Steps:

# Check available disk space
df -h

# Inspect build context size
docker build --no-cache -f docker/Dockerfile . --dry-run

# Test build locally with same base image
docker pull python:3.12-slim
docker build --no-cache -t test -f docker/Dockerfile .

Known Issue: GitHub-hosted runners have ~14GB free space. Our cleanup step addresses this.

3. Pre-commit Hooks Failing

Symptoms:

Pre-commit job fails in CI
Hooks pass locally but fail in CI
Formatting differences

Root Causes:

Different tool versions
Line ending differences (CRLF vs LF)
File not committed

Solutions:

# 1. Run pre-commit locally
pre-commit run --all-files

# 2. Check tool versions match .pre-commit-config.yaml
pre-commit autoupdate  # Update to latest

# 3. Fix line endings
git config --global core.autocrlf input
find . -type f -name "*.py" -exec dos2unix {} \;

# 4. Install same versions as CI
pip install black==25.9.0 isort==5.13.2

Workaround: If hooks consistently fail, you can temporarily skip:

# In .github/workflows/ci.yaml
- name: Run pre-commit hooks
  run: pre-commit run --all-files
  continue-on-error: true  # Temporary only!

4. Deployment Authentication Failures

Symptoms:

“Permission denied” during GCP auth
“Workload Identity Provider not found”
“Invalid service account”

Root Causes:

Missing or incorrect GitHub secrets
Workload Identity misconfiguration
Service account lacks permissions

Solutions:

# 1. Verify secrets are set
gh secret list

# 2. Check Workload Identity configuration
gcloud iam workload-identity-pools describe github-actions-pool \
  --location=global \
  --project=PROJECT_ID

# 3. Verify service account permissions
gcloud projects get-iam-policy PROJECT_ID \
  --flatten="bindings[].members" \
  --filter="bindings.members:github-actions-deployer@*"

# 4. Test authentication locally
gcloud auth application-default login
gcloud container clusters get-credentials CLUSTER_NAME --region=REGION

See Also: GCP Configuration Guide

Workflow-Specific Issues

Main CI/CD Pipeline (ci.yaml)

Issue: Test Job Timeout

Symptoms:

Job exceeds 30-minute timeout
Tests hang indefinitely

Solutions:

# 1. Identify slow tests
pytest tests/ --durations=20

# 2. Add timeout to specific tests
@pytest.mark.timeout(5)
def test_slow_function():
    ...

# 3. Skip slow tests in CI
pytest -m "not slow" tests/

Issue: Docker Multi-platform Build Fails

Symptoms:

ARM64 build fails
“exec format error”

Solutions:

# Test multi-platform build locally
docker buildx create --use
docker buildx build --platform linux/amd64,linux/arm64 -f docker/Dockerfile .

# Debug specific platform
docker buildx build --platform linux/arm64 --load -t test-arm -f docker/Dockerfile .
docker run --rm test-arm python -c "import platform; print(platform.machine())"

Security Scan Workflow (security-scan.yaml)

Issue: Trivy Scan Finds Vulnerabilities

Symptoms:

Security scan fails
Critical/High vulnerabilities reported

Solutions:

# 1. Run Trivy locally
trivy image --severity CRITICAL,HIGH mcp-server-langgraph:latest

# 2. Update base image
# In docker/Dockerfile, update:
FROM python:3.12-slim  # Check for newer version

# 3. Update dependencies
uv lock --upgrade-package PACKAGE_NAME

# 4. Accept risk (if false positive)
# Create .trivyignore file:
CVE-2024-XXXXX  # Reason: False positive, not exploitable in our use case

Issue: CodeQL Analysis Fails

Symptoms:

“No code found to analyze”
Python extraction errors

Solutions:

# Ensure Python setup before CodeQL
- name: Set up Python
  uses: actions/setup-python@v6
  with:
    python-version: '3.12'

- name: Install dependencies
  run: |
    pip install -r requirements.txt

- name: Initialize CodeQL
  uses: github/codeql-action/init@v4
  with:
    languages: python

E2E Tests Workflow (e2e-tests.yaml)

Issue: Test Infrastructure Not Ready

Symptoms:

Tests fail with connection errors
“Service not healthy” messages

Solutions:

# 1. Increase health check wait time
# In e2e-tests.yaml:
sleep 30  # Increase from 15

# 2. Add retries to health checks
for i in {1..5}; do
  curl -f http://localhost:9080/healthz && break
  sleep 5
done

# 3. Check Docker Compose logs
docker compose -f docker-compose.test.yml logs postgres-test

# 4. Verify port availability
netstat -tuln | grep 9432  # Should not be in use before test

Coverage Tracking Workflow (coverage-trend.yaml)

Issue: Coverage Drops >5% Without Code Changes

Symptoms:

Coverage workflow fails
No obvious code changes

Root Causes:

New files added without tests
Conditional code not executed in CI
Test files excluded incorrectly

Solutions:

# 1. Generate local coverage report
pytest --cov=src/mcp_server_langgraph --cov-report=html tests/
open htmlcov/index.html

# 2. Find uncovered files
pytest --cov=src/mcp_server_langgraph --cov-report=term-missing tests/ | grep "0%"

# 3. Check .coveragerc exclusions
cat .coveragerc  # Or [tool.coverage] in pyproject.toml

# 4. Temporarily adjust threshold
# In coverage-trend.yaml:
python scripts/ci/track-coverage.py --threshold=10  # Increase from 5%

Performance Issues

Slow Workflow Runs

Symptoms:

CI takes >20 minutes
Jobs queued for long time

Diagnosis:

# View workflow timing
gh run view RUN_ID --log

# Identify slow jobs
gh run view RUN_ID --json jobs --jq '.jobs[] | {name, conclusion, startedAt, completedAt}'

Optimizations:

Improve Caching:

# Better cache key
- uses: actions/cache@v4
  with:
    key: ${{ runner.os }}-uv-${{ hashFiles('uv.lock', 'pyproject.toml') }}

Parallel Execution:

# Run independent jobs in parallel
jobs:
  test:
    ...
  lint:  # Runs parallel to test
    ...

Conditional Execution:

# Skip jobs on docs-only changes
jobs:
  test:
    if: "!contains(github.event.head_commit.message, '[skip ci]')"

Reduce Test Scope:

# Only run tests for changed files
pytest --testmon tests/

High GitHub Actions Costs

Symptoms:

Monthly bill exceeds budget
Many long-running workflows

Solutions:

Monitor Costs:

# Run cost tracking workflow manually
gh workflow run cost-tracking.yaml

# View report
gh run list --workflow=cost-tracking.yaml --limit=1
gh run view RUN_ID

Optimize Workflows:

# Add concurrency limits
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true  # Cancel outdated runs

Reduce Scheduled Runs:

# Change from daily to weekly
schedule:
  - cron: '0 9 * * 1'  # Weekly instead of daily

Security Issues

Secrets Exposed in Logs

Symptoms:

Secrets visible in workflow logs
Security alerts from GitHub

Prevention:

# Use GitHub's secret masking
- name: Use secret
  env:
    MY_SECRET: ${{ secrets.MY_SECRET }}
  run: |
    echo "::add-mask::$MY_SECRET"  # Mask in logs
    echo "Using secret..."  # Won't show value

Dependency Vulnerabilities

Symptoms:

Dependabot alerts
Security scan failures

Solutions:

# 1. Update vulnerable dependency
uv lock --upgrade-package PACKAGE_NAME

# 2. Check for security advisories
pip-audit --fix

# 3. Pin to secure version
# In pyproject.toml:
[project.dependencies]
package = ">=1.2.3"  # Version without vulnerability

Debugging Techniques

Enable Debug Logging

In Workflow File:

jobs:
  debug-job:
    runs-on: ubuntu-latest
    steps:
    - name: Enable debug
      run: echo "ACTIONS_STEP_DEBUG=true" >> $GITHUB_ENV

    - name: Debug info
      run: |
        echo "::debug::Debugging information"
        env | sort  # Print all environment variables

Via GitHub UI:

Go to repository Settings → Secrets
Add secret: ACTIONS_STEP_DEBUG = true
Re-run workflow

SSH Into Runner (for Emergencies)

Using tmate:

# Add this step to workflow for interactive debugging
- name: Setup tmate session
  if: failure()  # Only on failure
  uses: mxschmitt/action-tmate@v3
  timeout-minutes: 30

⚠️ Warning: Remove tmate step before merging! It exposes your runner.

View Workflow Artifacts

# List artifacts
gh run view RUN_ID --log

# Download artifacts
gh run download RUN_ID

# View specific artifact
unzip artifact.zip
cat report.json | jq

Test Workflow Locally with Act

# Install act
brew install act  # macOS
# or
curl https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash

# Run workflow locally
act pull_request

# Run specific job
act -j test

# Use specific runner image
act -P ubuntu-latest=catthehacker/ubuntu:act-latest

Getting Help

Where to Look

Workflow Run Logs - Most detailed information
GitHub Status - Check if GitHub Actions is down: https://www.githubstatus.com/
Issue Tracker - Search existing issues: gh issue list
Documentation - Workflow comments explain logic

Reporting Issues

When reporting CI/CD issues, include:

## Issue Description
Brief description of the problem

## Workflow Details
- Workflow: `ci.yaml`
- Run ID: #12345
- Commit: abc123
- Branch: feature/new-thing

## Steps to Reproduce
1. Push to branch
2. Wait for CI
3. Observe failure in step X

## Logs

[Paste relevant log excerpt]

## Expected vs Actual
Expected: Tests pass
Actual: Tests fail with error Y

Emergency Contacts

CI/CD Issues: @cicd-team
Security Alerts: @security-team
Infrastructure: @platform-team

Useful Commands

# View recent workflow runs
gh run list --limit=10

# View specific run
gh run view RUN_ID

# Re-run failed jobs
gh run rerun RUN_ID --failed

# Cancel running workflow
gh run cancel RUN_ID

# Download workflow logs
gh run view RUN_ID --log > workflow.log

# List workflows
gh workflow list

# Disable/enable workflow
gh workflow disable WORKFLOW_NAME
gh workflow enable WORKFLOW_NAME

# View workflow file
gh workflow view WORKFLOW_NAME

# List secrets
gh secret list

# Set secret
gh secret set SECRET_NAME

# Delete secret
gh secret delete SECRET_NAME

Appendix: Workflow Quick Reference

Workflow	Triggers	Duration	Troubleshooting Priority
ci.yaml	PR, push	12 min	HIGH
e2e-tests.yaml	PR, push, nightly	15 min	MEDIUM
security-scan.yaml	PR, daily, release	15 min	HIGH
quality-tests.yaml	PR, push, weekly	20 min	LOW
deploy-preview-gke.yaml	push (main)	10 min	HIGH
deploy-production-gke.yaml	release, manual	15 min	CRITICAL

Last Updated: 2025-11-02 Maintained By: CI/CD Team

Overview

Configuration

Troubleshooting

Elite Features

CI/CD Troubleshooting Guide

CI/CD Troubleshooting Guide

Table of Contents

Common Issues

1. Tests Failing in CI But Passing Locally

2. Docker Build Failures

3. Pre-commit Hooks Failing

4. Deployment Authentication Failures

Workflow-Specific Issues

Main CI/CD Pipeline (ci.yaml)

Issue: Test Job Timeout

Issue: Docker Multi-platform Build Fails

Security Scan Workflow (security-scan.yaml)

Issue: Trivy Scan Finds Vulnerabilities

Issue: CodeQL Analysis Fails

E2E Tests Workflow (e2e-tests.yaml)

Issue: Test Infrastructure Not Ready

Coverage Tracking Workflow (coverage-trend.yaml)

Issue: Coverage Drops >5% Without Code Changes

Performance Issues

Slow Workflow Runs

High GitHub Actions Costs

Security Issues

Secrets Exposed in Logs

Dependency Vulnerabilities

Debugging Techniques

Enable Debug Logging

SSH Into Runner (for Emergencies)

View Workflow Artifacts

Test Workflow Locally with Act

Getting Help

Where to Look

Reporting Issues

Emergency Contacts

Useful Commands

Appendix: Workflow Quick Reference

Overview

Configuration

Troubleshooting

Elite Features

​CI/CD Troubleshooting Guide

​Table of Contents

​Common Issues

​1. Tests Failing in CI But Passing Locally

​2. Docker Build Failures

​3. Pre-commit Hooks Failing

​4. Deployment Authentication Failures

​Workflow-Specific Issues

​Main CI/CD Pipeline (ci.yaml)

​Issue: Test Job Timeout

​Issue: Docker Multi-platform Build Fails

​Security Scan Workflow (security-scan.yaml)

​Issue: Trivy Scan Finds Vulnerabilities

​Issue: CodeQL Analysis Fails

​E2E Tests Workflow (e2e-tests.yaml)

​Issue: Test Infrastructure Not Ready

​Coverage Tracking Workflow (coverage-trend.yaml)

​Issue: Coverage Drops >5% Without Code Changes

​Performance Issues

​Slow Workflow Runs

​High GitHub Actions Costs

​Security Issues

​Secrets Exposed in Logs

​Dependency Vulnerabilities

​Debugging Techniques

​Enable Debug Logging

​SSH Into Runner (for Emergencies)

​View Workflow Artifacts

​Test Workflow Locally with Act

​Getting Help

​Where to Look

​Reporting Issues

​Emergency Contacts

​Useful Commands

​Appendix: Workflow Quick Reference

CI/CD Troubleshooting Guide

Table of Contents

Common Issues

1. Tests Failing in CI But Passing Locally

2. Docker Build Failures

3. Pre-commit Hooks Failing

4. Deployment Authentication Failures

Workflow-Specific Issues

Main CI/CD Pipeline (ci.yaml)

Issue: Test Job Timeout

Issue: Docker Multi-platform Build Fails

Security Scan Workflow (security-scan.yaml)

Issue: Trivy Scan Finds Vulnerabilities

Issue: CodeQL Analysis Fails

E2E Tests Workflow (e2e-tests.yaml)

Issue: Test Infrastructure Not Ready

Coverage Tracking Workflow (coverage-trend.yaml)

Issue: Coverage Drops >5% Without Code Changes

Performance Issues

Slow Workflow Runs

High GitHub Actions Costs

Security Issues

Secrets Exposed in Logs

Dependency Vulnerabilities

Debugging Techniques

Enable Debug Logging

SSH Into Runner (for Emergencies)

View Workflow Artifacts

Test Workflow Locally with Act

Getting Help

Where to Look

Reporting Issues

Emergency Contacts

Useful Commands

Appendix: Workflow Quick Reference