CI/CD Troubleshooting Guide
Comprehensive troubleshooting guide for GitHub Actions workflows and CI/CD pipeline issues.Table of Contents
- Common Issues
- Workflow-Specific Issues
- Performance Issues
- Cost Optimization
- Security Issues
- Debugging Techniques
Common Issues
1. Tests Failing in CI But Passing Locally
Symptoms:- Tests pass on local machine
- Same tests fail in GitHub Actions
- Error messages mention environment differences
- Environment variable differences
- Timing/race conditions
- Different Python versions
- Missing dependencies
- Use
.env.exampleto document required variables - Run pre-commit hooks before pushing
- Test with multiple Python versions locally:
make test-all-pythons
2. Docker Build Failures
Symptoms:docker buildfails in CI- “No space left on device” errors
- Network timeout errors
- Disk space exhaustion
- Network timeouts
- Cache issues
- Base image unavailability
3. Pre-commit Hooks Failing
Symptoms:- Pre-commit job fails in CI
- Hooks pass locally but fail in CI
- Formatting differences
- Different tool versions
- Line ending differences (CRLF vs LF)
- File not committed
4. Deployment Authentication Failures
Symptoms:- “Permission denied” during GCP auth
- “Workload Identity Provider not found”
- “Invalid service account”
- Missing or incorrect GitHub secrets
- Workload Identity misconfiguration
- Service account lacks permissions
Workflow-Specific Issues
Main CI/CD Pipeline (ci.yaml)
Issue: Test Job Timeout
Symptoms:- Job exceeds 30-minute timeout
- Tests hang indefinitely
Issue: Docker Multi-platform Build Fails
Symptoms:- ARM64 build fails
- “exec format error”
Security Scan Workflow (security-scan.yaml)
Issue: Trivy Scan Finds Vulnerabilities
Symptoms:- Security scan fails
- Critical/High vulnerabilities reported
Issue: CodeQL Analysis Fails
Symptoms:- “No code found to analyze”
- Python extraction errors
E2E Tests Workflow (e2e-tests.yaml)
Issue: Test Infrastructure Not Ready
Symptoms:- Tests fail with connection errors
- “Service not healthy” messages
Coverage Tracking Workflow (coverage-trend.yaml)
Issue: Coverage Drops >5% Without Code Changes
Symptoms:- Coverage workflow fails
- No obvious code changes
- New files added without tests
- Conditional code not executed in CI
- Test files excluded incorrectly
Performance Issues
Slow Workflow Runs
Symptoms:- CI takes >20 minutes
- Jobs queued for long time
- Improve Caching:
- Parallel Execution:
- Conditional Execution:
- Reduce Test Scope:
High GitHub Actions Costs
Symptoms:- Monthly bill exceeds budget
- Many long-running workflows
- Monitor Costs:
- Optimize Workflows:
- Reduce Scheduled Runs:
Security Issues
Secrets Exposed in Logs
Symptoms:- Secrets visible in workflow logs
- Security alerts from GitHub
Dependency Vulnerabilities
Symptoms:- Dependabot alerts
- Security scan failures
Debugging Techniques
Enable Debug Logging
In Workflow File:- Go to repository Settings → Secrets
- Add secret:
ACTIONS_STEP_DEBUG=true - Re-run workflow
SSH Into Runner (for Emergencies)
Using tmate:View Workflow Artifacts
Test Workflow Locally with Act
Getting Help
Where to Look
- Workflow Run Logs - Most detailed information
- GitHub Status - Check if GitHub Actions is down: https://www.githubstatus.com/
- Issue Tracker - Search existing issues:
gh issue list - Documentation - Workflow comments explain logic
Reporting Issues
When reporting CI/CD issues, include:Emergency Contacts
- CI/CD Issues: @cicd-team
- Security Alerts: @security-team
- Infrastructure: @platform-team
Useful Commands
Appendix: Workflow Quick Reference
| Workflow | Triggers | Duration | Troubleshooting Priority |
|---|---|---|---|
| ci.yaml | PR, push | 12 min | HIGH |
| e2e-tests.yaml | PR, push, nightly | 15 min | MEDIUM |
| security-scan.yaml | PR, daily, release | 15 min | HIGH |
| quality-tests.yaml | PR, push, weekly | 20 min | LOW |
| deploy-staging-gke.yaml | push (main) | 10 min | HIGH |
| deploy-production-gke.yaml | release, manual | 15 min | CRITICAL |
Last Updated: 2025-11-02 Maintained By: CI/CD Team