Coverage Threshold Philosophy
Last Updated: 2025-11-17 Status: Active Policy Related: Testing Guide, CI/CD Strategy ADROverview
This document defines the coverage threshold strategy for the MCP Server LangGraph project. Different test types have different coverage requirements based on their purpose and focus.Quick Reference
| Test Type | Threshold | Workflow | Rationale |
|---|---|---|---|
| Unit Tests | 66% | ci.yaml | Comprehensive code coverage required |
| Coverage Trend | 80% | coverage-trend.yaml | Higher bar for main branch quality |
| Contract Tests | 0% | quality-tests.yaml | Focus: MCP protocol compliance |
| Property Tests | 0% | quality-tests.yaml | Focus: Edge cases with Hypothesis |
| Regression Tests | 0% | quality-tests.yaml | Focus: Performance baselines (split 4x) |
| E2E Tests | 0% | e2e-tests.yaml | Focus: Complete user journeys |
| Integration Tests | 0% | integration-tests.yaml | Focus: Component interactions (split 4x) |
Philosophy
Principle 1: Coverage ≠ Test Quality
Code coverage is a metric, not a goal. High coverage doesn’t guarantee good tests, and low coverage doesn’t mean tests are ineffective. Example: A contract test validating JSON-RPC 2.0 compliance may only touch 5% of the codebase but is critically important for protocol correctness.Principle 2: Purpose-Driven Thresholds
Each test category has a specific purpose. Thresholds should align with that purpose:- Unit tests: Validate business logic → High coverage expected
- Contract tests: Validate protocol compliance → Coverage irrelevant
- Property tests: Find edge cases → Coverage secondary to scenarios
- E2E tests: Validate user workflows → Coverage reflects complexity
Principle 3: Split Test Isolation
When tests are split across parallel jobs (usingpytest-split), each job only runs a subset of tests and therefore only covers a subset of code. Individual splits should not enforce coverage thresholds.
Example: Integration tests split 4 ways
- Total: 248 tests covering ~80% of code
- Per split: ~62 tests covering ~25% of code
- Threshold: 0% per split, evaluate merged coverage
Test Type Details
Unit Tests (66% threshold)
Purpose: Validate individual functions and classes in isolation Why 66%?- Balances comprehensive coverage with practical development
- Allows for defensive error handling (uncovered in unit tests but tested in integration)
- Prevents “coverage theater” (writing tests just to hit 100%)
Coverage Trend (80% threshold)
Purpose: Maintain high quality bar for main branch Why 80%?- Higher standard for production-ready code
- Prevents gradual coverage erosion
- Encourages thorough testing before merge
Contract Tests (0% threshold)
Purpose: Validate MCP (Model Context Protocol) compliance Why 0%?- Tests focus on protocol schema validation (JSON-RPC 2.0, MCP spec)
- Coverage measures protocol code paths, not business logic
- 23 contract tests may only touch 25% of codebase but validate critical protocol behavior
Property-Based Tests (0% threshold)
Purpose: Find edge cases using Hypothesis Why 0%?- Tests generate random inputs to find edge cases
- Focus is on scenario diversity, not code coverage
- 100 examples per test (CI profile) may hit same code paths repeatedly
Performance Regression Tests (0% threshold)
Purpose: Detect performance degradation over time Why 0%?- Tests measure execution time, not code coverage
- Split across 4 parallel jobs (~25% coverage per split)
- Coverage would need to be merged to evaluate properly
End-to-End Tests (0% threshold)
Purpose: Validate complete user journeys Why 0%?- Tests focus on user workflows, not code paths
- 14 E2E tests may only exercise specific features
- Coverage reflects workflow complexity, not test quality
Integration Tests (0% threshold)
Purpose: Test component interactions with infrastructure Why 0%?- Tests focus on integration points (DB, Redis, Keycloak, OpenFGA)
- 248 tests split across 4 parallel jobs (~62 tests per split)
- Coverage per split ~25%, merged coverage ~80%
Implementation Details
How Thresholds Are Overridden
Global threshold (frompyproject.toml):
--cov-fail-under flag):
Coverage Collection vs Enforcement
All test types collect coverage for reporting and trend analysis, but only unit tests enforce thresholds:Merged Coverage Analysis
For split test suites (integration, regression), coverage is:- Collected per split (4 jobs)
- Uploaded as separate artifacts
- Merged in a dedicated job
- Evaluated against aggregate threshold (if applicable)
Historical Context
Why This Philosophy Was Adopted
Date: 2025-11-17 Problem: Quality Tests workflow was failing in CI despite all tests passing:- Contract Tests: 25% coverage < 66% required ❌
- Property Tests: 29% coverage < 66% required ❌
- Regression Tests (4 splits): ~24% coverage per split < 66% required ❌
pyproject.toml applied to all pytest runs, including isolated test suites focused on specific quality aspects.
Solution: Override threshold to 0% for isolated test suites while maintaining strict enforcement for unit tests.
Impact:
- ✅ Quality tests now pass (focus on their actual purpose)
- ✅ Unit tests still maintain 66% coverage requirement
- ✅ Coverage trend tracking still enforces 80% for main branch
- ✅ Coverage data still collected for all test types (reporting)
Related Issues
- PR #101: CI failures due to coverage thresholds
- Codex Finding (2025-11-17): Contract tests missing, needed coverage override
- Quality Tests Audit: Identified need for coverage philosophy documentation
Best Practices
✅ DO
- Collect coverage for all test types (even if threshold = 0)
- Use coverage trends to identify under-tested modules
- Focus on test quality over coverage percentage
- Document uncovered edge cases in test comments
- Merge split coverage before evaluation
❌ DON’T
- Write tests just to hit coverage targets (coverage theater)
- Apply unit test thresholds to specialized test suites
- Ignore low coverage without understanding why
- Remove coverage collection from any test type
- Evaluate split coverage without merging first
Changing Thresholds
Process
- Propose change with justification (GitHub issue or ADR)
- Document rationale for new threshold
- Update workflows to reflect new threshold
- Update this document with change history
- Communicate to team via PR/Slack
Threshold Change Checklist
- Rationale documented (why is this change needed?)
- Impact analyzed (how many tests affected?)
- Workflows updated (YAML files modified)
- Documentation updated (this file)
- Team notified (PR description + Slack)
- ADR created (if architectural significance)
Troubleshooting
”Coverage threshold not met” Error
Symptom: CI fails with coverage < required threshold Diagnosis:- Unit tests: Add tests to cover missing lines (appropriate)
- Specialized tests: Override threshold if focus is non-coverage (e.g., contract, property)
- Split tests: Check if individual split or merged coverage
Split Coverage Lower Than Expected
Symptom: Individual split shows 25% coverage but merged should be 80% Diagnosis:--cov-fail-under=0. Only evaluate merged coverage.
Coverage Trend Failing on PR
Symptom: Coverage Trend workflow fails with “Coverage below 80%” Diagnosis:Related Documentation
- Testing Guide - Comprehensive testing strategies
- CI/CD Strategy ADR - Workflow architecture
- Quality Tests Workflow - Contract, property, regression tests
- E2E Tests Workflow - End-to-end user journeys
- Integration Tests Workflow - Component integration testing
Appendix: Coverage by Test Type
Typical Coverage Ranges
Based on historical data from this project:| Test Type | Typical Coverage | Test Count | Focus |
|---|---|---|---|
| Unit | 70-85% | ~800 | Business logic |
| Integration (merged) | 75-85% | 248 | Infrastructure |
| E2E | 30-50% | 14 | User journeys |
| Contract | 20-30% | 23 | Protocol compliance |
| Property | 25-35% | ~50 | Edge cases |
| Regression | 20-30% | ~100 | Performance |
Questions? Open an issue or ask in
#engineering Slack channel.