Test Flakiness Budget
A test flakiness budget is an explicit, team-agreed maximum flake rate for the test suite — a quality SLO. When the budget is exceeded, new feature work pauses until flake-rate drops back under the threshold, the same way SRE error budgets gate deployment.
In one sentence
A flakiness budget treats test reliability as a measurable SLO and enforces it the same way SRE practice enforces error budgets: when the budget is burnt, the team stops shipping new tests or features until reliability returns under the threshold.
Why a budget instead of a "fix the flakes" policy
"Fix the flakes" is everyone's intent and no one's priority. A flakiness budget makes it the team's priority by tying it to merge velocity: if the suite is flakier than agreed, no new code ships. This forces flake reduction to compete with — and beat — feature work for the duration of the burn.
Typical thresholds
| Suite tier | Recommended budget |
|---|---|
| Pre-merge smoke (golden path) | <0.5% flake rate |
| Post-merge comprehensive | <2% flake rate |
| Scheduled full regression | <5% flake rate |
Lower-tier suites tolerate more flake because they don't gate merges. The pre-merge tier must be near-perfect or it gets ignored, and ignored signals shipping bugs.
Enforcement patterns
- Hard gate: CI refuses to run new feature PRs when burn exceeds budget. Most aggressive; works best on small senior teams.
- Soft gate: PR template prompts the author to acknowledge and link a flake-fix ticket if budget is burnt. Cultural pressure rather than CI block.
- Exec dashboard: weekly leadership review that includes flakiness as a green/yellow/red metric alongside uptime and incident counts.
How budgets interact with quarantine
Quarantining a test reduces visible flake rate but doesn't reduce real flake debt. To prevent gaming the budget, count quarantined tests against the budget at a discount (e.g. 0.5×) — the team sees that quarantining helps but doesn't make the problem disappear.
What flakiness budget is not
- Not a replacement for fixing root causes — it's a forcing function for fixing them.
- Not the same as a flake-rate SLA — an SLA is the floor; the budget is the headroom above the floor before action triggers.