Test Flakiness Budget

A test flakiness budget is an explicit, team-agreed maximum flake rate for the test suite — a quality SLO. When the budget is exceeded, new feature work pauses until flake-rate drops back under the threshold, the same way SRE error budgets gate deployment.

In one sentence

A flakiness budget treats test reliability as a measurable SLO and enforces it the same way SRE practice enforces error budgets: when the budget is burnt, the team stops shipping new tests or features until reliability returns under the threshold.

Why a budget instead of a "fix the flakes" policy

"Fix the flakes" is everyone's intent and no one's priority. A flakiness budget makes it the team's priority by tying it to merge velocity: if the suite is flakier than agreed, no new code ships. This forces flake reduction to compete with — and beat — feature work for the duration of the burn.

Typical thresholds

Suite tierRecommended budget
Pre-merge smoke (golden path)<0.5% flake rate
Post-merge comprehensive<2% flake rate
Scheduled full regression<5% flake rate

Lower-tier suites tolerate more flake because they don't gate merges. The pre-merge tier must be near-perfect or it gets ignored, and ignored signals shipping bugs.

Enforcement patterns

  • Hard gate: CI refuses to run new feature PRs when burn exceeds budget. Most aggressive; works best on small senior teams.
  • Soft gate: PR template prompts the author to acknowledge and link a flake-fix ticket if budget is burnt. Cultural pressure rather than CI block.
  • Exec dashboard: weekly leadership review that includes flakiness as a green/yellow/red metric alongside uptime and incident counts.

How budgets interact with quarantine

Quarantining a test reduces visible flake rate but doesn't reduce real flake debt. To prevent gaming the budget, count quarantined tests against the budget at a discount (e.g. 0.5×) — the team sees that quarantining helps but doesn't make the problem disappear.

What flakiness budget is not

  • Not a replacement for fixing root causes — it's a forcing function for fixing them.
  • Not the same as a flake-rate SLA — an SLA is the floor; the budget is the headroom above the floor before action triggers.

Related terms