Flaky Test

A flaky test is an automated test that produces inconsistent results — passing and failing intermittently against the same code — without any change to the system under test. Flakiness is a signal-quality problem, not a code-correctness problem.

In one sentence

A flaky test is a test that gives different results — pass or fail — for the same code, depending on factors unrelated to whether the code is correct: timing, network conditions, test ordering, shared state, or non-deterministic resolution.

Why flakiness is worse than no test

A test that always fails is a clear signal: fix it. A test that always passes is also clear: trust it. A flaky test is the worst case — it sometimes catches real bugs, sometimes fires false alarms, and engineers learn to ignore it. The signal-to-noise ratio of the entire suite collapses, and even valid failures get re-run away.

Eight common root causes

  1. Timing assumptions — fixed sleeps, missing waits for async work to settle.
  2. Test ordering — tests sharing state that's only clean in a specific order.
  3. External dependencies — network calls, third-party APIs, real email or auth services.
  4. Concurrency — parallel test workers fighting over the same database row or DOM state.
  5. Non-deterministic UI resolution — selectors that match different elements depending on render order.
  6. Environment drift — staging data changes between runs.
  7. Resource exhaustion — memory or connection limits hit under load.
  8. Improper resource management — leaked browser contexts, unclosed connections, lingering processes.

How flakiness compounds in AI-velocity teams

When AI coding agents open 40+ pull requests per week, even a 1% flake rate produces multiple false failures every day. Engineers develop "rerun reflex" instead of debugging — and real regressions slip through.

What to do

  • Quarantine, don't delete — see quarantine test.
  • Measure flakiness as a first-class metric — flake rate by test age (newer tests are flakier; older tests should not regress).
  • Track a test flakiness budget — like an SRE error budget, but for the testing layer.
  • Fix the root cause — re-running until green hides the problem.

Related terms