Flaky Test

A flaky test is an automated test that produces inconsistent results — passing and failing intermittently against the same code — without any change to the system under test. Flakiness is a signal-quality problem, not a code-correctness problem.

In one sentence

A flaky test is a test that gives different results — pass or fail — for the same code, depending on factors unrelated to whether the code is correct: timing, network conditions, test ordering, shared state, or non-deterministic resolution.

Why flakiness is worse than no test

A test that always fails is a clear signal: fix it. A test that always passes is also clear: trust it. A flaky test is the worst case — it sometimes catches real bugs, sometimes fires false alarms, and engineers learn to ignore it. The signal-to-noise ratio of the entire suite collapses, and even valid failures get re-run away.

Eight common root causes

Timing assumptions — fixed sleeps, missing waits for async work to settle.
Test ordering — tests sharing state that's only clean in a specific order.
External dependencies — network calls, third-party APIs, real email or auth services.
Concurrency — parallel test workers fighting over the same database row or DOM state.
Non-deterministic UI resolution — selectors that match different elements depending on render order.
Environment drift — staging data changes between runs.
Resource exhaustion — memory or connection limits hit under load.
Improper resource management — leaked browser contexts, unclosed connections, lingering processes.

How flakiness compounds in AI-velocity teams

When AI coding agents open 40+ pull requests per week, even a 1% flake rate produces multiple false failures every day. Engineers develop "rerun reflex" instead of debugging — and real regressions slip through.

What to do

Quarantine, don't delete — see quarantine test.
Measure flakiness as a first-class metric — flake rate by test age (newer tests are flakier; older tests should not regress).
Track a test flakiness budget — like an SRE error budget, but for the testing layer.
Fix the root cause — re-running until green hides the problem.

In one sentence

Why flakiness is worse than no test

Eight common root causes

How flakiness compounds in AI-velocity teams

What to do

Related terms