Flaky Test
A flaky test is an automated test that produces inconsistent results — passing and failing intermittently against the same code — without any change to the system under test. Flakiness is a signal-quality problem, not a code-correctness problem.
In one sentence
A flaky test is a test that gives different results — pass or fail — for the same code, depending on factors unrelated to whether the code is correct: timing, network conditions, test ordering, shared state, or non-deterministic resolution.
Why flakiness is worse than no test
A test that always fails is a clear signal: fix it. A test that always passes is also clear: trust it. A flaky test is the worst case — it sometimes catches real bugs, sometimes fires false alarms, and engineers learn to ignore it. The signal-to-noise ratio of the entire suite collapses, and even valid failures get re-run away.
Eight common root causes
- Timing assumptions — fixed sleeps, missing waits for async work to settle.
- Test ordering — tests sharing state that's only clean in a specific order.
- External dependencies — network calls, third-party APIs, real email or auth services.
- Concurrency — parallel test workers fighting over the same database row or DOM state.
- Non-deterministic UI resolution — selectors that match different elements depending on render order.
- Environment drift — staging data changes between runs.
- Resource exhaustion — memory or connection limits hit under load.
- Improper resource management — leaked browser contexts, unclosed connections, lingering processes.
How flakiness compounds in AI-velocity teams
When AI coding agents open 40+ pull requests per week, even a 1% flake rate produces multiple false failures every day. Engineers develop "rerun reflex" instead of debugging — and real regressions slip through.
What to do
- Quarantine, don't delete — see quarantine test.
- Measure flakiness as a first-class metric — flake rate by test age (newer tests are flakier; older tests should not regress).
- Track a test flakiness budget — like an SRE error budget, but for the testing layer.
- Fix the root cause — re-running until green hides the problem.