How to Fix Flaky Tests: Causes and Permanent Fixes
Shiplight AI Team
Updated on April 7, 2026
Shiplight AI Team
Updated on April 7, 2026

A flaky test — sometimes called an intermittent or non-deterministic test — is a test that passes sometimes and fails sometimes — on the same code, with no changes. They're the most corrosive problem in a test suite because they turn your CI from a quality signal into noise.
Teams respond to flaky tests in predictable ways: first they rerun them, then they add retries, then they quarantine them, then they just stop looking at red CI. By the time a real regression ships, no one trusts the tests enough to catch it.
This guide covers the 7 root causes of flaky E2E tests and how to fix each one permanently — not with retries that hide the problem, but with changes that make the test reliable. For teams where cause #7 (UI changes breaking locators) is the dominant source of flakiness, Shiplight's self-healing layer eliminates it automatically.
| # | Root Cause | Primary Symptom | Fix |
|---|---|---|---|
| 1 | Timing / race conditions | "element not found" on CI | Replace waitForTimeout with condition-based waits |
| 2 | Brittle selectors | Breaks on CSS rename | Use getByRole, getByTestId, getByLabel |
| 3 | Shared test state | Fails in parallel, passes solo | Isolate data per test, reset state in afterEach |
| 4 | Environment instability | CI fails, local passes | Health checks, mock external APIs, raise timeouts |
| 5 | Animation interference | Random assertion failures | reducedMotion: 'reduce' in Playwright config |
| 6 | Parallelism conflicts | Fails with --workers > 1 | Scope data to workerIndex |
| 7 | UI changes / locator drift | Breaks after refactors | Shiplight self-healing or semantic selectors |
---
A test suite with 20% flakiness is worse than a smaller, reliable suite. Here's why:
The Google Testing Blog has documented that even 1% flakiness in a large suite creates enough noise to meaningfully slow down development. At 10%+, teams functionally stop relying on CI.
Symptom: Test fails with "element not found" or "timeout" — sometimes. Usually on CI, rarely locally.
Root cause: The test clicks or asserts before the page, network request, or animation has finished.
What not to do:
// Don't add arbitrary sleeps — they're fragile and slow
await page.waitForTimeout(2000);
await page.click('#submit-btn');Fix: Use explicit waits that respond to actual application state:
// Wait for the element to be visible and enabled
await page.waitForSelector('#submit-btn', { state: 'visible' });
await page.click('#submit-btn');
// Wait for network to settle after an action
await page.click('#submit-btn');
await page.waitForLoadState('networkidle');
// Wait for a specific response
const [response] = await Promise.all([
page.waitForResponse(r => r.url().includes('/api/submit') && r.status() === 200),
page.click('#submit-btn'),
]);
// Wait for navigation
await Promise.all([
page.waitForURL('**/dashboard'),
page.click('#login-btn'),
]);CI runners are slower than developer machines — timeouts that work locally fail in CI. Set explicit timeouts in your Playwright config:
// playwright.config.ts
export default {
timeout: 30000, // per test timeout
expect: { timeout: 10000 }, // per assertion timeout
use: {
actionTimeout: 10000, // per action timeout
},
};---
Symptom: Test breaks after a UI change that didn't change behavior — a CSS class rename, DOM restructure, or component migration.
Root cause: The test is coupled to implementation details (CSS classes, IDs, DOM structure) rather than user-visible behavior.
Fragile selectors:
// ❌ Breaks when class name changes
await page.click('.btn-primary-v2-active');
// ❌ Breaks when DOM restructures
await page.click('div > div:nth-child(3) > button');
// ❌ Breaks when internal ID changes
await page.click('#internal-submit-14');Resilient selectors (in order of preference):
// ✅ User-visible text — stable across refactors
await page.click('button:has-text("Sign In")');
// ✅ ARIA role + name — semantic and accessible
await page.getByRole('button', { name: 'Sign In' }).click();
// ✅ Test ID — explicit contract between test and dev
await page.getByTestId('submit-button').click();
// ✅ Label association — works for form inputs
await page.getByLabel('Email address').fill('user@example.com');
// ✅ Placeholder — for unlabeled inputs
await page.getByPlaceholder('Search...').fill('query');Add data-testid attributes to key interactive elements as a team convention. This creates an explicit contract: devs know which elements tests depend on, and changes are deliberate.
The deeper fix is to treat locators as a cache of user intent, not as the source of truth. Shiplight's intent-cache-heal pattern implements this systematically — when a locator breaks, the test resolves the correct element from its intent description rather than failing.
---
Symptom: Tests pass in isolation but fail when run together. Order-dependent failures. "Works on my machine" with a specific test order.
Root cause: Tests share state — database records, cookies, localStorage, or server-side session data — that bleeds between runs.
Fix: Make every test self-contained:
// ✅ Create isolated test data per test
test.beforeEach(async ({ page }) => {
// Create a fresh user for this test
const user = await createTestUser({ role: 'admin' });
await loginAs(page, user);
});
test.afterEach(async () => {
// Clean up test data
await cleanupTestUsers();
});For browser state (cookies, localStorage):
// playwright.config.ts
export default {
use: {
// Start every test in a fresh browser context
storageState: undefined,
},
};For auth state, use Playwright's storageState to save a logged-in session once and reuse it — avoiding repeated login steps while still isolating test data:
// global-setup.ts
import { chromium } from '@playwright/test';
async function globalSetup() {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('/login');
await page.fill('[name=email]', process.env.TEST_USER_EMAIL!);
await page.fill('[name=password]', process.env.TEST_USER_PASSWORD!);
await page.click('button[type=submit]');
await page.waitForURL('/dashboard');
await page.context().storageState({ path: 'auth.json' });
await browser.close();
}See stable auth and email E2E tests for handling authentication flows specifically.
---
Symptom: Tests fail on CI but not locally. Errors involve timeouts, connection refused, or service unavailability.
Root cause: CI environment differs from local — different network latency, services not fully started, environment variables missing, or third-party API rate limits.
Fix:
Health check before tests:
// global-setup.ts
async function globalSetup() {
const maxRetries = 10;
for (let i = 0; i < maxRetries; i++) {
try {
const res = await fetch(process.env.BASE_URL + '/health');
if (res.ok) break;
} catch {
await new Promise(r => setTimeout(r, 2000));
}
if (i === maxRetries - 1) throw new Error('App did not start');
}
}Mock external services that are unreliable or rate-limited in CI:
// Mock Stripe, SendGrid, or other third-party APIs in tests
await page.route('**/api.stripe.com/**', route =>
route.fulfill({ status: 200, body: JSON.stringify({ status: 'succeeded' }) })
);Increase timeouts for CI while keeping local tests fast:
// playwright.config.ts
export default {
timeout: process.env.CI ? 45000 : 15000,
};---
Symptom: Test clicks an element that's animating in/out and gets wrong behavior. Assertion fails because element is mid-transition.
Root cause: CSS transitions and animations run asynchronously and can interfere with element interaction timing.
Fix: Disable animations in test environments:
// playwright.config.ts
export default {
use: {
// Disable CSS animations
reducedMotion: 'reduce',
},
};Or inject a global CSS override in test setup:
test.beforeEach(async ({ page }) => {
await page.addStyleTag({
content: `*, *::before, *::after {
animation-duration: 0ms !important;
transition-duration: 0ms !important;
}`,
});
});---
Symptom: Tests pass when run sequentially (--workers=1) but fail with parallel execution.
Root cause: Parallel tests competing for the same resource — same test user account, same database record, same port.
Fix:
Use unique data per parallel worker:
// Use worker index to isolate data
test('create item', async ({ page }, testInfo) => {
const userId = `test-user-${testInfo.workerIndex}`;
// Each worker uses its own user, no conflicts
});Limit concurrency for tests that genuinely can't parallelize:
// playwright.config.ts
export default {
projects: [
{
name: 'sequential-tests',
testMatch: /serial\.spec\.ts/,
use: { workers: 1 },
},
],
};---
Symptom: Tests break after normal product development — a component refactor, CSS rename, or layout change — with no behavior change. This is the single largest driver of "tests as a maintenance burden."
Root cause: Tests are coupled to implementation details rather than user intent. Every locator-based test (#submit-btn, .btn-primary, div:nth-child(3)) is a bet that the DOM won't change. That bet loses constantly in teams shipping fast.
Short-term fix: Migrate to semantic selectors (see Cause #2). Add data-testid attributes to critical elements.
Systematic fix with Shiplight: Shiplight's intent-cache-heal pattern eliminates this entire class of flakiness. Instead of maintaining a list of fallback selectors, Shiplight stores the semantic intent of each test step — for example, "click the primary submit button on the checkout form." When a locator breaks, Shiplight's AI resolves the correct element from the live DOM using that intent, not a cached CSS selector.
The result: tests survive CSS renames, component refactors, and layout changes that would break traditional locator-based healers — without any manual selector updates.
# Shiplight YAML test — intent survives UI changes
goal: Verify checkout flow
statements:
- intent: Add item to cart
- intent: Proceed to checkout
- intent: Fill in shipping details
- VERIFY: order confirmation is displayedWhen a button moves or gets renamed, Shiplight heals the step automatically. The developer who renamed the button doesn't need to update a single test file.
See self-healing test automation and self-healing vs manual maintenance for how this works in production suites.
---
If you have an existing suite with widespread flakiness, don't try to fix everything at once. Use this triage approach:
// Mark known-flaky tests with skip + tracking issue
test.skip('checkout flow — flaky, tracked in TICKET-123', async ({ page }) => {
// ...
});Deleting flaky tests removes coverage. Quarantine them while you fix the root cause.
// playwright.config.ts
export default {
retries: process.env.CI ? 2 : 0,
};Retries are a symptom management tool, not a fix. Use them to keep CI green while you identify root causes, then remove them once the underlying issue is fixed.
Track which tests are most flaky. Playwright's built-in retry mechanism marks tests as flaky when they pass on retry — use this data to prioritize:
# Generate a JSON report to analyze flakiness
npx playwright test --reporter=json > results.jsonFor CI-specific reporter setup — including the github reporter that surfaces flaky test annotations directly in the PR diff — see E2E testing in GitHub Actions.
Fix the 20% of tests causing 80% of flakiness. Common culprits: auth flows, tests hitting external APIs, tests with waitForTimeout.
---
Build these habits into test authoring:
---
Three signals: (1) tests that pass on manual rerun after failing in CI — Playwright marks these as flaky in its JSON report; (2) tests that consistently appear in your retry log; (3) tests that pass with --workers=1 but fail with parallelism. Run npx playwright test --reporter=json > results.json and filter for "status": "flaky" entries to get a ranked list by frequency.
A broken test fails consistently on broken code — it's doing its job. A flaky test fails intermittently on working code — it's a reliability problem in the test itself. The fix for a broken test is to fix the code or update the test to match new behavior. The fix for a flaky test is to address the instability in the test.
Only as a temporary measure. Retries mask the root cause and slow down your CI pipeline. If a test needs 3 retries to pass, it's not a reliable test — it's a slow coin flip. Fix the underlying cause and remove the retries.
The Google Testing Blog recommends a target of 0.1% or lower flakiness per test run. In practice, teams tolerate up to 1–2% before it meaningfully impacts developer trust. Above 5%, teams stop relying on CI results.
Most common causes: slower CI runners (increase timeouts), missing environment variables, services not fully started (add health check), or external API rate limits (add mocks). Run CI tests with CI=true locally to replicate the environment.
retries: 2 in CI to stop the bleedingwaitForTimeout calls with proper waitsgetByRole / getByTestIdFor teams with chronic flakiness from UI changes (cause #7), Shiplight eliminates the entire category automatically. Its intent-based self-healing means tests survive CSS renames, refactors, and component migrations without manual updates — no selector maintenance required. See what is self-healing test automation?
---
waitForTimeout with condition-based waitsRelated: turning flaky tests into actionable signal · E2E testing in GitHub Actions · self-healing vs manual maintenance · intent-cache-heal pattern
Stop fixing broken selectors. Shiplight Plugin adds intent-based self-healing on top of your existing Playwright tests — free, no account required. · Book a demo
References: Playwright documentation, Google Testing Blog, GitHub Actions documentation