GuidesEngineering

How to Fix Flaky Tests: Causes and Permanent Fixes

Shiplight AI Team

Updated on April 7, 2026

View as Markdown
Developer fixing flaky Playwright tests — red CI failures resolved into stable green test results

A flaky test — sometimes called an intermittent or non-deterministic test — is a test that passes sometimes and fails sometimes — on the same code, with no changes. They're the most corrosive problem in a test suite because they turn your CI from a quality signal into noise.

Teams respond to flaky tests in predictable ways: first they rerun them, then they add retries, then they quarantine them, then they just stop looking at red CI. By the time a real regression ships, no one trusts the tests enough to catch it.

This guide covers the 7 root causes of flaky E2E tests and how to fix each one permanently — not with retries that hide the problem, but with changes that make the test reliable. For teams where cause #7 (UI changes breaking locators) is the dominant source of flakiness, Shiplight's self-healing layer eliminates it automatically.

Quick Reference: 7 Causes of Flaky Tests

#Root CausePrimary SymptomFix
1Timing / race conditions"element not found" on CIReplace waitForTimeout with condition-based waits
2Brittle selectorsBreaks on CSS renameUse getByRole, getByTestId, getByLabel
3Shared test stateFails in parallel, passes soloIsolate data per test, reset state in afterEach
4Environment instabilityCI fails, local passesHealth checks, mock external APIs, raise timeouts
5Animation interferenceRandom assertion failuresreducedMotion: 'reduce' in Playwright config
6Parallelism conflictsFails with --workers > 1Scope data to workerIndex
7UI changes / locator driftBreaks after refactorsShiplight self-healing or semantic selectors

---

Why Flaky Tests Are Worse Than No Tests

A test suite with 20% flakiness is worse than a smaller, reliable suite. Here's why:

  • False positives: CI fails on green code — developers learn to ignore it
  • Investigation overhead: every failure requires triage to determine if it's real
  • Trust erosion: once trust breaks, it doesn't come back without deliberate effort
  • Coverage rot: flaky tests get disabled, leaving real gaps behind

The Google Testing Blog has documented that even 1% flakiness in a large suite creates enough noise to meaningfully slow down development. At 10%+, teams functionally stop relying on CI.

The 7 Root Causes of Flaky E2E Tests

1. Timing and Race Conditions

Symptom: Test fails with "element not found" or "timeout" — sometimes. Usually on CI, rarely locally.

Root cause: The test clicks or asserts before the page, network request, or animation has finished.

What not to do:

// Don't add arbitrary sleeps — they're fragile and slow
await page.waitForTimeout(2000);
await page.click('#submit-btn');

Fix: Use explicit waits that respond to actual application state:

// Wait for the element to be visible and enabled
await page.waitForSelector('#submit-btn', { state: 'visible' });
await page.click('#submit-btn');

// Wait for network to settle after an action
await page.click('#submit-btn');
await page.waitForLoadState('networkidle');

// Wait for a specific response
const [response] = await Promise.all([
  page.waitForResponse(r => r.url().includes('/api/submit') && r.status() === 200),
  page.click('#submit-btn'),
]);

// Wait for navigation
await Promise.all([
  page.waitForURL('**/dashboard'),
  page.click('#login-btn'),
]);

CI runners are slower than developer machines — timeouts that work locally fail in CI. Set explicit timeouts in your Playwright config:

// playwright.config.ts
export default {
  timeout: 30000,           // per test timeout
  expect: { timeout: 10000 }, // per assertion timeout
  use: {
    actionTimeout: 10000,   // per action timeout
  },
};

---

2. Brittle Selectors

Symptom: Test breaks after a UI change that didn't change behavior — a CSS class rename, DOM restructure, or component migration.

Root cause: The test is coupled to implementation details (CSS classes, IDs, DOM structure) rather than user-visible behavior.

Fragile selectors:

// ❌ Breaks when class name changes
await page.click('.btn-primary-v2-active');

// ❌ Breaks when DOM restructures
await page.click('div > div:nth-child(3) > button');

// ❌ Breaks when internal ID changes
await page.click('#internal-submit-14');

Resilient selectors (in order of preference):

// ✅ User-visible text — stable across refactors
await page.click('button:has-text("Sign In")');

// ✅ ARIA role + name — semantic and accessible
await page.getByRole('button', { name: 'Sign In' }).click();

// ✅ Test ID — explicit contract between test and dev
await page.getByTestId('submit-button').click();

// ✅ Label association — works for form inputs
await page.getByLabel('Email address').fill('user@example.com');

// ✅ Placeholder — for unlabeled inputs
await page.getByPlaceholder('Search...').fill('query');

Add data-testid attributes to key interactive elements as a team convention. This creates an explicit contract: devs know which elements tests depend on, and changes are deliberate.

The deeper fix is to treat locators as a cache of user intent, not as the source of truth. Shiplight's intent-cache-heal pattern implements this systematically — when a locator breaks, the test resolves the correct element from its intent description rather than failing.

---

3. Shared or Leaked Test State

Symptom: Tests pass in isolation but fail when run together. Order-dependent failures. "Works on my machine" with a specific test order.

Root cause: Tests share state — database records, cookies, localStorage, or server-side session data — that bleeds between runs.

Fix: Make every test self-contained:

// ✅ Create isolated test data per test
test.beforeEach(async ({ page }) => {
  // Create a fresh user for this test
  const user = await createTestUser({ role: 'admin' });
  await loginAs(page, user);
});

test.afterEach(async () => {
  // Clean up test data
  await cleanupTestUsers();
});

For browser state (cookies, localStorage):

// playwright.config.ts
export default {
  use: {
    // Start every test in a fresh browser context
    storageState: undefined,
  },
};

For auth state, use Playwright's storageState to save a logged-in session once and reuse it — avoiding repeated login steps while still isolating test data:

// global-setup.ts
import { chromium } from '@playwright/test';

async function globalSetup() {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto('/login');
  await page.fill('[name=email]', process.env.TEST_USER_EMAIL!);
  await page.fill('[name=password]', process.env.TEST_USER_PASSWORD!);
  await page.click('button[type=submit]');
  await page.waitForURL('/dashboard');
  await page.context().storageState({ path: 'auth.json' });
  await browser.close();
}

See stable auth and email E2E tests for handling authentication flows specifically.

---

4. Environment and Network Instability

Symptom: Tests fail on CI but not locally. Errors involve timeouts, connection refused, or service unavailability.

Root cause: CI environment differs from local — different network latency, services not fully started, environment variables missing, or third-party API rate limits.

Fix:

Health check before tests:

// global-setup.ts
async function globalSetup() {
  const maxRetries = 10;
  for (let i = 0; i < maxRetries; i++) {
    try {
      const res = await fetch(process.env.BASE_URL + '/health');
      if (res.ok) break;
    } catch {
      await new Promise(r => setTimeout(r, 2000));
    }
    if (i === maxRetries - 1) throw new Error('App did not start');
  }
}

Mock external services that are unreliable or rate-limited in CI:

// Mock Stripe, SendGrid, or other third-party APIs in tests
await page.route('**/api.stripe.com/**', route =>
  route.fulfill({ status: 200, body: JSON.stringify({ status: 'succeeded' }) })
);

Increase timeouts for CI while keeping local tests fast:

// playwright.config.ts
export default {
  timeout: process.env.CI ? 45000 : 15000,
};

---

5. Animation and Transition Interference

Symptom: Test clicks an element that's animating in/out and gets wrong behavior. Assertion fails because element is mid-transition.

Root cause: CSS transitions and animations run asynchronously and can interfere with element interaction timing.

Fix: Disable animations in test environments:

// playwright.config.ts
export default {
  use: {
    // Disable CSS animations
    reducedMotion: 'reduce',
  },
};

Or inject a global CSS override in test setup:

test.beforeEach(async ({ page }) => {
  await page.addStyleTag({
    content: `*, *::before, *::after { 
      animation-duration: 0ms !important; 
      transition-duration: 0ms !important; 
    }`,
  });
});

---

6. Test Runner Parallelism Conflicts

Symptom: Tests pass when run sequentially (--workers=1) but fail with parallel execution.

Root cause: Parallel tests competing for the same resource — same test user account, same database record, same port.

Fix:

Use unique data per parallel worker:

// Use worker index to isolate data
test('create item', async ({ page }, testInfo) => {
  const userId = `test-user-${testInfo.workerIndex}`;
  // Each worker uses its own user, no conflicts
});

Limit concurrency for tests that genuinely can't parallelize:

// playwright.config.ts
export default {
  projects: [
    {
      name: 'sequential-tests',
      testMatch: /serial\.spec\.ts/,
      use: { workers: 1 },
    },
  ],
};

---

7. UI Changes Breaking Locators (The Self-Healing Problem)

Symptom: Tests break after normal product development — a component refactor, CSS rename, or layout change — with no behavior change. This is the single largest driver of "tests as a maintenance burden."

Root cause: Tests are coupled to implementation details rather than user intent. Every locator-based test (#submit-btn, .btn-primary, div:nth-child(3)) is a bet that the DOM won't change. That bet loses constantly in teams shipping fast.

Short-term fix: Migrate to semantic selectors (see Cause #2). Add data-testid attributes to critical elements.

Systematic fix with Shiplight: Shiplight's intent-cache-heal pattern eliminates this entire class of flakiness. Instead of maintaining a list of fallback selectors, Shiplight stores the semantic intent of each test step — for example, "click the primary submit button on the checkout form." When a locator breaks, Shiplight's AI resolves the correct element from the live DOM using that intent, not a cached CSS selector.

The result: tests survive CSS renames, component refactors, and layout changes that would break traditional locator-based healers — without any manual selector updates.

# Shiplight YAML test — intent survives UI changes
goal: Verify checkout flow
statements:
  - intent: Add item to cart
  - intent: Proceed to checkout
  - intent: Fill in shipping details
  - VERIFY: order confirmation is displayed

When a button moves or gets renamed, Shiplight heals the step automatically. The developer who renamed the button doesn't need to update a single test file.

See self-healing test automation and self-healing vs manual maintenance for how this works in production suites.

---

How to Triage Flaky Tests at Scale

If you have an existing suite with widespread flakiness, don't try to fix everything at once. Use this triage approach:

Step 1: Quarantine, don't delete

// Mark known-flaky tests with skip + tracking issue
test.skip('checkout flow — flaky, tracked in TICKET-123', async ({ page }) => {
  // ...
});

Deleting flaky tests removes coverage. Quarantine them while you fix the root cause.

Step 2: Add retries temporarily

// playwright.config.ts
export default {
  retries: process.env.CI ? 2 : 0,
};

Retries are a symptom management tool, not a fix. Use them to keep CI green while you identify root causes, then remove them once the underlying issue is fixed.

Step 3: Measure flakiness rate per test

Track which tests are most flaky. Playwright's built-in retry mechanism marks tests as flaky when they pass on retry — use this data to prioritize:

# Generate a JSON report to analyze flakiness
npx playwright test --reporter=json > results.json

For CI-specific reporter setup — including the github reporter that surfaces flaky test annotations directly in the PR diff — see E2E testing in GitHub Actions.

Step 4: Fix in order of frequency

Fix the 20% of tests causing 80% of flakiness. Common culprits: auth flows, tests hitting external APIs, tests with waitForTimeout.

---

Preventing Flakiness in New Tests

Build these habits into test authoring:

  • Never use `waitForTimeout` — always wait for a condition, not a duration
  • Always use semantic selectors — role, label, testid, text — never CSS classes or nth-child
  • Create isolated test data per test, clean up after
  • Test one thing per test — smaller tests are easier to debug when they fail
  • Run tests locally with `--headed` before committing — see what the test actually does

---

FAQ

How do I identify which tests are flaky?

Three signals: (1) tests that pass on manual rerun after failing in CI — Playwright marks these as flaky in its JSON report; (2) tests that consistently appear in your retry log; (3) tests that pass with --workers=1 but fail with parallelism. Run npx playwright test --reporter=json > results.json and filter for "status": "flaky" entries to get a ranked list by frequency.

What's the difference between a flaky test and a broken test?

A broken test fails consistently on broken code — it's doing its job. A flaky test fails intermittently on working code — it's a reliability problem in the test itself. The fix for a broken test is to fix the code or update the test to match new behavior. The fix for a flaky test is to address the instability in the test.

Should I use retries to fix flaky tests?

Only as a temporary measure. Retries mask the root cause and slow down your CI pipeline. If a test needs 3 retries to pass, it's not a reliable test — it's a slow coin flip. Fix the underlying cause and remove the retries.

How many flaky tests are acceptable?

The Google Testing Blog recommends a target of 0.1% or lower flakiness per test run. In practice, teams tolerate up to 1–2% before it meaningfully impacts developer trust. Above 5%, teams stop relying on CI results.

My tests pass locally but fail in CI — why?

Most common causes: slower CI runners (increase timeouts), missing environment variables, services not fully started (add health check), or external API rate limits (add mocks). Run CI tests with CI=true locally to replicate the environment.

What's the fastest way to reduce flakiness today?

  1. Add retries: 2 in CI to stop the bleeding
  2. Replace all waitForTimeout calls with proper waits
  3. Migrate selectors from CSS classes to getByRole / getByTestId
  4. Isolate test data so tests don't share state

For teams with chronic flakiness from UI changes (cause #7), Shiplight eliminates the entire category automatically. Its intent-based self-healing means tests survive CSS renames, refactors, and component migrations without manual updates — no selector maintenance required. See what is self-healing test automation?

---

Key Takeaways

  • Retries hide flakiness, they don't fix it — treat them as a temporary measure, track root cause
  • Timing issues are the #1 cause — replace waitForTimeout with condition-based waits
  • Selectors should reflect user intent — role, label, testid; never CSS class or DOM position
  • Test isolation is non-negotiable — shared state between tests is a reliability time bomb
  • UI changes cause chronic flakiness — Shiplight's self-healing resolves elements by intent, not cached selectors, eliminating this entire category

Related: turning flaky tests into actionable signal · E2E testing in GitHub Actions · self-healing vs manual maintenance · intent-cache-heal pattern

Stop fixing broken selectors. Shiplight Plugin adds intent-based self-healing on top of your existing Playwright tests — free, no account required. · Book a demo

References: Playwright documentation, Google Testing Blog, GitHub Actions documentation