GuidesEngineering

How to Fix Flaky E2E Tests: Root Causes and Permanent Fixes

Q: What's the fastest way to reduce flakiness today?

1. Add `retries: 2` in CI to stop the bleeding 2. Replace all `waitForTimeout` calls with proper waits 3. Migrate selectors from CSS classes to `getByRole` / `getByTestId` 4. Isolate test data so tests don't share state For teams with chronic flakiness from UI changes (cause #7), Shiplight eliminates the entire category automatically. Its intent-based self-healing means tests survive CSS renames, refactors, and component migrations without manual updates — no selector maintenance required. See what is self-healing test automation? ---

Shiplight AI Team

Updated on May 20, 2026

View as Markdown

Developer fixing flaky Playwright tests — red CI failures resolved into stable green test results

To fix flaky tests permanently, address the 8 root causes of non-determinism: timing/race conditions, brittle selectors, shared test state, environment instability, animation interference, parallelism conflicts, UI changes breaking locators, and improper resource management. Each has a specific fix — retries mask the symptom, not the cause.

---

A flaky test — sometimes called an intermittent or non-deterministic test — is a test that passes sometimes and fails sometimes — on the same code, with no changes. They're the most corrosive problem in a test suite because they turn your CI from a quality signal into noise.

Teams respond to flaky tests in predictable ways: first they rerun them, then they add retries, then they quarantine them, then they just stop looking at red CI. By the time a real regression ships, no one trusts the tests enough to catch it.

This guide covers the 8 root causes of flaky E2E tests and how to fix each one permanently — not with retries that hide the problem, but with changes that make the test reliable. For teams where cause #7 (UI changes breaking locators) is the dominant source of flakiness, Shiplight's self-healing layer eliminates it automatically.

Quick Reference: 8 Causes of Flaky Tests

#	Root Cause	Primary Symptom	Fix
1	Timing / race conditions	"element not found" on CI	Replace `waitForTimeout` with condition-based waits
2	Brittle selectors	Breaks on CSS rename	Use `getByRole`, `getByTestId`, `getByLabel`
3	Shared test state	Fails in parallel, passes solo	Isolate data per test, reset state in `afterEach`
4	Environment instability	CI fails, local passes	Health checks, mock external APIs, raise timeouts
5	Animation interference	Random assertion failures	`reducedMotion: 'reduce'` in Playwright config
6	Parallelism conflicts	Fails with `--workers > 1`	Scope data to `workerIndex`
7	UI changes / locator drift	Breaks after refactors	Shiplight self-healing or semantic selectors
8	Resource leaks	Flakiness increases over time / across suites	Add teardown for files, DB rows, containers, external resources

---

Why Flaky Tests Are Worse Than No Tests

Flaky tests are a silent tax on every automated testing program. A test suite with 20% flakiness is worse than a smaller, reliable suite. The obvious cost is time spent re-running CI and investigating false failures — but that's not the deepest problem.

The deeper problem is signal degradation. When CI is red often enough, engineers stop treating red as meaningful. They learn to retry first, investigate later, and merge if the retry passes. A culture of "it's probably flaky" means real failures get ignored too. Regressions merge. Bugs ship.

Specific consequences:

False positives: CI fails on green code — developers learn to ignore it
Investigation overhead: every failure requires triage to determine if it's real
Trust erosion: once trust breaks, it doesn't come back without deliberate effort
Coverage rot: flaky tests get disabled, leaving real gaps behind
Regressions ship: real bugs get dismissed as "probably flaky" and merge anyway

Once a team has learned to distrust their test suite, restoring that trust takes significantly more work than fixing the underlying flakiness would have. This is why flakiness should be treated as a defect, not a nuisance.

The Google Testing Blog has documented that even 1% flakiness in a large suite creates enough noise to meaningfully slow down development. At 10%+, teams functionally stop relying on CI.

The 8 Root Causes of Flaky E2E Tests

1. Timing and Race Conditions

Symptoms:

Test fails with "element not found" or "timeout" — sometimes
Passes on slower machines, fails on faster ones (or vice versa)
Passes locally, fails in CI
Starts failing after a performance optimization or infrastructure change

Root cause: The test clicks or asserts before the page, network request, or animation has finished.

What not to do:

// Don't add arbitrary sleeps — they're fragile and slow
await page.waitForTimeout(2000);
await page.click('#submit-btn');

Fix: Use explicit waits that respond to actual application state:

// Wait for the element to be visible and enabled
await page.waitForSelector('#submit-btn', { state: 'visible' });
await page.click('#submit-btn');

// Wait for network to settle after an action
await page.click('#submit-btn');
await page.waitForLoadState('networkidle');

// Wait for a specific response
const [response] = await Promise.all([
  page.waitForResponse(r => r.url().includes('/api/submit') && r.status() === 200),
  page.click('#submit-btn'),
]);

// Wait for navigation
await Promise.all([
  page.waitForURL('**/dashboard'),
  page.click('#login-btn'),
]);

CI runners are slower than developer machines — timeouts that work locally fail in CI. Set explicit timeouts in your Playwright config:

// playwright.config.ts
export default {
  timeout: 30000,           // per test timeout
  expect: { timeout: 10000 }, // per assertion timeout
  use: {
    actionTimeout: 10000,   // per action timeout
  },
};

---

2. Brittle Selectors

Symptoms:

Tests break after UI refactors that don't change user-facing behavior
Failures cluster around the same components repeatedly
Locator errors after design system updates or framework version bumps
Tests fail after developers rename a CSS class

Root cause: The test is coupled to implementation details (CSS classes, IDs, DOM structure) rather than user-visible behavior.

Fragile selectors:

// ❌ Breaks when class name changes
await page.click('.btn-primary-v2-active');

// ❌ Breaks when DOM restructures
await page.click('div > div:nth-child(3) > button');

// ❌ Breaks when internal ID changes
await page.click('#internal-submit-14');

Resilient selectors (in order of preference):

// ✅ User-visible text — stable across refactors
await page.click('button:has-text("Sign In")');

// ✅ ARIA role + name — semantic and accessible
await page.getByRole('button', { name: 'Sign In' }).click();

// ✅ Test ID — explicit contract between test and dev
await page.getByTestId('submit-button').click();

// ✅ Label association — works for form inputs
await page.getByLabel('Email address').fill('user@example.com');

// ✅ Placeholder — for unlabeled inputs
await page.getByPlaceholder('Search...').fill('query');

Add data-testid attributes to key interactive elements as a team convention. This creates an explicit contract: devs know which elements tests depend on, and changes are deliberate.

The deeper fix is to treat locators as a cache of user intent, not as the source of truth. Shiplight's intent-cache-heal pattern implements this systematically — when a locator breaks, the test resolves the correct element from its intent description rather than failing.

---

3. Shared or Leaked Test State

Symptoms:

Tests pass in isolation but fail when run as part of the full suite
Failures change based on which other tests ran before
"Works on my machine" with a specific test order
Order-dependent failures that disappear with --shard or parallel execution

Root cause: Tests share state — database records, cookies, localStorage, or server-side session data — that bleeds between runs.

Fix: Make every test self-contained:

// ✅ Create isolated test data per test
test.beforeEach(async ({ page }) => {
  // Create a fresh user for this test
  const user = await createTestUser({ role: 'admin' });
  await loginAs(page, user);
});

test.afterEach(async () => {
  // Clean up test data
  await cleanupTestUsers();
});

For browser state (cookies, localStorage):

// playwright.config.ts
export default {
  use: {
    // Start every test in a fresh browser context
    storageState: undefined,
  },
};

For auth state, use Playwright's storageState to save a logged-in session once and reuse it — avoiding repeated login steps while still isolating test data:

// global-setup.ts
import { chromium } from '@playwright/test';

async function globalSetup() {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto('/login');
  await page.fill('[name=email]', process.env.TEST_USER_EMAIL!);
  await page.fill('[name=password]', process.env.TEST_USER_PASSWORD!);
  await page.click('button[type=submit]');
  await page.waitForURL('/dashboard');
  await page.context().storageState({ path: 'auth.json' });
  await browser.close();
}

See stable auth and email E2E tests for handling authentication flows specifically.

---

4. Environment and Network Instability

Symptom: Tests fail on CI but not locally. Errors involve timeouts, connection refused, or service unavailability.

Root cause: CI environment differs from local — different network latency, services not fully started, environment variables missing, or third-party API rate limits.

Fix:

Health check before tests:

// global-setup.ts
async function globalSetup() {
  const maxRetries = 10;
  for (let i = 0; i < maxRetries; i++) {
    try {
      const res = await fetch(process.env.BASE_URL + '/health');
      if (res.ok) break;
    } catch {
      await new Promise(r => setTimeout(r, 2000));
    }
    if (i === maxRetries - 1) throw new Error('App did not start');
  }
}

Mock external services that are unreliable or rate-limited in CI:

// Mock Stripe, SendGrid, or other third-party APIs in tests
await page.route('**/api.stripe.com/**', route =>
  route.fulfill({ status: 200, body: JSON.stringify({ status: 'succeeded' }) })
);

Increase timeouts for CI while keeping local tests fast:

// playwright.config.ts
export default {
  timeout: process.env.CI ? 45000 : 15000,
};

---

5. Animation and Transition Interference

Symptom: Test clicks an element that's animating in/out and gets wrong behavior. Assertion fails because element is mid-transition.

Root cause: CSS transitions and animations run asynchronously and can interfere with element interaction timing.

Fix: Disable animations in test environments:

// playwright.config.ts
export default {
  use: {
    // Disable CSS animations
    reducedMotion: 'reduce',
  },
};

Or inject a global CSS override in test setup:

test.beforeEach(async ({ page }) => {
  await page.addStyleTag({
    content: `*, *::before, *::after { 
      animation-duration: 0ms !important; 
      transition-duration: 0ms !important; 
    }`,
  });
});

---

6. Test Runner Parallelism Conflicts

Symptom: Tests pass when run sequentially (--workers=1) but fail with parallel execution.

Root cause: Parallel tests competing for the same resource — same test user account, same database record, same port.

Fix:

Use unique data per parallel worker:

// Use worker index to isolate data
test('create item', async ({ page }, testInfo) => {
  const userId = `test-user-${testInfo.workerIndex}`;
  // Each worker uses its own user, no conflicts
});

Limit concurrency for tests that genuinely can't parallelize:

// playwright.config.ts
export default {
  projects: [
    {
      name: 'sequential-tests',
      testMatch: /serial\.spec\.ts/,
      use: { workers: 1 },
    },
  ],
};

---

7. UI Changes Breaking Locators (The Self-Healing Problem)

Symptom: Tests break after normal product development — a component refactor, CSS rename, or layout change — with no behavior change. This is the single largest driver of "tests as a maintenance burden."

Root cause: Tests are coupled to implementation details rather than user intent. Every locator-based test (#submit-btn, .btn-primary, div:nth-child(3)) is a bet that the DOM won't change. That bet loses constantly in teams shipping fast.

Short-term fix: Migrate to semantic selectors (see Cause #2). Add data-testid attributes to critical elements.

Systematic fix with Shiplight: Shiplight's intent-cache-heal pattern eliminates this entire class of flakiness. Instead of maintaining a list of fallback selectors, Shiplight stores the semantic intent of each test step — for example, "click the primary submit button on the checkout form." When a locator breaks, Shiplight's AI resolves the correct element from the live DOM using that intent, not a cached CSS selector.

The result: tests survive CSS renames, component refactors, and layout changes that would break traditional locator-based healers — without any manual selector updates.

# Shiplight YAML test — intent survives UI changes
goal: Verify checkout flow
statements:
  - intent: Add item to cart
  - intent: Proceed to checkout
  - intent: Fill in shipping details
  - VERIFY: order confirmation is displayed

When a button moves or gets renamed, Shiplight heals the step automatically. The developer who renamed the button doesn't need to update a single test file.

See self-healing test automation and self-healing vs manual maintenance for how this works in production suites.

8. Improper Resource Management

Symptom: Flakiness increases over time or across the full suite. A test that passes alone fails when run after 50 others. Disk fills up, database connections exhaust, or containers leak between runs.

Root cause: Tests create external resources — temporary files, database rows, uploaded blobs, sandboxed containers, message queue entries, seed users — but don't clean them up. Unlike shared test state (cause #3, typically in-memory or session), resource leaks accumulate in persistent storage and external systems. Later tests then hit quota limits, port collisions, duplicate-key errors, or stale data from earlier runs.

Examples:

Test creates a user named test@example.com, next run fails on "email already exists"
Test uploads a file to S3, never deletes it — bucket fills up, later uploads fail
Test spawns a Docker container, doesn't tear it down — port 5432 in use on next run
Test writes a fixture to /tmp, disk fills across long-running CI jobs

Fix: Add explicit teardown for every external resource the test creates. Scope resources to the test worker or run ID so parallel runs don't collide.

// Scope resources to this test run
const runId = process.env.TEST_RUN_ID || randomUUID();
const testEmail = `test-${runId}@example.com`;

test.afterEach(async () => {
  // Always clean up — even on failure — to prevent leak accumulation
  await db.user.deleteMany({ where: { email: { contains: runId } } });
  await s3.deleteObjects({ Prefix: `test-artifacts/${runId}/` });
  await docker.container.remove({ force: true });
});

For tests that modify shared state (database, external APIs), prefer transactional wrappers that roll back after each test:

// Wrap each test in a transaction that auto-rolls back
test.beforeEach(async () => { await db.$executeRaw`BEGIN`; });
test.afterEach(async () => { await db.$executeRaw`ROLLBACK`; });

Systematic fix: Use ephemeral environments for each test run — a fresh database, a clean file system, disposable containers. CI systems like GitHub Actions make this cheap with service containers. When the entire environment is disposable, resource leaks become mathematically impossible.

---

How AI Self-Healing Eliminates Flaky E2E Tests Permanently

Causes #1–6 and #8 require code changes to fix. Cause #7 — UI changes breaking locators — is the one AI can eliminate automatically.

Shiplight's intent-cache-heal pattern works by storing the semantic intent of each test step rather than a brittle CSS selector. When a locator breaks after a refactor, the AI resolves the correct element from the live DOM using the intent, updates the locator cache, and the test continues — no human intervention required.

This is especially valuable for teams shipping AI-generated code, where UI changes are constant and locator maintenance quickly becomes unsustainable. Instead of a flaky Playwright test that breaks every time a component is renamed, you get a test that describes what the user wants to do and adapts automatically when the implementation changes.

The result: cause #7 drops from your flakiness report entirely, and your team's attention stays on the six causes that actually require debugging.

What is self-healing test automation? · Self-healing vs manual maintenance

---

How to Triage Flaky Tests at Scale

If you have an existing suite with widespread flakiness, don't try to fix everything at once. Use this triage approach:

Step 1: Quarantine, don't delete

// Mark known-flaky tests with skip + tracking issue
test.skip('checkout flow — flaky, tracked in TICKET-123', async ({ page }) => {
  // ...
});

Deleting flaky tests removes coverage. Quarantine them while you fix the root cause.

Step 2: Add retries temporarily

// playwright.config.ts
export default {
  retries: process.env.CI ? 2 : 0,
};

Retries are a symptom management tool, not a fix. Use them to keep CI green while you identify root causes, then remove them once the underlying issue is fixed.

Step 3: Measure flakiness rate per test

Track which tests are most flaky. Playwright's built-in retry mechanism marks tests as flaky when they pass on retry — use this data to prioritize:

# Generate a JSON report to analyze flakiness
npx playwright test --reporter=json > results.json

For CI-specific reporter setup — including the github reporter that surfaces flaky test annotations directly in the PR diff — see E2E testing in GitHub Actions.

Step 4: Fix in order of frequency

Fix the 20% of tests causing 80% of flakiness. Common culprits: auth flows, tests hitting external APIs, tests with waitForTimeout.

---

Preventing Flakiness in New Tests

Build these habits into test authoring:

Never use waitForTimeout — always wait for a condition, not a duration
Always use semantic selectors — role, label, testid, text — never CSS classes or nth-child
Create isolated test data per test, clean up after
Test one thing per test — smaller tests are easier to debug when they fail
Run tests locally with --headed before committing — see what the test actually does

---

Building a Flake-Free Culture

Tooling alone won't create a reliable test suite. The most important cultural shift is treating flaky tests as real defects, not acceptable nuisances. A flaky test is a bug in your test suite. It deserves the same attention as a production bug: triage, root cause analysis, and a permanent fix — not a retry loop that hides the problem.

Concrete practices that distinguish teams with trustworthy CI from teams without:

Fix the category, not the instance. Adding a single waitForTimeout to one flaky test is the wrong move. Fix the underlying pattern — switch to condition-based waits systematically, isolate state systematically, mock external dependencies systematically. One fix per category is worth dozens of per-test patches.
Track flakiness rate per test, not just pass/fail. A test that passes 95/100 runs is flaky. Measure and surface this data so the team can prioritize which flakes matter most. See turning flaky tests into actionable signal.
Own the fix, don't pass it on. The person whose change introduced the flake owns the fix. No "it's probably the test framework's fault" deflection. If the test is legitimately wrong, fix it. If the application is wrong, fix that.
No retries in CI for merging. Retries can mask real bugs. If you must retry, do it in a separate monitoring lane — not in the PR gate.
Celebrate red → green. When an engineer fixes a category of flakiness, surface it in team updates. Teams optimize for what leadership notices.

Teams that maintain this standard consistently have test suites engineers trust — and test suites engineers trust actually catch regressions. This is the only durable way to preserve the quality signal CI is supposed to provide.

---

FAQ: Fixing Flaky E2E Tests

What causes test flakiness in UI automation suites?

Test flakiness in UI automation suites comes from eight root causes of non-determinism: (1) timing / race conditions — asserting before the page, network, or animation finishes; (2) brittle selectors bound to CSS classes or DOM structure that change; (3) shared test state bleeding between runs; (4) environment instability (CI differs from local); (5) animation interference; (6) parallelism conflicts when workers share data; (7) UI changes / locator drift after refactors; (8) resource leaks that accumulate across the suite. UI suites are especially flake-prone because they sit at the top of the stack — every layer beneath (network, render, animation, third-party widget) can introduce timing variance. Retries mask these symptoms; only addressing the specific root cause makes a test reliable. For cause #7 (the dominant source in fast-changing UIs), Shiplight's intent-based self-healing resolves the element semantically instead of breaking.

Why are UI automation tests more flaky than unit or API tests?

UI tests are the most flake-prone layer because they depend on the most moving parts: real browser rendering timing, network latency, animations, third-party widgets, and DOM structure that AI-driven and human refactors change frequently. Unit tests run in-process with no I/O; API tests have a stable contract; UI tests must wait for asynchronous rendering and bind to a visual structure that is unstable by nature. This is why the test pyramid keeps UI/E2E tests fewest — and why self-healing and intent-based resolution matter most at this layer. See what is software testing for the pyramid context.

How do I identify which tests are flaky?

Three signals: (1) tests that pass on manual rerun after failing in CI — Playwright marks these as flaky in its JSON report; (2) tests that consistently appear in your retry log; (3) tests that pass with --workers=1 but fail with parallelism. Run npx playwright test --reporter=json > results.json and filter for "status": "flaky" entries to get a ranked list by frequency.

What's the difference between a flaky test and a broken test?

A broken test fails consistently on broken code — it's doing its job. A flaky test fails intermittently on working code — it's a reliability problem in the test itself. The fix for a broken test is to fix the code or update the test to match new behavior. The fix for a flaky test is to address the instability in the test.

Should I use retries to fix flaky tests?

Only as a temporary measure. Retries mask the root cause and slow down your CI pipeline. If a test needs 3 retries to pass, it's not a reliable test — it's a slow coin flip. Fix the underlying cause and remove the retries.

How many flaky tests are acceptable?

The Google Testing Blog recommends a target of 0.1% or lower flakiness per test run. In practice, teams tolerate up to 1–2% before it meaningfully impacts developer trust. Above 5%, teams stop relying on CI results.

My tests pass locally but fail in CI — why?

Most common causes: slower CI runners (increase timeouts), missing environment variables, services not fully started (add health check), or external API rate limits (add mocks). Run CI tests with CI=true locally to replicate the environment.

What's the fastest way to reduce flakiness today?

Add retries: 2 in CI to stop the bleeding
Replace all waitForTimeout calls with proper waits
Migrate selectors from CSS classes to getByRole / getByTestId
Isolate test data so tests don't share state

For teams with chronic flakiness from UI changes (cause #7), Shiplight eliminates the entire category automatically. Its intent-based self-healing means tests survive CSS renames, refactors, and component migrations without manual updates — no selector maintenance required. See what is self-healing test automation?

---

Maintainable E2E playbook — prevent the flakiness problem upstream, not just fix it downstream
Postmortem-driven E2E testing — turn every production incident into a permanent regression test
Flaky tests to actionable signal — measure and prioritize flakiness systematically
Mitigate test flakiness: strategies for fast-paced teams — the flake-budget, quarantine, and ownership strategy this root-cause work sits inside
Best tools to fight flaky tests in CI/CD pipelines — the tools-by-category landscape (Harness, Trunk, Datadog CI Visibility, self-healing)
Self-healing vs manual maintenance — why intent-based healing eliminates the locator-drift cause of flakiness

Key Takeaways

Retries hide flakiness, they don't fix it — treat them as a temporary measure, track root cause
Timing issues are the #1 cause — replace waitForTimeout with condition-based waits
Selectors should reflect user intent — role, label, testid; never CSS class or DOM position
Test isolation is non-negotiable — shared state between tests is a reliability time bomb
UI changes cause chronic flakiness — Shiplight's self-healing resolves elements by intent, not cached selectors, eliminating this entire category

Stop fixing broken selectors. Shiplight Plugin adds intent-based self-healing on top of your existing Playwright tests — free, no account required. · Book a demo

References: Playwright documentation, Google Testing Blog, GitHub Actions documentation

How to Fix Flaky E2E Tests: Root Causes and Permanent Fixes

Quick Reference: 8 Causes of Flaky Tests

Why Flaky Tests Are Worse Than No Tests

The 8 Root Causes of Flaky E2E Tests

1. Timing and Race Conditions

2. Brittle Selectors

3. Shared or Leaked Test State

4. Environment and Network Instability

5. Animation and Transition Interference

6. Test Runner Parallelism Conflicts

7. UI Changes Breaking Locators (The Self-Healing Problem)

8. Improper Resource Management

How AI Self-Healing Eliminates Flaky E2E Tests Permanently

How to Triage Flaky Tests at Scale

Step 1: Quarantine, don't delete

Step 2: Add retries temporarily

Step 3: Measure flakiness rate per test

Step 4: Fix in order of frequency

Preventing Flakiness in New Tests

Building a Flake-Free Culture

FAQ: Fixing Flaky E2E Tests

What causes test flakiness in UI automation suites?

Why are UI automation tests more flaky than unit or API tests?

How do I identify which tests are flaky?

What's the difference between a flaky test and a broken test?

Should I use retries to fix flaky tests?

How many flaky tests are acceptable?

My tests pass locally but fail in CI — why?

What's the fastest way to reduce flakiness today?

Related Reading

Key Takeaways