---
title: "How to Fix Flaky Tests: Causes and Permanent Fixes"
excerpt: "Flaky tests erode trust in your entire test suite. Teams start ignoring red CI, skipping tests, or disabling them entirely — until regressions reach production. This guide covers the root causes of test flakiness and how to fix each one permanently."
metaDescription: "Fix flaky tests permanently. Covers the 7 root causes — timing, selectors, state, environment — with specific code fixes for each. Stop ignoring red CI."
publishedAt: 2026-04-07
updatedAt: 2026-04-07
author: Shiplight AI Team
categories:
 - Guides
 - Engineering
tags:
 - flaky-tests
 - e2e-testing
 - test-automation
 - test-maintenance
 - playwright
 - self-healing-tests
 - ci-cd
metaTitle: "How to Fix Flaky Tests: Root Causes and Permanent Fixes"
featuredImage: ./cover.png
featuredImageAlt: "Developer fixing flaky Playwright tests — red CI failures resolved into stable green test results"
---

A flaky test — sometimes called an intermittent or non-deterministic test — is a test that passes sometimes and fails sometimes — on the same code, with no changes. They're the most corrosive problem in a test suite because they turn your CI from a quality signal into noise.

Teams respond to flaky tests in predictable ways: first they rerun them, then they add retries, then they quarantine them, then they just stop looking at red CI. By the time a real regression ships, no one trusts the tests enough to catch it.

This guide covers the 7 root causes of flaky E2E tests and how to fix each one permanently — not with retries that hide the problem, but with changes that make the test reliable. For teams where cause #7 (UI changes breaking locators) is the dominant source of flakiness, [Shiplight's self-healing layer](/blog/what-is-self-healing-test-automation) eliminates it automatically.

## Quick Reference: 7 Causes of Flaky Tests

| # | Root Cause | Primary Symptom | Fix |
|---|-----------|----------------|-----|
| 1 | Timing / race conditions | "element not found" on CI | Replace `waitForTimeout` with condition-based waits |
| 2 | Brittle selectors | Breaks on CSS rename | Use `getByRole`, `getByTestId`, `getByLabel` |
| 3 | Shared test state | Fails in parallel, passes solo | Isolate data per test, reset state in `afterEach` |
| 4 | Environment instability | CI fails, local passes | Health checks, mock external APIs, raise timeouts |
| 5 | Animation interference | Random assertion failures | `reducedMotion: 'reduce'` in Playwright config |
| 6 | Parallelism conflicts | Fails with `--workers > 1` | Scope data to `workerIndex` |
| 7 | UI changes / locator drift | Breaks after refactors | [Shiplight self-healing](/plugins) or semantic selectors |

---

## Why Flaky Tests Are Worse Than No Tests

A test suite with 20% flakiness is worse than a smaller, reliable suite. Here's why:

- **False positives**: CI fails on green code — developers learn to ignore it
- **Investigation overhead**: every failure requires triage to determine if it's real
- **Trust erosion**: once trust breaks, it doesn't come back without deliberate effort
- **Coverage rot**: flaky tests get disabled, leaving real gaps behind

The [Google Testing Blog](https://testing.googleblog.com) has documented that even 1% flakiness in a large suite creates enough noise to meaningfully slow down development. At 10%+, teams functionally stop relying on CI.

## The 7 Root Causes of Flaky E2E Tests

### 1. Timing and Race Conditions

**Symptom:** Test fails with "element not found" or "timeout" — sometimes. Usually on CI, rarely locally.

**Root cause:** The test clicks or asserts before the page, network request, or animation has finished.

**What not to do:**
```js
// Don't add arbitrary sleeps — they're fragile and slow
await page.waitForTimeout(2000);
await page.click('#submit-btn');
```

**Fix:** Use explicit waits that respond to actual application state:

```js
// Wait for the element to be visible and enabled
await page.waitForSelector('#submit-btn', { state: 'visible' });
await page.click('#submit-btn');

// Wait for network to settle after an action
await page.click('#submit-btn');
await page.waitForLoadState('networkidle');

// Wait for a specific response
const [response] = await Promise.all([
  page.waitForResponse(r => r.url().includes('/api/submit') && r.status() === 200),
  page.click('#submit-btn'),
]);

// Wait for navigation
await Promise.all([
  page.waitForURL('**/dashboard'),
  page.click('#login-btn'),
]);
```

CI runners are slower than developer machines — timeouts that work locally fail in CI. Set explicit timeouts in your Playwright config:

```js
// playwright.config.ts
export default {
  timeout: 30000,           // per test timeout
  expect: { timeout: 10000 }, // per assertion timeout
  use: {
    actionTimeout: 10000,   // per action timeout
  },
};
```

---

### 2. Brittle Selectors

**Symptom:** Test breaks after a UI change that didn't change behavior — a CSS class rename, DOM restructure, or component migration.

**Root cause:** The test is coupled to implementation details (CSS classes, IDs, DOM structure) rather than user-visible behavior.

**Fragile selectors:**
```js
// ❌ Breaks when class name changes
await page.click('.btn-primary-v2-active');

// ❌ Breaks when DOM restructures
await page.click('div > div:nth-child(3) > button');

// ❌ Breaks when internal ID changes
await page.click('#internal-submit-14');
```

**Resilient selectors (in order of preference):**
```js
// ✅ User-visible text — stable across refactors
await page.click('button:has-text("Sign In")');

// ✅ ARIA role + name — semantic and accessible
await page.getByRole('button', { name: 'Sign In' }).click();

// ✅ Test ID — explicit contract between test and dev
await page.getByTestId('submit-button').click();

// ✅ Label association — works for form inputs
await page.getByLabel('Email address').fill('user@example.com');

// ✅ Placeholder — for unlabeled inputs
await page.getByPlaceholder('Search...').fill('query');
```

Add `data-testid` attributes to key interactive elements as a team convention. This creates an explicit contract: devs know which elements tests depend on, and changes are deliberate.

The deeper fix is to treat locators as a cache of user intent, not as the source of truth. Shiplight's [intent-cache-heal pattern](/blog/intent-cache-heal-pattern) implements this systematically — when a locator breaks, the test resolves the correct element from its intent description rather than failing.

---

### 3. Shared or Leaked Test State

**Symptom:** Tests pass in isolation but fail when run together. Order-dependent failures. "Works on my machine" with a specific test order.

**Root cause:** Tests share state — database records, cookies, localStorage, or server-side session data — that bleeds between runs.

**Fix:** Make every test self-contained:

```js
// ✅ Create isolated test data per test
test.beforeEach(async ({ page }) => {
  // Create a fresh user for this test
  const user = await createTestUser({ role: 'admin' });
  await loginAs(page, user);
});

test.afterEach(async () => {
  // Clean up test data
  await cleanupTestUsers();
});
```

For browser state (cookies, localStorage):
```js
// playwright.config.ts
export default {
  use: {
    // Start every test in a fresh browser context
    storageState: undefined,
  },
};
```

For auth state, use Playwright's `storageState` to save a logged-in session once and reuse it — avoiding repeated login steps while still isolating test data:

```js
// global-setup.ts
import { chromium } from '@playwright/test';

async function globalSetup() {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto('/login');
  await page.fill('[name=email]', process.env.TEST_USER_EMAIL!);
  await page.fill('[name=password]', process.env.TEST_USER_PASSWORD!);
  await page.click('button[type=submit]');
  await page.waitForURL('/dashboard');
  await page.context().storageState({ path: 'auth.json' });
  await browser.close();
}
```

See [stable auth and email E2E tests](/blog/stable-auth-email-e2e-tests) for handling authentication flows specifically.

---

### 4. Environment and Network Instability

**Symptom:** Tests fail on CI but not locally. Errors involve timeouts, connection refused, or service unavailability.

**Root cause:** CI environment differs from local — different network latency, services not fully started, environment variables missing, or third-party API rate limits.

**Fix:**

**Health check before tests:**
```js
// global-setup.ts
async function globalSetup() {
  const maxRetries = 10;
  for (let i = 0; i < maxRetries; i++) {
    try {
      const res = await fetch(process.env.BASE_URL + '/health');
      if (res.ok) break;
    } catch {
      await new Promise(r => setTimeout(r, 2000));
    }
    if (i === maxRetries - 1) throw new Error('App did not start');
  }
}
```

**Mock external services** that are unreliable or rate-limited in CI:
```js
// Mock Stripe, SendGrid, or other third-party APIs in tests
await page.route('**/api.stripe.com/**', route =>
  route.fulfill({ status: 200, body: JSON.stringify({ status: 'succeeded' }) })
);
```

**Increase timeouts for CI** while keeping local tests fast:
```js
// playwright.config.ts
export default {
  timeout: process.env.CI ? 45000 : 15000,
};
```

---

### 5. Animation and Transition Interference

**Symptom:** Test clicks an element that's animating in/out and gets wrong behavior. Assertion fails because element is mid-transition.

**Root cause:** CSS transitions and animations run asynchronously and can interfere with element interaction timing.

**Fix:** Disable animations in test environments:

```js
// playwright.config.ts
export default {
  use: {
    // Disable CSS animations
    reducedMotion: 'reduce',
  },
};
```

Or inject a global CSS override in test setup:
```js
test.beforeEach(async ({ page }) => {
  await page.addStyleTag({
    content: `*, *::before, *::after { 
      animation-duration: 0ms !important; 
      transition-duration: 0ms !important; 
    }`,
  });
});
```

---

### 6. Test Runner Parallelism Conflicts

**Symptom:** Tests pass when run sequentially (`--workers=1`) but fail with parallel execution.

**Root cause:** Parallel tests competing for the same resource — same test user account, same database record, same port.

**Fix:**

Use unique data per parallel worker:
```js
// Use worker index to isolate data
test('create item', async ({ page }, testInfo) => {
  const userId = `test-user-${testInfo.workerIndex}`;
  // Each worker uses its own user, no conflicts
});
```

Limit concurrency for tests that genuinely can't parallelize:
```js
// playwright.config.ts
export default {
  projects: [
    {
      name: 'sequential-tests',
      testMatch: /serial\.spec\.ts/,
      use: { workers: 1 },
    },
  ],
};
```

---

### 7. UI Changes Breaking Locators (The Self-Healing Problem)

**Symptom:** Tests break after normal product development — a component refactor, CSS rename, or layout change — with no behavior change. This is the single largest driver of "tests as a maintenance burden."

**Root cause:** Tests are coupled to implementation details rather than user intent. Every locator-based test (`#submit-btn`, `.btn-primary`, `div:nth-child(3)`) is a bet that the DOM won't change. That bet loses constantly in teams shipping fast.

**Short-term fix:** Migrate to semantic selectors (see Cause #2). Add `data-testid` attributes to critical elements.

**Systematic fix with Shiplight:** Shiplight's [intent-cache-heal pattern](/blog/intent-cache-heal-pattern) eliminates this entire class of flakiness. Instead of maintaining a list of fallback selectors, Shiplight stores the *semantic intent* of each test step — for example, "click the primary submit button on the checkout form." When a locator breaks, Shiplight's AI resolves the correct element from the live DOM using that intent, not a cached CSS selector.

The result: tests survive CSS renames, component refactors, and layout changes that would break traditional locator-based healers — without any manual selector updates.

```yaml
# Shiplight YAML test — intent survives UI changes
goal: Verify checkout flow
statements:
  - intent: Add item to cart
  - intent: Proceed to checkout
  - intent: Fill in shipping details
  - VERIFY: order confirmation is displayed
```

When a button moves or gets renamed, Shiplight heals the step automatically. The developer who renamed the button doesn't need to update a single test file.

See [self-healing test automation](/blog/what-is-self-healing-test-automation) and [self-healing vs manual maintenance](/blog/self-healing-vs-manual-maintenance) for how this works in production suites.

---

## How to Triage Flaky Tests at Scale

If you have an existing suite with widespread flakiness, don't try to fix everything at once. Use this triage approach:

### Step 1: Quarantine, don't delete

```js
// Mark known-flaky tests with skip + tracking issue
test.skip('checkout flow — flaky, tracked in TICKET-123', async ({ page }) => {
  // ...
});
```

Deleting flaky tests removes coverage. Quarantine them while you fix the root cause.

### Step 2: Add retries temporarily

```js
// playwright.config.ts
export default {
  retries: process.env.CI ? 2 : 0,
};
```

Retries are a symptom management tool, not a fix. Use them to keep CI green while you identify root causes, then remove them once the underlying issue is fixed.

### Step 3: Measure flakiness rate per test

Track which tests are most flaky. Playwright's built-in retry mechanism marks tests as `flaky` when they pass on retry — use this data to prioritize:

```bash
# Generate a JSON report to analyze flakiness
npx playwright test --reporter=json > results.json
```

For CI-specific reporter setup — including the `github` reporter that surfaces flaky test annotations directly in the PR diff — see [E2E testing in GitHub Actions](/blog/github-actions-e2e-testing).

### Step 4: Fix in order of frequency

Fix the 20% of tests causing 80% of flakiness. Common culprits: auth flows, tests hitting external APIs, tests with `waitForTimeout`.

---

## Preventing Flakiness in New Tests

Build these habits into test authoring:

- **Never use `waitForTimeout`** — always wait for a condition, not a duration
- **Always use semantic selectors** — role, label, testid, text — never CSS classes or nth-child
- **Create isolated test data** per test, clean up after
- **Test one thing per test** — smaller tests are easier to debug when they fail
- **Run tests locally with `--headed`** before committing — see what the test actually does

---

## FAQ

### How do I identify which tests are flaky?

Three signals: (1) tests that pass on manual rerun after failing in CI — Playwright marks these as `flaky` in its JSON report; (2) tests that consistently appear in your retry log; (3) tests that pass with `--workers=1` but fail with parallelism. Run `npx playwright test --reporter=json > results.json` and filter for `"status": "flaky"` entries to get a ranked list by frequency.

### What's the difference between a flaky test and a broken test?

A broken test fails consistently on broken code — it's doing its job. A flaky test fails intermittently on working code — it's a reliability problem in the test itself. The fix for a broken test is to fix the code or update the test to match new behavior. The fix for a flaky test is to address the instability in the test.

### Should I use retries to fix flaky tests?

Only as a temporary measure. Retries mask the root cause and slow down your CI pipeline. If a test needs 3 retries to pass, it's not a reliable test — it's a slow coin flip. Fix the underlying cause and remove the retries.

### How many flaky tests are acceptable?

The [Google Testing Blog](https://testing.googleblog.com) recommends a target of 0.1% or lower flakiness per test run. In practice, teams tolerate up to 1–2% before it meaningfully impacts developer trust. Above 5%, teams stop relying on CI results.

### My tests pass locally but fail in CI — why?

Most common causes: slower CI runners (increase timeouts), missing environment variables, services not fully started (add health check), or external API rate limits (add mocks). Run CI tests with `CI=true` locally to replicate the environment.

### What's the fastest way to reduce flakiness today?

1. Add `retries: 2` in CI to stop the bleeding
2. Replace all `waitForTimeout` calls with proper waits
3. Migrate selectors from CSS classes to `getByRole` / `getByTestId`
4. Isolate test data so tests don't share state

For teams with chronic flakiness from UI changes (cause #7), [Shiplight](/plugins) eliminates the entire category automatically. Its intent-based self-healing means tests survive CSS renames, refactors, and component migrations without manual updates — no selector maintenance required. See [what is self-healing test automation?](/blog/what-is-self-healing-test-automation)

---

## Key Takeaways

- **Retries hide flakiness, they don't fix it** — treat them as a temporary measure, track root cause
- **Timing issues are the #1 cause** — replace `waitForTimeout` with condition-based waits
- **Selectors should reflect user intent** — role, label, testid; never CSS class or DOM position
- **Test isolation is non-negotiable** — shared state between tests is a reliability time bomb
- **UI changes cause chronic flakiness** — Shiplight's self-healing resolves elements by intent, not cached selectors, eliminating this entire category

Related: [turning flaky tests into actionable signal](/blog/flaky-tests-to-actionable-signal) · [E2E testing in GitHub Actions](/blog/github-actions-e2e-testing) · [self-healing vs manual maintenance](/blog/self-healing-vs-manual-maintenance) · [intent-cache-heal pattern](/blog/intent-cache-heal-pattern)

**Stop fixing broken selectors.** [Shiplight Plugin](/plugins) adds intent-based self-healing on top of your existing Playwright tests — free, no account required. · [Book a demo](/demo)

References: [Playwright documentation](https://playwright.dev/docs/test-timeouts), [Google Testing Blog](https://testing.googleblog.com), [GitHub Actions documentation](https://docs.github.com/en/actions)
