How to Fix Flaky E2E Tests: Root Causes and Permanent Fixes
Shiplight AI Team
Updated on May 20, 2026
Shiplight AI Team
Updated on May 20, 2026

To fix flaky tests permanently, address the 8 root causes of non-determinism: timing/race conditions, brittle selectors, shared test state, environment instability, animation interference, parallelism conflicts, UI changes breaking locators, and improper resource management. Each has a specific fix — retries mask the symptom, not the cause.
---
A flaky test — sometimes called an intermittent or non-deterministic test — is a test that passes sometimes and fails sometimes — on the same code, with no changes. They're the most corrosive problem in a test suite because they turn your CI from a quality signal into noise.
Teams respond to flaky tests in predictable ways: first they rerun them, then they add retries, then they quarantine them, then they just stop looking at red CI. By the time a real regression ships, no one trusts the tests enough to catch it.
This guide covers the 8 root causes of flaky E2E tests and how to fix each one permanently — not with retries that hide the problem, but with changes that make the test reliable. For teams where cause #7 (UI changes breaking locators) is the dominant source of flakiness, Shiplight's self-healing layer eliminates it automatically.
| # | Root Cause | Primary Symptom | Fix |
|---|---|---|---|
| 1 | Timing / race conditions | "element not found" on CI | Replace waitForTimeout with condition-based waits |
| 2 | Brittle selectors | Breaks on CSS rename | Use getByRole, getByTestId, getByLabel |
| 3 | Shared test state | Fails in parallel, passes solo | Isolate data per test, reset state in afterEach |
| 4 | Environment instability | CI fails, local passes | Health checks, mock external APIs, raise timeouts |
| 5 | Animation interference | Random assertion failures | reducedMotion: 'reduce' in Playwright config |
| 6 | Parallelism conflicts | Fails with --workers > 1 | Scope data to workerIndex |
| 7 | UI changes / locator drift | Breaks after refactors | Shiplight self-healing or semantic selectors |
| 8 | Resource leaks | Flakiness increases over time / across suites | Add teardown for files, DB rows, containers, external resources |
---
Flaky tests are a silent tax on every automated testing program. A test suite with 20% flakiness is worse than a smaller, reliable suite. The obvious cost is time spent re-running CI and investigating false failures — but that's not the deepest problem.
The deeper problem is signal degradation. When CI is red often enough, engineers stop treating red as meaningful. They learn to retry first, investigate later, and merge if the retry passes. A culture of "it's probably flaky" means real failures get ignored too. Regressions merge. Bugs ship.
Specific consequences:
Once a team has learned to distrust their test suite, restoring that trust takes significantly more work than fixing the underlying flakiness would have. This is why flakiness should be treated as a defect, not a nuisance.
The Google Testing Blog has documented that even 1% flakiness in a large suite creates enough noise to meaningfully slow down development. At 10%+, teams functionally stop relying on CI.
Symptoms:
Root cause: The test clicks or asserts before the page, network request, or animation has finished.
What not to do:
// Don't add arbitrary sleeps — they're fragile and slow
await page.waitForTimeout(2000);
await page.click('#submit-btn');Fix: Use explicit waits that respond to actual application state:
// Wait for the element to be visible and enabled
await page.waitForSelector('#submit-btn', { state: 'visible' });
await page.click('#submit-btn');
// Wait for network to settle after an action
await page.click('#submit-btn');
await page.waitForLoadState('networkidle');
// Wait for a specific response
const [response] = await Promise.all([
page.waitForResponse(r => r.url().includes('/api/submit') && r.status() === 200),
page.click('#submit-btn'),
]);
// Wait for navigation
await Promise.all([
page.waitForURL('**/dashboard'),
page.click('#login-btn'),
]);CI runners are slower than developer machines — timeouts that work locally fail in CI. Set explicit timeouts in your Playwright config:
// playwright.config.ts
export default {
timeout: 30000, // per test timeout
expect: { timeout: 10000 }, // per assertion timeout
use: {
actionTimeout: 10000, // per action timeout
},
};---
Symptoms:
Root cause: The test is coupled to implementation details (CSS classes, IDs, DOM structure) rather than user-visible behavior.
Fragile selectors:
// ❌ Breaks when class name changes
await page.click('.btn-primary-v2-active');
// ❌ Breaks when DOM restructures
await page.click('div > div:nth-child(3) > button');
// ❌ Breaks when internal ID changes
await page.click('#internal-submit-14');Resilient selectors (in order of preference):
// ✅ User-visible text — stable across refactors
await page.click('button:has-text("Sign In")');
// ✅ ARIA role + name — semantic and accessible
await page.getByRole('button', { name: 'Sign In' }).click();
// ✅ Test ID — explicit contract between test and dev
await page.getByTestId('submit-button').click();
// ✅ Label association — works for form inputs
await page.getByLabel('Email address').fill('user@example.com');
// ✅ Placeholder — for unlabeled inputs
await page.getByPlaceholder('Search...').fill('query');Add data-testid attributes to key interactive elements as a team convention. This creates an explicit contract: devs know which elements tests depend on, and changes are deliberate.
The deeper fix is to treat locators as a cache of user intent, not as the source of truth. Shiplight's intent-cache-heal pattern implements this systematically — when a locator breaks, the test resolves the correct element from its intent description rather than failing.
---
Symptoms:
--shard or parallel executionRoot cause: Tests share state — database records, cookies, localStorage, or server-side session data — that bleeds between runs.
Fix: Make every test self-contained:
// ✅ Create isolated test data per test
test.beforeEach(async ({ page }) => {
// Create a fresh user for this test
const user = await createTestUser({ role: 'admin' });
await loginAs(page, user);
});
test.afterEach(async () => {
// Clean up test data
await cleanupTestUsers();
});For browser state (cookies, localStorage):
// playwright.config.ts
export default {
use: {
// Start every test in a fresh browser context
storageState: undefined,
},
};For auth state, use Playwright's storageState to save a logged-in session once and reuse it — avoiding repeated login steps while still isolating test data:
// global-setup.ts
import { chromium } from '@playwright/test';
async function globalSetup() {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('/login');
await page.fill('[name=email]', process.env.TEST_USER_EMAIL!);
await page.fill('[name=password]', process.env.TEST_USER_PASSWORD!);
await page.click('button[type=submit]');
await page.waitForURL('/dashboard');
await page.context().storageState({ path: 'auth.json' });
await browser.close();
}See stable auth and email E2E tests for handling authentication flows specifically.
---
Symptom: Tests fail on CI but not locally. Errors involve timeouts, connection refused, or service unavailability.
Root cause: CI environment differs from local — different network latency, services not fully started, environment variables missing, or third-party API rate limits.
Fix:
Health check before tests:
// global-setup.ts
async function globalSetup() {
const maxRetries = 10;
for (let i = 0; i < maxRetries; i++) {
try {
const res = await fetch(process.env.BASE_URL + '/health');
if (res.ok) break;
} catch {
await new Promise(r => setTimeout(r, 2000));
}
if (i === maxRetries - 1) throw new Error('App did not start');
}
}Mock external services that are unreliable or rate-limited in CI:
// Mock Stripe, SendGrid, or other third-party APIs in tests
await page.route('**/api.stripe.com/**', route =>
route.fulfill({ status: 200, body: JSON.stringify({ status: 'succeeded' }) })
);Increase timeouts for CI while keeping local tests fast:
// playwright.config.ts
export default {
timeout: process.env.CI ? 45000 : 15000,
};---
Symptom: Test clicks an element that's animating in/out and gets wrong behavior. Assertion fails because element is mid-transition.
Root cause: CSS transitions and animations run asynchronously and can interfere with element interaction timing.
Fix: Disable animations in test environments:
// playwright.config.ts
export default {
use: {
// Disable CSS animations
reducedMotion: 'reduce',
},
};Or inject a global CSS override in test setup:
test.beforeEach(async ({ page }) => {
await page.addStyleTag({
content: `*, *::before, *::after {
animation-duration: 0ms !important;
transition-duration: 0ms !important;
}`,
});
});---
Symptom: Tests pass when run sequentially (--workers=1) but fail with parallel execution.
Root cause: Parallel tests competing for the same resource — same test user account, same database record, same port.
Fix:
Use unique data per parallel worker:
// Use worker index to isolate data
test('create item', async ({ page }, testInfo) => {
const userId = `test-user-${testInfo.workerIndex}`;
// Each worker uses its own user, no conflicts
});Limit concurrency for tests that genuinely can't parallelize:
// playwright.config.ts
export default {
projects: [
{
name: 'sequential-tests',
testMatch: /serial\.spec\.ts/,
use: { workers: 1 },
},
],
};---
Symptom: Tests break after normal product development — a component refactor, CSS rename, or layout change — with no behavior change. This is the single largest driver of "tests as a maintenance burden."
Root cause: Tests are coupled to implementation details rather than user intent. Every locator-based test (#submit-btn, .btn-primary, div:nth-child(3)) is a bet that the DOM won't change. That bet loses constantly in teams shipping fast.
Short-term fix: Migrate to semantic selectors (see Cause #2). Add data-testid attributes to critical elements.
Systematic fix with Shiplight: Shiplight's intent-cache-heal pattern eliminates this entire class of flakiness. Instead of maintaining a list of fallback selectors, Shiplight stores the semantic intent of each test step — for example, "click the primary submit button on the checkout form." When a locator breaks, Shiplight's AI resolves the correct element from the live DOM using that intent, not a cached CSS selector.
The result: tests survive CSS renames, component refactors, and layout changes that would break traditional locator-based healers — without any manual selector updates.
# Shiplight YAML test — intent survives UI changes
goal: Verify checkout flow
statements:
- intent: Add item to cart
- intent: Proceed to checkout
- intent: Fill in shipping details
- VERIFY: order confirmation is displayedWhen a button moves or gets renamed, Shiplight heals the step automatically. The developer who renamed the button doesn't need to update a single test file.
See self-healing test automation and self-healing vs manual maintenance for how this works in production suites.
Symptom: Flakiness increases over time or across the full suite. A test that passes alone fails when run after 50 others. Disk fills up, database connections exhaust, or containers leak between runs.
Root cause: Tests create external resources — temporary files, database rows, uploaded blobs, sandboxed containers, message queue entries, seed users — but don't clean them up. Unlike shared test state (cause #3, typically in-memory or session), resource leaks accumulate in persistent storage and external systems. Later tests then hit quota limits, port collisions, duplicate-key errors, or stale data from earlier runs.
Examples:
test@example.com, next run fails on "email already exists"/tmp, disk fills across long-running CI jobsFix: Add explicit teardown for every external resource the test creates. Scope resources to the test worker or run ID so parallel runs don't collide.
// Scope resources to this test run
const runId = process.env.TEST_RUN_ID || randomUUID();
const testEmail = `test-${runId}@example.com`;
test.afterEach(async () => {
// Always clean up — even on failure — to prevent leak accumulation
await db.user.deleteMany({ where: { email: { contains: runId } } });
await s3.deleteObjects({ Prefix: `test-artifacts/${runId}/` });
await docker.container.remove({ force: true });
});For tests that modify shared state (database, external APIs), prefer transactional wrappers that roll back after each test:
// Wrap each test in a transaction that auto-rolls back
test.beforeEach(async () => { await db.$executeRaw`BEGIN`; });
test.afterEach(async () => { await db.$executeRaw`ROLLBACK`; });Systematic fix: Use ephemeral environments for each test run — a fresh database, a clean file system, disposable containers. CI systems like GitHub Actions make this cheap with service containers. When the entire environment is disposable, resource leaks become mathematically impossible.
---
Causes #1–6 and #8 require code changes to fix. Cause #7 — UI changes breaking locators — is the one AI can eliminate automatically.
Shiplight's intent-cache-heal pattern works by storing the semantic intent of each test step rather than a brittle CSS selector. When a locator breaks after a refactor, the AI resolves the correct element from the live DOM using the intent, updates the locator cache, and the test continues — no human intervention required.
This is especially valuable for teams shipping AI-generated code, where UI changes are constant and locator maintenance quickly becomes unsustainable. Instead of a flaky Playwright test that breaks every time a component is renamed, you get a test that describes what the user wants to do and adapts automatically when the implementation changes.
The result: cause #7 drops from your flakiness report entirely, and your team's attention stays on the six causes that actually require debugging.
What is self-healing test automation? · Self-healing vs manual maintenance
---
If you have an existing suite with widespread flakiness, don't try to fix everything at once. Use this triage approach:
// Mark known-flaky tests with skip + tracking issue
test.skip('checkout flow — flaky, tracked in TICKET-123', async ({ page }) => {
// ...
});Deleting flaky tests removes coverage. Quarantine them while you fix the root cause.
// playwright.config.ts
export default {
retries: process.env.CI ? 2 : 0,
};Retries are a symptom management tool, not a fix. Use them to keep CI green while you identify root causes, then remove them once the underlying issue is fixed.
Track which tests are most flaky. Playwright's built-in retry mechanism marks tests as flaky when they pass on retry — use this data to prioritize:
# Generate a JSON report to analyze flakiness
npx playwright test --reporter=json > results.jsonFor CI-specific reporter setup — including the github reporter that surfaces flaky test annotations directly in the PR diff — see E2E testing in GitHub Actions.
Fix the 20% of tests causing 80% of flakiness. Common culprits: auth flows, tests hitting external APIs, tests with waitForTimeout.
---
Build these habits into test authoring:
waitForTimeout — always wait for a condition, not a duration--headed before committing — see what the test actually does---
Tooling alone won't create a reliable test suite. The most important cultural shift is treating flaky tests as real defects, not acceptable nuisances. A flaky test is a bug in your test suite. It deserves the same attention as a production bug: triage, root cause analysis, and a permanent fix — not a retry loop that hides the problem.
Concrete practices that distinguish teams with trustworthy CI from teams without:
waitForTimeout to one flaky test is the wrong move. Fix the underlying pattern — switch to condition-based waits systematically, isolate state systematically, mock external dependencies systematically. One fix per category is worth dozens of per-test patches.Teams that maintain this standard consistently have test suites engineers trust — and test suites engineers trust actually catch regressions. This is the only durable way to preserve the quality signal CI is supposed to provide.
---
Test flakiness in UI automation suites comes from eight root causes of non-determinism: (1) timing / race conditions — asserting before the page, network, or animation finishes; (2) brittle selectors bound to CSS classes or DOM structure that change; (3) shared test state bleeding between runs; (4) environment instability (CI differs from local); (5) animation interference; (6) parallelism conflicts when workers share data; (7) UI changes / locator drift after refactors; (8) resource leaks that accumulate across the suite. UI suites are especially flake-prone because they sit at the top of the stack — every layer beneath (network, render, animation, third-party widget) can introduce timing variance. Retries mask these symptoms; only addressing the specific root cause makes a test reliable. For cause #7 (the dominant source in fast-changing UIs), Shiplight's intent-based self-healing resolves the element semantically instead of breaking.
UI tests are the most flake-prone layer because they depend on the most moving parts: real browser rendering timing, network latency, animations, third-party widgets, and DOM structure that AI-driven and human refactors change frequently. Unit tests run in-process with no I/O; API tests have a stable contract; UI tests must wait for asynchronous rendering and bind to a visual structure that is unstable by nature. This is why the test pyramid keeps UI/E2E tests fewest — and why self-healing and intent-based resolution matter most at this layer. See what is software testing for the pyramid context.
Three signals: (1) tests that pass on manual rerun after failing in CI — Playwright marks these as flaky in its JSON report; (2) tests that consistently appear in your retry log; (3) tests that pass with --workers=1 but fail with parallelism. Run npx playwright test --reporter=json > results.json and filter for "status": "flaky" entries to get a ranked list by frequency.
A broken test fails consistently on broken code — it's doing its job. A flaky test fails intermittently on working code — it's a reliability problem in the test itself. The fix for a broken test is to fix the code or update the test to match new behavior. The fix for a flaky test is to address the instability in the test.
Only as a temporary measure. Retries mask the root cause and slow down your CI pipeline. If a test needs 3 retries to pass, it's not a reliable test — it's a slow coin flip. Fix the underlying cause and remove the retries.
The Google Testing Blog recommends a target of 0.1% or lower flakiness per test run. In practice, teams tolerate up to 1–2% before it meaningfully impacts developer trust. Above 5%, teams stop relying on CI results.
Most common causes: slower CI runners (increase timeouts), missing environment variables, services not fully started (add health check), or external API rate limits (add mocks). Run CI tests with CI=true locally to replicate the environment.
retries: 2 in CI to stop the bleedingwaitForTimeout calls with proper waitsgetByRole / getByTestIdFor teams with chronic flakiness from UI changes (cause #7), Shiplight eliminates the entire category automatically. Its intent-based self-healing means tests survive CSS renames, refactors, and component migrations without manual updates — no selector maintenance required. See what is self-healing test automation?
---
waitForTimeout with condition-based waitsRelated: turning flaky tests into actionable signal · E2E testing in GitHub Actions · self-healing vs manual maintenance · intent-cache-heal pattern
Stop fixing broken selectors. Shiplight Plugin adds intent-based self-healing on top of your existing Playwright tests — free, no account required. · Book a demo
References: Playwright documentation, Google Testing Blog, GitHub Actions documentation