Engineering

AI-Generated Code Has 1.7x More Bugs — Here's the Fix

Shiplight AI Team

Updated on June 30, 2026

The data is in, and it's not what AI optimists hoped for.

CodeRabbit's "State of AI vs Human Code Generation" report, analyzing 470 real-world GitHub pull requests, found that AI-generated code produces approximately 1.7x more issues than human-written code. Not in toy benchmarks — in production repositories.

That's the headline. Here's what makes it worse:

Logic and correctness errors are 75% more common in AI-generated PRs
Readability issues spike more than 3x
Error handling gaps are nearly 2x more frequent
Security vulnerabilities are up to 2.74x higher

And this isn't an isolated finding. Uplevel's study of 800 developers found a 41% increase in bug rates for teams with GitHub Copilot access. GitClear's analysis of 211 million lines of code found that code churn — code rewritten or deleted within two weeks of being committed — nearly doubled from 3.1% to 5.7% between 2020 and 2024, with AI-assisted coding identified as a key driver.

The pattern is consistent across every major study: AI makes developers faster, but the code it produces breaks more often.

Bar chart showing AI-generated code produces 1.7x more bugs than human-written code per pull request

So why are some teams shipping AI-generated code with fewer bugs than before?

The Problem Isn't AI. It's the Missing Feedback Loop.

When a human developer writes code, they typically:

Write the code
Run it locally
Click through the UI to check it works
Write or update tests
Push to CI

When an AI coding agent writes code, most teams:

Prompt the AI
Review the diff visually
Push to CI

Steps 2-4 just vanished. The developer didn't run the app. Didn't click through the flow. Didn't verify the UI actually works. The AI generated plausible-looking code, the developer skimmed it, and it went straight to review.

This is where the 1.7x bug multiplier comes from. Not because AI writes worse code in absolute terms — but because the human verification step that catches bugs disappears when AI writes code fast enough that reviewing feels like enough.

What the Data Actually Shows

Let's look at what types of bugs increase most in AI-generated code:

Issue Category	AI vs Human Rate	Why It Happens
Logic & correctness	+75%	AI generates statistically likely code, not contextually correct code
Readability	+3x	AI doesn't follow team conventions or naming patterns
Error handling	+2x	AI handles the happy path well; misses edge cases
Security	+2.74x	AI reproduces known vulnerability patterns from training data

Source: CodeRabbit, Dec 2025

Notice what's at the top: logic and correctness. Not syntax errors. Not type mismatches. The kind of bugs that only show up when you actually run the application and verify the UI behaves as expected.

Unit tests don't catch these. Linters don't catch these. Code review often doesn't catch these either — because the code looks correct. It compiles, the types check, the logic reads plausibly. You have to click through the flow to discover the bug. That's what end-to-end testing is for — and it's exactly the step that disappears in AI-assisted workflows.

The 4 Failure Modes of AI-Generated Code

The CodeRabbit numbers tell you how often AI code fails. The more useful question is how it fails — because each failure mode needs a different test strategy. Four categories cover most real-world defects in AI-generated code:

1. Intent Inversion

The code does the literal opposite of what was requested. AI generates price * (1 - discount) when the intent was price * (1 + tax), or writes status === 'inactive' when the filter should be 'active'. Types check. Linters pass. The code reads plausibly. Only a behavioral test — running the flow and checking outcomes — catches these.

2. Dropped Safeguards

AI reproduces the happy path cleanly and silently drops the defensive logic that was there before. A refactor loses a null check. A regenerated auth middleware loses the rate limiter. A rewritten payment flow loses the idempotency guard. The specific safeguard depends on what existed; the pattern is consistent.

3. Contextual Mismatch

AI writes statistically likely code that doesn't fit your specific codebase. It imports a library you don't use. It follows a pattern from its training data that conflicts with your team's conventions. It names a variable in a style that breaks your linter. Each instance is small; together they create technical debt faster than review can clean up.

4. The Silent Pass Problem

The code passes every test in your suite and is still wrong. This is the most dangerous failure mode because your CI stays green. The tests check what was already specified; the bug is in behavior that wasn't specified because no one thought it was at risk. AI-generated refactors frequently introduce these — the refactor preserves every test-covered behavior and quietly changes behavior that wasn't covered.

Coverage metrics lie here. 90% line coverage with 30% behavioral coverage means 70% of possible bugs can ship green. The fix isn't more unit tests — it's shifting coverage from "did this line execute" to "did the user-facing behavior actually work."

Meanwhile, Technical Debt Is Compounding

GitClear's 2025 research reveals a deeper structural problem:

Code duplication rose 8x in AI-assisted repositories
Refactoring dropped from 25% to under 10% of code changes between 2021-2024
Copy-pasted code blocks rose from 8.3% to 12.3% of all changes

AI tools generate new code instead of reusing existing abstractions. The result: repositories that grow faster but become harder to maintain. Each duplicated block is a future bug — when you fix one copy, the others remain broken.

What High-Performing Teams Do Differently

The teams shipping AI-generated code without the 1.7x bug penalty all share one practice: they verify AI output in a real browser before it reaches main.

Not with unit tests. Not with code review alone. With actual end-to-end verification — the same kind of "click through the app" checking that human developers do naturally, but automated so it scales with AI's speed.

Here's what that looks like at three companies using Shiplight:

Warmly: From 60% Maintenance Time to Zero

> "I used to spend 60% of my time authoring and maintaining Playwright tests for our entire web application. I spent 0% of the time doing that in the past month. I'm able to spend more time on other impactful/more technical work. Awesome work!"

— Jeffery King, Head of QA, Warmly

The 60% number is staggering but common. Industry data shows that test maintenance is one of the largest hidden costs in software development, often consuming more time than writing the tests in the first place. When tests break every time the UI changes, teams either burn cycles fixing them or stop running them entirely — leaving AI-generated code unverified.

Warmly eliminated this by switching to self-healing test automation — intent-based tests that adapt when the UI changes. The time freed up went to higher-impact engineering work, not more test maintenance.

Jobright: Reliable Coverage Within Days

> "Within just a few days, we achieved reliable end-to-end coverage across our most critical flows, even with complex integrations and data-driven logic. QA no longer slows the team down as we ship fast."

— Binil Thomas, Head of Engineering, Jobright

The key phrase: "within just a few days." Traditional E2E test suites take weeks or months to build. By the time they're ready, the AI-assisted codebase has already moved on. Jobright closed that gap by generating tests directly from their AI coding workflow — the same agent that writes code also verifies it.

Daffodil: 80% Regression Coverage in Weeks

> "We automated over 80% of our core regression flows within the first few weeks. Most manual checks are gone, ongoing maintenance is minimal, and shipping changes feels significantly safer now."

— Ethan Zheng, Co-founder & CTO, Daffodil

80% coverage of core regression flows means 80% fewer places for AI-generated bugs to hide. When every PR triggers automated verification of the most critical user paths, the 1.7x bug multiplier gets absorbed before it reaches production.

The Fix: Make AI Verify Its Own Work

The solution isn't to stop using AI coding tools. The productivity gains are real — teams using AI assistants ship features significantly faster. The solution is to close the verification gap with agentic QA testing — letting the AI agent verify its own output.

With MCP (Model Context Protocol), AI coding agents can now:

Write the code — same as before
Open a real browser — navigate to the running app
Verify the change works — click through flows, check the UI
Save the verification as a test — YAML file in your repo
Run tests in CI — every future PR is verified automatically

The agent that generates the code also proves it works. The verification step that humans skip when AI writes code fast enough becomes automated.

goal: Verify checkout flow after AI-generated payment update
base_url: http://localhost:3000
statements:
  - navigate: /products
  - intent: Add first product to cart
    action: click
    locator: "getByRole('button', { name: 'Add to cart' })"
  - navigate: /checkout
  - VERIFY: Cart shows correct item and price
  - intent: Fill payment details
    action: fill
    locator: "getByLabel('Card number')"
    value: "4242424242424242"
  - intent: Submit payment
    action: click
    locator: "getByRole('button', { name: 'Pay now' })"
  - VERIFY: Order confirmation page appears with order number

This test is readable by anyone on the team. It lives in your repo. When the UI changes, intent-based steps self-heal automatically — the same pattern described in AI-generated tests vs hand-written tests. And it catches exactly the type of bugs that multiply 1.7x in AI-generated code — logic errors, flow breakages, and UI regressions that unit tests miss.

A 4-Part Testing Strategy for AI-Generated Code

A browser verification loop is the foundation. Above it, four practices systematically address the failure modes AI code introduces:

1. Behavioral coverage over line coverage

Shift the measurement. Instead of "did this line execute in a test," ask "did the user-facing behavior get verified?" A test that asserts the checkout button clicks and confirms the order number appears is worth more than 50 unit tests that individually verify each function returns the right shape. AI code breaks in behavior, not shape.

2. Re-verify on every AI-generated diff, not just per PR

The default pattern is "run tests on PR." For AI-generated diffs, that's not enough. When an AI coding agent refactors a file, treat every user flow that file touches as untested until re-verified. Hook the verification into the coding agent's workflow so it runs during development, not after. See how to add automated testing to Cursor, Copilot & Codex for the MCP pattern.

3. Contract tests at service boundaries

AI code regularly changes the shape of what a function returns or what an API accepts. Contract tests pin down the interface between services or modules so these changes get caught at the boundary, not at the integration test 5 layers deep. Smallest tests, highest leverage.

4. Human review reserved for security and business logic

Reviewers drown when asked to validate AI code at machine speed. Triage: automated tests handle functional correctness; humans review only what AI can't reliably reason about — threat model, authorization rules, compliance requirements, business logic edge cases. This is where human judgment adds value that tests can't.

Applied together, these four practices address all four failure modes: behavioral coverage catches intent inversions and silent passes, verification on every diff catches dropped safeguards, contract tests surface contextual mismatches, and targeted human review handles the classes of bugs automation can't.

The Numbers Add Up

Metric	Without E2E Verification	With Automated Verification
AI code bug rate	1.7x more issues (CodeRabbit)	Caught before merge
Logic errors	+75% vs human code	Verified in real browser
Security gaps	+2.74x vs human code	Flagged during review
Test maintenance time	40-60% of QA effort	Near-zero (self-healing)
Time to full E2E coverage	Weeks to months	Days (Jobright)
Regression flow coverage	Manual spot-checks	80%+ automated (Daffodil)

The Bottom Line

AI coding tools are here to stay. The 1.7x bug multiplier doesn't have to be.

The teams that will win are the ones that treat AI-generated code the same way they'd treat code from a very fast junior developer: verify everything, automate the verification, and never ship without testing.

The tools to do this exist today. Get started with Shiplight Plugin — it takes one command to add automated verification to your AI coding workflow. The question is whether your team adopts it before the technical debt compounds — or after the production incident.

How to detect hidden bugs in AI-generated code — practical techniques for catching bugs AI reviewers miss
AI-generated vs hand-written tests — when each approach wins
Verify AI-written UI changes — the verification workflow AI coding agents need
QA for the AI coding era — why traditional QA can't keep up with AI-generated code

---

Sources:

GenIA-E2ETest: LLM-Based Automated E2E Test Generation (arXiv, 2025) — AI-generated test scripts achieved 82% execution precision but required manual fixes in 18% of cases; fragile locators and dynamic content identified as primary failure modes
CodeRabbit: State of AI vs Human Code Generation (Dec 2025) — 470 GitHub PRs analyzed, AI code produces 1.7x more issues
CodeRabbit press release (BusinessWire)
Uplevel: Copilot 41% bug increase study — 800 developers over 3 months
GitClear: AI Copilot Code Quality 2025 — 211M lines of code analyzed
GitClear: Coding on Copilot (2024 projections)
Stack Overflow: Are bugs inevitable with AI coding agents?
Rainforest QA: The unexpected costs of test automation maintenance
The Register: AI-authored code needs more attention