Vibe Coding Quality Issues: A Triage Playbook for Engineering Leaders
Shiplight AI Team
Updated on May 8, 2026
Shiplight AI Team
Updated on May 8, 2026

Vibe coding quality issues are the predictable second-quarter outcome of velocity-first AI adoption: output doubles, the vibe coding bug rate climbs 2–3x in 4–12 weeks, and the review process never adjusted. The fix is not slowing the agents down. It is moving the verification responsibility from a person reading a diff to a system that exercises the running app on every change.
---
The teams hitting this wall did not do anything wrong. They installed Claude Code, Cursor, Codex, or GitHub Copilot, watched feature throughput climb, and scaled the rest of the engineering process at the same pace it had run for years. Six weeks later, incidents started landing in clusters — usually in code that "looked fine" in review.
This is the playbook for the week you realize that pattern is yours.
The cause is not the model. It is the loop. Vibe coding compresses the write → run → review → ship cycle by collapsing the human-driven middle steps. The bugs that survive concentrate in four predictable shapes — distinct in failure mode but identical in root cause:
Each of these is invisible to a line-by-line review at AI velocity. None are invisible to a test that actually runs the user flow. That is the lever this playbook pulls.
For the underlying data and four-failure-mode model, see AI-generated code has 1.7x more bugs — here's the fix.
If your bug rate already spiked, the first job is stabilization, not strategy. Run this sequence the first week:
This is the floor. It buys time. It does not solve the underlying review-loop mismatch.

The mistake most teams make next is requiring two reviewers on every AI-generated PR. That collapses velocity for low-risk surface area and burns out reviewers on the parts that don't need scrutiny. Tier by code risk, not by authorship:
| Risk Tier | Surface Area | Review Requirement | Test Requirement |
|---|---|---|---|
| High | Auth, payments, data writes, API integrations, anything PII-adjacent | Two reviewers, security review on first touch | Mandatory E2E coverage before merge |
| Medium | Business logic, state management, request handlers, non-trivial UI flows | Standard review focused on intent, not lines | E2E for primary path; unit tests for branching logic |
| Low | UI scaffolding, copy changes, config, boilerplate, theme tokens | Single reviewer | Optional |
Risk tiering does two things at once: it lets the high-velocity tier stay high-velocity, and it concentrates human attention where AI failure modes have the worst blast radius. The teams that stick with vibe coding long-term build this matrix into their PR template the same week they install the first agent.
Once the immediate fire is out, the goal is a governance model where every iteration raises the floor and never lowers it. Four ratchet steps:
At AI throughput, a reviewer cannot meaningfully read every line of a 400-line agent diff. They can meaningfully ask: "What is this PR claiming the user can now do? Is that demonstrated?" That is intent-level review. The artifact that makes intent reviewable is a passing E2E test that exercises the new flow.
Every quarter someone runs a coverage report, finds it dropped 6%, and writes a doc nobody reads. Make it a gate instead: PRs touching tier-1 flows do not merge without an E2E test that exercises the change. Coverage stops being a number on a dashboard and becomes a property the codebase enforces.
Whether the code came from a human, an agent, or an agent supervised by a human is the wrong axis. The right axis is: how much damage does this code do if it's wrong? Apply the matrix above to all code. The same review rules cover human-written authentication code and AI-generated authentication code — because the failure cost is identical.
The hand-written test suite cannot keep up with AI output. That is the structural problem. Either tests are generated and maintained at AI velocity, or they decay until coverage is theatre. The next section is how that floor gets automated.

For deeper governance patterns, see a practical quality gate for AI pull requests.
The shape of a test suite that survives vibe coding velocity has three properties:
#submit-btn breaks the next time an agent renames the element. A test bound to "submit the order" survives every UI refactor that preserves the user-visible behavior. This is the intent-cache-heal pattern.Shiplight is built on exactly this shape. The Shiplight Plugin exposes test generation and execution as Model Context Protocol (MCP) tools that Claude Code, Cursor, Codex, and GitHub Copilot can call directly. The agent that just wrote the feature calls /verify to run it in a real browser and /create_e2e_tests to save the verification as a self-healing YAML test in the repo. Tests are authored as structured intent steps, so when an agent restructures the UI next sprint, the test re-resolves rather than breaks.
The customer pattern is consistent. HeyGen's Head of QA reported moving from spending 60% of his time authoring and maintaining Playwright tests to spending 0% — same coverage, freed velocity for higher-leverage work. Read the full story in the HeyGen case study.
The point is not that this is the only solution. It is that some solution with these three properties is now load-bearing if AI is in the development loop.
The hardest part of this transition is not technical. It is admitting that the review process the team trusted last quarter is not the review process the team needs this quarter. The signal that the conversation has gone well: engineers stop framing "AI made a mistake" as a story about the model and start framing it as a story about the gate that should have caught it. That reframe is what makes the governance ratchet stick.
For a deeper read on the loop itself, see QA for the AI coding era and how to add automated testing to Cursor, Copilot, and Codex.
Most teams see the spike 4–12 weeks after broad adoption. The lag is the time it takes for AI-generated code to accumulate enough surface area that the previously-adequate review process starts missing things. Earlier signal: an uptick in production incidents traced to recently-shipped code that "passed review."
No. The productivity gain is real, and the bug rate is fixable without giving it up. The fix is closing the verification loop with automated end-to-end testing on every diff, not removing the agents that closed the velocity gap.
Vibe coding is describing intent in natural language and letting an agent write the implementation. Vibe testing is verifying the implementation actually does what was described — by exercising the running app, not by reading the diff. See vibe coding testing: how to add QA without slowing down.
Only on tier-1 surface area: authentication, payments, data writes, integrations. Apply the risk matrix above. Requiring two reviewers on every PR collapses velocity on low-risk changes and trains the team to rubber-stamp.
The only sustainable answer is generating tests inside the same agent loop that generates the code, in a format that self-heals when the UI changes. Hand-authored Playwright suites cannot keep up. See self-healing tests vs manual maintenance: the ROI case.
Three leading indicators precede a measurable spike: (1) PRs merged without an associated test change rising past 60% of the AI-generated PR volume, (2) production incidents traced back to PRs that "passed review" climbing month-over-month, and (3) engineers using "I'll trust the agent" as PR-review shorthand. Lagging indicator: weekly incident count or hotfix frequency. Track the leading three; the lagging metric arrives too late to triage cleanly.
Traditional code review policy gates on authorship and line count — every PR over N lines gets two reviewers regardless of risk. Vibe coding governance gates on risk tier and behavioral coverage — a 400-line UI scaffolding PR can ship with one reviewer and no E2E test, while a 50-line auth change requires two reviewers and a passing E2E. The shift is from "scrutinize all change" to "scrutinize change proportional to blast radius."
If your team is six weeks into AI adoption and the incident graph is bending the wrong way, the gap is not in the model. It is in the verification loop that used to be a human and is now nobody.
Install Shiplight Plugin into your coding agent and the next AI-generated feature you ship will close that loop on the first commit.
---
Sources: