How to Set Up a Vibe Coding QA Process: A Practical 9-Stage Workflow for 2026
Shiplight AI Team
Updated on May 15, 2026
Shiplight AI Team
Updated on May 15, 2026

A solid vibe coding QA process treats AI like a very fast junior engineer: high throughput, inconsistent judgment, and prone to confident mistakes. The goal is not "more testing." It's creating enough structure that AI-generated changes become predictable, reviewable, and reversible. A practical workflow usually looks like this: Intent → Contracts → AI implementation → AI-generated tests → Human review → CI gates → Staging exploration → Production monitoring → Regression learning. Each stage solves a specific failure mode AI coding agents introduce — drift from intent, hallucinated tests, silent regressions, undiscovered flows in production. This guide walks through all nine stages, what each requires, and the Shiplight features that implement them.
Traditional QA assumes a relatively stable codebase, deliberate authorship, and slow change. Vibe coding inverts all three:
A QA process designed for the traditional shape collapses under vibe-coding throughput. Reviewing every AI-generated PR by hand becomes a bottleneck within a sprint. Selector-bound test suites break weekly. Manual regression cycles can't keep pace with the deploy cadence. The 9-stage workflow below replaces the manual choke points with structural gates that scale.
For the broader category, see what is vibe testing and vibe coding quality issues: a triage playbook.
The QA process starts before the first prompt to the AI. Every feature has a user intent — "the user can sign up with email, verify, and reach the dashboard within 60 seconds" — and that intent is the source of truth. Without it, the AI generates something, the engineer accepts it, and there's no contract to test against.
Capture intent in plain language as part of the ticket or PR description. Three rules:
Shiplight surface: intent statements feed directly into Shiplight YAML Test Format. The same plain-English intent that goes in the ticket becomes the YAML test step.
Before the AI writes code, define the contracts the implementation must honor:
Contracts are the difference between "AI made something" and "AI made something that fits." Without contracts, AI fills the ambiguity with its training-data priors — which sometimes match your system and often don't. See requirements to E2E coverage.
With intent and contracts in hand, the AI coding agent (Cursor, Claude Code, OpenAI Codex, Copilot) generates the implementation. Three discipline points:
The output of this stage is code in a feature branch — not merged, not deployed. The next stages decide whether it ships.
This is the stage most teams skip and pay for later. Immediately after the AI generates the implementation, have it generate tests for the same intent — preferably in the same session, before the PR opens. The agent has full context, can produce relevant edge cases, and remembers what it just wrote.
The output of this stage:
/admin).This is the largest single multiplier of coverage. Authoring throughput now tracks code generation, not human typing. See boost test coverage with agentic AI and testing layer for AI coding agents.
Shiplight surface: Shiplight MCP Server and AI SDK let coding agents call the testing tool as a callable resource — generate, run, and pass tests inside the same session they wrote the feature. See MCP for testing.
The AI-generated implementation and AI-generated tests both go to a human reviewer in the same pull request. The reviewer is not catching every bug — they're verifying three things:
Self-healing tools should emit unhealed-step patches as PR diffs, not silent rewrites — preserving the audit trail for this review. See self-healing vs manual maintenance.
The PR runs an automated CI gate before merge. Three categories of check:
The gate is blocking — failure prevents merge. Nightly regression doesn't replace this; it supplements. The 16-hour gap between "PR merged" and "nightly catches the bug" is incompatible with vibe-coding deploy cadence. See a practical quality gate for AI pull requests and E2E testing in GitHub Actions: setup guide.
Shiplight surface: Shiplight Cloud runners + CI integration produce structured failure output (replay video, DOM snapshot, diff per step) — not stack traces a reviewer has to dig through.
After merge, the change deploys to staging. A human (or autonomous explorer) runs exploratory testing on the staging environment — looking for things the contract and tests didn't cover:
Findings from staging exploration become new tests, fed back into the regression suite (Stage 9). The bug class AI is worst at catching is the one that exists in flows nobody documented; staging exploration is where you find those flows. See how to test vibe-coded applications for reliability.
Once in production, the QA process continues. Three observability layers:
When monitoring catches an issue users hit, the fix workflow re-enters at Stage 1: intent → contracts → AI implementation → ... The loop closes.
The final stage is the discipline that makes the loop self-improving: every production bug becomes a permanent regression test. Three steps:
Without Stage 9, the same bug ships twice. With it, the suite gets smarter with every incident. See tribal knowledge to executable specs and postmortem-driven E2E testing.
| Stage | Owner | Artifact | Gate |
|---|---|---|---|
| 1. Intent | PM or feature owner | One-sentence user-goal statement in the ticket | Ticket cannot enter sprint without intent |
| 2. Contracts | Tech lead or domain owner | API / data / behavioral / security contracts | PR cannot open without contracts referenced |
| 3. AI implementation | Engineer (with coding agent) | Code in a feature branch | None — output of this stage is the input to Stage 4 |
| 4. AI-generated tests | Coding agent via MCP / SDK | YAML intent tests committed in same PR | PR cannot open without test files |
| 5. Human review | Reviewer (engineer or QA) | Approved or change-requested PR | Approval required to proceed |
| 6. CI gates | Automated | Test + lint + contract check results | Blocking — fail = no merge |
| 7. Staging exploration | QA engineer / product | Exploration session notes + new bug tickets | Soft gate — recommended before release tag |
| 8. Production monitoring | On-call / SRE | Dashboards + alerts on contracts | Continuous — feeds back into Stage 1 |
| 9. Regression learning | QA engineer | New intent-based test for the production bug | Required for every shipped bug fix |
Each row is its own discipline. Skipping any row pushes work into one of the later rows — usually the most expensive one (production-incident response).
Most teams adopting vibe coding QA don't get to all 9 stages at once. The realistic maturity curve:
Each level is the floor above the previous one. Skipping levels (e.g., trying to go straight from Level 2 to Level 5) usually fails — the missing stages produce gaps that defeat the gates above them.
| Stage | Shiplight surface |
|---|---|
| 1. Intent | YAML Test Format — intent statements double as test steps |
| 2. Contracts | Used as inputs to test generation; not Shiplight-specific |
| 3. AI implementation | External — your coding agent (Cursor, Claude Code, Codex) |
| 4. AI-generated tests | Shiplight AI SDK + MCP Server |
| 5. Human review | Tests as plain YAML in git, reviewable in standard PR flow |
| 6. CI gates | Shiplight Cloud runners + GitHub Actions / GitLab CI / CircleCI integration |
| 7. Staging exploration | Shiplight autonomous flow discovery (Plugin) |
| 8. Production monitoring | External (Sentry / Datadog) — outside Shiplight scope |
| 9. Regression learning | Coding agent + Shiplight MCP — bug-to-test translation in the same session |
Six of the nine stages map directly to Shiplight surfaces. The other three (implementation, monitoring, contracts) are handled by adjacent tools — Shiplight integrates with them rather than replacing them.
A realistic schedule for a team adopting the full 9 stages from scratch:
By month 3 you're at Level 5 maturity. See the 30-day agentic E2E playbook for the condensed timeline.
A solid vibe coding QA process is a 9-stage workflow: Intent → Contracts → AI implementation → AI-generated tests → Human review → CI gates → Staging exploration → Production monitoring → Regression learning. Each stage has an explicit owner, an artifact, and (where appropriate) a blocking gate. The goal is structure — enough constraint that AI-coding-agent throughput produces predictable, reviewable, reversible changes. Most teams converge on this 9-stage shape over 4–8 weeks of refinement.
Vibe coding inverts three assumptions traditional QA depends on: code is generated in seconds from prompts rather than written deliberately; the same feature is regenerated multiple times with different internal structure; throughput is bounded by prompt iteration speed, not human typing speed. A traditional QA process — manual regression, selector-bound tests, single-stage gate at deploy time — collapses under that throughput. The 9-stage process replaces manual choke points with structural gates that scale.
Stage 4 — AI-generated tests in the same session as the feature. Most teams skip this and rely on "we'll write tests next sprint," which never happens because the coding agent has moved on and the test-writing context is lost. Adding Stage 4 is the largest single multiplier of test coverage because authoring throughput now tracks code generation throughput instead of human authoring speed. See boost test coverage with agentic AI.
Traditional QA assumes deliberate authorship, stable codebases, and slow change. The QA team writes tests after the feature ships; the test suite is selector-bound; regression runs nightly. Vibe coding QA assumes AI-generated authorship, churning code, and rapid change. Tests are AI-generated in the same session as the feature; the test suite is intent-based and self-healing; regression runs on every PR. The 9-stage process is built specifically for the second assumption set.
The coding agent (Cursor, Claude Code, OpenAI Codex, GitHub Copilot) needs a callable testing tool — typically a Model Context Protocol (MCP) server or an SDK — to author and run tests during the build session. Install Shiplight MCP Server, configure the agent's MCP tools list to include Shiplight, then when you prompt the agent to add a feature it can also generate the test. See MCP for testing and agent-first testing.
A realistic schedule: ticket and PR templates with intent + contracts (Week 1), intent-based test authoring for new features (Week 2), PR-time CI gates (Week 3), AI-coding-agent test authoring via MCP (Week 4), staging exploration discipline (Month 2), regression-learning discipline (Month 3). By month 3 you're operating at Level 5 maturity. See the 30-day agentic E2E playbook.
Five levels: (L1) Reactive — bugs found by users; (L2) Manual gate — human reviews PRs but no automated tests; (L3) AI-generated tests added in the same PR; (L4) PR-time CI gates blocking merge; (L5) Full 9-stage loop including production monitoring and regression learning. Each level is the floor above the previous. Most teams that try to skip from L2 to L5 fail; incremental adoption — adding one stage at a time — works.
Stage 1 (Intent): PM or feature owner. Stage 2 (Contracts): tech lead or domain owner. Stage 3 (AI implementation): engineer with their coding agent. Stage 4 (AI-generated tests): coding agent via MCP/SDK. Stage 5 (Human review): reviewer (engineer or QA). Stage 6 (CI gates): automated; QA engineer maintains the gate config. Stage 7 (Staging): QA engineer + product. Stage 8 (Production monitoring): on-call / SRE. Stage 9 (Regression learning): QA engineer. Each stage having a named owner is the difference between disciplined process and good-intentions wishlist.
The V-model and testing pyramid describe what to test (which levels, which dimensions). The 9-stage process describes when and with what discipline to test in a vibe-coding workflow. They're complementary, not substitutes. Your team uses the pyramid to decide unit vs integration vs E2E coverage; you use the 9-stage process to organize how AI-generated changes flow from prompt to production. See what is software testing for the foundational levels.
Two stages most teams under-invest: Stage 1 (intent capture) and Stage 4 (AI-generated tests in the same session). The first sets the contract everything else verifies against; the second is what makes coverage scale with agent throughput rather than human authoring. Skip either, and the later gates become reactive cleanup rather than proactive constraint. Everything else — CI gates, monitoring, regression learning — depends on those two being in place.
---
Vibe coding moves the bottleneck from typing speed to prompt iteration speed. The QA process has to move with it — not by adding "more tests," but by inserting structural gates at the right points in the workflow. The 9-stage process above is what most teams converge on after 1–2 quarters of iteration. Setting it up from scratch is a 4–8 week project, not a 6-month transformation. The teams that ship this faster than their competitors are the ones whose engineering velocity actually compounds rather than oscillating between "we shipped a lot" and "we spent the next sprint cleaning up."
For teams ready to operationalize the 9-stage process, Shiplight AI implements six of the nine stages: YAML Test Format for intent-based authoring, Plugin for self-healing execution and autonomous staging exploration, AI SDK and MCP Server for agent-callable test generation, and Cloud runners for PR-time CI gates. Book a 30-minute walkthrough and we'll map your current QA workflow to the 9 stages and identify the highest-leverage gaps.