AI TestingBest PracticesEngineering

How to Set Up a Vibe Coding QA Process: A Practical 9-Stage Workflow for 2026

Q: What's the difference between this 9-stage process and existing QA frameworks like the V-model or testing pyramid?

The V-model and testing pyramid describe what to test (which levels, which dimensions). The 9-stage process describes when and with what discipline to test in a vibe-coding workflow. They're complementary, not substitutes. Your team uses the pyramid to decide unit vs integration vs E2E coverage; you use the 9-stage process to organize how AI-generated changes flow from prompt to production. See what is software testing for the foundational levels.

Shiplight AI Team

Updated on May 15, 2026

View as Markdown

Marketing cover with the headline 'Vibe Coding QA Process.' on the left and a circular 9-stage closed-loop diagram on the right — nine numbered indigo circles (01 highlighted) arranged in a ring with arrowheads indicating clockwise flow and a 'QA Loop' label in the center

A solid vibe coding QA process treats AI like a very fast junior engineer: high throughput, inconsistent judgment, and prone to confident mistakes. The goal is not "more testing." It's creating enough structure that AI-generated changes become predictable, reviewable, and reversible. A practical workflow usually looks like this: Intent → Contracts → AI implementation → AI-generated tests → Human review → CI gates → Staging exploration → Production monitoring → Regression learning. Each stage solves a specific failure mode AI coding agents introduce — drift from intent, hallucinated tests, silent regressions, undiscovered flows in production. This guide walks through all nine stages, what each requires, and the Shiplight features that implement them.

Key takeaways

Vibe coding QA is process design, not tooling choice. The tools matter — but only inside a workflow that constrains AI-coding-agent throughput at the right gates.
Nine stages, not three. Most teams that fail at vibe coding QA collapse the workflow to "AI codes → human reviews" and skip the seven other stages. Each missing stage produces a specific failure class.
The first three stages are the most overlooked — intent capture, contracts, and AI-generated tests. They prevent the bulk of problems the later stages exist to clean up.
Each stage should have an owner. "QA does QA" is not enough. Each stage has a different artifact, a different reviewer, and a different gate.
The loop closes only with regression learning. Production bugs that don't become tests are bugs that will happen again.

What makes vibe coding QA different from traditional QA

Traditional QA assumes a relatively stable codebase, deliberate authorship, and slow change. Vibe coding inverts all three:

Code is generated in seconds from prompts, not written deliberately.
Implementation churns weekly — the same feature is regenerated multiple times with different internal structure.
Throughput is bounded by prompt iteration speed, not human typing speed.

A QA process designed for the traditional shape collapses under vibe-coding throughput. Reviewing every AI-generated PR by hand becomes a bottleneck within a sprint. Selector-bound test suites break weekly. Manual regression cycles can't keep pace with the deploy cadence. The 9-stage workflow below replaces the manual choke points with structural gates that scale.

For the broader category, see what is vibe testing and vibe coding quality issues: a triage playbook.

The 9-stage vibe coding QA process

Stage 1: Intent — capture what the user is trying to do

The QA process starts before the first prompt to the AI. Every feature has a user intent — "the user can sign up with email, verify, and reach the dashboard within 60 seconds" — and that intent is the source of truth. Without it, the AI generates something, the engineer accepts it, and there's no contract to test against.

Capture intent in plain language as part of the ticket or PR description. Three rules:

Phrase the intent as the user's goal, not as an implementation detail.
Include the success outcome ("name visible in top-right," "order confirmation email arrives within 5 minutes").
One intent per ticket. Don't bundle.

Shiplight surface: intent statements feed directly into Shiplight YAML Test Format. The same plain-English intent that goes in the ticket becomes the YAML test step.

Stage 2: Contracts — define the verifiable boundaries

Before the AI writes code, define the contracts the implementation must honor:

API contracts. Endpoints, request/response shapes, error codes.
Data contracts. Database schema changes, default values, nullability.
Behavioral contracts. What flows the user can take, what the system must do at each step, what it must not do.
Performance contracts. Latency budgets, throughput floors for high-traffic paths.
Security contracts. Auth boundaries, access control rules, input sanitization requirements.

Contracts are the difference between "AI made something" and "AI made something that fits." Without contracts, AI fills the ambiguity with its training-data priors — which sometimes match your system and often don't. See requirements to E2E coverage.

Stage 3: AI implementation — let the agent build

With intent and contracts in hand, the AI coding agent (Cursor, Claude Code, OpenAI Codex, Copilot) generates the implementation. Three discipline points:

One-prompt, one-feature. Don't ask for multiple features in one prompt; the success rate drops sharply.
Keep the prompt close to the intent. "Add a feature where the user can ___" is better than "improve the signup flow."
Pin the agent's context. Reference the contract files explicitly so the agent doesn't drift.

The output of this stage is code in a feature branch — not merged, not deployed. The next stages decide whether it ships.

Stage 4: AI-generated tests — let the agent verify its own work

This is the stage most teams skip and pay for later. Immediately after the AI generates the implementation, have it generate tests for the same intent — preferably in the same session, before the PR opens. The agent has full context, can produce relevant edge cases, and remembers what it just wrote.

The output of this stage:

Intent-based YAML tests for the new user flow (committed in the same PR as the feature).
Edge-case tests for inputs the agent identified as risky (empty states, malformed inputs, error paths).
Negative tests asserting what the system should not do (e.g., a non-admin cannot reach /admin).

This is the largest single multiplier of coverage. Authoring throughput now tracks code generation, not human typing. See boost test coverage with agentic AI and testing layer for AI coding agents.

Shiplight surface: Shiplight MCP Server and AI SDK let coding agents call the testing tool as a callable resource — generate, run, and pass tests inside the same session they wrote the feature. See MCP for testing.

Stage 5: Human review — verify intent match and approve patches

The AI-generated implementation and AI-generated tests both go to a human reviewer in the same pull request. The reviewer is not catching every bug — they're verifying three things:

Does the implementation match the intent? Not "is the code well-written"; "did the AI build the right thing?"
Do the tests actually test the contract? AI-generated tests sometimes assert that the implementation does what it does (tautological), instead of asserting that the implementation matches what the user needs.
Are there obvious red flags? Hardcoded secrets, sloppy error handling, security gaps the AI missed.

Self-healing tools should emit unhealed-step patches as PR diffs, not silent rewrites — preserving the audit trail for this review. See self-healing vs manual maintenance.

Stage 6: CI gates — block merge on failure

The PR runs an automated CI gate before merge. Three categories of check:

Test suite. Unit + integration + the intent-based E2E tests for the affected flows. Sub-10-minute latency.
Static analysis. Type checks, lint, security scanners. Catches the obvious mechanical defects.
Contract verification. API contract tests (e.g., schema-based), behavioral contracts (e.g., did the agent break a flow it shouldn't have touched).

The gate is blocking — failure prevents merge. Nightly regression doesn't replace this; it supplements. The 16-hour gap between "PR merged" and "nightly catches the bug" is incompatible with vibe-coding deploy cadence. See a practical quality gate for AI pull requests and E2E testing in GitHub Actions: setup guide.

Shiplight surface: Shiplight Cloud runners + CI integration produce structured failure output (replay video, DOM snapshot, diff per step) — not stack traces a reviewer has to dig through.

Stage 7: Staging exploration — find what the tests don't

After merge, the change deploys to staging. A human (or autonomous explorer) runs exploratory testing on the staging environment — looking for things the contract and tests didn't cover:

Surprising user paths the spec didn't anticipate.
Cross-feature interactions (this change broke an unrelated flow).
UX issues (the flow works but feels confusing).
Performance regressions under realistic data volumes.

Findings from staging exploration become new tests, fed back into the regression suite (Stage 9). The bug class AI is worst at catching is the one that exists in flows nobody documented; staging exploration is where you find those flows. See how to test vibe-coded applications for reliability.

Stage 8: Production monitoring — observe what users actually do

Once in production, the QA process continues. Three observability layers:

Error tracking. Sentry / Datadog / Honeycomb / similar — every unhandled exception, surfaced and grouped.
User-flow analytics. Are users completing the intent? Where do they drop off? Funnel charts on critical paths (signup, checkout, core action).
Performance / latency. Stay within the contracts defined in Stage 2.

When monitoring catches an issue users hit, the fix workflow re-enters at Stage 1: intent → contracts → AI implementation → ... The loop closes.

Stage 9: Regression learning — production bugs become permanent tests

The final stage is the discipline that makes the loop self-improving: every production bug becomes a permanent regression test. Three steps:

Capture the user-observable repro of the bug (the steps a user took to hit it).
Author an intent-based test reproducing those steps (often, the AI coding agent does this from the bug report).
Land the test in the same PR as the fix, so the bug never escapes again.

Without Stage 9, the same bug ships twice. With it, the suite gets smarter with every incident. See tribal knowledge to executable specs and postmortem-driven E2E testing.

Owners and artifacts by stage

Stage	Owner	Artifact	Gate
1. Intent	PM or feature owner	One-sentence user-goal statement in the ticket	Ticket cannot enter sprint without intent
2. Contracts	Tech lead or domain owner	API / data / behavioral / security contracts	PR cannot open without contracts referenced
3. AI implementation	Engineer (with coding agent)	Code in a feature branch	None — output of this stage is the input to Stage 4
4. AI-generated tests	Coding agent via MCP / SDK	YAML intent tests committed in same PR	PR cannot open without test files
5. Human review	Reviewer (engineer or QA)	Approved or change-requested PR	Approval required to proceed
6. CI gates	Automated	Test + lint + contract check results	Blocking — fail = no merge
7. Staging exploration	QA engineer / product	Exploration session notes + new bug tickets	Soft gate — recommended before release tag
8. Production monitoring	On-call / SRE	Dashboards + alerts on contracts	Continuous — feeds back into Stage 1
9. Regression learning	QA engineer	New intent-based test for the production bug	Required for every shipped bug fix

Each row is its own discipline. Skipping any row pushes work into one of the later rows — usually the most expensive one (production-incident response).

Maturity model: where most teams actually sit

Most teams adopting vibe coding QA don't get to all 9 stages at once. The realistic maturity curve:

Level 1: Reactive only. Code generated, deployed, users find bugs. No tests, no gates. ~70% of vibe-coding teams in 2025.
Level 2: Manual gate. Code generated, human reviews, merge. Tests written sometimes, by hand, after the feature ships. Bugs decrease but coverage stays low.
Level 3: AI-generated tests. Stage 4 added. Coding agent writes tests in the same session as the feature. Coverage triples.
Level 4: PR-time CI gates. Stages 4 + 6 are blocking. Bugs caught before merge instead of after. Maintenance budget drops below 20% of QA hours.
Level 5: Full 9-stage loop. All stages including production monitoring and regression learning. Maintenance budget below 5% of QA hours. Suite improves with every incident. The 2026 default for AI-coding-agent teams operating at scale.

Each level is the floor above the previous one. Skipping levels (e.g., trying to go straight from Level 2 to Level 5) usually fails — the missing stages produce gaps that defeat the gates above them.

How Shiplight implements the 9 stages

Stage	Shiplight surface
1. Intent	YAML Test Format — intent statements double as test steps
2. Contracts	Used as inputs to test generation; not Shiplight-specific
3. AI implementation	External — your coding agent (Cursor, Claude Code, Codex)
4. AI-generated tests	Shiplight AI SDK + MCP Server
5. Human review	Tests as plain YAML in `git`, reviewable in standard PR flow
6. CI gates	Shiplight Cloud runners + GitHub Actions / GitLab CI / CircleCI integration
7. Staging exploration	Shiplight autonomous flow discovery (Plugin)
8. Production monitoring	External (Sentry / Datadog) — outside Shiplight scope
9. Regression learning	Coding agent + Shiplight MCP — bug-to-test translation in the same session

Six of the nine stages map directly to Shiplight surfaces. The other three (implementation, monitoring, contracts) are handled by adjacent tools — Shiplight integrates with them rather than replacing them.

Common pitfalls

Collapsing the 9 stages to "AI codes → human reviews." Skipping intent capture and contract definition pushes ambiguity to the reviewer, who can't reliably catch it. Define before generating.
Skipping Stage 4 (AI-generated tests). "We'll write tests next sprint" is the most common version of this. Next sprint, the coding agent has moved on; nobody can recover the lost context. Tests in the same PR as the feature, or not at all.
Letting CI gates be advisory, not blocking. A gate that doesn't block merge produces a slow, unevenly-enforced quality floor.
Manual review of every line. A human reviewer trying to catch every bug becomes a bottleneck within a sprint. The reviewer's job is to check intent match and approve self-healing patches — the bugs are caught by Stages 4 and 6.
No Stage 9. Production bugs that don't become tests are bugs that ship again.
One owner for everything. Each stage needs an explicit owner. "QA owns QA" is too vague to enforce.

How long does this take to set up?

A realistic schedule for a team adopting the full 9 stages from scratch:

Week 1. Establish intent and contract discipline at the ticket / PR level. No tool change needed — just template the ticket and PR templates.
Week 2. Adopt intent-based test authoring for new features. Install Shiplight Plugin, commit YAML alongside source.
Week 3. Wire PR-time CI gates. Shiplight Cloud runners integrated with your CI provider.
Week 4. Enable AI coding agent test authoring via Shiplight MCP Server. Coding agent now writes its own tests during the build session.
Month 2. Add staging exploration discipline (autonomous discovery via Shiplight Plugin's exploration mode, plus manual exploratory sessions).
Month 3. Establish Stage 9 discipline — every production bug yields a regression test. Run the first quarterly retro.

By month 3 you're at Level 5 maturity. See the 30-day agentic E2E playbook for the condensed timeline.

Frequently Asked Questions

How do I set up a vibe coding QA process?

A solid vibe coding QA process is a 9-stage workflow: Intent → Contracts → AI implementation → AI-generated tests → Human review → CI gates → Staging exploration → Production monitoring → Regression learning. Each stage has an explicit owner, an artifact, and (where appropriate) a blocking gate. The goal is structure — enough constraint that AI-coding-agent throughput produces predictable, reviewable, reversible changes. Most teams converge on this 9-stage shape over 4–8 weeks of refinement.

Why does vibe coding need a different QA process?

Vibe coding inverts three assumptions traditional QA depends on: code is generated in seconds from prompts rather than written deliberately; the same feature is regenerated multiple times with different internal structure; throughput is bounded by prompt iteration speed, not human typing speed. A traditional QA process — manual regression, selector-bound tests, single-stage gate at deploy time — collapses under that throughput. The 9-stage process replaces manual choke points with structural gates that scale.

What is the most overlooked stage of the vibe coding QA process?

Stage 4 — AI-generated tests in the same session as the feature. Most teams skip this and rely on "we'll write tests next sprint," which never happens because the coding agent has moved on and the test-writing context is lost. Adding Stage 4 is the largest single multiplier of test coverage because authoring throughput now tracks code generation throughput instead of human authoring speed. See boost test coverage with agentic AI.

What is the difference between vibe coding QA and traditional QA?

Traditional QA assumes deliberate authorship, stable codebases, and slow change. The QA team writes tests after the feature ships; the test suite is selector-bound; regression runs nightly. Vibe coding QA assumes AI-generated authorship, churning code, and rapid change. Tests are AI-generated in the same session as the feature; the test suite is intent-based and self-healing; regression runs on every PR. The 9-stage process is built specifically for the second assumption set.

How do I get my AI coding agent to write tests?

The coding agent (Cursor, Claude Code, OpenAI Codex, GitHub Copilot) needs a callable testing tool — typically a Model Context Protocol (MCP) server or an SDK — to author and run tests during the build session. Install Shiplight MCP Server, configure the agent's MCP tools list to include Shiplight, then when you prompt the agent to add a feature it can also generate the test. See MCP for testing and agent-first testing.

How long does it take to set up a vibe coding QA process?

A realistic schedule: ticket and PR templates with intent + contracts (Week 1), intent-based test authoring for new features (Week 2), PR-time CI gates (Week 3), AI-coding-agent test authoring via MCP (Week 4), staging exploration discipline (Month 2), regression-learning discipline (Month 3). By month 3 you're operating at Level 5 maturity. See the 30-day agentic E2E playbook.

What's the vibe coding QA maturity model?

Five levels: (L1) Reactive — bugs found by users; (L2) Manual gate — human reviews PRs but no automated tests; (L3) AI-generated tests added in the same PR; (L4) PR-time CI gates blocking merge; (L5) Full 9-stage loop including production monitoring and regression learning. Each level is the floor above the previous. Most teams that try to skip from L2 to L5 fail; incremental adoption — adding one stage at a time — works.

Who should own which stage?

Stage 1 (Intent): PM or feature owner. Stage 2 (Contracts): tech lead or domain owner. Stage 3 (AI implementation): engineer with their coding agent. Stage 4 (AI-generated tests): coding agent via MCP/SDK. Stage 5 (Human review): reviewer (engineer or QA). Stage 6 (CI gates): automated; QA engineer maintains the gate config. Stage 7 (Staging): QA engineer + product. Stage 8 (Production monitoring): on-call / SRE. Stage 9 (Regression learning): QA engineer. Each stage having a named owner is the difference between disciplined process and good-intentions wishlist.

What's the difference between this 9-stage process and existing QA frameworks like the V-model or testing pyramid?

The V-model and testing pyramid describe what to test (which levels, which dimensions). The 9-stage process describes when and with what discipline to test in a vibe-coding workflow. They're complementary, not substitutes. Your team uses the pyramid to decide unit vs integration vs E2E coverage; you use the 9-stage process to organize how AI-generated changes flow from prompt to production. See what is software testing for the foundational levels.

What's the most important thing to get right in a vibe coding QA process?

Two stages most teams under-invest: Stage 1 (intent capture) and Stage 4 (AI-generated tests in the same session). The first sets the contract everything else verifies against; the second is what makes coverage scale with agent throughput rather than human authoring. Skip either, and the later gates become reactive cleanup rather than proactive constraint. Everything else — CI gates, monitoring, regression learning — depends on those two being in place.

---

Conclusion: structure is what makes AI velocity safe

Vibe coding moves the bottleneck from typing speed to prompt iteration speed. The QA process has to move with it — not by adding "more tests," but by inserting structural gates at the right points in the workflow. The 9-stage process above is what most teams converge on after 1–2 quarters of iteration. Setting it up from scratch is a 4–8 week project, not a 6-month transformation. The teams that ship this faster than their competitors are the ones whose engineering velocity actually compounds rather than oscillating between "we shipped a lot" and "we spent the next sprint cleaning up."

For teams ready to operationalize the 9-stage process, Shiplight AI implements six of the nine stages: YAML Test Format for intent-based authoring, Plugin for self-healing execution and autonomous staging exploration, AI SDK and MCP Server for agent-callable test generation, and Cloud runners for PR-time CI gates. Book a 30-minute walkthrough and we'll map your current QA workflow to the 9 stages and identify the highest-leverage gaps.