Skip to main content
Nick Baynham

Flagship project · Phases 0–13 shipped · Project complete

Test Commander

AI-assisted software testing from exploration to automation.

A practical workflow for testers: explore software, identify user flows, design scenarios, generate BDD specs, produce deterministic Playwright automation, and ship quality reports — all from the terminal. Not a replacement for testers. An assistant for them.

Autonomous where safe. Human-governed where it matters.

~/test-commander · workflow demo
$ ./bootstrap.sh && make install
$ /tc:init
$ /tc:review-requirements
$ /tc:learn-from-docs
$ /tc:create-charter --target "Sign-in flow"
$ /tc:explore --charter CH-001
$ /tc:test-ideas --session SESS-20260528-600
$ /tc:generate-bdd
$ /tc:automation-plan
$ /tc:automate
$ /tc:traceability-map

// These commands take a tester from a cold local app through requirements, knowledge, exploration, BDD, and a generated Playwright suite. The agent does the heavy lifting; the tester stays in charge of every decision in between.

Test Commander turns human testing insight into structured automation and quality evidence.

Human judgement
Testers stay in charge of scope, risk, and what “good enough” means.
Agent assistance
Exploration, documentation, drafting, generation, summarization.
Deterministic evidence
Playwright tests, BDD specs, and reports a CI/CD pipeline can rely on.

What it is

A terminal-first testing assistant.

Test Commander is a workflow and framework concept for modern quality engineering. It treats AI as a structured assistant, not a magic test generator. The tester stays in control; the agent does the heavy lifting.

  1. Exploratory notes
  2. Structured ideas
  3. BDD specifications
  4. Playwright tests
  5. Quality evidence

Why it exists

Most testing teams hit the same wall.

Manual testing knowledge stays trapped in people’s heads. Exploration produces insight but inconsistent documentation. Automation lags behind delivery. Reports are manual, inconsistent, or missing. Teams want to use AI but worry about reliability. Test Commander connects exploration, design, automation, and reporting so the work compounds instead of dispersing.

  • Manual testing knowledge becomes reusable, not lost.
  • Exploration produces documentation, not anecdotes.
  • Automation starts from intent, not screen-recorded clicks.
  • Reports communicate evidence, not opinions.
  • AI accelerates the boring parts, humans own the calls.
  • The same workflow runs locally and in CI/CD.

For manual testers

Your thinking becomes the input.

Think of Test Commander as an assistant that turns your testing instincts into organized artifacts. You decide what matters. The agent helps document, structure, and automate. Here is what that conversion looks like on a real workflow.

You observe

Exploring a shopping-cart flow, you notice friction the team has not catalogued.

  • The cart count does not always update.
  • Invalid search terms are handled inconsistently.
  • Required form fields do not show clear errors.
  • Login behavior changes after a failed attempt.
  • Some buttons are hard to locate reliably.

Test Commander produces

Those observations become reusable artifacts in minutes, ready for review.

  • User-flow documentation for the cart and checkout.
  • Page-object candidates with stable locators.
  • Defects and risks logged with severity hints.
  • BDD scenarios you can review and approve.
  • Playwright tests and quality-report findings.

The workflow

A loop, not a one-way street.

Seven steps, designed to feed each other. Insight gathered in one cycle seeds the exploration in the next.

  1. 01
    Explore
    Charter-based sessions with structured anomaly capture.
  2. 02
    Model
    Ingest documents, specs, code, recordings, tests.
  3. 03
    Specify
    Reviewed requirements and traceable BDD specs.
  4. 04
    Automate
    Playwright suite generated from scored candidates.
  5. 05
    Execute
    Local + CI runs with evidence captured per run.
  6. 06
    Report
    Quality report with history; release-readiness scoring.
  7. 07
    Improve
    Governed lessons; nothing promoted silently.

01. Explore

Phase 4 (shipped 2026-05-28). /tc:create-charter scopes a session against the project knowledge; /tc:explore classifies every recorded Playwright event into six universal observation types and six universal anomaly categories with a Charter-Coverage matrix; an internal exploration-review sub-mode routes gap signals to requirements/open-questions.md.

02. Model

Phase 3 (shipped 2026-05-27). Five /tc:learn-from-* commands extract entities, terms, journeys, endpoints, modules, recorded responses, and test coverage into ten structured product-knowledge artifacts under .test-commander/product-knowledge/ with full path:line provenance. A shared synthesizer rebuilds system-model.md byte-deterministically at the end of every run.

03. Specify

Phase 2 (shipped 2026-05-27) ships the requirements layer; Phase 5 (shipped 2026-05-29) turns enriched test ideas into Gherkin. /tc:generate-bdd renders one scenario per enrichment candidate with machine-readable @req:/@cs:provenance, /tc:review-bdd runs a six-category universal rubric, and /tc:traceability-map rebuilds the requirement and scenario-level maps tying each requirement forward to the scenarios that exercise it.

04. Automate

Phase 6 (shipped 2026-05-29) is the project’s first executable artifacts. /tc:build-framework lazily scaffolds a Playwright + TypeScript framework; /tc:automation-plan scores every scenario against a seven-factor suitability rubric; /tc:automate generates page objects, fixtures, and specs with @req:/@cs: provenance; /tc:review-automation enforces quality; and /tc:generate-test-data keeps data in .test-commander/test-data/ rather than inline in code.

05. Execute

Phase 7 (shipped). /tc:run orchestrates suite execution and /tc:analyze-results triages failures; per-run records land in .test-commander/runs/; screenshots, traces, and logs route to .test-commander/evidence/ with the policy defined in config.yaml. The same workflow runs locally and in CI.

06. Report

Phase 7 (shipped). /tc:report writes .test-commander/quality-report/current-quality-report.md and snapshots a copy to history/YYYY-MM-DD-HHmm.md. /tc:quality-gate evaluates release-readiness against project-defined thresholds. Facts, interpretation, and human-review items stay clearly separated.

07. Improve

Phase 8 (shipped). /tc:learn, /tc:learn-from-failures, /tc:learn-from-exploration, /tc:learn-from-feedback, /tc:review-lessons, and /tc:promote-lessons turn the workspace into a learning loop. Every promotion is visible in git diff — Test Commander never silently rewrites methodology.

The full roadmap

Every phase shipped, foundation first.

Test Commander was built phase by phase, each landing under strict test-driven discipline with its own annotated git tag (phase-0 phase-13). The foundation (Phases 0–6) established the workspace and the exploration-to- automation pipeline; the later phases (7–13) added execution, reporting, learning, visuals, a governed web console, an API and MCP server, sandboxes, and a continuous quality agent. The whole roadmap is now complete.

Foundation · Phases 0–6

  • Phase 0 — Repository foundation, plugin scaffold, marketplace registration
  • Phase 1 — Workspace and artifact model (/tc:init, /tc:status, /tc:journal, /tc:next)
  • Phase 2 — Requirements quality (16-dimension rubric, INVEST review, acceptance-criteria review, coverage map, seeded test-ideas)
  • Phase 3 — Project knowledge ingestion (five /tc:learn-from-* helpers, shared synthesizer, ten product-knowledge artifacts)
  • Phase 4 — Charter-based exploratory testing (charters, recorded-session replay, session summaries, Phase-2 seed enrichment)
  • Phase 5 — BDD generation and traceability (Gherkin features with @req:/@cs: linkage, six-category review, requirement + scenario maps)
  • Phase 6 — Lazy Playwright/TypeScript framework, seven-factor automation plan, generated suite + review, test-data discipline (D6)

Now also shipped · Phases 7–13

  • Phase 7 — Execution, evidence policy, and the quality report with committed history
  • Phase 8 — Governed continuous learning loop (nothing promoted silently)
  • Phase 9 — Mermaid diagrams + infographics (eight /tc:diagram-* commands)
  • Phase 10 — Read-only web console (dashboard, journal, BDD viewer, run history, evidence)
  • Phase 10.5 — Controlled agent execution: the single governed-execution pipeline
  • Phase 11 — Runtime API + schema-first MCP server (front-ends to the same pipeline)
  • Phase 12 — Sandboxed environments launched from GitHub Actions, safe-by-default
  • Phase 13 — Continuous quality agent with five autonomy modes

What ships

Twenty skills, 64 commands, one workspace per project.

Each tc-* skill is owned in-repo (Decision D1 — no community-skill dependencies). The commands route to bundled Python helpers; the workspace lives at .test-commander/ inside the consuming project and is committed to git like any other source artifact. Three skills (tc-evidence, tc-governance, tc-mcp) ship runtime rather than /tc:* commands. The full per-command reference lives on the documentation page.

  1. tc-core

    Phase 1 · shipped

    Workspace orchestration. Initialize, inspect, journal, recommend.

    • /tc:init
    • /tc:status
    • /tc:journal
    • /tc:next
  2. tc-requirements

    Phase 2 · shipped

    Requirements quality. 16-dimension rubric, INVEST review, AC review, coverage, seed test-ideas.

    • /tc:review-requirements
    • /tc:review-user-stories
    • /tc:review-acceptance-criteria
    • /tc:requirements-coverage
    • /tc:requirements-to-tests
  3. tc-knowledge

    Phase 3 · shipped

    Project knowledge ingestion. Five helpers extract structured artifacts from documents, specs, code, recorded API traffic, and existing tests.

    • /tc:learn-from-docs
    • /tc:learn-from-specs
    • /tc:learn-from-code
    • /tc:learn-from-api
    • /tc:learn-from-tests
  4. tc-explore

    Phase 4 · shipped

    Charter-based exploratory testing. Scope a session, replay a recorded Playwright run, synthesize the summary, enrich the Phase-2 test-idea seeds.

    • /tc:create-charter
    • /tc:explore
    • /tc:session-summary
    • /tc:test-ideas
  5. tc-bdd

    Phase 5 · shipped

    BDD generation and review. Render Gherkin from enriched test ideas with @req:/@cs: provenance; run a six-category universal rubric.

    • /tc:generate-bdd
    • /tc:review-bdd
  6. tc-traceability

    Phase 5 · shipped

    The cross-cutting map. Rebuild the requirement and scenario-level traceability chains; downstream links resolve as phases populate them.

    • /tc:traceability-map
  7. tc-build-framework

    Phase 6 · shipped

    The lazy framework. Scaffold the project-root tests/ tree, playwright.config.ts, and package.json only when automation first needs them (D8).

    • /tc:build-framework
  8. tc-automation-plan

    Phase 6 · shipped

    The strategic gate. Score every scenario against a seven-factor suitability rubric and rank each automate / consider / manual.

    • /tc:automation-plan
  9. tc-automate

    Phase 6 · shipped

    Generation and review. Render page objects, fixtures, and specs with provenance and fixture-mediated data; mechanically review the result.

    • /tc:automate
    • /tc:review-automation
  10. tc-test-data

    Phase 6 · shipped

    The data discipline. Populate test-data/ seed JSON and a per-area spec so nothing is inlined in test code (D6).

    • /tc:generate-test-data
  11. tc-run

    Phase 7 · shipped

    Execution and triage. Orchestrate the suite, capture per-run records, and classify failures without weakening assertions.

    • /tc:run
    • /tc:analyze-results
  12. tc-quality-report

    Phase 7 · shipped

    The quality report. Write the current report with committed history and evaluate release-readiness against project thresholds.

    • /tc:report
    • /tc:quality-gate
  13. tc-evidence

    Phase 7 · shipped

    The evidence indexer. Route screenshots, traces, and logs into .test-commander/evidence/ per the config policy. Runtime; no /tc:* commands.

    • evidence indexer
  14. tc-learning

    Phase 8 · shipped

    The governed learning loop. Capture lessons from failures, exploration, and feedback; review and promote them in visible git diffs.

    • /tc:learn
    • /tc:learn-from-failures
    • /tc:learn-from-exploration
    • /tc:learn-from-feedback
    • /tc:review-lessons
    • /tc:promote-lessons
  15. tc-visualize

    Phase 9 · shipped

    Visual documentation. Eight diagram types, infographics, and a deterministic renderer turn the workspace into Mermaid sources and rendered assets.

    • /tc:visualize
    • /tc:diagram-*
    • /tc:generate-infographic
    • /tc:render-visuals
  16. tc-web

    Phase 10 · shipped

    The read-only web console. A team-facing viewer over the committed workspace — dashboard, journal, BDD, runs, evidence — that never invents data.

    • /tc:web-init
    • /tc:web-start
    • /tc:web-sync
    • /tc:web-index-artifacts
    • /tc:web-export
  17. tc-governance

    Phase 10.5 · shipped

    The controlled-execution pipeline. Intent → plan → policy → approval → bounded execution → validation → audit. Default deny; the single path every action takes. Runtime; no /tc:* commands.

    • governance pipeline
  18. tc-mcp

    Phase 11 · shipped

    Runtime API + MCP server. Alternative front-ends that drive Test Commander over HTTP and the Model Context Protocol — through the same pipeline. Runtime; no /tc:* commands.

    • Runtime API
    • MCP server
  19. tc-sandbox

    Phase 12 · shipped

    Sandboxed environments. Launch an on-demand, team-accessible Test Commander environment from GitHub Actions, governed and safe-by-default.

    • /tc:sandbox-init
    • /tc:sandbox-launch
    • /tc:sandbox-status
    • /tc:sandbox-sync
    • /tc:sandbox-stop
    • /tc:sandbox-export
  20. tc-continuous-quality

    Phase 13 · shipped

    The continuous quality agent. Watch changes, map impact, find coverage gaps, propose tests, and open labeled PRs — gated by five autonomy modes.

    • /tc:watch-changes
    • /tc:impact-analysis
    • /tc:coverage-gap-analysis
    • /tc:propose-tests
    • /tc:create-test-pr
    • /tc:continuous-quality-check

User-guide walkthroughs

One reproducible walkthrough per shipped phase.

Every shipped phase ships its own end-to-end walkthrough under docs/user-guide/ in the test-commander repo. Each one drives the seeded fixture end to end with verbatim sample output so a reader can reproduce the result in a tmp workspace.

  • Phase 1

    First workflow walkthrough

    From clone to /tc:next: init the workspace, edit project metadata, append a journal entry, ask what to do next.

  • Phase 2

    Reviewing requirements

    Upload requirements.md, run the rubric pass, surface mutually-exclusive open questions, seed tc-test-idea/v1 files for every REQ.

  • Phase 3

    Building project knowledge

    Drive five /tc:learn-from-* helpers against the seeded sample-project fixture; produce ten product-knowledge artifacts with file:line provenance.

  • Phase 4

    Exploring an app

    Charter -> explore -> session-summary -> test-ideas: scope an exploration, classify every recorded event into universal observation and anomaly cores, enrich the Phase-2 seeds.

  • Phase 5

    Generating BDD

    generate-bdd -> review-bdd -> traceability-map: render Gherkin from enriched test ideas with @req:/@cs: linkage, run the six-category rubric, and rebuild the requirement and scenario-level maps.

  • Phase 6

    Automating a suite

    build-framework -> automation-plan -> automate -> review-automation -> generate-test-data: score scenarios, generate a traceable Playwright/TypeScript suite, and keep test data out of the code.

  • Phase 7

    Running tests

    run -> analyze-results -> report -> quality-gate: execute the suite, triage failures, write the quality report with committed history, and score release-readiness.

  • Phase 8

    The learning loop

    Capture lessons from failures, exploration, and feedback, then review and promote them — every promotion visible in git diff, nothing rewritten silently.

  • Phase 9

    Visuals and infographics

    visualize -> the eight diagram-* commands -> generate-infographic -> render-visuals: turn the workspace into Mermaid sources and deterministically rendered assets.

  • Phase 10

    The web console

    web-init -> web-start -> web-sync: bring up a read-only, team-facing viewer over the committed workspace. Renders the artifacts; never invents data or runs a command.

  • Phase 10.5

    Governance

    The controlled-execution pipeline: how a user request becomes a planned, permission-checked, approved, validated, and audited action — with default deny and no backdoor.

  • Phase 11

    Integrating (API + MCP)

    Drive Test Commander from another tool or agent over the Runtime API or the schema-first MCP server — both front-ends to the same governed pipeline.

  • Phase 12

    Sandboxes

    sandbox-init -> sandbox-launch -> sandbox-status -> sandbox-stop: launch an on-demand environment from GitHub Actions, allow-listed and private-range-blocked by default.

  • Phase 13

    Continuous quality

    watch-changes -> impact-analysis -> coverage-gap-analysis -> propose-tests -> create-test-pr: the watch -> analyze -> propose -> PR loop, gated by the configured autonomy mode.

For a single hands-on tour that walks all four shipped phases against one tmp project, read the blog post Test Commander after Phase 4: a hands-on tour.

What it produces

Artifacts you can review, automate, and ship.

The output is not just “a test ran.” It is structured artifacts your team can read, edit, version-control, and learn from.

Charter · YAML frontmatter (Phase 4 · shipped)

.test-commander/charters/CH-001.md
---
id: CH-001
mission: Discover whether the Sign-in flow plus workspace-detail asset upload
  behaves correctly under the documented risk conditions.
target: Sign-in flow plus workspace-detail asset upload (POST /workspaces/{id}/assets).
time-box: 60min
risk-areas:
  - Authentication / authorization boundaries
  - Session lifecycle and token leakage
  - Performance under documented load thresholds
acceptance-criteria:
  - Every flow under '...' completes the happy path with documented status codes.
  - Authentication is correctly enforced for every endpoint that should require it.
  - At least one anomaly per universal category is documented or explained away.
created_at: 2026-05-28T18:47:33Z
phase_3_sources:
  - product-knowledge/entities.md
  - product-knowledge/user-journeys.md
  - requirements/open-questions.md
---

Exploration note · table excerpt (Phase 4 · shipped)

.test-commander/exploration-notes/SESS-20260528-600.md
# SESS-20260528-600 - exploration note for CH-001

## Observations

| # | event_type      | Page             | Result |
| - | --------------- | ---------------- | ------ |
| 0 | page_load       | /sign-in         | ok     |
| 4 | click           | /sign-in         |        |
| 5 | network_request | /sign-in         | 201    |
| 8 | network_request | /dashboard       | 200    |

## Anomalies

| Category         | Severity | Page             | Evidence |
| ---------------- | -------- | ---------------- | -------- |
| auth-mismatch    | high     | /workspaces/ws-1 | S-005    |
| broken-link      | medium   | /account/profile | S-004    |
| slow-response    | high     | /dashboard       | S-002    |

Enriched test-idea · Phase-2 seed + Phase-4 enrichment

.test-commander/test-ideas/REQ-005.md
---
schema: tc-test-idea/v1
requirement_id: REQ-005
requirement_title: All API access requires an authenticated user account
status: enriched              # was: status: seed (Phase 2)
phase_4_sessions: [SESS-20260528-600]
phase_2_findings: [completeness, consistency, testability]
candidates:                   # Phase 2 seeded; preserved byte-for-byte
  - id: REQ-005-happy-01
    title: Happy path
    type: positive
generated_by: /tc:requirements-to-tests
---

## Phase 4 enrichment

### SESS-20260528-600

- **CS-600-001** (negative) - Reproduce auth-mismatch on /workspaces/ws-1
- **CS-600-010** (happy)    - Happy path: POST /sessions returns 201

BDD feature · @req:/@cs: linkage (Phase 5 · shipped)

.test-commander/bdd/features/sign-in.feature
@area:sign-in
Feature: Sign-in flow

  @req:REQ-005 @cs:CS-600-010 @smoke
  Scenario: Happy path - authenticated session is created
    Given a registered user on the sign-in page
    When they submit valid credentials
    Then the session is created and the dashboard loads

  @req:REQ-005 @cs:CS-600-001 @regression @anomaly:auth-mismatch
  Scenario: Authorization boundary is enforced
    Given an authenticated user without workspace access
    When they request a protected workspace asset
    Then the request is rejected

Generated spec · provenance + fixture data (Phase 6 · shipped)

tests/e2e/sign-in.spec.ts
import { test, expect } from "../fixtures/sign-in";

// @req:REQ-005 @cs:CS-600-010
test("Happy path - authenticated session is created", async ({
  signInPage,
  data,
}) => {
  await signInPage.goto();
  await signInPage.signIn(data.validUser);
  await expect(signInPage.dashboard).toBeVisible();
});

// Generated by /tc:automate · refine steps inside the preserved region.
// Data flows from .test-commander/test-data/seed/sign-in.json (D6).

Terminal workflow

Seven commands, end to end.

Test Commander is terminal-first by design. A tester who is comfortable with the command line never has to open an IDE to drive the workflow.

CommandPurpose
./bootstrap.shVerify prereqs (Python 3.12, PDM, Docker, git, make); auto-install the safe ones.
make installValidate plugin manifests, register the local Claude Code marketplace, install the test-commander plugin, verify the twenty shipped skills.
/tc:initInside a consuming project, copy the 63-file workspace template into .test-commander/. Idempotent — existing files are preserved.
/tc:statusPrint a snapshot: per-bucket file counts, populated counts (bytes differ from template), per-phase status. Read-only.
/tc:journal appendAppend a timestamped narrative entry to today's journal/YYYY-MM-DD.md. Append-only; never edited in place.
/tc:nextRead the workspace state and recommend the next /tc:* command for this project.
/tc:review-requirementsRun the 16-dimension rubric on uploaded requirements.md; emit requirements-review.md plus [<kind>] open-questions.

Who it’s for

Five different audiences, one workflow.

The same artifacts serve manual testers, automation engineers, QA managers, recruiters, and clients — each gets value at a different point in the lifecycle.

  • Manual testers

    Turn exploratory knowledge into reusable evidence.

    • Organize exploration into user flows and risks.
    • Convert observations into BDD scenarios you can review.
    • Participate in AI-assisted work without becoming a developer overnight.
  • Automation engineers

    Start from a model, not a pile of vague tickets.

    • Structured flows, locators, and page objects as inputs.
    • Spec-first generation that maps cleanly to Playwright.
    • Less rework because tests start from real testing intent.
  • QA managers

    A repeatable quality process with explicit guardrails.

    • Faster design, clearer coverage, stronger reporting.
    • Easier onboarding for new testers.
    • A practical way to adopt AI without losing human oversight.
  • Recruiters and hiring managers

    Evidence of modern QA judgement, not just tooling.

    • AI-assisted exploration, BDD, Playwright, CI/CD, reporting — connected.
    • Demonstrates human-in-the-loop quality systems.
    • Pairs automation engineering with strategic communication.
  • Clients

    A practical path to AI-enhanced quality engineering.

    • Modernize exploratory testing into a structured automation pipeline.
    • Generate readable specs your team can review and approve.
    • Stand up regression coverage and quality reports without starting from scratch.

Design principles

The opinions baked into the workflow.

  1. Principle 01

    Human-guided, not fully autonomous

    The agent helps; the tester owns the quality decision. AI output is never treated as automatically correct.

  2. Principle 02

    Universal cores, project-specific tuning

    Per Decision D19, every shipped detector uses universal English and software-engineering vocabulary only. Domain awareness enters additively through <workspace>/config.yaml — extensions union with the universal core; you cannot remove a default. The same rubric runs against a banking app, a hospital system, or an internal dashboard.

  3. Principle 03

    file:line provenance for every claim

    Every entity, business rule, endpoint, anomaly, candidate scenario, or open question Test Commander surfaces is paired with the path:line where it came from. The structured artifacts are indexes, not summaries. You can always answer 'where did this come from' without leaving the workspace.

  4. Principle 04

    Byte-deterministic re-runs

    Every shipped helper is idempotent. Re-running against unchanged input produces byte-identical bytes. The workspace is safe to commit to git like any other source artifact — reviews show up as real diffs; nothing flickers on re-run.

  5. Principle 05

    Exploration before automation

    Automation starts from understanding. Identify what matters first, then encode it.

  6. Principle 06

    BDD as the bridge (Phase 5, shipped)

    Readable specs connect manual testers, automation engineers, and product stakeholders. The tc-test-idea/v1 schema Phase 2 authors and Phase 4 enriches is the input contract Phase 5 reads — every generated scenario carries @req:/@cs: tags that are the mechanical join key the traceability map parses.

  7. Principle 07

    Deterministic tests for CI/CD (Phase 6, shipped)

    AI may help generate tests, but CI/CD needs reliable checks. Phase 6 generates and structurally validates a Playwright/TypeScript suite, but never invokes the runner — execution is Phase 7's job. Playwright stays the source of truth.

  8. Principle 08

    Separate facts from interpretation

    Reports distinguish observed, tested, passed, failed, inferred, and items needing human review. The Phase-7 quality report (shipped) enforces this separation.

  9. Principle 09

    One governed execution path

    From Phase 10.5 on, every action above read-only flows through a single controlled-execution pipeline — intent, plan, permission policy, approval gate, bounded execution, validation, audit. The web console, the Runtime API, the MCP server, sandboxes, and the continuous agent are all front-ends to it. Default deny; nothing bypasses the gates.

Capability roadmap

All fourteen phases shipped.

What the system can do, phase by phase — every one now complete, each landed under test-driven discipline with its own annotated git tag. Pair this with the team-adoption maturity model below — the two roadmaps answer different questions.

  1. Phase 0

    Shipped

    Repo foundation

    Bootstrap script, plugin manifest, marketplace registration, skill verifier, link checker, CI scaffold.

  2. Phase 1

    Shipped

    Workspace + artifact model

    tc-core: /tc:init, /tc:status, /tc:journal, /tc:next. 63-file workspace template; per-phase recommendation engine.

  3. Phase 2

    Shipped

    Requirements + user-story intelligence

    tc-requirements: 16-dimension rubric, INVEST review, AC review, coverage map, tc-test-idea/v1 seeds.

  4. Phase 3

    Shipped

    Project knowledge ingestion

    tc-knowledge: five /tc:learn-from-* helpers (docs, specs, code, api, tests) + shared synthesizer; ten product-knowledge artifacts with file:line provenance.

  5. Phase 4

    Shipped

    Exploratory testing

    tc-explore: /tc:create-charter, /tc:explore + internal review sub-mode, /tc:session-summary, /tc:test-ideas enriching Phase-2 seeds.

  6. Phase 5

    Shipped

    BDD generation + traceability

    tc-bdd + tc-traceability: /tc:generate-bdd, /tc:review-bdd, /tc:traceability-map. Reads enriched test-ideas; emits Gherkin features tied to REQ-IDs with @req:/@cs: linkage.

  7. Phase 6

    Shipped

    Playwright framework + strategic automation

    tc-build-framework, tc-automation-plan, tc-automate, tc-test-data: lazy Playwright + TypeScript scaffolding, seven-factor automation scoring, generated suite + review, test data outside test code. The first executable artifacts.

  8. Phase 7

    Shipped

    Execution + evidence + quality report

    tc-run + tc-quality-report + tc-evidence: /tc:run, /tc:analyze-results, /tc:report, /tc:quality-gate. Per-run records; committed quality-report history.

  9. Phase 8

    Shipped

    Continuous learning

    tc-learning: governed lessons inbox; /tc:learn-from-failures, /tc:learn-from-exploration, /tc:learn-from-feedback, /tc:review-lessons, /tc:promote-lessons. Nothing promoted silently.

  10. Phase 9

    Shipped

    Visual documentation

    tc-visualize: eight /tc:diagram-* types, /tc:generate-infographic, and /tc:render-visuals — Mermaid sources plus deterministically rendered SVG/PNG.

  11. Phase 10

    Shipped

    Web console MVP

    tc-web: a read-only, team-facing viewer over the committed workspace — dashboard, journal, BDD viewer, run history, evidence. Renders the artifacts; never invents data.

  12. Phase 10.5

    Shipped

    Controlled agent execution

    tc-governance: the single governed-execution pipeline — intent, plan, permission policy, approval gate, bounded execution, output validation, audit. Default deny; no backdoor.

  13. Phase 11

    Shipped

    Runtime API + MCP server

    tc-mcp: an expanded Runtime API and a schema-first MCP server — alternative front-ends that drive Test Commander through the same governed pipeline. Seven permission levels enforced server-side.

  14. Phase 12

    Shipped

    Sandboxed testing environment

    tc-sandbox: on-demand, team-accessible environments launched from GitHub Actions via a provider abstraction (docker-compose-local MVP). Governed and safe-by-default targeting.

  15. Phase 13· Current

    Shipped

    Continuous quality agent

    tc-continuous-quality: watch changes, map impact, find coverage gaps, propose tests, and open labeled PRs — gated by five autonomy modes (0 advisor → 4 governed-autonomy). The final phase.

Implementation roadmap

How teams adopt Test Commander.

Adoption does not have to be all-or-nothing. Start with visibility, layer in requirements review, exploration, BDD, and automation, then graduate to a team console and finally to continuous, governed autonomy.

  1. Stage 1

    Quality visibility

    A shared quality baseline.

    Existing requirements, tests, defects, risks, and reports become a single living dashboard. No major process change — just the picture, made visible.

  2. Stage 2

    Requirements review

    Stories become clearer and more testable.

    Test Commander reviews stories before implementation for clarity, missing acceptance criteria, edge cases, data rules, and automation suitability. Quality shifts left through better questions, not more meetings.

  3. Stage 3

    Guided exploration

    Exploratory testing becomes durable.

    A tester points Test Commander at a target environment and explores. Observations, screenshots, risks, bugs, locator candidates, and test data needs join the quality knowledge base instead of someone's notebook.

  4. Stage 4

    BDD and test design

    Test design becomes traceable to business intent.

    Approved test ideas become BDD scenarios tied to requirements and risks. The team can see which stories have coverage and which edge cases are still missing.

  5. Stage 5

    Strategic automation

    Playwright tests with rationale, not guesswork.

    Automation candidates are scored on business criticality, repeatability, determinism, UI stability, and maintenance cost. Only the candidates worth automating become Playwright tests, page objects, fixtures, and test data.

  6. Stage 6

    Team web console

    A quality command center the whole team sees.

    Live dashboard, journal, BDD viewer, run history, evidence gallery, risk register. Testers explore, developers inspect traces, product owners answer questions, managers read the summary — one shared quality story.

  7. Stage 7

    Sandboxed workspaces

    No-code testing environments on demand.

    A pull request spins up a temporary Test Commander workspace with UI, runtime, uploaded docs, target URL, and artifact storage. Open a link, start testing. No local setup. No Playwright install on a tester's laptop.

  8. Stage 8

    Continuous self-improvement

    The system gets better at helping the team test.

    Lessons accumulate from requirements, code, tests, failures, and production defects. Candidate lessons are reviewed, accepted, rejected, or flagged for human review — then promoted into project guidance. The loop is governed, not silent.

  9. Stage 9

    Governed autonomy

    Continuous monitoring with human-approved change.

    Test Commander watches code, requirements, and pipelines. It analyzes impact, proposes coverage, runs approved suites, captures evidence, opens pull requests, and explains itself. Humans still approve the changes that matter.

Autonomy modes

How much should the agent be allowed to do on its own?

Phase 13 ships this as a concrete control: the configured autonomy mode is a ceiling on which permission levels the continuous agent may auto-approve in the governed pipeline. Five modes, cumulative — and destructive / admin never auto-approve at any mode.

  1. Mode 0

    Read-only advisor

    Reads artifacts, maps impact, finds coverage gaps, and proposes tests. Auto-approves nothing — a pure advisor. The right place to start.

  2. Mode 1

    Assisted testing

    Auto-approves safe-write work — analysis and proposed artifacts. Anything that writes code, runs tests, or opens a PR still waits for a human.

  3. Mode 2

    Approved execution

    Adds execute-tests to what auto-approves: the agent may run designated suites in safe environments. It still cannot open a pull request.

  4. Recommended default

    Mode 3

    Pull-request automation

    Adds code-write and may open clearly-labeled pull requests — new BDD scenarios, generated tests, refreshed traceability. Humans review and merge.

  5. Mode 4

    Governed autonomy

    Adds external-network targets. The broadest auto-approval — but destructive and admin actions are never auto-approved at any mode, and nothing auto-merges.

A mature workflow tends to settle at Mode 3. Test Commander runs continuously, but every change to test assets arrives as a clearly-labeled pull request a human can read, accept, or reject — and the agent never auto-merges.

Continuous quality agent mode

A living quality system, not a one-shot script.

Phase 13 ships the continuous quality agent: Test Commander watches the application and the delivery pipeline, responds to change, and produces evidence — continuously, transparently, and under the autonomy-mode approval rules above.

Continuously improving, human-governed quality automation.

When requirements change, code ships, or tests fail, the agent reacts. It analyzes impact, reviews updated stories, identifies coverage gaps, generates candidate scenarios, runs impacted suites, captures evidence, updates reports, and records lessons learned. Automatic observation, automatic analysis, automatic reporting. Human-approved implementation.

  1. Code change detected
  2. Impact analysis
  3. Story and risk review
  4. Coverage gap analysis
  5. Generate candidates
  6. Run impacted suite
  7. Capture evidence
  8. Open PR · learn

The same loop runs on pull requests, pushes, nightly schedules, release candidates, and manual dispatches. Read-only analysis happens automatically. Report updates happen automatically. Test execution happens automatically in safe environments. Generated changes arrive as pull requests. Core methodology improvements are proposed, reviewed, and promoted deliberately.

Sample PR comment · continuous-agent output

PR #428 · test-commander comment
Test Commander Analysis

Changed areas:
  - Checkout
  - Saved addresses
  - Payment error handling

Detected risks:
  - Saved address validation behavior changed
  - Payment failure scenario lacks automated coverage

Existing tests:
  - 12 checkout tests passing
  - 2 impacted tests failed
  - 1 flaky test detected

Recommended actions:
  - Clarify expected behavior for expired saved addresses
  - Add BDD scenario for payment timeout
  - Approve generated Playwright test candidate

Artifacts:
  quality report · screenshots · trace · coverage map

The agent says what it changed, what it analyzed, what it found, and what a human should look at next. No surprises, no silent edits.

What’s next

Hiring for AI-augmented QA, or building it yourself?

Test Commander is part of an ongoing body of work in AI-augmented software quality, Playwright automation, and human-guided agentic testing. If you are hiring — or your team is exploring practical AI-assisted testing — I would like to talk.