Skip to main content
Nick Baynham

BlogAgentic Testing

Software testing for the agentic era

By Nick Baynham · · 3 min read

I want to write this down before the phrase "agentic testing" gets reduced to a marketing slot. The agentic era is not arriving as one big release. It is a slow change in who is at the keyboard - and the most useful framing for engineers is to ask what stays the same, what changes, and what to build first.

What stays the same

The fundamentals do not move.

  • The reason tests exist. A test is a piece of evidence about quality, written for a human reader. The reader is usually the next engineer who has to change the code. Agents do not change the reader.
  • The cost of flakiness. A flaky test does not become less expensive when an agent wrote it. If anything it gets worse, because the agent can generate flakes faster than humans can triage them.
  • The boundary of responsibility. A test failure that ships to production is on the team, not on the agent. Treat the agent as a contributor on a probationary period: useful, supervised, never the last reviewer.

What changes

Three shifts are real.

  1. The unit cost of writing a test drops sharply. Drafting a Playwright case from an exploratory walk used to take a few minutes; with a good agent it takes seconds. The constraint moves to which tests are worth keeping.
  2. Maintenance becomes the headline cost. When generation is cheap, fleets of tests pile up. Selectors drift, environments change, expectations rot. The biggest leverage point is making the maintenance loop boring instead of dramatic.
  3. The reviewer's job changes shape. Instead of writing the test, the reviewer reads a proposal and decides whether it captures real intent. This is the right place for human judgment - it is exactly where automation tends to over-generalize.

What to build first

If I had a quarter to invest in agentic testing inside one engineering org, I would not start with generation. I would start with three smaller things in this order:

  1. A clean abstraction the agent writes against. A domain-specific representation (or a stable API contract) that the agent can target without having to know which runtime is in use. This bounds the blast radius of bad output.
  2. A maintenance loop the agent participates in. Selector drift, broken fixtures, flaky timing - these are the tedious cases an agent can actually help with, and where the value is easiest to defend.
  3. A failure summarizer. Most automated suites produce more failure data than a team can read. Letting an agent write a one-paragraph summary of an overnight run is a near-instant return on investment.

Generation is the headline feature, but it is the wrong place to start. The agent should be earning trust on the boring work first.

Why "agentic", not just "AI"

I keep the word agentic because it captures a specific posture: the agent has a goal, takes actions, and reports back. That posture is what changes the workflow. The interesting question is not "can the model write a test?" - it is "what is the loop, who is in it, and what does the agent have authority to do without asking?"

That is the question this site is going to keep returning to.

  • Why AI agents still need human testers

    The cases where current agents miss, why they miss, and the kind of judgment that does not transfer.

  • Agentic engineering antipatterns

    Five failure shapes I keep seeing when teams adopt AI agents in testing and tooling work.

  • Generating a UI test suite for a live app: eight failures that were really findings

    Asked to generate a UI test suite for a live Angular app, the honest answer was that the auto-generated scaffolds were a non-runnable skeleton. The real suite had to be hand-authored and debugged into green against the running system. It went from 30 to 41 passing tests across eight rounds of failure, and every failure was the application teaching the test something true about itself: a blank edit form, unlabelled number inputs, an anchored URL validator, a list that does not render reliably, and a backend that mishandles concurrent writes.