Next-Gen App & Browser Testing Cloud
Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

A practical 2026 guide to AI agents for SDETs: where they fit in the test loop, real workflows, frameworks, common failure modes, and a 90-day adoption plan.

Prince Dewani
Author
June 22, 2026
AI agents for SDETs are autonomous systems that plan, generate, run, and debug tests on their own, with a human setting the goal and verifying the output. Capgemini's World Quality Report 2025 found 89% of organizations are piloting or deploying generative AI in quality engineering, yet only 15% have scaled it enterprise-wide.[1]
This guide covers how SDETs use agents day to day, the difference between an agent, an assistant, and a workflow, real workflows, the tools and frameworks SDETs use, where agents break, how to test AI agents, and a 90-day adoption plan.
Key Takeaways
SDETs use AI agents for four daily jobs: generating test cases from acceptance criteria, drafting API tests against an existing collection, triaging failed builds, and exploring an app to reproduce bugs. Across teams, the same process applies: the agent drafts, the SDET reviews and edits, and only the edited output enters the suite.

Each job has a clear pattern that works in practice:
I use it to generate test scenarios based on acceptance criteria and then I have it generate a test case csv file with all the steps based on the scenarios that I selected... I hate writing up test cases, so I have the GitHub Copilot do most of the work for me. I also have it create API tests for me. You can point it to the Bruno collection files, tell it which endpoint you are creating a test for, tell it what you want to test, and it will quickly generate the API tests for you.
- u/JustDesserts29, "Guys please tell how you are best using Ai as SDET", r/QualityAssurance (Source)
Agents are strong on API and scenario work and weak on UI locators. Check the UI tests an agent generates before adding them to the suite, because they are the most likely to be flaky.
This is why the SDET role is changing rather than disappearing. SDETs add value by defining what to test, choosing edge cases, judging where an agent will misjudge, and understanding business impact the agent misses. For more on that broader shift, see ai in software testing.
An AI assistant answers a prompt inside a chat and stops. A workflow runs an LLM through predefined code paths you control. An agent lets the LLM direct its own steps and tool calls to finish a task. Risk rises in the same order: an assistant cannot break your pipeline, a workflow fails predictably, and an agent can take an action you did not foresee.
The difference decides how much guardrail each one needs. Anthropic's engineering team defines it directly: "Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."[2]
| Attribute | AI Assistant | Workflow | AI Agent |
|---|---|---|---|
| What it does | Answers a prompt inside a chat and stops | Runs an LLM through predefined code paths | Directs its own steps and tool calls to finish a task |
| Who directs the steps | The human, one prompt at a time | Your code | The agent itself |
| Typical SDET use | Draft a test snippet or explain a failure in chat | A fixed test suite with one model-driven assertion | Read a failed build, find a root cause, open a ticket |
| Failure behavior | Cannot break the pipeline | Fails predictably along the coded path | Can take an action you did not foresee |
| Guardrail needed | None | Low | Retry caps, token budgets, and approval gates |
A workflow runs the steps your code defines in the order your code defines them. An agent decides the order itself. That control makes an agent more capable and harder to predict.
For an SDET, this maps to a concrete choice. A test suite that always runs the same steps with one model-driven assertion is a workflow, and a workflow engine is the right home for it. A system that reads a failed build, decides which logs to pull, forms a root-cause hypothesis, and chooses whether to open a ticket is an agent, and it needs caps and approval gates the workflow never did.
Anthropic's own advice is to "find the simplest solution possible, and only increasing complexity when needed," because agentic systems "trade latency and cost for better task performance."[2]
Most testing tasks are workflows with an AI step, not full agents. Anthropic states the same trade-off, advising teams to start with the simplest setup and add agent autonomy only when the task needs it.[2]
If the process is deterministic with AI at the decision points, a workflow engine is the safer default. Reach for a full agent only when the task genuinely needs the model to choose its own path. To go deeper on the agent-versus-autonomy spectrum, see multi-agent AI systems.
Note: Author and export end-to-end web and mobile tests from natural language with KaneAI by TestMu AI. Try it free.
Four workflows show up repeatedly in SDET practice: test generation from criteria, API test drafting against a collection, build-failure triage, and agent-assisted onboarding to an unfamiliar app. Each works because it is a bounded task with a clear input and a checkable output, which is exactly the shape an agent handles well.
1. Test generation from acceptance criteria. The agent reads the criteria or a Jira ticket, drafts scenarios, and emits structured steps you can import. This works because acceptance criteria are already close to a specification, so the agent translates them rather than inventing them.
It still needs a human review because the agent produces plausible-but-wrong assertions, so the edit pass is required.
2. API test drafting. Point the agent at an existing collection, name the endpoint, state what to verify, and it drafts the request-and-assert tests. API tests are the agent's strongest area because the contract is explicit and the output is easy to check against a known schema, which is why this workflow usually needs only small edits.
3. Build-failure triage. This is the workflow with the clearest payoff. An agent watches failed builds, groups similar failures, finds a likely root cause, and opens a ticket automatically. Industry write-ups describe teams cutting triage time meaningfully with exactly this pattern, because grouping and first-pass root cause is pattern work, the repetitive part of triage that does not use their judgment.[3]
4. Onboarding the agent to an app, then automating. Before asking an agent to write tests, walk it through the app to build a knowledge base it can reason against, then write automation together and iterate until a test is stable enough for the CI suite.
An agent that has context on your app produces better tests than an agent working from a single prompt, and building that context up front saves time on every later test.
Decide who owns unit-test coverage early. Agents lower the cost of writing unit tests enough that an SDET can own coverage directly and do the repetitive work instead of waiting on developers, which also strengthens the regression testing suite over time.
High coverage from agent-written tests is only worth it if each test is reviewed. An unreviewed coverage number does not tell you the tests are correct.
SDETs use three layers of tooling: coding assistants for day-to-day test code, agent frameworks for multi-step autonomy, and the Model Context Protocol (MCP) to connect either to real tools like a browser. The single most important tool for UI work is the Playwright MCP server, because it fixes the locator-guessing problem directly.
The locator problem is where agent-written UI tests most often fail. When an agent generates UI tests from a prompt alone, it guesses selectors it never saw, and those guesses default to absolute XPaths that break on the next layout change.
The Playwright MCP server fixes this. Because the agent reads the accessibility tree, it prefers Playwright's recommended getByRole and getByTestId locators over brittle XPaths, which is the same locator-rules discipline experienced SDETs already enforce by hand with instruction files.[4]
An agent that inspects the browser reads the real selectors, while an agent that skips that step invents them.
Frameworks require upfront engineering work. A framework gives you persistence, checkpointing, and human approval gates, and you write the code to wire them up. For a side-by-side of these and the no-code options, see this roundup of ai agent builders.
Writing and maintaining every test by hand is slow, and turning acceptance criteria, PRDs, and Jira tickets into reviewable tests is repetitive manual work that takes time away from a sprint. KaneAI by TestMu AI is a purpose-built test-authoring agent for that work: it plans, authors, and runs end-to-end web and mobile tests from natural-language prompts, and converts PRDs, Jira tickets, and other formats into structured test cases.
It targets the exact problem SDETs script by hand: turning acceptance criteria into reviewable tests without you writing every selector. Relevant capabilities for SDET work:
You can read the getting started with KaneAI documentation to see the authoring and export flow in detail.
AI agents break in production in four ways, and none are model-quality problems you can prompt away. They are system-design problems an SDET has to engineer against:
The reason a small failure becomes a large one is structural. Anthropic's engineering team is direct: agents "run for long periods of time, maintaining state across many tool calls," so "minor system failures can be catastrophic for agents," and "one step failing can cause agents to explore entirely different trajectories."[5]
A scripted test fails once, but an agent that takes a wrong step can keep acting on it for dozens of steps. To contain this, set a hard retry cap, a per-run token budget, and a decision log that records each step so a failed run can be debugged.
Coordination makes this worse: when agents hand off to each other, one bad step propagates, so every added agent adds more ways the system can fail.
For an SDET, most of these failures are not bad LLM answers but bad specification, coordination, and verification, which are the parts you design rather than the parts the model owns. The safest default is one agent owning a complete, checkable unit of work, with a handoff only where the task genuinely splits.
Industry coverage of agent-driven testing adds three more field pitfalls: poor training data leading to unreliable predictions, blind trust in agents without human validation, and a lack of explainability that makes a failed run hard to justify to stakeholders.[3] To prevent all three, treat agent output as unverified until a human or a deterministic check confirms it.
When I paired Claude Code with a Playwright suite on an Angular Material app, the agent kept generating selectors against the runtime-only attributes Angular Material assigns. Roughly half the selectors broke between releases. I pinned the project prompt to use data-testid only and routed the agent through the Playwright MCP server to read the live DOM.
After that change, selector stability returned to above 90% and the breakage between releases mostly stopped. The agent could not produce stable locators from a prompt alone; it needed to read the page.
No. AI coding assistants like Claude Code, Codex, and GitHub Copilot are strong at authoring test code, but on their own they do not cover real testing needs, because authoring, execution, visual checks, and validating AI features are different problems at different levels of testing.
While testing across a full test cycle with an AI coding assistant, the following bottlenecks showed up:
These are strong authoring co-pilots, not a full quality layer.
SDET and engineering teams need a coordinated ecosystem of purpose-built AI agents that together form an end-to-end quality layer, which is how TestMu AI's agent suite is built for enterprise scale.
Each agent owns a level of the test cycle:
An SDET needs coverage at every level, not one tool applied to all of them. An ecosystem that hands off between authoring, execution, visual, and AI-feature validation is what an end-to-end quality layer means in practice.
A 90-day AI agent adoption plan for SDETs is a staged rollout that moves a team from one coding assistant on one bounded task to a reviewed, capped, agent-assisted workflow inside CI. The sequence matters: baseline your current effort first, automate the safest task next, add the browser and guardrails second, and only move an agent into CI once a single agent is reliable.
Record how long your team currently spends authoring tests and triaging failures, so you can prove the change later. Install one coding assistant such as Claude Code, GitHub Copilot, or Codex, and point it at the single safest task: drafting API tests against an existing collection, reviewed and edited before merge. Write a locator-rules instruction file now, prefer ID, role, and data-testid and ban absolute XPaths, so every later UI test inherits it.
Add the Playwright MCP server so the agent reads the live DOM instead of guessing selectors, and start generating reviewed UI tests. Put hard caps in place before you scale: a retry limit, a per-run token budget, and a decision log. In this phase, confirm by running them that your agent-written tests stay stable across a release, because the silent-pass failure appears here.
Move a bounded agent into CI, with build-failure triage as the highest-payoff first candidate, and gate every consequential action behind a human approval or a hard cap. Keep the agent's scope narrow: it reads logs, groups failures, and suggests a root cause, but a human still approves the ticket it opens. Add a second agent only when one task clearly hands off to another, because every added agent adds coordination failure modes.
The skills to build alongside the tooling are prompt and context engineering, a working understanding of how LLMs fail, MCP and tool integration, and the verification judgment that turns agent output into trustworthy tests.
The coding and framework foundations do not go away; reading and fixing agent-generated Playwright and Selenium is now a daily task. For the language layer underneath, Selenium remains a core skill, and sharpening prompt engineering is the fastest single lever on agent output quality.
No. AI agents shift the SDET role toward orchestration, oversight, and verification rather than eliminating it. The job moves from writing every line of test code to deciding what to test, setting guardrails, reviewing agent output, and testing the agents themselves, work that grows in importance as more of the routine authoring is handed off.
In practice, the SDET role shifts along five lines:
The data backs the shift rather than a replacement. Capgemini's World Quality Report 2025 found 89% of organizations are piloting or deploying generative AI in quality engineering, yet only 15% have scaled it enterprise-wide, and non-adoption actually rose to 11%.[1] The report frames the shift as collaborative intelligence, where AI augments core testing work instead of replacing human judgment.
Anthropic's engineering team reaches the same conclusion from the agent side: "Human evaluation catches what automation misses," and "even in a world of automated evaluations, manual testing remains essential."[5] Human oversight is not a temporary phase. It is the part of the role that grows in importance.
Agents take over the repetitive authoring. The remaining human work, defining intent, judging edge cases, and verifying output, is the harder part of the job. The SDETs who do well direct the agent, check its output, and review every test it writes before trusting it.
AI agents do not replace SDETs; they shift the role to orchestration, guardrails, and verification. A single coding assistant authors tests well but cannot cover execution, visual checks, or AI-feature validation, so a coordinated suite of purpose-built agents, like KaneAI for authoring alongside the rest of TestMu AI's agents, is what covers the full test cycle.
Start with one bounded task and the 90-day plan: generate, review, cap, and verify, then add a second agent only when one task clearly hands off to another. If you are choosing the layer underneath, this roundup of LLM agent frameworks is a good next read.
Did you find this page helpful?
More Related Hubs
TestMu AI forEnterprise
Get access to solutions built on Enterprise
grade security, privacy, & compliance