Why do bugs slip through when thousands of tests pass? Rohit Mehta breaks it down at Spartans Summit 2026 with insights on AI-powered testing.

Rohit Mehta
April 22, 2026
On This Page
AI-powered testing doesn't get better just by piling on more test cases. It improves when it learns from real data, picks up patterns, and adjusts based on feedback.
So the role of a test engineer is changing. Instead of mainly running tests, the focus shifts to helping the system learn better. That means:
In this session of Spartans Summit 2026, Rohit Mehta, Practice Head of QA at Pratam Software, kicked off this session with a single question that most QA teams have been asked at least once.
If we have 20,000 test cases all passing in the pipeline, why is a bug still reaching production?
His answer runs through this entire session, covering AI-powered testing, how LLMs learn from your data, the mathematics behind confidence scoring, and how to build guardrails that actually hold.
Missed the session? No worries, we've got you covered. Watch the video below.
Rohit sets the scene. A payment system goes down at 3 a.m. The team gets a flood of notifications and mobile alerts. Six hours later, they find the cause: a classic break in a payment loop, a piece of code that was missed.
What makes this painful is the question that follows from the manager, leads, and stakeholders: What was your regression test suite doing? The team had 20,000-plus test cases. All of them were green in the pipeline. The bug was still there.
You have a lot of coverage, but you don't have intelligence. That is the major pain area.
This is where most large product teams get stuck. Everyone is writing test cases, covering positive, negative, edge, and boundary scenarios. Coverage numbers look strong. But somewhere the intelligence is missing, and that is the gap this session is built to close.
The problem most teams face starts small. When a product is new, regression runs in a few minutes. As the product grows, it moves from a few hours, then to many hours. Flaky test cases start creating noise and eroding trust in the automation layer itself.
The bottleneck is no longer writing test cases. It is managing them. Once a team has thousands of test cases across sanity, production, and regression suites, the challenge shifts to maintaining them. Traditional approaches are struggling to keep pace with modern development velocity.
The old thinking was straightforward: higher test case count equals higher confidence. Some organizations still measure this as a KPI, tracking how many test cases a tester wrote in the last quarter or the last sprint, and how many bugs were found. That thinking is now working against teams.
More tests are not increasing the confidence anymore. They are increasing the overhead of maintaining them.
Note: Automate Software Testing with TestMu AI. Try Now!
Several shifts are happening at once.
Most engineering solutions now ship with AI built into them. When AI is in the product, there is no fixed response. Every response changes with context. Testing has moved from deterministic outputs to non-deterministic output models. It is very difficult to test a solution that already has AI in place because the behavior is no longer fixed.
Sprint cycles that used to be three weeks, then two, then one, are now sometimes overnight. Continuous deployments mean some products push to production multiple times a day. Stakeholders now expect feedback in minutes, not hours or days. The prioritization of testing has changed to match this.
Multi-tenant solutions with multiple clients create a new problem. A feature available for one client should not be visible for another. Gradual rollouts, feature flags, and A/B tests mean code behavior is changing in production after the tests have already run.
Pre-deployment testing cannot catch configuration-dependent issues when you have, for example, 90 clients each running the same system with different setups.
Logs, metrics, traces and user telemetry. Most teams have massive amounts of this data available, but lack the tools to extract useful signal from it. The question now is what data to consume when building test strategies. Production logs are available. Are they being used to write better tests?
The old response to a drop in confidence was always to add more test cases. More unit tests, more integration tests, more end-to-end scenarios. The assumption was that higher written test case count equals more confidence.
The new reality is different. Quality teams now treat testing as a data problem.
Instead of writing more test suites, the work is to analyze production signals, understand what the major business flows are, understand what happens during specific periods like Christmas or New Year when activity patterns change, and understand what type of seasonality the system supports.
From there, you decide what to improve in terms of functional and non-functional testing.
| Old Approach | New Approach |
|---|---|
| Write more test cases when confidence drops | Analyze production signals and learn from them |
| Higher test count equals higher confidence | Testing is a data problem, not a coverage problem |
| Add unit, integration, and E2E tests | Focus on what the data is telling you |
| KPI: how many test cases written per sprint | Prioritize risk based on real signals |
Most teams using LLMs today just open a chat tool, write a prompt, and start working. They never think about whether to train the model, set the right context, or do anything more than write a prompt and wait for an answer.
Training the model for testing is not about abstract AI theory. It is about using your historical data to make the LLM smarter about your specific product.
Key Point from Rohit: You already have the data. You are just not using it systematically.
Rohit identifies five major sources of data that, when ingested into your LLM, help train it on your specific system.
Run the entire suite of 20,000 test cases. Execution time is 4 to 6 hours. Results are 95 percent pass, 3 percent flaky and 2 percent failure. Most of the tests are irrelevant to the actual changes made, but there is no way to know which ones matter. A developer changes a block of code in one module, but the whole regression suite runs anyway.
Run a risk-prioritized subset of 2,000 test cases out of 20,000. Execution time drops to 20 to 30 minutes. Results show a higher failure detection rate with 90 percent less execution time.
The model receives these inputs to make that selection:
Two files are compared. payment-gateway.js modified by a new hire, with a historical defect rate of five bugs per sprint, complexity of 50, and a critical blast radius. Risk score: 0.95. user-profile.css with low complexity and a low blast radius. Risk score: 0.15.
The model has the Jira integration. It knows who the developer is, what type of story was picked up, and how much code was changed. For the high-risk file, it flags the change for human review and runs a larger portion of the test suite. For the low-risk file, it can go straight to a merge or a smoke test.
AI-powered test generation goes beyond simple templates. By analyzing code changes, defect patterns, and existing test coverage, the model suggests relevant test scenarios that a developer might overlook.
When a developer modifies a function or existing validation logic, the model works through the code path, including all boundary conditions. It suggests suitable test cases for those changes and suitable test data to go with them.
The model has historical defect data. It can be seen that this feature has historically produced this type of defect. When a particular line of code is modified, it gives test cases based on both the historical bugs and the new changes.
If null pointer errors have clustered around certain patterns, the model recommends null handling tests proactively. If a login function has been changed 45 times and 40 of those times produced null pointer exceptions, the model will push that scenario to the highest priority.
For API changes, the model generates test cases for formats, post body, headers, and authentication. It also covers the non-functional parts. It asks whether the JWT is being tested and whether OWASP top validations are being run against the API. It creates a full set of contract variation test cases.
Best practice: start with AI-generated test suggestions, then apply human validation. Review for correctness, add assertions, and verify business logic alignment before committing.
Everything going to an LLM travels as vectors, as embeddings. When an MCP connector syncs your Jira or when a code wrapper integrates with the model, everything is converted to vector format before it reaches the model.
If 50 bugs are labeled as null pointer exceptions in the login module, the model learns to always generate null checks for any login-related PR. It is not a guess. It is pattern matching against your own history.
The process: take the bug summary, create embeddings, push those embeddings to the embedding store, and start fine-tuning the LLM trainer. From that point on, whenever a PR change is pushed for the login module, the model generates test suggestions for null checks and login edge cases automatically.
Machine learning models analyze test execution history to identify flaky tests before they erode trust in the CI. The model looks for patterns like inconsistent timing and environmental dependencies.
When a build enters the CI pipeline, the model assesses its risk profile based on the files changed, the author's history, and time-of-day patterns. High-risk builds trigger enhanced scrutiny, additional reviews, or expanded test coverage.
Before deploying to production, the model scores the release candidate based on test results, code churn, dependency updates, and historical stability of the included changes.
With six months of CI logs fed into the model, it builds a picture of what type of runner is being used, what test suites run on those runners, the history of passes and fails for each test case, and which test cases are flaky versus stable. It creates an environment fingerprint.
Once that fingerprint exists, when a test fails, the model can diagnose it immediately. CI test diagnostic output: 85 percent of failing tests are due to flaky behavior, not a genuine regression. It says skip debugging, report as flaky.
There is no change on the development side, no change on elements or request/response bodies. The failure is purely historical flakiness. The model already has that history, so it tells you directly. This saves significant time.
AI is not a magic test for writing. It is an intelligent amplification of human testing decisions.
Rohit is direct about this. AI fails in predictable ways, and teams need to know them.
AI output is a recommendation. It is not a verdict. It is humans helping in human decision-making. It is not a replacement for expert judgment.
Rohit covers four practices for keeping AI recommendations trustworthy.
Here are the guardrails that evolve over time as the team learns which constraints prevent real problems versus which ones create unnecessary friction.
Prevent the AI from recommending tests that delete production data, modify user accounts, or perform irreversible operations.
This is especially important when MCP servers are configured in agent mode with broad permissions. Those destructive areas can cause irreversible damage.
Whatever your organization's policies are, coding guidelines, review guidelines, PR guidelines, automation writing standards, security-critical rules, these need to be given to the model as strict rules inside whatever coding tool you are using with LLM integration. Security-critical changes in particular must be governed by these policies.
Give the model proper domain information. Create rules inside the LLM or the coding engine. Examples: account balance must be greater than zero, order total must equal the sum of line items.
It is possible that a customer has internal logic that differs from the industry standard. That internal logic needs to be explicitly stated. The model cannot infer it.
Define what the acceptable risk level is for different deployment stages. Tell the model clearly that staging tolerates a higher risk than production. Whatever pipeline or automation suite is running, the instructions should be clear.
Feature flag changes allow more experimental behavior than generally available features. These distinctions need to be given explicitly.
Rohit walks through the mathematics behind how an LLM calculates confidence for its test recommendations.
Confidence Score = (Model Score x Historical Precision x Signal Strength x Context Match) divided by Risk Multiplier
Rohit uses a restaurant scenario to make the formula tangible. You want to go out for dinner and write to an AI that you want to eat pizza. The AI reviews everything and suggests Yugis as the best pizza place. Model score: 0.9.
You went there, ate the pizza and loved it. Your track record: 60 percent of your feedback matches what other people say. Historical precision: 0.6.
You also uploaded a photo of the pizza you wanted as evidence. Signal strength: 1.0.
But Yugis is a noisy place with a lot of people, and your context was that you needed a silent place to discuss a sales pitch. Context match: 0.5.
You are sitting down to discuss a client deal in a noisy venue. High-stake situation. Risk multiplier: 2.0.
Final confidence: (0.9 x 0.6 x 1.0 x 0.5) divided by 2 = 0.135.
The system output: do your own research. This place may have good pizza, but the context does not fit. If you update the context next time and say you want to eat pizza at Yugis also needs a silent place to discuss a client deal, the model should give you a confidence score of 0.135 and tell you to do your own research before going there for something that important.
The same formula applies to a payment gateway change. Historical precision of 0.7 because there are already two or three known bugs in the module. Signal strength of 1.0 because those bugs are in the defect history.
Context match covers what needs to be tested. Risk multiplier of 2 because it is a critical blast area at the core of the application.
After all the calculation the confidence score comes out at 0.25. The model output: route to human review. Whatever changes are being tested here, a human needs to review them. Do not depend on the model alone.
Selenium works. Playwright works. API automation works. Everything in the existing stack continues as it is. There is no need to replace any of it.
What changes is that an intelligence layer is added on top. Signal analysis, test generation, and feedback learning sit above the existing frameworks.
The traditional execution layer stays. The AI layer provides the thinking about what to run, when to run it, and what the results mean.
Did you find this page helpful?
More Related Hubs
TestMu AI forEnterprise
Get access to solutions built on Enterprise
grade security, privacy, & compliance