Home
/
Blog
/
Spartans Summit'26: Stop Writing Tests, Start Training Models

Spartans Summit'26: Stop Writing Tests, Start Training Models

Why do bugs slip through when thousands of tests pass? Rohit Mehta breaks it down at Spartans Summit 2026 with insights on AI-powered testing.

Rohit Mehta

April 22, 2026

On This Page

The 3 AM Incident
The Bottleneck Has Shifted
What Changed
Testing Is a Data Problem
What Training a Model Means
The Five Training Signals
Use Case 1: Smart Regression
Use Case 2: Test Generation
How Training Vectors Work
Use Case 3: Failure Prediction
Where AI Actually Helps Most
Where AI Fails
How to Validate AI Output
Guardrails That Work
The Confidence Score Formula

AI-powered testing doesn't get better just by piling on more test cases. It improves when it learns from real data, picks up patterns, and adjusts based on feedback.

So the role of a test engineer is changing. Instead of mainly running tests, the focus shifts to helping the system learn better. That means:

feeding it clean, high-quality data
setting clear rules for how it should behave
and regularly checking its outputs to make sure they actually catch real risks

In this session of Spartans Summit 2026, Rohit Mehta, Practice Head of QA at Pratam Software, kicked off this session with a single question that most QA teams have been asked at least once.

If we have 20,000 test cases all passing in the pipeline, why is a bug still reaching production?

His answer runs through this entire session, covering AI-powered testing, how LLMs learn from your data, the mathematics behind confidence scoring, and how to build guardrails that actually hold.

Missed the session? No worries, we've got you covered. Watch the video below.

The 3 AM Incident

Rohit sets the scene. A payment system goes down at 3 a.m. The team gets a flood of notifications and mobile alerts. Six hours later, they find the cause: a classic break in a payment loop, a piece of code that was missed.

What makes this painful is the question that follows from the manager, leads, and stakeholders: What was your regression test suite doing? The team had 20,000-plus test cases. All of them were green in the pipeline. The bug was still there.

You have a lot of coverage, but you don't have intelligence. That is the major pain area.

This is where most large product teams get stuck. Everyone is writing test cases, covering positive, negative, edge, and boundary scenarios. Coverage numbers look strong. But somewhere the intelligence is missing, and that is the gap this session is built to close.

The Bottleneck Has Shifted

The problem most teams face starts small. When a product is new, regression runs in a few minutes. As the product grows, it moves from a few hours, then to many hours. Flaky test cases start creating noise and eroding trust in the automation layer itself.

The bottleneck is no longer writing test cases. It is managing them. Once a team has thousands of test cases across sanity, production, and regression suites, the challenge shifts to maintaining them. Traditional approaches are struggling to keep pace with modern development velocity.

The old thinking was straightforward: higher test case count equals higher confidence. Some organizations still measure this as a KPI, tracking how many test cases a tester wrote in the last quarter or the last sprint, and how many bugs were found. That thinking is now working against teams.

More tests are not increasing the confidence anymore. They are increasing the overhead of maintaining them.

Note: Automate Software Testing with TestMu AI. Try Now!

What Changed

Several shifts are happening at once.

AI Is Inside the Products Being Tested

Most engineering solutions now ship with AI built into them. When AI is in the product, there is no fixed response. Every response changes with context. Testing has moved from deterministic outputs to non-deterministic output models. It is very difficult to test a solution that already has AI in place because the behavior is no longer fixed.

Release Cycles Are Getting Faster

Sprint cycles that used to be three weeks, then two, then one, are now sometimes overnight. Continuous deployments mean some products push to production multiple times a day. Stakeholders now expect feedback in minutes, not hours or days. The prioritization of testing has changed to match this.

Behavior Changes Post-Deployment

Multi-tenant solutions with multiple clients create a new problem. A feature available for one client should not be visible for another. Gradual rollouts, feature flags, and A/B tests mean code behavior is changing in production after the tests have already run.

Pre-deployment testing cannot catch configuration-dependent issues when you have, for example, 90 clients each running the same system with different setups.

Too Much Runtime Data Going Unused

Logs, metrics, traces and user telemetry. Most teams have massive amounts of this data available, but lack the tools to extract useful signal from it. The question now is what data to consume when building test strategies. Production logs are available. Are they being used to write better tests?

The Shift: Testing Is a Data Problem

The old response to a drop in confidence was always to add more test cases. More unit tests, more integration tests, more end-to-end scenarios. The assumption was that higher written test case count equals more confidence.

The new reality is different. Quality teams now treat testing as a data problem.

Instead of writing more test suites, the work is to analyze production signals, understand what the major business flows are, understand what happens during specific periods like Christmas or New Year when activity patterns change, and understand what type of seasonality the system supports.

From there, you decide what to improve in terms of functional and non-functional testing.

Old Approach	New Approach
Write more test cases when confidence drops	Analyze production signals and learn from them
Higher test count equals higher confidence	Testing is a data problem, not a coverage problem
Add unit, integration, and E2E tests	Focus on what the data is telling you
KPI: how many test cases written per sprint	Prioritize risk based on real signals

What Training a Model Means in Practice

Most teams using LLMs today just open a chat tool, write a prompt, and start working. They never think about whether to train the model, set the right context, or do anything more than write a prompt and wait for an answer.

Training the model for testing is not about abstract AI theory. It is about using your historical data to make the LLM smarter about your specific product.

Learning from past defects: Take a list of past defects from the previous release versus the work done so far for the new release. Ingest this data into the model. It starts learning what usually breaks. It begins to understand the patterns of failure.
Using production signals: If logs, traces, and metrics are available, and if the client agrees, ingest those into the LLM. Now the model has both past defects and production signals. It builds a more complete picture.
Prioritizing risk dynamically: Once these inputs are in place, the LLM starts to identify which areas carry risk. It can tell you that a certain area has a high blast radius, and that this module is usually at risk given certain types of changes. It creates a historical stability ranking.
Improving over time: As you keep feeding the right data, the chat context improves. The model becomes more accurate with each cycle.

Key Point from Rohit: You already have the data. You are just not using it systematically.

The Five Training Signals

Rohit identifies five major sources of data that, when ingested into your LLM, help train it on your specific system.

Jira Defects: Integrate Jira with your LLM via MCP connectors. When a bug ticket is written with proper structure, including what broke, the root cause analysis, why it failed, and a link to the commit that fixed it, the model syncs with that data. It starts with training from your Jira history. It can even see the code review of the fix.
CI Failures: Push your Git and Jenkins artifacts to the LLM. It starts by learning on what basis your CI is failing, whether the issue is a flaky test case, a stale element, or something else.
Logs and Telemetry: Feed application logs, error traces, performance metrics, and runtime data into the model. These become training signals that help the LLM understand production behavior.
User Behaviors: User behavior data tells the LLM which business flows are high priority. When the model knows which paths most users hit, it can create those patterns internally and watch closely whenever changes happen in those areas.
Incident Post-Mortems: Push RCA documents from past incidents into the LLM. This lets the model act as an RCA assistant. It learns what broke, what was fixed, and what the root cause was. It gives the LLM a clearer picture of failure modes.

Practical Use Case 1: Smart Regression

Traditional Approach

Run the entire suite of 20,000 test cases. Execution time is 4 to 6 hours. Results are 95 percent pass, 3 percent flaky and 2 percent failure. Most of the tests are irrelevant to the actual changes made, but there is no way to know which ones matter. A developer changes a block of code in one module, but the whole regression suite runs anyway.

AI-Assisted Approach

Run a risk-prioritized subset of 2,000 test cases out of 20,000. Execution time drops to 20 to 30 minutes. Results show a higher failure detection rate with 90 percent less execution time.

The model receives these inputs to make that selection:

PR changes integrated with the LLM. Every PR raised is scanned. The model reads it, takes a call, and tells you what to run and what not to run.
Past defects. If those classes or features already have past defects tied to certain types of changes, those test cases are added to the run.
Module risk. If the LLM knows a module is a business-critical area used heavily by users, it increases the test coverage for that area.
Blast radius, dependency depth, and change frequency.

Risk Model Breakdown Example

Two files are compared. payment-gateway.js modified by a new hire, with a historical defect rate of five bugs per sprint, complexity of 50, and a critical blast radius. Risk score: 0.95. user-profile.css with low complexity and a low blast radius. Risk score: 0.15.

The model has the Jira integration. It knows who the developer is, what type of story was picked up, and how much code was changed. For the high-risk file, it flags the change for human review and runs a larger portion of the test suite. For the low-risk file, it can go straight to a merge or a smoke test.

Practical Use Case 2: Test Generation

AI-powered test generation goes beyond simple templates. By analyzing code changes, defect patterns, and existing test coverage, the model suggests relevant test scenarios that a developer might overlook.

Diff-Based Test Suggestions

When a developer modifies a function or existing validation logic, the model works through the code path, including all boundary conditions. It suggests suitable test cases for those changes and suitable test data to go with them.

Edge Cases from Defect Clusters

The model has historical defect data. It can be seen that this feature has historically produced this type of defect. When a particular line of code is modified, it gives test cases based on both the historical bugs and the new changes.

If null pointer errors have clustered around certain patterns, the model recommends null handling tests proactively. If a login function has been changed 45 times and 40 of those times produced null pointer exceptions, the model will push that scenario to the highest priority.

API Contract Variation

For API changes, the model generates test cases for formats, post body, headers, and authentication. It also covers the non-functional parts. It asks whether the JWT is being tested and whether OWASP top validations are being run against the API. It creates a full set of contract variation test cases.

Best practice: start with AI-generated test suggestions, then apply human validation. Review for correctness, add assertions, and verify business logic alignment before committing.

How Training Vectors Work

Everything going to an LLM travels as vectors, as embeddings. When an MCP connector syncs your Jira or when a code wrapper integrates with the model, everything is converted to vector format before it reaches the model.

If 50 bugs are labeled as null pointer exceptions in the login module, the model learns to always generate null checks for any login-related PR. It is not a guess. It is pattern matching against your own history.

The process: take the bug summary, create embeddings, push those embeddings to the embedding store, and start fine-tuning the LLM trainer. From that point on, whenever a PR change is pushed for the login module, the model generates test suggestions for null checks and login edge cases automatically.

Practical Use Case 3: Failure Prediction

Flaky Test Prediction

Machine learning models analyze test execution history to identify flaky tests before they erode trust in the CI. The model looks for patterns like inconsistent timing and environmental dependencies.

Risky Build Alerts

When a build enters the CI pipeline, the model assesses its risk profile based on the files changed, the author's history, and time-of-day patterns. High-risk builds trigger enhanced scrutiny, additional reviews, or expanded test coverage.

Release Instability Scoring

Before deploying to production, the model scores the release candidate based on test results, code churn, dependency updates, and historical stability of the included changes.

Environment Fingerprinting

With six months of CI logs fed into the model, it builds a picture of what type of runner is being used, what test suites run on those runners, the history of passes and fails for each test case, and which test cases are flaky versus stable. It creates an environment fingerprint.

Once that fingerprint exists, when a test fails, the model can diagnose it immediately. CI test diagnostic output: 85 percent of failing tests are due to flaky behavior, not a genuine regression. It says skip debugging, report as flaky.

There is no change on the development side, no change on elements or request/response bodies. The failure is purely historical flakiness. The model already has that history, so it tells you directly. This saves significant time.

Where AI Actually Helps Most

Prioritization: AI ranks thousands of test cases by risk, execution cost, and coverage value. This turns an unmanageable test suite into a focused, efficient regression run that completes in a reasonable time frame.
Pattern detection: When certain types of commits happen, the model tells you which modules are likely to break. It automatically flags those areas.
Noise reduction: When something fails in the CI pipeline due to environmental fingerprinting, the model identifies it as a flaky test. It saves debugging time by telling you immediately that the failure is not a regression.
Alert fatigue reduction: If 85 percent of the failing test cases are flaky, the model can tell you to skip the debug cycle and rerun the build. On a suite of 10,000 test cases with 850 failures in a day, that classification saves hours.

AI is not a magic test for writing. It is an intelligent amplification of human testing decisions.

Where AI Fails

Rohit is direct about this. AI fails in predictable ways, and teams need to know them.

Hallucinated scenarios: Because the model is trying to maximize coverage, it can start generating scenarios that are out of context or irrelevant. Hallucination is the biggest problem with AI in this space.
Confident but wrong outputs: The model lacks awareness of its own limitations. It provides risk scores and test suggestions with high confidence even when the training data is sparse or irrelevant. If poor-quality data was ingested as training data, the model will build confidence on top of that bad foundation and give you wrong answers with certainty.
Biased from bad defect data: If defect tickets in Jira are written poorly, without proper steps, labels, or screenshots, the model's output will reflect that same poor quality. Garbage in, garbage out.
Missing domain context: If you did not set the right context, if you did not give the model knowledge bases and KB articles about what your product does, it will not understand your domain. Sitting down with an LLM and just starting to write prompts without setting context is not the right approach. You need to create the RAG layer, create the KB articles, and ingest them so the model has the right foundation.

AI output is a recommendation. It is not a verdict. It is humans helping in human decision-making. It is not a replacement for expert judgment.

How to Validate AI Output

Rohit covers four practices for keeping AI recommendations trustworthy.

Confidence scoring: Every AI recommendation includes a confidence score based on training data quality and prediction uncertainty. Low confidence means the recommendation goes to human review.
Explainability: The model must be able to explain its reasoning. Why was this test prioritized? Which factor influenced the risk score? This explanation lets engineers assess whether the model's logic matches reality.
Guardrails: Hard constraints are placed to prevent dangerous recommendations from getting through.
Human review for high risk: When confidence is low, when explainability is weak, and when guardrails flag a recommendation, the output goes to a human reviewer before any action is taken.

Guardrails That Work

Here are the guardrails that evolve over time as the team learns which constraints prevent real problems versus which ones create unnecessary friction.

Block Destructive Test Suggestions

Prevent the AI from recommending tests that delete production data, modify user accounts, or perform irreversible operations.

This is especially important when MCP servers are configured in agent mode with broad permissions. Those destructive areas can cause irreversible damage.

Policy Filters

Whatever your organization's policies are, coding guidelines, review guidelines, PR guidelines, automation writing standards, security-critical rules, these need to be given to the model as strict rules inside whatever coding tool you are using with LLM integration. Security-critical changes in particular must be governed by these policies.

Domain Invariants

Give the model proper domain information. Create rules inside the LLM or the coding engine. Examples: account balance must be greater than zero, order total must equal the sum of line items.

It is possible that a customer has internal logic that differs from the industry standard. That internal logic needs to be explicitly stated. The model cannot infer it.

Risk Thresholds

Define what the acceptable risk level is for different deployment stages. Tell the model clearly that staging tolerates a higher risk than production. Whatever pipeline or automation suite is running, the instructions should be clear.

Feature flag changes allow more experimental behavior than generally available features. These distinctions need to be given explicitly.

The Confidence Score Formula

Rohit walks through the mathematics behind how an LLM calculates confidence for its test recommendations.

Confidence Score = (Model Score x Historical Precision x Signal Strength x Context Match) divided by Risk Multiplier

Model Score: Starts from 0 to 1. It is the raw AI probability or ranking score that the model assigns internally.
Historical Precision: The past accuracy of similar AI suggestions. How often has the model been right when it made similar recommendations before?
Signal Strength: The quality of the input signal. A recommendation backed by actual defect data scores higher than one backed by a pure prompt with no context.
Context Match: The relevance of the recommendation to the current code changes and their dependencies.
Risk Multiplier: Ranges from 1 to 2. Starts at 1 for low-risk areas. Goes up to 2 for high-risk domains, acting as a penalty that lowers the overall confidence score.

The Pizza Example

Rohit uses a restaurant scenario to make the formula tangible. You want to go out for dinner and write to an AI that you want to eat pizza. The AI reviews everything and suggests Yugis as the best pizza place. Model score: 0.9.

You went there, ate the pizza and loved it. Your track record: 60 percent of your feedback matches what other people say. Historical precision: 0.6.

You also uploaded a photo of the pizza you wanted as evidence. Signal strength: 1.0.

But Yugis is a noisy place with a lot of people, and your context was that you needed a silent place to discuss a sales pitch. Context match: 0.5.

You are sitting down to discuss a client deal in a noisy venue. High-stake situation. Risk multiplier: 2.0.

Final confidence: (0.9 x 0.6 x 1.0 x 0.5) divided by 2 = 0.135.

The system output: do your own research. This place may have good pizza, but the context does not fit. If you update the context next time and say you want to eat pizza at Yugis also needs a silent place to discuss a client deal, the model should give you a confidence score of 0.135 and tell you to do your own research before going there for something that important.

The Testing Practical Example

The same formula applies to a payment gateway change. Historical precision of 0.7 because there are already two or three known bugs in the module. Signal strength of 1.0 because those bugs are in the defect history.

Context match covers what needs to be tested. Risk multiplier of 2 because it is a critical blast area at the core of the application.

After all the calculation the confidence score comes out at 0.25. The model output: route to human review. Whatever changes are being tested here, a human needs to review them. Do not depend on the model alone.

How This Fits Into Existing Frameworks

Selenium works. Playwright works. API automation works. Everything in the existing stack continues as it is. There is no need to replace any of it.

What changes is that an intelligence layer is added on top. Signal analysis, test generation, and feedback learning sit above the existing frameworks.

The traditional execution layer stays. The AI layer provides the thinking about what to run, when to run it, and what the results mean.

Author

Rohit Mehta

Blogs: 1

Rohit is a Quality Engineering and Testing Practice Head with 15+ years of experience across enterprise and SaaS platforms. He builds AI-driven QA practices that enable faster releases, lower risk, and predictable quality at scale. He leads QA strategy, AI adoption, and governance across programs, working closely with engineering, product, and CXO teams to position quality as a business enabler. His expertise includes intelligent test generation, self-healing automation, regression optimization, predictive analytics, and CI/CD-integrated quality practices.