Next-Gen App & Browser Testing Cloud
Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

Spartans Summit 2026 by TestMu AI covered AI agent evaluation, MCP security, hallucination testing, smart regression, and agentic quality systems.

TestMu AI
March 19, 2026
We recently hosted Spartans Summit 2026, a six-hour virtual event featuring seven sessions, a live workshop, and a panel discussion.
Speakers from Thoughtworks, EPAM, Microsoft, Paramount, and other organizations joined to discuss how teams are evaluating AI agents, securing MCP servers, and moving AI pilots into actual engineering pipelines.
Here are the major highlights of the Spartans Summit 2026 by TestMu AI (Formerly LambdaTest).
Ioannis Papadakis, Head of QA at Snappi, opened the session with a walk through QA history, going back to Grace Hopper's actual bug in 1945 and moving forward through waterfall, Agile, DevOps, and into the current wave of agentic systems.
The historical framing was his way of making a point: QA has survived every major shift before. This one is not different in kind, just in pace. His core argument was that the QA role is shifting from gatekeeper to enabler.
Ioannis drew a distinction that came up repeatedly throughout the day:
The reason this matters for QA: the two require completely different testing approaches. A static model can be tested with fixed assertions. Agentic AI testing demands an entirely new evaluation mindset.
GenAI completes tasks. Agentic AI sets goals, plans long-term, and adapts in real time. That difference changes everything for QA automation. pic.twitter.com/5QtrUqdENO
— TestMu AI (@testmuai) March 11, 2026
He expects QA engineers to develop AI prompt engineering skills because querying and testing agentic systems requires knowing how to construct input that actually exercises the system. Techniques worth learning include zero-shot, few-shot, and chain-of-thought prompting techniques.
He also built his own agentic testing system on WebdriverIO as a personal experiment for mobile native app testing. He shared it to make the point that you can start building without waiting for someone to hand you a framework.
Ioannis cleared up something he said causes ongoing confusion: MCP (Model Context Protocol) is about connecting models to tools and enabling action execution. It is not the same as instructions or prompt templates. People mix them up, and it matters because the security implications are different.
His practical advice: start in high-risk, high-impact areas. Measure what you get. Then expand.
Note: Automate QA using natural language with KaneAI. Try TestMu AI Now!
Asmita Parab, QE Manager at EPAM Systems, started with a stat worth sitting with: the AI agent market is projected to hit $83 billion by 2033, but only about 5% of AI pilots are currently extracting real business value.
Traditional testing is like grading a math exam. There is a right answer. You check against it. Agent evaluation is like grading an essay. The same prompt can produce different outputs that are all technically acceptable. "Expected equals actual" breaks down entirely.
Asmita Parab explores how evaluating AI agents differs from traditional software testing, where outcomes are no longer purely deterministic.
— TestMu AI (@testmuai) March 11, 2026
She highlights the need for new approaches that assess reasoning, behavior, and reliability across complex agent workflows.… pic.twitter.com/q7HpvE78wM
Asmita structured the evaluation problem across several dimensions:
She introduced EDD as a direct parallel to BDD: define your evals before or alongside building the agent, not after. Evals are test cases. They should be binary where possible, and they become regression testing benchmarks once the system goes to production. Every time the model, prompt, or any component changes, the evals run.
For RAG-based components, she recommended tracking context recall, context precision, factual consistency, and hallucination as the key metrics. For routing and tool-calling components, test them in isolation before testing end-to-end. Her closing point: most teams spend energy building agents and almost none evaluating them. Understanding the benefits of CI/CD makes it clear why evals belong in the pipeline.
Gaurav Khurana, Senior Test Consultant at Microsoft, took a hands-on approach. He started by running the same question through Copilot and ChatGPT live, showing that you get different answers every time.
He walked through precision, recall, and BLEU score and explained the gap. BLEU measures word overlap. If two sentences say the same thing using different words, BLEU scores them as dissimilar. That breaks down when your model gives a valid answer in different phrasing every run. Understanding NLP testing fundamentals helps contextualize why these traditional metrics fall short for modern AI systems.
His framework for AI evaluation uses four parameters: query, response, context, and ground truth. Not all four are needed for every metric:
Gaurav matched what Asmita covered but added one constraint: the judge model should be the same capability level or higher than the model being tested. A weaker model judging a stronger one gets you unreliable scores.
He uploaded a JSONL dataset with intentionally wrong answers, ran it through an evaluation pipeline with GPT-4.1 as the judge, and showed how per-test-case failure reasons come back. He also showed the same done via code, which matters for teams that want evaluation in a CI pipeline rather than a UI.
Worth noting: testing AI agent LLM applications costs money in a way traditional testing does not. Every evaluation invocation costs tokens. Plan for it. Tools mentioned include LangSmith, PromptFlow, TruLens, and Azure AI Foundry.
Great AI isn't just about generating answers it's about generating the right answers.
— TestMu AI (@testmuai) March 11, 2026
Gaurav Khurana breaking down Relevance in LLM evaluation at #SpartansSummit pic.twitter.com/ulXTkypVkM
Srinivasan Sekar and Sai Krishna, Directors of Engineering at TestMu AI (Formerly TestMu AI), hosted a live workshop across three parts.
If you have three AI models and five tools, you normally need 15 custom integrations to connect them all. MCP makes that a standard protocol, so any model can talk to any tool without a custom bridge for each pairing. The USB-C analogy Srinivasan used is accurate: one standard, interoperable across devices.
The architecture breaks down into three components:
Servers can be local (running as executables) or remote (accessed over HTTP). The spec is open, evolving roughly quarterly, and maintained by an Anthropic-led council.
Sai built a performance metrics MCP server in TypeScript on top of Playwright MCP. The motivation: the default Playwright MCP does not expose client-side performance metrics like First Contentful Paint, Time to First Byte, or DOM content loaded.
Key implementation details include registering tools with descriptive names (the agent uses the description to decide which tool to call), returning results in the correct contract format, and adding a resource document so the agent can interpret raw numbers against reference values.
Two demos were shown:
Playwright MCP. GitHub MCP. Appium MCP. The ecosystem is building fast, QA engineers who understand MCP architecture will have a serious edge. pic.twitter.com/keJbnx4o2X
— TestMu AI (@testmuai) March 11, 2026
This section was the most direct. The core problem: MCP servers are trusted by the agent. If a server's tool description contains malicious instructions, the agent follows them without questioning.
Attack types covered include:
In the live demo, Sai added one hidden instruction to a tool description telling the agent to return system information (username, location, IP) alongside performance metrics. The agent did it. The user saw only the performance metrics in the output.
Defense tools mentioned include Secure Hulk (open source, built by Srinivasan), MCP Scan (Invariant Labs), and ETDI (Enhanced Tool Definition Interface). Proper security testing of MCP servers should be a prerequisite before integrating them into your workflows.
The panel featured Harinee Muralinath (Director, Thoughtworks), Jaydeep Chakrabarty (Director of AI in Tech, Piramal Finance), Pricilla Bilavendran (Team Leader, Billennum), Rahul Parwal (Specialist, ifm engineering), and Siddhant Wadhwani (Engineering Manager, Newfold Digital).
The panel question: 2025 was the year of AI experiments. How do you actually operationalize any of this?
Several panelists landed on the same framing: experiments do not scale, systems do. Harinee talked about treating AI workflows as code, with prompts and agent steps in version-controlled repositories with review pipelines. Siddhant described a shift from "which LLM should we use" to "where in our workflow is there friction that AI could reduce."
"The teams winning with AI stopped asking 'which LLM?' and started asking 'where are our workflow frictions?' That shift changes everything."#SpartansSummit pic.twitter.com/PdHwKFELJr
— TestMu AI (@testmuai) March 11, 2026
Rahul introduced a framing worth keeping. "Slop" is what AI produces by default when there is no engineering around it. "Kino" (the German word for cinema, edited and structured) is what you get when you apply guardrails, iteration, and judgment. Moving from one to the other is the actual engineering work.
The panel agreed the pyramid's values have not changed (fast feedback, stable tests closer to the core, reducing noise) even if its structure has shifted:
The consensus list from the panel includes:
Harinee flagged the limit: AI handles obvious test scenarios well, but domain complexity and regulatory requirements layer in things models currently miss.
Jaydeep's framing: stop asking "did AI generate more?" Start asking, "Did engineering get safer?" His team tags AI-generated code in their repository and tracks what percentage of AI MRs developers accept without modification. Currently: about 14% pass through untouched, 70-80% need iteration, and about 10% are beyond what the model can handle.
Other metrics raised by the panel:
Siddhant listed four challenges:
His team built an internal AI portal called Atlas to consolidate use cases rather than leaving them scattered across disconnected tools.
Rohit Mehta, Practice Head of QA at Pratham Software, opened with a scenario most people in the room had lived through: a payment system goes down at 3 am. Six hours later, you find a bug in a loop. Your regression suite has 20,000 green tests. Management asks what the tests were doing.
The problem he named: coverage without intelligence. You can have a large test suite and still miss what matters. The bottleneck in modern QA is not writing tests. It is managing them, trusting them, and making them tell you something useful.
Rohit's central argument: testing is now a data problem, not a coverage problem. The question is not how many tests you have. It is whether you are using your existing signals to make the testing you do more targeted.
Five training signals he recommended ingesting:
Two files in the same PR can get very different scores based on:
A payment gateway file with high risk history gets a score of 0.95 and triggers a human review flag. A CSS file with low history and low blast radius gets 0.15 and can go through with minimal scrutiny.
AI-powered test generation goes beyond simple templates. It identifies test scenarios and detects patterns that developers would otherwise overlook. pic.twitter.com/zvaP6Z5cLR
— TestMu AI (@testmuai) March 11, 2026
The model builds environment fingerprints: which runner, which tests, which pass/fail history. After enough data, it can classify a failing test as "85% likely a flaky test, no product change correlates with this failure," which saves significant investigation time.
Rohit covered this directly:
His conclusion: AI output is a recommendation, not a verdict. Guardrails (policy filters, domain invariance rules, risk thresholds) need to be defined explicitly.
Partha Sarathi Samal, QE Manager at Paramount, addressed a specific problem: before a release, a VP of engineering asks, "Is this safe to ship?" Most test systems cannot give a useful answer. They give you a pass/fail count. That is not the same thing.
His framing: current AI tools are good at isolated tasks. They write a test, suggest a fix, summarize a failure. None of those answers the release question: what changed, what risk did that create, what validation covered it, and what evidence supports the go/no-go decision?
One change (a PR, a story, an API spec update) enters the system and triggers a closed loop:
Every run produces an evidence pack: logs, screenshots, traces, network summaries, checksums, and execution metadata. Partha's principle: a test result without evidence is an opinion. If you cannot replay what happened and explain why, you cannot defend a release decision.
A test result without evidence is just an opinion. Every validation should produce logs, traces, screenshots, and checksums, or it's not proof, it's hope."#SpartansSummit pic.twitter.com/Br6Py2rJTU
The output shifts from "test 481 failed" to explanations like:
That distinction tells the engineer what to do next without manual investigation.
Coverage percentage is not a useful number. "80% coverage" does not tell you which services, workflows, or failure modes are covered. A graph does. Connect services, tests, requirements, incidents, and production paths, then ask:
When a production incident hits a path with no strong validation edge, the graph records it, and the next cycle targets it.
Self-healing sounds useful, but it can hide real change. A selector updates automatically, the test goes green, but the product behavior may have changed in a risky way, and nobody noticed.
Bounded repair is stricter: limited, auditable, reversible updates with guardrails. Log every repair. Require repeatability. Escalate risky cases. Never silently rewrite reality.
Do not replace your pipeline. Progress in three stages:
Instead of pass rate (which can look healthy while untested risk accumulates), Partha recommended tracking:
Partha's one-line summary: a modern quality system should explain release risk with evidence. Teams focused on AI agent testing should prioritize these metrics over simple pass rates.
Seven sessions, different speakers, different roles. A few things came up repeatedly.
Did you find this page helpful?
More Related Hubs
TestMu AI forEnterprise
Get access to solutions built on Enterprise
grade security, privacy, & compliance