Next-Gen App & Browser Testing Cloud

Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

On This Page

Session 1: Symbiosys Effect
Session 2: Evaluating AI Agents
Session 3: Hallucination Hunters
Session 4: Pilot to Pipeline (Live Workshop)
Session 5: Panel Discussion
Session 6: Start Training Models
Session 7: Agentic AI for Testing

Home
/
Blog
/
Spartans Summit 2026: A Quick Recap

Spartans Summit 2026: A Quick Recap

Spartans Summit 2026 by TestMu AI covered AI agent evaluation, MCP security, hallucination testing, smart regression, and agentic quality systems.

TestMu AI

March 19, 2026

We recently hosted Spartans Summit 2026, a six-hour virtual event featuring seven sessions, a live workshop, and a panel discussion.

Speakers from Thoughtworks, EPAM, Microsoft, Paramount, and other organizations joined to discuss how teams are evaluating AI agents, securing MCP servers, and moving AI pilots into actual engineering pipelines.

Here are the major highlights of the Spartans Summit 2026 by TestMu AI (Formerly LambdaTest).

Session 1: The Symbiosys Effect: AI & I

Ioannis Papadakis, Head of QA at Snappi, opened the session with a walk through QA history, going back to Grace Hopper's actual bug in 1945 and moving forward through waterfall, Agile, DevOps, and into the current wave of agentic systems.

The historical framing was his way of making a point: QA has survived every major shift before. This one is not different in kind, just in pace. His core argument was that the QA role is shifting from gatekeeper to enabler.

GenAI vs Agentic AI

Ioannis drew a distinction that came up repeatedly throughout the day:

GenAI is task-oriented and static. You give it a prompt, and it gives you output.
Agentic AI is goal-oriented, plans across steps, and adapts based on what it learns.

The reason this matters for QA: the two require completely different testing approaches. A static model can be tested with fixed assertions. Agentic AI testing demands an entirely new evaluation mindset.

GenAI completes tasks. Agentic AI sets goals, plans long-term, and adapts in real time. That difference changes everything for QA automation. pic.twitter.com/5QtrUqdENO
— TestMu AI (@testmuai) March 11, 2026

Prompt Engineering as a QA Skill

He expects QA engineers to develop AI prompt engineering skills because querying and testing agentic systems requires knowing how to construct input that actually exercises the system. Techniques worth learning include zero-shot, few-shot, and chain-of-thought prompting techniques.

He also built his own agentic testing system on WebdriverIO as a personal experiment for mobile native app testing. He shared it to make the point that you can start building without waiting for someone to hand you a framework.

On MCP Confusion

Ioannis cleared up something he said causes ongoing confusion: MCP (Model Context Protocol) is about connecting models to tools and enabling action execution. It is not the same as instructions or prompt templates. People mix them up, and it matters because the security implications are different.

His practical advice: start in high-risk, high-impact areas. Measure what you get. Then expand.

Note: Automate QA using natural language with KaneAI. Try TestMu AI Now!

Session 2: Evaluating AI Agents - Testing and Tooling for Reliable Outcomes

Asmita Parab, QE Manager at EPAM Systems, started with a stat worth sitting with: the AI agent market is projected to hit $83 billion by 2033, but only about 5% of AI pilots are currently extracting real business value.

The Core Analogy

Traditional testing is like grading a math exam. There is a right answer. You check against it. Agent evaluation is like grading an essay. The same prompt can produce different outputs that are all technically acceptable. "Expected equals actual" breaks down entirely.

Asmita Parab explores how evaluating AI agents differs from traditional software testing, where outcomes are no longer purely deterministic.
She highlights the need for new approaches that assess reasoning, behavior, and reliability across complex agent workflows.… pic.twitter.com/q7HpvE78wM
— TestMu AI (@testmuai) March 11, 2026

What to Evaluate

Asmita structured the evaluation problem across several dimensions:

Task Completion: Did the agent do what you asked?
Path Validation: Two runs of the same agent can take completely different routes to the same answer, and both paths may need to be valid.
Non-Functional: Token usage, latency, loop detection (is the agent stuck?)
Safety: Adversarial resistance, PII leakage, bias, graceful error handling.

Evaluation-Driven Development

She introduced EDD as a direct parallel to BDD: define your evals before or alongside building the agent, not after. Evals are test cases. They should be binary where possible, and they become regression testing benchmarks once the system goes to production. Every time the model, prompt, or any component changes, the evals run.

Three Evaluation Techniques

Code-Based Evals: Fast and cheap, good for objective/quantifiable outputs, brittle for subjective tasks.
LLM-as-Judge: A second model evaluates the first. Use binary/categorical scoring (relevant/not relevant) rather than numeric scales, which introduce bias.
Human-in-the-Loop: Most thorough. Used for full session trajectory evaluation and to calibrate LLM judges.

For RAG-based components, she recommended tracking context recall, context precision, factual consistency, and hallucination as the key metrics. For routing and tool-calling components, test them in isolation before testing end-to-end. Her closing point: most teams spend energy building agents and almost none evaluating them. Understanding the benefits of CI/CD makes it clear why evals belong in the pipeline.

Session 3: Hallucination Hunters: A QA Engineer's Guide to AI Evaluation

Gaurav Khurana, Senior Test Consultant at Microsoft, took a hands-on approach. He started by running the same question through Copilot and ChatGPT live, showing that you get different answers every time.

Why Traditional NLP Metrics Fall Short

He walked through precision, recall, and BLEU score and explained the gap. BLEU measures word overlap. If two sentences say the same thing using different words, BLEU scores them as dissimilar. That breaks down when your model gives a valid answer in different phrasing every run. Understanding NLP testing fundamentals helps contextualize why these traditional metrics fall short for modern AI systems.

The Four-Parameter Test Case Format

His framework for AI evaluation uses four parameters: query, response, context, and ground truth. Not all four are needed for every metric:

Groundedness (does the answer come from the provided context?) needs context.
Relevance (does the answer stay on topic?) only needs the query.
Coherence: Logical flow of ideas.
Fluency: Grammatical correctness and natural language quality.

Hallucination Types

Intrinsic: Contradicts existing facts.
Extrinsic: Fabricates information that does not exist. His example was lawyers citing non-existent court cases, making the stakes concrete.

LLM-as-Judge

Gaurav matched what Asmita covered but added one constraint: the judge model should be the same capability level or higher than the model being tested. A weaker model judging a stronger one gets you unreliable scores.

Live Demo on Azure AI Foundry

He uploaded a JSONL dataset with intentionally wrong answers, ran it through an evaluation pipeline with GPT-4.1 as the judge, and showed how per-test-case failure reasons come back. He also showed the same done via code, which matters for teams that want evaluation in a CI pipeline rather than a UI.

Worth noting: testing AI agent LLM applications costs money in a way traditional testing does not. Every evaluation invocation costs tokens. Plan for it. Tools mentioned include LangSmith, PromptFlow, TruLens, and Azure AI Foundry.

Great AI isn't just about generating answers it's about generating the right answers.

Gaurav Khurana breaking down Relevance in LLM evaluation at #SpartansSummit pic.twitter.com/ulXTkypVkM
— TestMu AI (@testmuai) March 11, 2026

Session 4: From Pilot to Pipeline - How Teams Are Actually Scaling AI in Quality Engineering

Srinivasan Sekar and Sai Krishna, Directors of Engineering at TestMu AI (Formerly TestMu AI), hosted a live workshop across three parts.

Part 1: What Is MCP?

If you have three AI models and five tools, you normally need 15 custom integrations to connect them all. MCP makes that a standard protocol, so any model can talk to any tool without a custom bridge for each pairing. The USB-C analogy Srinivasan used is accurate: one standard, interoperable across devices.

The architecture breaks down into three components:

Host: The user-facing application (an IDE, a chat interface)
Client: Handles communication between host and server using JSON-RPC
Server: Exposes tools, resources, and prompt templates

Servers can be local (running as executables) or remote (accessed over HTTP). The spec is open, evolving roughly quarterly, and maintained by an Anthropic-led council.

Part 2: Building a Custom MCP Server

Sai built a performance metrics MCP server in TypeScript on top of Playwright MCP. The motivation: the default Playwright MCP does not expose client-side performance metrics like First Contentful Paint, Time to First Byte, or DOM content loaded.

Key implementation details include registering tools with descriptive names (the agent uses the description to decide which tool to call), returning results in the correct contract format, and adding a resource document so the agent can interpret raw numbers against reference values.

Two demos were shown:

A Cursor agent pointed at a website, collecting performance metrics and interpreting results against benchmarks in the resource doc.
An Appium MCP connected to a physical Android device. The agent opened the Contacts app, created a new contact, and saved it. It also generated a WebdriverIO test file with locators from the DOM it traversed.

Playwright MCP. GitHub MCP. Appium MCP. The ecosystem is building fast, QA engineers who understand MCP architecture will have a serious edge. pic.twitter.com/keJbnx4o2X
— TestMu AI (@testmuai) March 11, 2026

Part 3: MCP Security

This section was the most direct. The core problem: MCP servers are trusted by the agent. If a server's tool description contains malicious instructions, the agent follows them without questioning.

Attack types covered include:

Tool Poisoning: Hidden instructions in the tool description, executed silently.
Advanced Tool Poisoning: Malicious instructions in error return values, which the LLM treats as content to act on.
Schema Poisoning: Any node in the input/output schema can carry instructions.
Prompt Injection: Malicious instructions hidden in Markdown resource files that the agent reads.
Rug Pull Attacks: The tool behaves normally for several uses, then the server updates the description to add malicious behavior.
DNS Rebinding: An attacker takes control of localhost and accesses local MCP servers through open browser tabs.

In the live demo, Sai added one hidden instruction to a tool description telling the agent to return system information (username, location, IP) alongside performance metrics. The agent did it. The user saw only the performance metrics in the output.

Defense tools mentioned include Secure Hulk (open source, built by Srinivasan), MCP Scan (Invariant Labs), and ETDI (Enhanced Tool Definition Interface). Proper security testing of MCP servers should be a prerequisite before integrating them into your workflows.

Session 5: From Pilot to Pipeline - How Teams Are Actually Scaling AI in Quality Engineering

The panel featured Harinee Muralinath (Director, Thoughtworks), Jaydeep Chakrabarty (Director of AI in Tech, Piramal Finance), Pricilla Bilavendran (Team Leader, Billennum), Rahul Parwal (Specialist, ifm engineering), and Siddhant Wadhwani (Engineering Manager, Newfold Digital).

The panel question: 2025 was the year of AI experiments. How do you actually operationalize any of this?

From Experiments to Systems

Several panelists landed on the same framing: experiments do not scale, systems do. Harinee talked about treating AI workflows as code, with prompts and agent steps in version-controlled repositories with review pipelines. Siddhant described a shift from "which LLM should we use" to "where in our workflow is there friction that AI could reduce."

"The teams winning with AI stopped asking 'which LLM?' and started asking 'where are our workflow frictions?' That shift changes everything."#SpartansSummit pic.twitter.com/PdHwKFELJr
— TestMu AI (@testmuai) March 11, 2026

Rahul introduced a framing worth keeping. "Slop" is what AI produces by default when there is no engineering around it. "Kino" (the German word for cinema, edited and structured) is what you get when you apply guardrails, iteration, and judgment. Moving from one to the other is the actual engineering work.

The Testing Pyramid

The panel agreed the pyramid's values have not changed (fast feedback, stable tests closer to the core, reducing noise) even if its structure has shifted:

Siddhant described the future as more of a mesh, with AI generating tests at multiple layers simultaneously and selecting which to run based on code changes.
Harinee added a new layer that did not exist before: agent-level tests for workflow and tool-use validation.
Jaydeep added a useful check: just because AI can write 5,000 tests does not mean you should have 5,000 tests. The pyramid needs periodic evaluation, not just expansion.

AI Use Cases That Work in Production

The consensus list from the panel includes:

Test case generation with AI from specifications
Failure triage and root cause analysis
PR review for test gaps and security issues
Self-healing and flaky test detection

Harinee flagged the limit: AI handles obvious test scenarios well, but domain complexity and regulatory requirements layer in things models currently miss.

Metrics Worth Tracking

Jaydeep's framing: stop asking "did AI generate more?" Start asking, "Did engineering get safer?" His team tags AI-generated code in their repository and tracks what percentage of AI MRs developers accept without modification. Currently: about 14% pass through untouched, 70-80% need iteration, and about 10% are beyond what the model can handle.

Other metrics raised by the panel:

Signal-to-noise ratio in test output
Defect detection effectiveness (how early are bugs caught?)
Time for root cause analysis
Regression cycle time
Engineering cognitive load reduction

Challenges at Scale

Siddhant listed four challenges:

Data Quality: AI is only as reliable as the data it trains on.
Missing Eval Frameworks: Teams deploy without measuring accuracy or bias.
Tool Sprawl: Experimentation without integration.
Adoption Resistance: Engineers do not trust systems that behave unpredictably.

His team built an internal AI portal called Atlas to consolidate use cases rather than leaving them scattered across disconnected tools.

Final Advice From the Panel

Rahul: Execute, do not just attend talks.
Siddhant: Identify workflow friction first, then introduce AI at those exact points.
Harinee: Start with evals, practice risk-based thinking.
Jaydeep: Learn how the systems you are testing actually work, including the bias implications. He gave the example of a resume shortlisting system that recommended "James" but rejected "Jane" for the same resume. QA engineers focused on AI in software testing need enough model knowledge to catch that kind of failure.

Session 6: Stop Writing Tests. Start Training Models

Rohit Mehta, Practice Head of QA at Pratham Software, opened with a scenario most people in the room had lived through: a payment system goes down at 3 am. Six hours later, you find a bug in a loop. Your regression suite has 20,000 green tests. Management asks what the tests were doing.

The problem he named: coverage without intelligence. You can have a large test suite and still miss what matters. The bottleneck in modern QA is not writing tests. It is managing them, trusting them, and making them tell you something useful.

Testing as a Data Problem

Rohit's central argument: testing is now a data problem, not a coverage problem. The question is not how many tests you have. It is whether you are using your existing signals to make the testing you do more targeted.

Five training signals he recommended ingesting:

Jira defect history
CI failure logs
Production telemetry (logs, traces, metrics)
User behavior patterns
Incident postmortems

Smart Regression: Risk Scoring

Two files in the same PR can get very different scores based on:

Developer History: New hire vs. senior contributor.
Module Defect Rate: How often has this area broken before?
Blast Radius: How critical is this component?

A payment gateway file with high risk history gets a score of 0.95 and triggers a human review flag. A CSS file with low history and low blast radius gets 0.15 and can go through with minimal scrutiny.

Test Generation Types

Diff-Based Suggestions: The model reads the code change and generates tests for the new code paths.
Defect Cluster Analysis: If a login module has 45 historical null pointer exceptions and that module just changed, the model flags null-handling tests as high priority.
API Contract Variation: Generating tests across the full contract surface, including auth, schema, and OWASP checks.

AI-powered test generation goes beyond simple templates. It identifies test scenarios and detects patterns that developers would otherwise overlook. pic.twitter.com/zvaP6Z5cLR
— TestMu AI (@testmuai) March 11, 2026

Flaky Test Prediction

The model builds environment fingerprints: which runner, which tests, which pass/fail history. After enough data, it can classify a failing test as "85% likely a flaky test, no product change correlates with this failure," which saves significant investigation time.

Where AI Fails

Rohit covered this directly:

Hallucinated Scenarios: Models generate test cases that make no sense for your product.
Confident But Wrong: High confidence scores built on low-quality training data.
Biased by Bad Defect Data: If your Jira tickets are badly written, the model learns from badly written tickets.
Missing Domain Context: Without context about what your product does, the model guesses.

His conclusion: AI output is a recommendation, not a verdict. Guardrails (policy filters, domain invariance rules, risk thresholds) need to be defined explicitly.

Session 7: Agentic AI for Testing - A Self-Updating Quality System That Writes, Runs, Explains, and Governs Tests

Partha Sarathi Samal, QE Manager at Paramount, addressed a specific problem: before a release, a VP of engineering asks, "Is this safe to ship?" Most test systems cannot give a useful answer. They give you a pass/fail count. That is not the same thing.

His framing: current AI tools are good at isolated tasks. They write a test, suggest a fix, summarize a failure. None of those answers the release question: what changed, what risk did that create, what validation covered it, and what evidence supports the go/no-go decision?

The Five-Agent Loop

One change (a PR, a story, an API spec update) enters the system and triggers a closed loop:

Change Intelligence: Reads the diff and produces risk-scored change events.
Coverage Orchestration: Maps events against a living dependency graph and selects tests within the available time budget.
Verification: Runs the validations.
Failure Narrative: Explains what happened when something breaks, including the class of failure (product defect, environment instability, or flakiness) with evidence.
Quality Gate: Applies deterministic policies to produce a pass, warn, or fail verdict.

Immutable Evidence Packs

Every run produces an evidence pack: logs, screenshots, traces, network summaries, checksums, and execution metadata. Partha's principle: a test result without evidence is an opinion. If you cannot replay what happened and explain why, you cannot defend a release decision.

A test result without evidence is just an opinion. Every validation should produce logs, traces, screenshots, and checksums, or it's not proof, it's hope."#SpartansSummit pic.twitter.com/Br6Py2rJTU

Causal Attribution

The output shifts from "test 481 failed" to explanations like:

"This failure is most consistent with a product defect in the payment service, supported by a trace mismatch and a network error following a contract shift."
"This looks like environmental instability seen across unrelated suites, not a product regression."
"This is a flaky test with no correlated code change."

That distinction tells the engineer what to do next without manual investigation.

The Living Coverage Graph

Coverage percentage is not a useful number. "80% coverage" does not tell you which services, workflows, or failure modes are covered. A graph does. Connect services, tests, requirements, incidents, and production paths, then ask:

Which validations defend this API?
Which tests cover this service dependency?
Which incidents have no prevention edge yet?

When a production incident hits a path with no strong validation edge, the graph records it, and the next cycle targets it.

Bounded Repair vs Self-Healing

Self-healing sounds useful, but it can hide real change. A selector updates automatically, the test goes green, but the product behavior may have changed in a risky way, and nobody noticed.

Bounded repair is stricter: limited, auditable, reversible updates with guardrails. Log every repair. Require repeatability. Escalate risky cases. Never silently rewrite reality.

Adoption Model

Do not replace your pipeline. Progress in three stages:

Observation Mode: The system runs silently alongside your existing tests.
Recommendation Mode: It suggests what to run and what the policy would say, but humans still decide.
Enforcement Mode: Deterministic gates actively shape release outcomes.

Metrics That Matter

Instead of pass rate (which can look healthy while untested risk accumulates), Partha recommended tracking:

Mean time to detect
Mean time to explain
Flake rate
Selection efficiency (is the risk model choosing valuable validations?)
Reproducibility confidence
Escape defect proxy

Partha's one-line summary: a modern quality system should explain release risk with evidence. Teams focused on AI agent testing should prioritize these metrics over simple pass rates.

Common Threads Across All Sessions

Seven sessions, different speakers, different roles. A few things came up repeatedly.

"Run Everything" Is No Longer a Strategy: Whether framed as risk-based selection, smart regression, or budget-constrained orchestration, the goal is tests that tell you something true, not tests that produce large numbers.
Evals Are Not Optional for AI Systems: You cannot test non-deterministic output with deterministic assertions. A different evaluation framework is required, and it needs to be in your CI/CD pipeline.
Human Judgment Is Not Going Away, but Its Focus Is Shifting: The consensus across sessions was to let machines handle algorithmic and procedural work. Use humans for creativity, domain judgment, ethical assessment, and validating that the machines are doing what you think they are.
Security Around AI Tooling Needs More Attention: The MCP session was the most direct about this, but the thread ran through several other talks. When you give an AI system access to tools, you are trusting everything in the execution path. That trust needs to be earned, audited, and limited.

Author

TestMu AI

Blogs: 201

TestMu AI is World's First Full Stack AI Agentic Quality Engineering platform that empowers teams to test intelligently, smarter, and ship faster. Built for scale, it offers a full-stack testing cloud with 10K+ real devices and 3,000+ browsers. With AI-native test management, MCP servers, and agent-based automation, TestMu AI supports Selenium, Appium, Playwright, and all major frameworks. AI Agents like HyperExecute and KaneAI bring the power of AI and cloud into your software testing workflow, enabling seamless automation testing with 120+ integrations. TestMu AI Agents accelerate your testing throughout the entire SDLC, from test planning and authoring to automation, infrastructure, execution, RCA, and reporting.