SESSION

Backwards Scoring: Ranking Test Suites by Which Real Incidents They Would Have Caught

Test coverage is the metric everyone has and nobody trusts. A 90%-covered codebase still ships incidents. The question worth answering is harder: of the tests we run, which ones have ever caught anything that mattered?

This paper proposes a backwards effectiveness score. For each historical incident in our streaming QE database, we replay candidate test suites against the pre-incident state of the system and record which tests would have failed. Tests that fail on real historical incidents score high. Tests that have never failed on anything real score low, regardless of code coverage or assertion count.

We applied the method to roughly 4,200 automated test cases over an 18-month incident window. About 12% had no historical catch evidence. Another 8% caught incidents already caught earlier in the pipeline, contributing duplicate coverage. We did not auto-delete those tests. We flagged them for owner review, and the conversation that followed surfaced cases where the test was preventive rather than reactive (caught problems before they entered the incident database) and others where the test was simply dead weight.

The contribution is the scoring method, the limits of the method (it punishes preventive tests, it depends on incident data quality), and a sober view of what "test effectiveness" can and cannot tell.

Key Takeaways:

  • Takeaway

    The Flaw in Traditional Metrics: Understand why high code coverage doesn't equate to incident prevention and why we need a "backwards" approach to validation.

  • Takeaway

    The Backwards Scoring Framework: Learn a repeatable methodology for replaying historical incidents against existing test suites to identify which tests actually provide value.

  • Takeaway

    Identifying "Dead Weight" Tests: How to use incident data to pinpoint tests that provide duplicate coverage or no historical evidence of catching real-world issues.

  • Takeaway

    Balancing Reactive vs. Preventive Testing: Insights into the limits of effectiveness scoring, specifically how to distinguish between "useless" tests and those that are vital for prevention but don't show up in incident logs.

  • Takeaway

    Data-Driven Test Maintenance: A practical workflow for flagging low-scoring tests for owner review rather than relying on blunt auto-deletion.

About the speaker

Partha Sarathi Samal:

Partha Samal is a Quality Engineering Leader with more than 21 years of experience in software testing. He has led test teams for media streaming platforms that serve millions of viewers across regions. His work spans test automation, performance engineering, and integration of QA practices into DevOps pipelines. Partha focuses on stable releases, strong test coverage, and fast feedback for developers. He uses AI and cloud platforms to scale test runs and spot problems earlier in the cycle. He mentors engineers on modern testing skills and builds practices that raise product quality and team ownership, and speaks at internal forums and community events on practical ways to grow QA maturity.

TESTMU-CONF 2026

GET YOUR FREE BOARDING PASS

I agree to TestMu AI's Privacy Policy, Conference Terms and Conditions.

About
TestMu Conf

Testμ (TestMu) is the world’s largest virtual conference on agentic engineering and quality, built by the community, for the community. As AI reshapes how we build, test, and ship software, Testμ Conf is where you connect, grow, and lead: agentic workflows, autonomous quality, battle-tested AI playbooks, hands-on workshops, and the engineering culture driving it all.