SESSION

Eval-First QA Agents: Testing Streaming Platforms at Fox Networks

Everyone's building AI agents for QA. Most ship demos. A few break production. Almost none have evals.

I'm building one at Fox right now. It runs against our streaming platforms, Apple TV, Roku, Fire TV, and I'm refusing to ship it until it can pass an eval suite as rigorous as the one I use to gate the test code it generates.

This talk is the field report. The architecture: a 38-tool MCP server, a 43-API helper library with a 3-tier resolution pipeline, and a Promptfoo eval suite currently holding the line at 88 assertions and 100% pass rate. The hard parts: hallucinated test steps, false-pass syndrome on visual checkpoints, brittle grounding on TV UIs that don't behave like web pages, and the recurring problem of an agent that's confident when it should be silent.

I'll walk through what actually works, what broke embarrassingly in production, and the design pattern that's emerging from the wreckage: eval-first agent development. Don't build the agent and bolt tests on afterward. Build the eval suite first. Use it to constrain what capabilities the agent is allowed to exercise. Promote new tools into the agent's hands only after the evals catch up to cover them.

If you're tired of "AI agent for QA" talks that are 80% architecture diagrams and 20% wishful thinking, this is the opposite of that, concrete tools, real failure modes, and a workflow you can adopt Monday morning.

Key Takeaways:

  • Takeaway

    The eval-first development loop for QA agents and why it changes the order you build things.

  • Takeaway

    A practical agent architecture: MCP server design, helper libraries, and resolution pipelines for streaming-platform QA.

  • Takeaway

    The five failure modes that keep biting me (including the ones I'm still hitting).

  • Takeaway

    How to use Promptfoo to gate agent capabilities, not just score outputs.

  • Takeaway

    A realistic path from "experimental agent" to "agent your team trusts to triage UHD VPF spikes at 2am".

About the speaker

Gregory Goldshteyn:

Gregory Goldshteyn is a QA leader with substantial experience across the IT industry, having held roles at Salesforce, Sony, and currently FOX, where he leads quality strategy within Video Engineering and Quality Assurance for high-concurrency streaming environments in which a single second of latency is unacceptable. He specializes in bridging the gap between complex backend engineering and seamless user experiences. His current focus areas include the AI revolution in QA, moving beyond traditional automation into agentic testing and GenAI to build self-healing, low-maintenance test suites, and influencing senior leadership to adopt disruptive technologies that reduce regression cycles from days to minutes. As a mabl Ambassador and frequent speaker at conferences, including Testmu Conf, Gregory shares how QA teams can leverage low-code tools to outpace traditional SDLC bottlenecks. Beyond his technical work, he leads diverse technology product teams and mentors the next generation of software developers, and he has been included in the Bristol Who's Who international registry.

TESTMU-CONF 2026

GET YOUR FREE BOARDING PASS

I agree to TestMu AI's Privacy Policy, Conference Terms and Conditions.

About
TestMu Conf

Testμ (TestMu) is the world’s largest virtual conference on agentic engineering and quality, built by the community, for the community. As AI reshapes how we build, test, and ship software, Testμ Conf is where you connect, grow, and lead: agentic workflows, autonomous quality, battle-tested AI playbooks, hands-on workshops, and the engineering culture driving it all.