Can You Trust Your ChatBot: Techniques for Testing LLM Responses
Like everyone and their mother, you now have a chatbot on your site. But can you trust it to give the right answers? Not insult the users? Not give them ingredients for a homemade exploding salad?
Testing something that gives you a whole lot of text, and never the same way twice. That's tough. But not impossible.
The never - What you never, ever want to see
The must - The beginning of a good answer
Golden Datasets - What good answers look like
Tone and bias detection - What proper answers look like
Scorecards - What the new "pass/fail" looks like
The AI Judge - Ask a smarter bot to settle the argument between you and your bot. What modern delegation looks like. Unfortunately.
All with live examples. If you're testing chatbots, AI agents, or just want to know if somebody's prompt will crash your system - this one's for you.
Yes, you can run a couple of examples and see if your chatbot behaves. But trust? That we need to build. So, let's make sure that bot doesn't get us on the news.
Key Takeaways:
Because of indeterministic results, we need better testing.
New techniques for automation and CI.
We need to cover risks we're not used to dealing with (e.g. safety, security).
About the speaker
Gil Zilberfeld:
Gil Zilberfeld has been writing code and building software for over 25 years, starting with his trusty Sinclair ZX81. His career has taken him from developer to team leader and consultant, giving him a deep, holistic understanding of the software lifecycle. Today, he operates as an independent consultant and trainer under the brand TestinGil, where he helps R&D and QA professionals build better software from the inside out. Gil's core philosophy is that quality is a team sport. He is a benevolent skeptic of industry dogma, consistently challenging ineffective "best practices" in favor of pragmatic, context-driven solutions that work in the real world. He is a frequent speaker at international conferences, where he shares his expertise on topics ranging from TDD and clean code to the modern complexities of web automation. He is currently focused on creating actionable, engineering-led frameworks for ensuring the quality and long-term resilience of AI-powered systems. You can find his articles, videos, and courses at testingil.com. In his spare time, he shoots zombies for fun.
About
TestMu Conf
Testμ (TestMu) is the world’s largest virtual conference on agentic engineering and quality, built by the community, for the community. As AI reshapes how we build, test, and ship software, Testμ Conf is where you connect, grow, and lead: agentic workflows, autonomous quality, battle-tested AI playbooks, hands-on workshops, and the engineering culture driving it all.