Next-Gen App & Browser Testing Cloud
Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

Exploring ambiguity in software development. Discover AI's potential, challenges, & bronze-bullet solutions for enhanced testing.
Matt Heusser
March 15, 2026
About ten years ago a certain large technology company had giant banners announcing their âfuturists.â. I remember a big billboard in Toronto, Canada,detailing a vision of a future where you could speak software into existence, and perhaps even think it into being. Given that English, like other human non-scientific languages, is a vague and ambiguous language, I doubted it. While AI struggles to translate words into symbols as effectively as humans, even when teams of people collaborate on written descriptions, they often have wildly different understandings of what should be created.
Then, the Large Language Models arrived, which promise to take ambiguous natural language as input and generate code. There are, however, classic engineering problems with this. Fred Brooks, the lead manager on OS/360, one of the worldâs first structured software engineering efforts with several hundred people of size, suggested that it was logically impossible to have a single silver bullet to slay the software development best. Brooks proposed that software works through different activities â requirements (what), design (how), code (build it), and test. Since no single activity takes more than 25% of the time, any silver bullet that made an activity instant and free would only be a 25% reduction. Brookâs solution was a series of bronze bullets, each one making things a little better.
Despite the rhetoric Brooks logic seems to apply today, simple english descriptions of software sent into AI do fall short. More than that, we canât even âAIâ our way out of the test process. What we might be able to achieve, though, is to identify a series of small, rapid victories in utilizing AI for testing. As we discussed previously, itâs worthwhile to be at the forefront of this endeavor.
As the landscape of software testing continues to evolve, the integration of Generative AI introduces innovative possibilities and challenges. We conducted a poll on TestMu AIâs social media platform, asking the community about their primary concerns regarding AI in test automation. The majority of respondents voiced their apprehension about âData privacyâ shedding light on a critical aspect that demands careful consideration in the era of advanced testing methodologies.

Source
In this article, weâll explore some AI use cases in testing. Weâve tried these techniques, either at work or on sample code from the public domain, and have some commentary to share. The ideas below look for bronze bullets using AI in all of software, but focused on test and quality.
Using generative AI tools can streamline and enhance the accuracy of unit test creation from complex requirements or existing code. KaneAI stands out by addressing the limitations of traditional AI tools, which often struggle with intricate dependencies or ambiguous requirements.
KaneAI is a GenAI native QA Agent-as-a-Service platform that excels in generating precise tests directly from plain English descriptions and specifications, translating natural language instructions into automated tests. This not only ensures clarity but also reduces manual effort.
Additionally, KaneAI effectively handles inconsistent requirements and complex logic, offering actionable insights and improving test accuracy. With KaneAI, you benefit from intelligent, context-aware testing tailored to your needs, all while upholding privacy and compliance standards.
With the rise of AI in testing, its crucial to stay competitive by upskilling or polishing your skillsets. The KaneAI Certification proves your hands-on AI testing skills and positions you as a future-ready, high-value QA professional.
Perhaps, at best, the generated tests could find crashes and unhandled exceptions. Still we wouldnât expect the AI to recognize the correct behavior of the code without being told. Thus, we also would not expect it to generate unit tests that expose and cause failure when the software deviates. We asked ChatGPT to create tests for a Roman Number to decimal calculator. (Requirements code generated tests) The initial tests ChatGPT 3.5 produced asked for single digits, 1 to 3, additive digits, and examples that involved subtraction, such as IX or XL. It was a pleasant surprise when, after the first round of unit tests, we tried to ask for more tests to âbreakâ or âstressâ the software. These can find interesting use cases and unexpected combinations. The stress test of the roman numeral to decimal evaluator , hereâs what ChatGPT 3.5 produced:
def test_stress_test romannumeral = Romannumeral.new min_numeral = 'I' * 1000 max_numeral = 'M' * 1000 assert_equal 1000, romannumeral.rn_to_decimal(min_numeral) assert_equal 1000000, romannumeral.rn_to_decimal(max_numeral) end
Reviewing this article, Harry Robinson pointed out that the assertion and values are an oracle, a programmed way of finding a problem. He raised an intriguing question: did humans initially create the original tests that ChatGPT then expanded upon? Itâs conceivable that ChatGPT may have aligned with certain examples present on Github, even though we hadnât provided any such examples to begin with. Notably, despite a thorough search of both GitHub and the wider internet, we were unable to uncover any parallels to this particular strategy. For that matter, more than three âIââs in a row is arguably an invalid Roman Numeral â we need requirements. The same applies for roman numerals with invalid letters, special characters, and decimals; we need to know what the software should do.
Acme auto insurance charges based on age. Clients under 16 cannot purchase insurance. Clients aged 16-21 pay $600 per month for insurance. 21 to 30 pay $500 per month. 30 to 40 pay $400 per month. 40 to 60 pay $350 per month. 61 to 99 pay $550 per month. In addition, acme insurance has a good student discount of -10% per month for students who get at least at 3.0 in school and maintain it. In addition, Acme has a safe driver discount of -5% for drivers who have had no accidents in 5 years. There is a bad driver penalty of +5% for drivers who have had one at-fault accident in the past 24 months and +10% for two or more at-fault accidents. Coverage is not offered if someone has had five at-fault accidents. Students who have the bad driver label are not eligible for the good student discount.
We hoped the tool would find that 21 and 30 are overlapping ages. It did, along with the overlap at 40. However, the tool falls short in providing guidance regarding a 100-year-old driver. Similarly, when faced with a scenario involving a responsible 22-year-old student driver, the tool does not offer a definitive solution. Should we deduct 10% from $500 or multiply $500 by 95% twice consecutively? The generative AI tool lacks the capability to advise us on this matter. Remember, LLM tools donât have insight the way a human does â they find mathematical models for how words could be strung together. With enough training and a consistent domain, such as tax law, it might be more helpful. As of now, that would be under a great deal of human supervision.

I want to generate a .csv file. The fields are first name, last name, middle initial, social security, date of birth, sex, and home address with street number, street, city, state and zip. These should be people who live in Michigan with Allegan zip codes - that is 49010. They should have a rough bell curve of ages from 18-50. They should all have unique first and last names. 50 names should do.
With ChatGPT, now you can! Though ChatGPT kept trying to generate small samples. It needed to be coaxed to make more, maxing out at about 50 rows. Bard did even worse, creating ten rows then producing blank lines. Thereâs a bit more on test data generation in another article for the Lambdatest blog. And, of course,there are plenty of open source test data generators; they just donât process natural English the way generative AI can.
These examples highlight how generative AI is becoming a practical ally for testers across various activities. For a deeper look at the evolving role of AI in software testing, check out how teams are applying intelligent automation to improve coverage and efficiency.

Note: You can use TestMu AI Cloud to test on a variety of iPhone simulators and real iPhone devices. No need to install anything, just sign up for a free account and start testing. Try TestMu AI Today!
If you wish to learn more about AI Automation you can read our hub for more insights
Andy Hird, a software tester in Cleveland, Ohio (who is as of now running for the board of the Association for Software Testing), pointed out that there is a difference between using AI to generate test ideas and data, and relying on it to generate expected results. In other words, if you feed these tools you code, get tests back, cut/paste, save and run, you deserve the result youâll likely get. Wouter Lagerweij, another friend from consulting, believes that few technical staff are comfortable working in the tight test-code-refactor feedback loops that Test Driven Development (TDD) and BDD requires. Generative AI, especially AI in the development environment such as Githubâs CoPilot, can work to fill in that gap â but youâll have to double check it.,
The great benefit of using these tools for stub code is that they can get us past that âstuckâ point where we just canât figure out why a few lines of code arenât working, for example to launch a browser or click a button. That âstuckâ point might change two lines of code and cost an hour or four. Ironically, when I tried to have the tools generate simple code for the roman numeral kata, the code contained an off-by-one index error, accessing an array outside of bounds and causing a crash. Debugging and fixing it took about two hours â and two lines of code â about half as long as it would take to write from scratch. Blake Link, a developer with Excelon Development, reports a 90% reduction in âboilerplateâ code generation.
We realize the two data points above unravel each other. And yet, here we are. Our goal was neither to tell you AI is an amazing solution, nor to âdebunkâ anything. Instead, we wanted to do a broad, detailed analysis of generative AI for testing, to push the conversation, research, and application in the field a half-inch forward.
For now, itâs probably best to look for these tools as the sources of test ideas and to generate template code to drive events. Tools already exist to generate synthetic test data, for accessibility testing, for combinatorial ideas. The power of LLMs is less in doing that work, and more in making programmatic sense of unstructured natural language. Using the right prompting techniques can significantly improve the quality of output you get from these generative AI tools.
Thatâs just the opinions of a few of us, plus a few experiments and a little more experience. What do you think? Where can you use generative AI for testing? More importantly, where are you using it right now?
Letâs keep talking.
Did you find this page helpful?
More Related Hubs
TestMu AI forEnterprise
Get access to solutions built on Enterprise
grade security, privacy, & compliance