Next-Gen App & Browser Testing Cloud
Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

Learn to create diverse test cases using both intrinsic and extrinsic metrics and balance the performance with resource management for reliable LLMs.

TestMu AI
January 30, 2026
Scaling LLM testing involves balancing effective model performance with resource management. The main challenge is handling the computational cost and complexity of diverse test scenarios while maintaining efficiency.
In this session with Anand Kannappan, Co-founder and CEO of Patronus AI, you’ll learn to overcome these challenges by focusing on creating diverse test cases, avoiding reliance on weak intrinsic metrics, and exploring new evaluation methods beyond traditional benchmarks.
If you couldn’t catch all the sessions live, don’t worry! You can access the recordings at your convenience by visiting the TestMu AI YouTube Channel.
As Anand walked through the session’s agenda, he continued to emphasize key points from his abstract. He clearly conveyed how companies are excited about the possibilities of generative AI but are also equally concerned about potential risks, such as hallucinations, unexpected behavior, and unsafe outputs from large language models (LLMs).

He highlighted that testing LLMs is significantly different from testing traditional software due to the unpredictability and wide range of possible behaviors. The focus of his talk was on reliable and automated methods for effectively testing LLMs at scale.
After providing a brief overview and introducing his team, Anand began by explaining the basics of LLM testing in detail.
LLM testing basics involve evaluating large language models (LLMs) to ensure their accuracy, reliability, and effectiveness. This includes assessing their performance using both intrinsic metrics, which measure the model’s output quality in isolation, and extrinsic metrics, which evaluate how well the model performs in specific real-world applications. Effective LLM testing requires a combination of these metrics to comprehensively understand the model’s strengths and limitations.
After he gave an overview of the basics, he further discussed the challenges faced in evaluating LLMs in detail.
Below are some core challenges that he explained in detail during the session:
Anand discussed the distinction between intrinsic and extrinsic evaluation of large language models (LLMs).

He explained that intrinsic evaluation measures a language model’s performance in isolation and has few advantages, such as being faster and easier to perform than extrinsic evaluation. It can provide insights into the model’s internal workings and identify areas for improvement.
Extrinsic evaluation involves assessing a language model’s performance based on its effectiveness in specific real-world tasks or applications. This approach requires access to these tasks, which may not always be readily available, making it potentially time-consuming and resource-intensive. Despite these challenges, extrinsic evaluation has advantages as well, such as providing a more accurate measure of the model’s performance in practical scenarios and ensuring that the model meets the needs of its intended users.
@anandnk24 explains how extrinsic evaluation, such as the GLUE benchmark, provides practical insights into LLM performance by assessing real-world task handling
— LambdaTest (@testmuai) August 22, 2024
.Join this session to learn more! pic.twitter.com/XCH1rRXoGa
He concluded this challenge by stating that both intrinsic and extrinsic evaluations have their respective strengths and limitations; combining both approaches can offer a more comprehensive assessment of a model’s overall performance.
It is crucial to consider these challenges and limitations when evaluating large language models to obtain a thorough understanding of their capabilities and effectiveness.
To address these challenges effectively, leveraging GenAI native tools like KaneAI, offered by TestMu AI, can be highly beneficial. It is designed to automate test creation, debugging, and execution using natural language. KaneAI supports multi-language test generation and integrates with various frameworks. It also enhances efficiency by enabling intelligent test planning and automation.
With the rise of AI in testing, its crucial to stay competitive by upskilling or polishing your skillsets. The KaneAI Certification proves your hands-on AI testing skills and positions you as a future-ready, high-value QA professional.
He further explained the difficulties associated with using open-source benchmarks in evaluating large language models (LLMs).
He discussed the use of open-source benchmarks for evaluating the performance of language models, highlighting several challenges and limitations associated with them. He pointed out that the lack of standardization in the creation and use of these benchmarks can make it difficult to compare the performance of different models. Additionally, he noted that the quality of the data used to create open-source benchmarks can vary, which may impact the reliability and validity of the evaluation results.
He also addressed the issue of domain specificity, explaining that many open-source benchmarks are designed for specific domains, potentially limiting their generalizability to other areas. Furthermore, he mentioned that these benchmarks might not cover all the tasks and scenarios a language model is expected to handle in real-world applications. Ethical concerns were also highlighted, including the use of benchmarks that might contain biased or sensitive data.

To address these challenges, he emphasized the importance of considering several factors when using open-source benchmarks. He recommended using benchmarks that are transparently created and shared, with clear documentation of data sources and methods.
Ensuring data diversity was another key point, as it helps in obtaining more generalizable evaluation results. He also stressed the need for comprehensive coverage of tasks and scenarios to gain a full understanding of a language model’s performance. Finally, he advised being mindful of ethical considerations and avoiding benchmarks that contain biased or sensitive data.

While explaining the challenges, Anand also addressed a common question he encounters in his work with large language models (LLMs): “How to constrain LLM outputs on your own?”

He outlined several methods for guiding LLMs:
He concluded that these strategies can effectively constrain LLM outputs to better meet specific needs and contexts.
As he continued explaining the LLM challenges, he further emphasized the critical role of high-quality data before delving into long-term scalability and oversight issues. He expressed concern that high-quality training data for large language models (LLMs) might run out by 2026. He highlighted that high-quality evaluation data is a major bottleneck for improving foundation models, both in the short term and in the long term.
Anand highlights a crucial stat: high-quality language training data is projected to run out by 2026! 🌐
— LambdaTest (@testmuai) August 22, 2024
Join this session to learn how this impacts LLMs and discover strategies to navigate this challenge pic.twitter.com/D4f6jgtwuJ
He noted that LLMs are generally trained using large volumes of high-quality data, such as content from Wikipedia. However, he identified two main challenges: the prevalence of low-quality data from sources like comments on Reddit or Instagram and the necessity for companies and developers to create high-quality synthetic data.
This synthetic data is crucial for LLMs to continue growing and improving as the supply of high-quality natural data decreases. Proceeding further, he discussed a major challenge in AI development: scalability in large language models (LLMs).
Anand discussed the scalability challenges associated with large language models (LLMs) in AI development. He highlighted the potential future of scalable oversight, where AI systems might be employed to evaluate other AI systems, aiming to ensure ongoing reliability and performance at scale. He also explored the potential of using AI to oversee AI systems, addressing challenges related to transparency, accountability, and fairness in AI development and deployment.

To provide a better understanding of the concepts related to large language models (LLMs) and Retrieval-Augmented Generation (RAG) systems, which are integral to AI product development, Anand offered a detailed discussion by explaining the workflow of the RAG evaluation framework.
Anand provided an in-depth analysis of evaluating Retrieval-Augmented Generation (RAG) systems, focusing on strategies to enhance their performance and scalability. He underscored the importance of comprehensive evaluation to ensure that RAG systems operate effectively in real-world scenarios.
He discussed the complexities associated with building effective RAG systems, which are vital for AI product development. Anand highlighted the distinct challenges these systems face, particularly regarding evaluation and performance. He noted that choosing the appropriate evaluation metrics—whether intrinsic or extrinsic—is crucial and should be tailored to the specific use case and application of the RAG system.
Anand discusses about the Retrieval-Augmented Generation (RAG) framework, crucial for evaluating models that blend retrieval and generation.This framework assesses retrieval accuracy, generation quality, and overall task performance. pic.twitter.com/gYnQz1ijF8
— LambdaTest (@testmuai) August 22, 2024
He detailed the RAG evaluation process, which involves assessing the performance of two main components:
He stated that when selecting models and frameworks for RAG systems, understanding the types of evaluation—intrinsic and extrinsic—is crucial.
He stated that choosing the appropriate evaluation metrics depends on the specific application or task of the RAG system. Intrinsic metrics are useful for assessing the general quality of a model’s output, while extrinsic metrics provide insights into how effectively the model performs within its intended application.
As he explained the selection of the right tools, he further provided a few tips on implementing strategies for RAG systems below.
RAG systems are an essential component of AI product development. However, they come with their own set of challenges, especially when it comes to performance and scalability. Here are some strategies to improve the performance and scalability of RAG systems:

Anand: An Agentic RAG System may produce better results than a Modular RAG System, depending on the specific application and use case. Agentic systems are designed to adapt and make decisions dynamically, which can enhance performance in complex scenarios. However, Modular systems offer flexibility and modularity, which can be advantageous for certain tasks. The choice between them should be based on the specific requirements and goals of the application.
Anand: To reliably assess LLM performance, use a mix of metrics:
Anand: To ensure LLMs maintain ethical standards and avoid biases, it’s crucial to:
Anand: He explained that scaling LLMs is costly due to the need for extensive server space and heavy GPU usage for training. However, he anticipates these costs will decrease over time with technological advancements. As QA professionals, the role is to ensure that cost reductions do not impact performance, focusing on efficient testing and staying updated on new technologies.
Anand: He mentioned that there isn’t a one-size-fits-all model for Retrieval-Augmented Generation (RAG) systems. However, frameworks like Langchain and Llama Index are well-suited for RAG applications as they offer robust tools for integrating retrieval and generation components effectively. The choice of model or framework should align with the specific requirements and goals of the RAG system.
Anand: To improve the accuracy of large language models (LLMs) and minimize false positives and negatives, Anand highlighted several key measures:
Anand: To improve LLM accuracy, he recommends:
Please don’t hesitate to ask questions or seek clarification within the TestMu AI Community.
Did you find this page helpful?
More Related Hubs
TestMu AI forEnterprise
Get access to solutions built on Enterprise
grade security, privacy, & compliance