Next-Gen App & Browser Testing Cloud
Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

Voice quality testing measures call clarity using MOS, POLQA, and PESQ. Learn the metrics, methods, parameters, and how to test VoIP, IVR, and AI voice agents.

Naima Nasrullah
June 10, 2026
A call can connect perfectly and still be unusable. If speech arrives delayed, clipped, or buried under distortion, the customer hangs up, and no uptime dashboard will tell you why. Voice quality testing exists to measure that gap between "the call went through" and "the caller could actually understand it," scored on the 1 to 5 Mean Opinion Score (MOS) scale defined in ITU-T Recommendation P.800.
This guide covers the metrics that matter (MOS, PESQ, and POLQA), the network parameters that move them, and how to actually run the tests, from a VoIP line to an IVR menu to a modern AI voice agent. The goal is simple: leave knowing which score to track, how to measure it, and what to automate first.
Overview
Voice quality testing measures the clarity, naturalness, and intelligibility of speech across a voice channel, then reports it on the 1 to 5 MOS scale so teams can catch degradation before customers do.
What you need to measure and control:
Voice quality testing is the practice of measuring how clearly and naturally speech is carried across a voice channel, then scoring it so degradation can be caught and fixed. It evaluates intelligibility, distortion, delay, and dropouts on a VoIP call, an IVR menu, a contact center route, or an AI voice agent, and reports the result on the 1 to 5 Mean Opinion Score scale from ITU-T P.800.
The key idea is that connectivity and quality are different things. A call can stay connected end to end while the audio is choppy, robotic, or a half-second behind, and only a quality measurement, not a connection check, will surface that.
What voice quality testing typically evaluates:
For voice-driven menus specifically, this sits right next to IVR automation testing, where prompt audio and recognition accuracy matter as much as routing.
Three metrics anchor almost every voice quality conversation. MOS is the score, while PESQ and POLQA are the standardized algorithms that produce a MOS-like score without a room full of human listeners.
| Metric | Standard | What It Measures | Best For |
|---|---|---|---|
| MOS | ITU-T P.800 | Perceived quality on a 1 to 5 scale, the reference every other score maps to. | Reporting a single, human-meaningful quality number. |
| PESQ | ITU-T P.862 | Objective MOS estimate by comparing degraded audio to a clean reference. | Narrowband and early VoIP automated scoring. |
| POLQA | ITU-T P.863 | Successor to PESQ for wideband and HD voice, higher sampling rates, modern codecs. | Today's HD voice, mobile, and AI voice channels. |
How to read a MOS score:
PESQ and POLQA are both full-reference algorithms, meaning they need the original clean audio to compare against. POLQA, defined in ITU-T P.863, replaced PESQ to handle the wideband and HD voice that narrowband-era PESQ was never built for, which is why new projects standardize on POLQA.
There are two ways to land on a MOS score: ask humans, or run an algorithm. Subjective testing is the original, where a listening panel rates audio under controlled conditions per ITU-T P.800. Objective testing estimates that same score automatically with PESQ or POLQA, which is what makes continuous, repeatable testing possible.
| Aspect | Subjective Testing | Objective Testing |
|---|---|---|
| How it works | Human listeners rate audio samples on the 1 to 5 scale. | PESQ or POLQA compares degraded audio to a clean reference. |
| Repeatability | Varies by listener, mood, and session. | Deterministic; the same input yields the same score. |
| Speed and scale | Slow; needs panels and controlled rooms. | Fast; runs unattended on every build. |
| Best for | Calibration, new codecs, edge cases. | Regression, monitoring, CI pipelines. |
The practical answer is not one or the other. Use objective scoring to test continuously at scale, and keep a small subjective panel to calibrate thresholds and judge anything the algorithms were not designed for, such as a brand-new synthetic voice.
Note: Validate the web and mobile self-service channels around your voice systems across 10,000+ real devices and every major browser with TestMu AI. Start testing free
A MOS score is an effect; network and audio parameters are the cause. To fix a low score, you test the conditions that produced it, and most degradation traces back to a handful of measurable parameters.
A useful testing habit is to reproduce these conditions deliberately. By injecting controlled latency, jitter, and packet loss on the network path, you can watch the MOS score move and set realistic pass or fail thresholds for the conditions your real users face.
Objective voice quality testing follows a consistent shape regardless of the channel: send a known clean sample, capture what comes out the other end, and let an algorithm score the difference. The detail is in controlling the conditions in between.
A practical objective-testing workflow:
Conceptually, a single automated voice quality check reads like the snippet below. It scores a captured call against a reference and asserts a MOS floor:
// Conceptual objective voice quality check (POLQA/PESQ scoring)
const reference = loadAudio("reference-speech.wav"); // known clean sample
const degraded = await captureCall(referencePlayedOverNetwork);
const result = voiceQuality.score(reference, degraded, { mode: "polqa" });
console.log("MOS:", result.mos, "Latency(ms):", result.latency);
// Fail the build if quality drops below the agreed floor
expect(result.mos).toBeGreaterThanOrEqual(3.8);The voice path is only half the system. The web portals and mobile apps that share data with an IVR or voice agent need the same rigor, and you can validate that channel on the TestMu AI cloud grid with a standard Selenium setup:
const { Builder } = require("selenium-webdriver");
const capabilities = {
browserName: "Chrome",
browserVersion: "latest",
"LT:Options": {
platform: "Windows 11",
build: "Voice Quality - Self-Service Channel",
name: "Validate web self-service flow",
user: process.env.LT_USERNAME,
accessKey: process.env.LT_ACCESS_KEY,
},
};
(async () => {
const driver = await new Builder()
.usingServer("https://hub.lambdatest.com/wd/hub")
.withCapabilities(capabilities)
.build();
try {
// Stand-in for the self-service portal that mirrors your voice channel
await driver.get("https://www.testmuai.com/selenium-playground/");
// assert the same data a caller hears is correct in the web channel
} finally {
await driver.quit();
}
})();Run that suite on a real device cloud so the self-service experience is verified on the same devices and browsers your customers actually use.
AI voice agents change what "quality" means. A classic VoIP test asks whether the audio is clear; an AI voice agent test asks that, and then whether the agent listened correctly, answered fast enough, and said the right thing. Audio scoring with MOS, PESQ, and POLQA still applies, but it is now the floor, not the whole test.
What AI voice agent quality testing adds on top of audio scoring:
This is where audio testing meets conversational testing. The design trade-offs behind reliable agents are covered in the voice agent's core dilemma, and keeping them healthy after launch is the job of voice observability, which monitors live agents in production.
Because an AI voice agent is a conversational system, an autonomous testing agent is a natural fit for validating it. TestMu AI's Agentic Testing platform plans, authors, runs, and proves tests across web, UI, API, and network layers, and its KaneAI test agent lets you author those checks in plain English. To wire it into your pipeline, follow the KaneAI documentation.
There is no single best voice quality testing tool, only the best fit for what you are measuring. Tooling in this space falls into a few categories, and most teams combine them rather than relying on one.
A selection checklist, whatever the category:
For the digital channels around the voice path, a unified automation testing platform avoids stitching together a separate toolchain per channel, and an AI testing approach handles the conversational layer that classic audio tools never touch. The same logic that drives how to test a chatbot applies to voice: test the audio and the conversation, not one or the other.
Strong voice quality programs measure continuously, test under realistic conditions, and assert on meaning rather than connection. The practices below keep results trustworthy and actionable.
Start by picking one channel and one number: choose POLQA, set a MOS floor your users would accept, and score a recorded test call across the real network path. Once that baseline runs in CI, add impaired-condition tests for latency, jitter, and packet loss, then extend the same discipline to your AI voice agents with transcript and response-latency checks.
For the web, mobile, and API channels that share data with your voice systems, validate them on the TestMu AI real device cloud, author conversational checks with the Agentic Testing platform, and follow the KaneAI documentation to wire it into your release pipeline. Reliable voice experiences come from tests that run on every change, not from one listen-through before launch.
Did you find this page helpful?
More Related Hubs
TestMu AI forEnterprise
Get access to solutions built on Enterprise
grade security, privacy, & compliance