Hero Background

Next-Gen App & Browser Testing Cloud

Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

Next-Gen App & Browser Testing Cloud
AIAutomation

Voice Quality Testing: Metrics, Methods, and AI Voice Agents

Voice quality testing measures call clarity using MOS, POLQA, and PESQ. Learn the metrics, methods, parameters, and how to test VoIP, IVR, and AI voice agents.

Author

Naima Nasrullah

June 10, 2026

A call can connect perfectly and still be unusable. If speech arrives delayed, clipped, or buried under distortion, the customer hangs up, and no uptime dashboard will tell you why. Voice quality testing exists to measure that gap between "the call went through" and "the caller could actually understand it," scored on the 1 to 5 Mean Opinion Score (MOS) scale defined in ITU-T Recommendation P.800.

This guide covers the metrics that matter (MOS, PESQ, and POLQA), the network parameters that move them, and how to actually run the tests, from a VoIP line to an IVR menu to a modern AI voice agent. The goal is simple: leave knowing which score to track, how to measure it, and what to automate first.

Overview

Voice quality testing measures the clarity, naturalness, and intelligibility of speech across a voice channel, then reports it on the 1 to 5 MOS scale so teams can catch degradation before customers do.

What you need to measure and control:

  • MOS: The 1 to 5 perceived-quality score every other metric maps back to.
  • PESQ and POLQA: Objective algorithms that estimate MOS automatically by comparing audio to a clean reference.
  • Network parameters: Latency, jitter, and packet loss, the conditions that usually drag quality down.
  • Codecs: The compression scheme that sets the ceiling on achievable quality.
  • Conversation checks: For AI voice agents, response latency, interruption handling, and speech accuracy on top of audio.

What Is Voice Quality Testing?

Voice quality testing is the practice of measuring how clearly and naturally speech is carried across a voice channel, then scoring it so degradation can be caught and fixed. It evaluates intelligibility, distortion, delay, and dropouts on a VoIP call, an IVR menu, a contact center route, or an AI voice agent, and reports the result on the 1 to 5 Mean Opinion Score scale from ITU-T P.800.

The key idea is that connectivity and quality are different things. A call can stay connected end to end while the audio is choppy, robotic, or a half-second behind, and only a quality measurement, not a connection check, will surface that.

What voice quality testing typically evaluates:

  • Clarity and intelligibility: Whether words are understandable without strain or replays.
  • Distortion and artifacts: Robotic tones, metallic edges, clipping, or compression noise.
  • Delay and timing: One-way latency that makes natural back-and-forth feel awkward.
  • Dropouts and choppiness: Missing syllables caused by lost or late audio packets.
  • Echo and background noise: Reflections of the speaker's own voice and ambient interference.

For voice-driven menus specifically, this sits right next to IVR automation testing, where prompt audio and recognition accuracy matter as much as routing.

Voice Quality Metrics: MOS, PESQ, and POLQA

Three metrics anchor almost every voice quality conversation. MOS is the score, while PESQ and POLQA are the standardized algorithms that produce a MOS-like score without a room full of human listeners.

MetricStandardWhat It MeasuresBest For
MOSITU-T P.800Perceived quality on a 1 to 5 scale, the reference every other score maps to.Reporting a single, human-meaningful quality number.
PESQITU-T P.862Objective MOS estimate by comparing degraded audio to a clean reference.Narrowband and early VoIP automated scoring.
POLQAITU-T P.863Successor to PESQ for wideband and HD voice, higher sampling rates, modern codecs.Today's HD voice, mobile, and AI voice channels.

How to read a MOS score:

  • 4.0 to 4.5: Toll quality, the clear, business-grade audio users expect from a good call.
  • 3.5 to 4.0: Acceptable, with minor imperfections most callers tolerate.
  • 3.0 to 3.5: Noticeable distortion or choppiness; users start to strain.
  • Below 3.0: Poor, the range where callers complain and abandon.

PESQ and POLQA are both full-reference algorithms, meaning they need the original clean audio to compare against. POLQA, defined in ITU-T P.863, replaced PESQ to handle the wideband and HD voice that narrowband-era PESQ was never built for, which is why new projects standardize on POLQA.

Subjective vs Objective Voice Quality Testing

There are two ways to land on a MOS score: ask humans, or run an algorithm. Subjective testing is the original, where a listening panel rates audio under controlled conditions per ITU-T P.800. Objective testing estimates that same score automatically with PESQ or POLQA, which is what makes continuous, repeatable testing possible.

AspectSubjective TestingObjective Testing
How it worksHuman listeners rate audio samples on the 1 to 5 scale.PESQ or POLQA compares degraded audio to a clean reference.
RepeatabilityVaries by listener, mood, and session.Deterministic; the same input yields the same score.
Speed and scaleSlow; needs panels and controlled rooms.Fast; runs unattended on every build.
Best forCalibration, new codecs, edge cases.Regression, monitoring, CI pipelines.

The practical answer is not one or the other. Use objective scoring to test continuously at scale, and keep a small subjective panel to calibrate thresholds and judge anything the algorithms were not designed for, such as a brand-new synthetic voice.

Note

Note: Validate the web and mobile self-service channels around your voice systems across 10,000+ real devices and every major browser with TestMu AI. Start testing free

What Affects Voice Quality? Key Parameters

A MOS score is an effect; network and audio parameters are the cause. To fix a low score, you test the conditions that produced it, and most degradation traces back to a handful of measurable parameters.

  • Latency (one-way delay): The mouth-to-ear time. ITU-T G.114 recommends keeping it at or below 150 ms for most applications, beyond which conversation starts to feel like a walkie-talkie.
  • Jitter: Variation in packet arrival times. High jitter forces the jitter buffer to drop or delay packets, which shows up as choppiness.
  • Packet loss: Audio packets that never arrive. Even a few percent loss can make speech unintelligible, since each lost packet is a missing slice of sound.
  • Codec: The compression scheme sets a hard ceiling on quality. Wideband codecs preserve more of the original voice than older narrowband ones.
  • Echo: The speaker hearing their own voice reflected back, which is jarring and lowers perceived quality even when intelligibility is fine.
  • Background noise and distortion: Ambient interference and clipping that the algorithms penalize directly.

A useful testing habit is to reproduce these conditions deliberately. By injecting controlled latency, jitter, and packet loss on the network path, you can watch the MOS score move and set realistic pass or fail thresholds for the conditions your real users face.

How to Test Voice Quality, Step by Step

Objective voice quality testing follows a consistent shape regardless of the channel: send a known clean sample, capture what comes out the other end, and let an algorithm score the difference. The detail is in controlling the conditions in between.

A practical objective-testing workflow:

  • Pick a reference sample: Use a standardized, clean speech clip as the known-good baseline.
  • Play it across the real path: Send the sample through the network, codec, and device combination you want to test.
  • Capture the degraded audio: Record exactly what arrives at the far end, distortions and all.
  • Score with PESQ or POLQA: Compare the captured audio to the reference to get a MOS-equivalent score.
  • Log the network conditions: Record latency, jitter, and packet loss alongside the score so you know why it moved.
  • Run it in CI with a threshold: Fail the build when the score falls below your floor, so regressions surface before release.

Conceptually, a single automated voice quality check reads like the snippet below. It scores a captured call against a reference and asserts a MOS floor:

// Conceptual objective voice quality check (POLQA/PESQ scoring)
const reference = loadAudio("reference-speech.wav");      // known clean sample
const degraded  = await captureCall(referencePlayedOverNetwork);

const result = voiceQuality.score(reference, degraded, { mode: "polqa" });

console.log("MOS:", result.mos, "Latency(ms):", result.latency);

// Fail the build if quality drops below the agreed floor
expect(result.mos).toBeGreaterThanOrEqual(3.8);

The voice path is only half the system. The web portals and mobile apps that share data with an IVR or voice agent need the same rigor, and you can validate that channel on the TestMu AI cloud grid with a standard Selenium setup:

const { Builder } = require("selenium-webdriver");

const capabilities = {
  browserName: "Chrome",
  browserVersion: "latest",
  "LT:Options": {
    platform: "Windows 11",
    build: "Voice Quality - Self-Service Channel",
    name: "Validate web self-service flow",
    user: process.env.LT_USERNAME,
    accessKey: process.env.LT_ACCESS_KEY,
  },
};

(async () => {
  const driver = await new Builder()
    .usingServer("https://hub.lambdatest.com/wd/hub")
    .withCapabilities(capabilities)
    .build();

  try {
    // Stand-in for the self-service portal that mirrors your voice channel
    await driver.get("https://www.testmuai.com/selenium-playground/");
    // assert the same data a caller hears is correct in the web channel
  } finally {
    await driver.quit();
  }
})();

Run that suite on a real device cloud so the self-service experience is verified on the same devices and browsers your customers actually use.

...

Testing Voice Quality for AI Voice Agents

AI voice agents change what "quality" means. A classic VoIP test asks whether the audio is clear; an AI voice agent test asks that, and then whether the agent listened correctly, answered fast enough, and said the right thing. Audio scoring with MOS, PESQ, and POLQA still applies, but it is now the floor, not the whole test.

What AI voice agent quality testing adds on top of audio scoring:

  • Response latency: The pause before the agent replies. Sub-second responses feel natural; multi-second gaps make turn-taking feel broken.
  • Barge-in and interruption: Whether the agent stops talking and listens when the caller interrupts, instead of speaking over them.
  • Speech-to-text accuracy: Whether the agent transcribed the caller correctly across accents, names, and noisy lines.
  • Text-to-speech quality: Whether the synthesized voice is clear and natural, scored with the same objective algorithms as human audio.
  • Response correctness: Whether the agent actually answered the question, the conversational equivalent of asserting on content, not just connection.

This is where audio testing meets conversational testing. The design trade-offs behind reliable agents are covered in the voice agent's core dilemma, and keeping them healthy after launch is the job of voice observability, which monitors live agents in production.

Because an AI voice agent is a conversational system, an autonomous testing agent is a natural fit for validating it. TestMu AI's Agentic Testing platform plans, authors, runs, and proves tests across web, UI, API, and network layers, and its KaneAI test agent lets you author those checks in plain English. To wire it into your pipeline, follow the KaneAI documentation.

...

Voice Quality Testing Tools and Software

There is no single best voice quality testing tool, only the best fit for what you are measuring. Tooling in this space falls into a few categories, and most teams combine them rather than relying on one.

  • Objective scoring engines: Implement PESQ or POLQA to turn captured audio into a MOS score. These are the core of any automated voice quality suite.
  • Network impairment and emulation tools: Inject controlled latency, jitter, and packet loss so you can measure quality under realistic conditions, not just on a perfect line.
  • Telephony and load test harnesses: Place real calls at volume to validate quality under concurrency, which overlaps heavily with IVR and contact center testing.
  • AI voice agent testing platforms: Add conversational checks, response latency, interruption handling, and transcript accuracy on top of audio scoring.
  • Cloud testing platforms for surrounding channels: Validate the web and mobile self-service apps and the APIs your voice systems depend on.

A selection checklist, whatever the category:

  • Standards support: PESQ and, more importantly, POLQA for modern wideband and HD voice.
  • Condition control: The ability to reproduce specific latency, jitter, and packet-loss profiles.
  • CI/CD integration: A CLI or API so scoring runs on every build with a pass or fail threshold.
  • Conversational coverage: For AI agents, transcript and latency assertions, not just audio scoring.
  • Channel breadth: Coverage of the web, mobile, and API layers around the voice path.

For the digital channels around the voice path, a unified automation testing platform avoids stitching together a separate toolchain per channel, and an AI testing approach handles the conversational layer that classic audio tools never touch. The same logic that drives how to test a chatbot applies to voice: test the audio and the conversation, not one or the other.

Voice Quality Testing Best Practices

Strong voice quality programs measure continuously, test under realistic conditions, and assert on meaning rather than connection. The practices below keep results trustworthy and actionable.

  • Standardize on POLQA for new work: It scores the wideband and HD voice that PESQ was never designed for, so it tracks what users actually hear today.
  • Test under impaired conditions, not just ideal ones: Inject latency, jitter, and packet loss that mirror your real users, since a perfect lab line hides the failures customers meet.
  • Set MOS thresholds per channel: A toll-quality floor for premium support may differ from a self-service line; pick a number and fail builds below it.
  • Score on every release: Treat voice quality like any regression suite in CI so a codec or config change cannot quietly degrade audio.
  • For AI agents, assert audio and conversation: A clear voice that answers the wrong question still fails the caller, so validate transcript accuracy and latency too.
  • Validate the surrounding channels: Keep the web, mobile, and API self-service paths in sync with the voice channel so a caller hears the same data they see, much like broader IVR automation testing does.

Conclusion

Start by picking one channel and one number: choose POLQA, set a MOS floor your users would accept, and score a recorded test call across the real network path. Once that baseline runs in CI, add impaired-condition tests for latency, jitter, and packet loss, then extend the same discipline to your AI voice agents with transcript and response-latency checks.

For the web, mobile, and API channels that share data with your voice systems, validate them on the TestMu AI real device cloud, author conversational checks with the Agentic Testing platform, and follow the KaneAI documentation to wire it into your release pipeline. Reliable voice experiences come from tests that run on every change, not from one listen-through before launch.

Author

Naima Nasrullah is a Community Contributor at TestMu AI, holding certifications in Appium, Kane AI, Playwright, Cypress and Automation Testing.

Open in ChatGPT Icon

Open in ChatGPT

Open in Claude Icon

Open in Claude

Open in Perplexity Icon

Open in Perplexity

Open in Grok Icon

Open in Grok

Open in Gemini AI Icon

Open in Gemini AI

Copied to Clipboard!
...

3000+ Browsers. One Platform.

See exactly how your site performs everywhere.

Try it free
...

Write Tests in Plain English with KaneAI

Create, debug, and evolve tests using natural language.

Try for free

Voice Quality Testing FAQs

Did you find this page helpful?

More Related Hubs

TestMu AI forEnterprise

Get access to solutions built on Enterprise
grade security, privacy, & compliance

  • Advanced access controls
  • Advanced data retention rules
  • Advanced Local Testing
  • Premium Support options
  • Early access to beta features
  • Private Slack Channel
  • Unlimited Manual Accessibility DevTools Tests