• Home
  • /
  • Blog
  • /
  • Agent-to-Agent Testing CLI: Validate Your AI Agents From Your Terminal
AIAutomationAgent Testing

Agent-to-Agent Testing CLI: Validate Your AI Agents From Your Terminal

TestMu AI's A2A Testing CLI lets you run AI agent evaluations, red team tests, and voice agent checks directly from your terminal. Works in CI/CD pipelines.

Author

Devansh Bhardwaj

April 6, 2026

TestMu AI's Agent to Agent Testing platform now has a CLI. Here's what that means for your workflow.

AI agents are in production. Chatbots handle customer queries at scale. Voice assistants route support tickets. Calling agents close loops without a human in the loop and most QA teams are still testing these systems by hand, one conversation at a time.

That bottleneck is exactly what TestMu AI built Agent to Agent Testing to fix. Today, we're extending it to the command line.

Here's what a full evaluation run looks like from your terminal:

# Install
pip install testmu-a2a-cli

# Authenticate
testmu-a2a auth --username YOUR_USERNAME --access-key YOUR_KEY

# Run a quick evaluation against your agent
testmu-a2a test \
  --agent https://your-chatbot-endpoint.com \
  --spec "E-commerce customer support chatbot" \
  --count 200 \
  --format json \
  --output results.json

What Is testmu-a2a-cli

testmu-a2a-cli is a Python-based CLI tool (requires Python 3.10+) that lets you trigger, configure, and run Agent-to-Agent test evaluations directly from your terminal.

It connects your agent under test to TestMu AI's evaluation infrastructure, which simulates real users, generates adversarial inputs, and scores responses across the quality dimensions that matter.

Chat agent metrics covered out of the box: Bias Detection, Hallucination Detection, Completeness, Context Awareness, Response Quality, Conversation Flow, User Satisfaction, File Handling Quality, and File Generation Accuracy.

Evaluate Your Agents at Scale

Let's take this into practice.

Use case: Your team has built a customer support chatbot for an e-commerce platform. It handles order queries, refund requests, and product FAQs.

You're three days from shipping it to production and you need to know: does it hallucinate? Does it stay on topic when users go adversarial? Does it handle edge cases around refund policy correctly?

The old way: A QA engineer writes 30-50 test scripts manually, runs them one by one, and files bugs based on what they notice. It takes days. It misses edge cases because humans don't think adversarially at scale.

With testmu-a2a-cli: Point the CLI at your agent's endpoint, define your spec, and let autonomous evaluators run hundreds of realistic and adversarial scenarios against it in parallel. In minutes, you have structured quality scores across Hallucination Detection, Bias Detection, Response Quality, Conversation Flow, and more. No scripts, no manual review.

The customer support example is just one case. The same pattern applies anywhere you're shipping a conversational AI system:

  • Chatbots that need validation across hundreds of user intent paths before going live
  • Voice assistants that handle phone calls and need to stay coherent across multi-turn conversations
  • Calling agents that book appointments or handle escalations, where one wrong response has real consequences
  • Internal AI tools that interact with employees and need to stay within compliance boundaries
  • LLM-powered features embedded in products, where hallucinations and off-topic responses are silent failures

Spin up your Agent to Agent CLI in just minutes with this detailed documentation.

Security Test Your Agent Before Production Does

One of the most important capabilities in testmu-a2a-cli and the one most teams don't think to use until it's too late is redteam.

testmu-a2a redteam \
  --agent https://your-chatbot-endpoint.com \
  --output redteam-results.json

This runs your agent through 9 dedicated attack categories: prompt injection, jailbreak attempts, data exfiltration, PII leakage, and more.

It doesn't just check whether your agent gives good answers, it checks whether it can be broken by a motivated user.

If your agent handles sensitive data, makes decisions with real-world consequences, or is customer-facing, red teaming before launch is not optional. The CLI makes it a single command.

Test Voice and Phone Agents Too

testmu-a2a-cli isn't just for chatting. The call command brings the same evaluation depth to phone agents, inbound and outbound with capabilities built specifically for voice:

testmu-a2a call \
  --agent https://your-phone-agent-endpoint.com \
  --type inbound \
  --output call-results.json

The CLI supports 30+ phone-agent quality metrics, background sound simulation, and DTMF detection, covering the real conditions your voice agent will face in production.

Multi-turn coherence, intent handling under noise, escalation behavior, all testable from the terminal.

Config-Driven Workflows With init and run

For teams who want repeatable, version-controlled evaluation runs, testmu-a2a init generates a testmu-a2a.yaml config file that you can commit alongside your codebase and run consistently across environments:

# Generate your config file
testmu-a2a init

# Run from config - works in CI/CD without any flags
testmu-a2a run --config testmu-a2a.yaml

This is how most teams use the CLI in practice: testmu-a2a test for fast ad-hoc checks during development, testmu-a2a run with a committed config for pipeline-controlled evaluation gates.

Why This Matters

The Agent-to-Agent Testing platform has always had a clear thesis: you can't use deterministic, script-based QA to validate non-deterministic AI systems. Static test cases don't adapt, they miss edge cases.

The platform addresses this by deploying autonomous evaluators that emulate real users and intelligent adversarial interactions. Until now, accessing that capability required going through the TestMu AI browser-based console. The CLI changes that, it's built for teams who live in the terminal, run tests in CI/CD pipelines, and want evaluation results feeding directly into deployment gates.

Drop testmu-a2a test into your GitHub Actions or Jenkins step. Set thresholds. Fail the build if Hallucination Detection or Bias scores don't meet your bar. Results come back as structured JSON, parseable, alertable, and ready to feed into any dashboard.

Author

Devansh Bhardwaj is a Community Evangelist at TestMu AI with 4+ years of experience in the tech industry. He has authored 30+ technical blogs on web development and automation testing and holds certifications in Automation Testing, KaneAI, Selenium, Appium, Playwright, and Cypress. Devansh has contributed to end-to-end testing of a major banking application, spanning UI, API, mobile, visual, and cross-browser testing, demonstrating hands-on expertise across modern testing workflows.

Open in ChatGPT Icon

Open in ChatGPT

Open in Claude Icon

Open in Claude

Open in Perplexity Icon

Open in Perplexity

Open in Grok Icon

Open in Grok

Open in Gemini AI Icon

Open in Gemini AI

Copied to Clipboard!
...

3000+ Browsers. One Platform.

See exactly how your site performs everywhere.

Try it freeArrowArrow
...

Write Tests in Plain English with KaneAI

Create, debug, and evolve tests using natural language.

Try for freeArrowArrow

Did you find this page helpful?

More Related Hubs

TestMu AI forEnterprise

Get access to solutions built on Enterprise
grade security, privacy, & compliance

  • Advanced access controls
  • Advanced data retention rules
  • Advanced Local Testing
  • Premium Support options
  • Early access to beta features
  • Private Slack Channel
  • Unlimited Manual Accessibility DevTools Tests