CODING JAG - Issue 297

News
AI
Automation
Tools
Video & Podcast
Events

Welcome to the 297th edition of Coding Jag brought to you by TestMu AI!👐

There is a quiet lie at the center of AI-assisted development: the assistant writes the code, assures you it "should work," and signs off. Getting to a first draft is nearly instant now; getting to proof that it actually works is not, and that gap is exactly where releases quietly go sideways.

This week's edition is about closing it.

See how a trio of MCP servers lets Claude Code verify its own work against a real database and live logs instead of just claiming success, why the sharpest teams now treat the model as the reasoning engine and their own code as the safety harness. And how NVIDIA is making you prove a skill file is safe before an agent ever runs it. Moreover, the staggering results from Project Glasswing, self-healing Playwright pipelines, and so much more.

📬 Found something worth sharing this week? Hit Reply to this email; your ideas could shape a future edition.

News

Project Glasswing: How Anthropic's AI Is Transforming Cybersecurity Bug Detection

09 minanthropic.com

🛡️ Anthropic shares the first results of Project Glasswing, its coalition that points AI at critical software before attackers can. In one month, its ~50 partners used the unreleased Claude Mythos Preview model to find more than 10,000 high- and critical-severity vulnerabilities across the world's most important software. The twist: progress is no longer limited by how fast you can find bugs, but by how fast you can patch them. Of 530 reported so far, only 75 are fixed, since each one still takes about two weeks of human effort.

NVIDIA-Verified Agent Skills: Capability Governance for AI Agents

08 mindeveloper.nvidia.com

🛡️ NVIDIA tackles a different proving problem: how do you trust a skill before you let an agent run it? Verified agent skills are portable SKILL.md instruction sets that pass through a publishing pipeline first, SkillSpector scanning for prompt injection and tool poisoning, cryptographic signing for provenance, and a skill card documenting ownership, dependencies, and known limits. It follows the open agentskills.io spec and works across Claude Code, Codex, and Cursor, so teams can verify and inspect a skill before it touches their environment.

Claude Is Quietly Winning the AI Race: How Anthropic Lapped OpenAI in 2026

07 minlivenewschat.eu

📈 Fresh Enterprise Technology Research data reported by The Wall Street Journal shows enterprise adoption of Claude up roughly 128% over twelve months, while OpenAI slipped around 8% and Grok stayed a rounding error. The Ramp AI Index, which tracks actual business spend, told the same story. The engine behind the surge is not the chatbot. It is Claude Code, which reached general availability in May 2025 and reported ~$2.5 billion annualized run rate by February 2026.

Everything from the Google I/O 2026 developer keynote

09 mindevelopers.googleblog.com

🚀 Google's developer keynote leans hard into agents. Gemini 3.5 Flash outperforms 3.1 Pro across almost all benchmarks while running about four times faster. Managed Agents arrive in the Gemini API via a single call, and Antigravity 2.0 lands as a desktop app. The standout for the web is WebMCP, a proposed open standard that lets sites expose structured tools so browser-based agents can act with more speed and precision, with an experimental origin trial starting in Chrome 149.

Claude Code MCP: Your AI Says the Code Works. Can It Prove It?

08 minmedium.com

🔍 Rick Hightower tackles the core dishonesty of AI coding: assistants stop at "this should work." He shows how the Model Context Protocol gives Claude Code real senses, the ability to query a database, screenshot a rendered page, and read production logs, so it can verify its own work instead of asserting it. The piece walks through the specific trio of MCP servers that closes the verification loop for any real web application.

The AI Agent Reliability Gap in 2026: Why the Tooling Is Finally Catching Up

07 mindev.to

🧱 A clear-eyed look at why agents that nail the demo still hallucinate a file path or quietly take a wrong action in production, and what the 2026 tooling does about it. The core argument: don't let the LLM decide whether an action is allowed. Build an explicit orchestration layer, a state machine or workflow engine that constrains what the model can do. The LLM is the reasoning engine; your code is the safety harness.

How to Use AI With Playwright Tests in 2026: A Complete Guide

10 minautify.com

🧪 The Autify team breaks down how AI layers onto Playwright's solid foundation to prevent and fix flaky tests faster. "Fix with AI" features detect broken selectors and auto-suggest reliable replacements, eliminating manual locator updates. The guide also looks at moving beyond augmented Playwright toward autonomous agents that run tests in natural language while still emitting real, editable framework code.

AI Evaluation Metrics 2026: Tested by Conversation Experts

09 minmasterofcode.com

🔬 Master of Code's Tetiana Tsymbal lays out a practical framework for judging whether an AI agent actually does its job once it's live, not just in training. The argument: evaluation has to be metrics-driven, because an agent can look accurate in testing yet hit only a 45% containment rate after launch. The piece walks through measuring KPIs like containment and completion rate across each stage and tying them to real business outcomes, so evals drive ROI instead of staying a technical box-tick.

Automation

Playwright AI Test Generation: Complete 2026 Guide

09 minblog.buildbetter.ai

🎯 BuildBetter's Spencer Shulem provides a hands-on walkthrough of AI-powered test generation, now standard practice for a reported 76% of QA leaders, up from 31% two years ago. The guide covers setup, code examples, and CI/CD integration, and makes the case for grounding generated specs in real customer evidence rather than engineering guesses, so the suite mirrors actual user pain instead of assumptions.

AI-Powered Playwright: Building a Self-Healing CI/CD Testing Pipeline

10 minstackabuse.com

⚙️ Kanika Vatsyayan offers a detailed build-along for a Playwright pipeline that repairs itself. Since Playwright 1.56 added intelligent agents that can plan, write, and fix tests with minimal human help, the guide shows how to pair MCP with browser automation and a hand-written "seed test" that teaches the agent your coding style, auth methods, and navigation patterns, so self-healing stays consistent with your conventions instead of drifting.

Playwright Test Runner: Fixtures & Parallelism (Official Docs)

08 minplaywright.dev

📘 Straight from Microsoft's official Playwright docs: a clear reference on the two features that keep large suites fast and stable. Fixtures let you centralize setup like auth, API clients, and shared state so tests stay declarative and isolated, while the test runner's worker model runs files in parallel, each in its own OS process and browser, with workers shut down after a failure to guarantee a pristine environment for the next run. The canonical starting point before layering AI on top.

Tools

10 Best AI Evaluation Tools for Testing & Improving AI Applications in 2026

11 minconfident-ai.com

🧰 Confident AI's Jeffrey Ip frames the exact problem behind this week's issue: traditional software has unit tests and clear pass/fail criteria, but an LLM can return a 200 response in under a second and still hallucinate, contradict its context, or leak PII, with no compiler to catch it. His guide compares 10 evaluation tools, platforms, open-source frameworks, and hybrid solutions, ranked by metric depth and use-case coverage. And how well each closes the loop between testing and production, so you pick by how you actually need to prove quality.

Best AI Agent Evaluation Tools in 2026

10 minlyzr.ai

⚖️ A comparison of the platforms teams use to keep agents reliable in production. The guide covers simulation-driven testing across personas and edge cases, structured evaluation pipelines, and observability for tracing and debugging live behavior, making the case that static, predefined scenarios are no longer enough for multi-turn agentic systems.

Video & Podcast

Claude Code Cast: Codex on Windows, billing splits, and agent safety

08 minpodcasts.apple.com

🎙️ Hosts Alex and Sam run a weekly breakdown of everything Claude Code, from the latest Anthropic updates to practical prompt and workflow tips. Recent episodes dig into Codex on Windows, remote cloud coding agents, Claude's new billing splits, and why a Raspberry Pi running rm -rf is the warning label every agent workflow needs. A sharp, regular listen for anyone living in the terminal with an AI pair.

AI & I: Why We Switched From Claude Code to Codex

07 minpodcasts.apple.com

📺 Dan Shipper sits down with Austin Tedesco to explore why the agent-management interface – a desktop app built on top of a coding agent is becoming the new operating system for knowledge work. They walk through setting up Codex with folders, keys, and reviewer agents, managing the human review step when an agent drafts communications, brainstorming automations across Gmail, Slack, and Notion, and building a live KPI tracker in Notion that agents can read.

Events

EuroSTAR 2026 - June 15–18, Europe's leading QA conference

08 mineurostarsoftwaretesting.com

🌍 EuroSTAR returns June 15–18, 2026, bringing Europe's testing community together for 60-plus sessions led by the region's QA trailblazers. Expect keynotes, tutorials, and track talks heavy on AI in testing, alongside the Huddle-area networking and community conversations that have helped make EuroSTAR Europe's leading software testing conference.

Issue 296

An important update: Transitioning Gemini CLI to Antigravity CLI
Everything Google Cloud customers need to know coming out of Google I/O
Learn TestMu AI - Agentic AI Quality Engineering Platform