Welcome to the 297th edition of Coding Jag brought to you by TestMu AI!๐
There is a quiet lie at the center of AI-assisted development: the assistant writes the code, assures you it "should work," and signs off. Getting to a first draft is nearly instant now; getting to proof that it actually works is not, and that gap is exactly where releases quietly go sideways.
This week's edition is about closing it.
See how a trio of MCP servers lets Claude Code verify its own work against a real database and live logs instead of just claiming success, why the sharpest teams now treat the model as the reasoning engine and their own code as the safety harness. And how NVIDIA is making you prove a skill file is safe before an agent ever runs it. Moreover, the staggering results from Project Glasswing, self-healing Playwright pipelines, and so much more.
๐ฌ Found something worth sharing this week? Hit Reply to this email; your ideas could shape a future edition.
News
09 min
anthropic.com
๐ก๏ธ Anthropic shares the first results of Project Glasswing, its coalition that points AI at critical software before attackers can. In one month, its ~50 partners used the unreleased Claude Mythos Preview model to find more than 10,000 high- and critical-severity vulnerabilities across the world's most important software. The twist: progress is no longer limited by how fast you can find bugs, but by how fast you can patch them. Of 530 reported so far, only 75 are fixed, since each one still takes about two weeks of human effort.
08 min
developer.nvidia.com
๐ก๏ธ NVIDIA tackles a different proving problem: how do you trust a skill before you let an agent run it? Verified agent skills are portable SKILL.md instruction sets that pass through a publishing pipeline first, SkillSpector scanning for prompt injection and tool poisoning, cryptographic signing for provenance, and a skill card documenting ownership, dependencies, and known limits. It follows the open agentskills.io spec and works across Claude Code, Codex, and Cursor, so teams can verify and inspect a skill before it touches their environment.
07 min
livenewschat.eu
๐ Fresh Enterprise Technology Research data reported by The Wall Street Journal shows enterprise adoption of Claude up roughly 128% over twelve months, while OpenAI slipped around 8% and Grok stayed a rounding error. The Ramp AI Index, which tracks actual business spend, told the same story. The engine behind the surge is not the chatbot. It is Claude Code, which reached general availability in May 2025 and reported ~$2.5 billion annualized run rate by February 2026.
09 min
developers.googleblog.com
๐ Google's developer keynote leans hard into agents. Gemini 3.5 Flash outperforms 3.1 Pro across almost all benchmarks while running about four times faster. Managed Agents arrive in the Gemini API via a single call, and Antigravity 2.0 lands as a desktop app. The standout for the web is WebMCP, a proposed open standard that lets sites expose structured tools so browser-based agents can act with more speed and precision, with an experimental origin trial starting in Chrome 149.
AI
08 min
medium.com
๐ Rick Hightower tackles the core dishonesty of AI coding: assistants stop at "this should work." He shows how the Model Context Protocol gives Claude Code real senses, the ability to query a database, screenshot a rendered page, and read production logs, so it can verify its own work instead of asserting it. The piece walks through the specific trio of MCP servers that closes the verification loop for any real web application.
07 min
dev.to
๐งฑ A clear-eyed look at why agents that nail the demo still hallucinate a file path or quietly take a wrong action in production, and what the 2026 tooling does about it. The core argument: don't let the LLM decide whether an action is allowed. Build an explicit orchestration layer, a state machine or workflow engine that constrains what the model can do. The LLM is the reasoning engine; your code is the safety harness.
10 min
autify.com
๐งช The Autify team breaks down how AI layers onto Playwright's solid foundation to prevent and fix flaky tests faster. "Fix with AI" features detect broken selectors and auto-suggest reliable replacements, eliminating manual locator updates. The guide also looks at moving beyond augmented Playwright toward autonomous agents that run tests in natural language while still emitting real, editable framework code.
09 min
masterofcode.com
๐ฌ Master of Code's Tetiana Tsymbal lays out a practical framework for judging whether an AI agent actually does its job once it's live, not just in training. The argument: evaluation has to be metrics-driven, because an agent can look accurate in testing yet hit only a 45% containment rate after launch. The piece walks through measuring KPIs like containment and completion rate across each stage and tying them to real business outcomes, so evals drive ROI instead of staying a technical box-tick.
Automation
09 min
blog.buildbetter.ai
๐ฏ BuildBetter's Spencer Shulem provides a hands-on walkthrough of AI-powered test generation, now standard practice for a reported 76% of QA leaders, up from 31% two years ago. The guide covers setup, code examples, and CI/CD integration, and makes the case for grounding generated specs in real customer evidence rather than engineering guesses, so the suite mirrors actual user pain instead of assumptions.
10 min
stackabuse.com
โ๏ธ Kanika Vatsyayan offers a detailed build-along for a Playwright pipeline that repairs itself. Since Playwright 1.56 added intelligent agents that can plan, write, and fix tests with minimal human help, the guide shows how to pair MCP with browser automation and a hand-written "seed test" that teaches the agent your coding style, auth methods, and navigation patterns, so self-healing stays consistent with your conventions instead of drifting.
08 min
playwright.dev
๐ Straight from Microsoft's official Playwright docs: a clear reference on the two features that keep large suites fast and stable. Fixtures let you centralize setup like auth, API clients, and shared state so tests stay declarative and isolated, while the test runner's worker model runs files in parallel, each in its own OS process and browser, with workers shut down after a failure to guarantee a pristine environment for the next run. The canonical starting point before layering AI on top.
Tools
11 min
confident-ai.com
๐งฐ Confident AI's Jeffrey Ip frames the exact problem behind this week's issue: traditional software has unit tests and clear pass/fail criteria, but an LLM can return a 200 response in under a second and still hallucinate, contradict its context, or leak PII, with no compiler to catch it. His guide compares 10 evaluation tools, platforms, open-source frameworks, and hybrid solutions, ranked by metric depth and use-case coverage. And how well each closes the loop between testing and production, so you pick by how you actually need to prove quality.
10 min
lyzr.ai
โ๏ธ A comparison of the platforms teams use to keep agents reliable in production. The guide covers simulation-driven testing across personas and edge cases, structured evaluation pipelines, and observability for tracing and debugging live behavior, making the case that static, predefined scenarios are no longer enough for multi-turn agentic systems.
Video & Podcast
08 min
podcasts.apple.com
๐๏ธ Hosts Alex and Sam run a weekly breakdown of everything Claude Code, from the latest Anthropic updates to practical prompt and workflow tips. Recent episodes dig into Codex on Windows, remote cloud coding agents, Claude's new billing splits, and why a Raspberry Pi running rm -rf is the warning label every agent workflow needs. A sharp, regular listen for anyone living in the terminal with an AI pair.
07 min
podcasts.apple.com
๐บ Dan Shipper sits down with Austin Tedesco to explore why the agent-management interface โ a desktop app built on top of a coding agent is becoming the new operating system for knowledge work. They walk through setting up Codex with folders, keys, and reviewer agents, managing the human review step when an agent drafts communications, brainstorming automations across Gmail, Slack, and Notion, and building a live KPI tracker in Notion that agents can read.
Events
08 min
eurostarsoftwaretesting.com
๐ EuroSTAR returns June 15โ18, 2026, bringing Europe's testing community together for 60-plus sessions led by the region's QA trailblazers. Expect keynotes, tutorials, and track talks heavy on AI in testing, alongside the Huddle-area networking and community conversations that have helped make EuroSTAR Europe's leading software testing conference.