CODING JAG - Issue 299

Welcome to the 299th edition of Coding Jag brought to you by TestMu AI!👐

AI spent a year in the back seat, suggesting, drafting, autocompleting. This week, it climbed into the driver's seat: Apple shipped an agentic Siri at Tim Cook's final WWDC, MetaMask launched an Agent Wallet that lets AI trade on-chain without your private keys, and OpenCode overtook the paid tools to become the most-used coding agent on GitHub. The Miasma worm showed the dark side, a poisoned commit that fires a credential stealer the moment you open the repo, enabling GitHub to disable 73 Microsoft repos in 105 seconds.

Which is why this edition lands where it does: an agent that acts is easy to build; one you can trust is not. Reliable agents aren't built, they're evaluated. Inside #299: NVIDIA's Cosmos 3, open-weight coding models closing the gap, and TestMu AI's own Kane CLI now asserting on network, console, cookies, and page speed, because the green badge was never proof.

📬 Come across something useful or interesting? Just reply and let's exchange ideas.

News

Apple Makes Its Huge Siri AI Reveal at WWDC 2026

10 minChrome-Extensionbusiness-standard.com

🍏 At WWDC 2026, Apple unveiled "Siri AI" — a more conversational, context-aware assistant that can draw on personal context, read what's on your screen, and take multi-step actions across apps. It runs on a redesigned Apple Intelligence built with Google's Gemini models (on-device plus Private Cloud Compute). It ships alongside iOS 27, macOS 27 "Golden Gate," and Liquid Glass refinements. Apple even extended agentic actions into system apps like Passwords, which can now navigate sites, sign you in, and rotate credentials on its own.

The Miasma Worm Turned AI Coding Tools Into an Attack Surface

08 minChrome-Extensionstepsecurity.io

🪲 The StepSecurity team breaks down the Miasma worm: on June 5, a poisoned commit to Azure/durabletask fired a credential stealer the moment a developer opened the repo in Claude Code, Cursor, Gemini CLI, or VS Code. GitHub's abuse systems auto-disabled 73 Microsoft repos in 105 seconds, with CI/CD pipelines caught downstream.

MetaMask Agent Wallet Lets AI Trade Your Crypto — Inside Your Rules

08 minChrome-Extensioncoindesk.com

🦊 The MetaMask team at Consensys launched Agent Wallet (early access) — a self-custodial wallet that lets AI agents trade on-chain without your private keys. You set the guardrails up front: spending caps, allowlists, a risk profile, and 2FA, with simulation, threat-scanning, and MEV protection on every transaction.

OpenCode Overtakes the Paid Tools as the Most-Used Coding Agent

09 minChrome-Extensionbyteiota.com

⌨️ OpenCode, the open-source, terminal-native coding agent, just topped LogRocket's June rankings and crossed 160,000 GitHub stars. Its edge fits this issue's thesis: it feeds live LSP compiler diagnostics back into the model mid-task, so the agent self-corrects before reporting a green result, which is why, in DataCamp's head-to-head, it generated 21 more tests on average than Claude Code on the same model.

AI

Reliable Agents Aren't Built. They're Evaluated.

10 minChrome-Extensiondigitalapplied.com

🧪 The DigitalApplied team makes the issue's thesis explicit: reliable agents aren't built, they're evaluated, with eval work eating 60–80% of dev time in teams that actually ship. The reframe that matters is pass^k over pass@k: a 70%-per-trial agent looks production-ready at ~97% (pass@3) but lands all three runs only ~34% of the time. The playbook is concrete, golden datasets from real failures, graders that verify the actual end state (did the booking land in the database, not just whether the agent said so), calibrated LLM-judges, and CI gates that fail the build on regressions.

NVIDIA Cosmos 3 Cuts Physical-AI Evaluation From Months to Days

08 minChrome-Extensionblogs.nvidia.com

🌐 The NVIDIA team launched Cosmos 3 at GTC Taipei, an open-world model for physical AI that fuses vision, world generation, and action prediction. It shrinks robot and AV training and evaluation from months to days, with agent skills for synthetic data, scene reconstruction, and defect-image generation.

The Best Open-Source Coding Models of 2026

09 minChrome-Extensionkilo.ai

🌍 The Kilo team rounds up 2026's open-source coding models, showing open weights now run in real pipelines, not just benchmarks. The new MiniMax M3 leads the agentic-coding field alongside GLM-5.1, Kimi K2.6, DeepSeek V4, and Qwen3-Coder, a global field from China to Europe's Mistral Devstral and India's Sarvam.

AI Benchmarks 2026: Why Leaderboards Don't Predict Production

09 minChrome-Extensionkili-technology.com

📊 The Kili Technology team argues what QA teams keep relearning the hard way: enterprise agents show a ~37% gap between lab benchmark scores and real-world performance, with up to 50x cost variation for similar accuracy. Human review still catches the edge cases that automated pipelines miss.

Automation

Playwright's Planner–Generator–Healer Loop, Road-Tested

09 minChrome-Extensionscrolltest.com

🎭 The ScrollTest team road-tests Playwright's Test Agents — the Planner, Generator, and Healer that explore an app, write specs, and repair locators. With Playwright at ~90k+ GitHub stars, the 1.60.0 release leaned harder into agentic workflows, trace analysis, and MCP — with an honest take on where it still fails.

Agentic AI Is Reshaping Enterprise Testing — In Three Levels

08 minChrome-Extensionepam.com

⚙️ Adam Auerbach of EPAM Systems breaks down how agentic AI moves testing beyond traditional automation toward intelligent, agent-driven systems, and lays out the shift in three levels, from assisted to fully autonomous QA. The payoff: streamlined workflows, better efficiency, and wider access to testing, without losing the human judgment that keeps quality honest.

The Rise of AI QA Engineers: Testing Skills Developers Cannot Ignore

07 minChrome-Extensionblog.newtum.com

🚦 The Newtum team describes AI QA engineering inside the CI/CD pipeline, where AI sizes up each commit's risk, complexity, services touched, failure odds, and picks the test strategy to match. Light runs for small changes, deep validation for risky ones: speed and safety at once.

Tools

Kane CLI Now Reads the Browser: Introducing DevTools Assertions

10 minChrome-Extensiontestmuai.com

🔍 Bhawana, Community Evangelist and Product Marketing Lead at TestMu AI, shows how Kane CLI now reads the browser, asserting on network calls, console errors, cookies, localStorage, and page-load budgets in plain English, no selectors. It ties back to this edition's theme: stop trusting the green badge and verify the hidden layers the UI never shows.

The AI Agent Evaluation Tools That Keep Agents Honest (2026)

09 minChrome-Extensionaugmentcode.com

🧰 The Augment Code team rounds up the AI agent evaluation tools production teams actually reach for — Promptfoo's YAML-first assertions, Arize for tracing and online evals, and more. The throughline ties back to this whole issue: a single misstep mid-workflow can sail past a final-output check, so you need tooling that scores the full trajectory, not just the answer.

Video & Podcast

Why Traditional Testing Fails for AI Systems

08 minChrome-Extensionrichard-seidl.com

🎧 On Software Testing Unleashed, host Richard Seidl and guest Dusanka Lecic dig into why traditional pass/fail testing collapses on AI systems: the same input yields many valid outputs, and the bugs are "invisible" because they live in prompts, retrieval, and generation, not the code. Her fix is this issue's reflex exactly, trace the hidden layers (chunks, retrieval results, queries), not just the final answer, structured with a CHAT checklist: Context retention, Hallucination control, Accuracy/relevance, and a trace-fix-retest workflow.

Apple WWDC 2026 Keynote: Introducing Siri AI

12 minChrome-Extensiondeveloper.apple.com

📺 Worth watching the primary source. Apple's full WWDC 2026 keynote walks through the live agentic Siri demo, the new Gemini-built Apple Foundation Models, and the wider "27" software reset — and closes with Tim Cook's final keynote as CEO before he hands off to John Ternus on September 1.

Trajectory Evals: Why Watching the Output Isn't Enough

11 minChrome-Extensionyoutube.com

🎥 Arize's Dat Ngo lays out how to evaluate non-deterministic agents in production, arguing observability is only table stakes. The real leverage is trajectory evals: scoring the entire sequence of calls against a business process, not just a single input–output pair. He breaks evaluation signal into five categories and warns teams against defaulting to LLM-as-judge when cheaper, more trusted checks exist. The sharpest line: your agent called tool B before tool A, your code never caught it, but the telemetry did.

Events

EuroSTAR 2026 — Europe's Biggest Testing Conference Lands in Oslo

08 minChrome-Extensioneurostarsoftwaretesting.com

🎟️ EuroSTAR 2026 runs June 15–18 in Oslo — the longest-running and largest European software testing and quality assurance conference, and this year it's centered on AI, with 60+ sessions from global testing experts. It's in-person, so treat it as the week's signal to follow even if you're not in the room.

AI Testing Conference 2026 | Testμ Conf by TestMu AI

07 minChrome-Extensiontestmuai.com

🎤 TestMu AI is hosting TestMu Conference 2026, a free three-day virtual event running August 19 to 21, 2026. With 75,000 plus testers expected, and 100 plus speakers and sessions, it is the world's largest software testing conference covering AI in testing, test automation, quality engineering, and everything in between. Registration is free and open now.