Next-Gen App & Browser Testing Cloud
Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

Learn advanced techniques for production AI, including layering, compression, retrieval, and validation to improve performance, scalability, and reliability.
Srinivasan Sekar
January 11, 2026
In Part 1 of Context Engineering, we looked at why AI agents forget, the four ways they can fail when context isn’t handled properly, and the first two pillars of Context Engineering.
Now, in Part 2, let’s get into the more advanced methods that set good AI agents apart from those that are ready for production.
Context Engineering for using AI in production involves structuring, managing, and optimizing the information you provide an AI so it performs reliably and efficiently in real-world applications.
Why COMPRESS Matters for Using AI in Production?
Compression keeps AI agents efficient and focused by reducing token usage without losing important details. It helps models remain accurate, relevant, and responsive even as conversations or datasets grow in size and complexity.
Why ISOLATE Matters for Using AI in Production?
Isolation prevents context from overlapping between tasks, keeping AI focused, organized, and efficient. It ensures models handle complex workflows without confusion or interference from unrelated information.
What Are Advanced Context Engineering Patterns?
These patterns help AI handle complex, multi-turn tasks efficiently while maintaining focus, continuity, and relevance without exceeding context limits.
The main idea is to keep the most important information and get rid of or summarize the rest.
The Zoom Lens Approach: Consider you are describing your summer vacation:
Zoom Level 1 – Ultra Wide (5 words):
Zoom Level 2 – Wide (50 words):
Zoom Level 3 – Medium (500 words):
Zoom Level 4 – Full Detail (5000 words):
The Smart Part:
Full Technical Specification (5000 words):
"Our company was founded in 2010 with the mission to revolutionize
cloud testing. Over the years, we've grown from a team of 5 to 500+
employees across 12 countries..."
Medium Summary (500 words):
"Testing platform founded 2010. Team of 500+ across 12 countries.
Processes 10M+ tests daily for 10K+ customers..."
Short Summary (50 words):
"Cloud testing platform. 500+ employees, 10K+ customers, 10M+ daily tests."
Ultra Short (5 words):
"Cloud testing platform, global scale"
Load the amount of detail you need for each job!
For long conversations, keep track of the details of recent messages and summarize the older ones.
The Conversation Memory Trick: Consider you are having a 2-hour phone call with your friend:
What You Remember:
Minutes 110-120 (Just Now) – Crystal Clear:
Minutes 1-109 (Earlier) – Fuzzy Summary:
You DON’T Remember:
What Happens:
Claude Code auto-compact feature implement this brilliantly that triggers at 95% context window capacity.
Some tools give back HUGE answers. Before adding to the context, compress:
The “Report Card Summary” Approach:
Think about how your teacher grades 10,000 students on a spreadsheet:
Without Compression (The Overwhelming Way):
Show me all 10,000 students:
Row 1: John Smith, Math: 92, English: 88, Science: 91..
Row 2: Sarah Jones, Math: 85, English: 93, Science: 87..
Row 3: Mike Brown, Math: 78, English: 82, Science: 85..
[... 9,997 more rows ...]
AI Context: EXPLODED! Can't fit!
With Compression (The Smart Summary):
Query returned 10,000 student records.
Key Statistics:
- Average Math score: 84.5
- Average English score: 86.2
- Top 5 students: Sarah (94.3 avg), Mike (93.1 avg)...
- Bottom 5 students: Need tutoring support
- Grade distribution: 15% A's, 35% B's, 40% C's, 10% D's
Sample records:
Row 1: John Smith (90.3 avg) - Excellent
Row 2: Sarah Jones (88.3 avg) - Very Good
Full data saved to: student_grades.xlsx
Result: AI gets the important insights (200 tokens) instead of a lot of raw data (20,000 tokens).
Compression by Tool Type:
Code Search Results:
Database Query:
Log Files:
Lossless Compression: Get rid of extra data without losing any information.
Original: "The user wants to book a flight. The user prefers direct flights.
The user's budget is $500. The user is traveling next week."
Lossless: "User wants direct flight, $500 budget, traveling next week."
Information preserved: 100%
Token reduction: 40%
Lossy Compression: Accept some loss of information to get a big reduction.
Original: 50-page technical specification with exact implementation details
Lossy: "System processes payments via Stripe. Supports credit cards,
PayPal, and Apple Pay. Handles refunds within 30 days."
Information preserved: ~60%
Token reduction: 98%
When to Use Each:
Note: Test your AI agents across real-world scenarios. Try Agent to Agent Testing Today!
The main idea is to keep context from getting in the way, break up concerns into focused units.
Anthropic’s multi-agent research system shows that specialized agents with separate contexts work much better than single-agent systems. Their internal tests showed that “a multi-agent system with Claude Opus 4 as the main agent and Claude Sonnet 4 as subagents did 90.2% better than a single-agent Claude Opus 4.”
The main point is that “subagents make compression easier by working in parallel with their own context windows and looking at different parts of the question at the same time.” You can assign a narrow sub-task to each subagent’s context without having to worry about unrelated information getting in the way.
Architecture Pattern: You can consider it like a group project at school:
The Teacher (Orchestrator Agent):
The Students (Specialist Agents):
Every student has their small, focused backpack! The teacher gathers everyone’s work at the end and puts it all together to make the final project. Each student only had to remember what they were supposed to do, not the whole project!
Real-World Diagram:
Task: "Write a comprehensive market analysis report"
┌─────────────────────────────────────────────────┐
│ Orchestrator Agent │
│ Context: Task description, plan, coordination │
└────┬─────────┬─────────┬─────────┬──────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Research │ │Financial│ │Competitor│ │Synthesis│
│Agent │ │Agent │ │Agent │ │Agent │
│ │ │ │ │ │ │ │
│Context: │ │Context: │ │Context: │ │Context: │
│-Search │ │-Finance │ │-Competitor│ │-All │
│ tools │ │ data │ │ data │ │ summaries│
│-Market │ │-Metrics │ │ frameworks│ │-Report │
│ sources │ │ formulas│ │ │ │ template│
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Each agent has isolated, focused context – no interference, no confusion!
Trade-offs of Multi-Agent Systems:
According to Anthropic’s research, multi-agent systems have significant benefits and costs:
Benefits:
Costs:
When to Use Multi-Agent Systems:
Anthropic found that multi-agent systems excel at:
When Not to Use Multi-Agent Systems:
Key Finding: In Anthropic’s BrowseComp evaluation, they found that token usage by itself explains 80% of performance variance. Multi-agent systems work primarily because they “help spend enough tokens to solve the problem” through parallel context windows.
HuggingFace’s CodeAgent approach shows how to isolate data-heavy operations.
The Sandbox is Like a Workshop:
Consider you’re building a huge LEGO castle:
Without Sandbox (Everything in Your Bedroom):
With Sandbox (Using a Separate Workshop):
Your Bedroom (AI’s Context) Only Sees:
The Garage (Sandbox) Holds:
Benefits:
The Three-Drawer System: Consider your desk has three drawers with different rules:
Top Drawer (ALWAYS Open):
This drawer is always visible. The AI sees this every time.
Middle Drawer (Open ONLY When Needed):
This drawer opens only when specifically asked. Most of the time it stays closed to keep your desk uncluttered.
Bottom Drawer (NEVER Show to AI):
This drawer is locked. The AI never sees what’s inside.
Why This Works:
Now that you know what the four pillars are, let’s look at some more advanced patterns that are used in production systems:
Following best practices, as outlined in Daffodil Software Engineering Insights, organize information according to levels of importance:
The Five-Level Information Tower: Think of information like floors in a building – higher floors are more important:
Tier 0 – The Foundation (NEVER expires):
Tier 1 – The Ground Floor (Lasts 30 days):
Tier 2 – Second Floor (Lasts 7 days):
Tier 3 – Third Floor (Lasts 24 hours):
Tier 4 – The Rooftop (lasts 5 minutes):
How It Works:
The AI packs its backpack from most important to least important, stopping when the backpack is full!
Anthropic’s production experience provides critical insights for managing extended conversations:
The Relay Race Strategy for Super Long Conversations: Consider you’re running a marathon (26 miles), but you can only run 5 miles before getting tired:
Runner 1 (Miles 1-5):
Runner 2 (Miles 6-10):
Runners 3, 4, 5… Continue the pattern.
What Happens:
Anthropic’s Three-Part Strategy:
Building multi-agent systems that work in production requires solving challenges beyond basic Context Engineering. Anthropic’s engineering team shares critical lessons from deploying their research system.
The Problem:
Unlike traditional software where you can restart on error, agents can’t restart from the beginning – it’s “expensive and frustrating for users”.
The Solution – The Video Game Save Point Strategy:
Consider you are playing a video game with 20 levels:
Without Checkpoints (The Nightmare):
With Checkpoints (The Smart Way):
Here are scenarios when things go wrong:
Scenario 1 – Tool Breaks:
Scenario 2 – System Crashes:
Key Insight From Anthropic: “Letting the agent know when a tool is failing and letting it adapt works surprisingly well.” The AI is smart enough to find another way – just tell it what’s broken!
The Problem:
Users say, “The AI didn’t find obvious information” but when you try, it works fine. What happened?
The Solution: The Detective’s Notebook (Without Reading Private Diaries)
The problem is like: Consider your robot toy sometimes goes left and sometimes goes right, even with the same button press. How do you fix it if you can’t predict what it’ll do?
The Solution – Track Patterns, Not Content:
Instead of reading every private conversation (creepy!), track the patterns:
Here are aspects we track:
Decisions Made:
Interaction Patterns:
Performance Stats:
Privacy Protected:
Anthropic emphasizes: “We monitor agent decision patterns and interaction structures, all without monitoring the contents of individual conversations, to maintain user privacy.”
The Detective Work:
The Problem:
You can’t update all agents simultaneously without breaking running tasks.
The Solution: The Two-Playground Strategy‘
Here is the problem: Consider a theme park where 100 people are on different rides:
Now you want to upgrade all the rides with new features. But you can’t:
Rainbow Deployment (The Smart Way):
Step 1: Build a second, upgraded theme park next door.
Step 2: Make a simple rule:
Step 3: Wait patiently.
Step 4: When the old park is empty:
Nobody’s ride was interrupted. This is exactly how Anthropic deploys updates: “Gradually shifting traffic from old to new versions while keeping both running simultaneously” so no one’s work gets interrupted.
The Current State: Anthropic notes that currently their “lead agents execute subagents synchronously, waiting for each set of subagents to complete before proceeding.”
The Problem:
The Future:
Anthropic’s research demonstrates that using multiple specialized agents with separate contexts significantly improves performance. By isolating responsibilities, parallelizing tasks, and managing context individually, multi-agent systems handle complex, large-scale workflows more efficiently and reliably than single-agent setups.
1. Think Like Your Agents
Build simulations with exact prompts and tools, and watch agents work step-by-step. This “immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools.”
2. Teach the Orchestrator How to Delegate
Vague instructions like “research the semiconductor shortage” led to duplicated work and gaps. Instead, each subagent needs:
3. Scale Effort to Query Complexity
Embed scaling rules in prompts:
4. Tool Design is Critical
“Agent-tool interfaces are as critical as human-computer interfaces.” The right tool makes tasks efficient; often it’s strictly necessary.
5. The Last Mile is Most of the Journey
Even well-designed AI agents fail when context handling goes wrong. These are the most frequent mistakes teams make when managing context at scale, and how to fix them with practical, production-tested methods.
The Backpack Analogy:
The Analogy:
The Analogy:
The Analogy:
The Analogy:
Track these metrics to know if you’re on the right track:
These metrics show how effectively your AI is using and managing its context window for optimal performance.
Context Utilization:
Information Density:
Retrieval Precision:
Context Freshness:
Redundancy Rate:
These metrics measure the accuracy, relevance, and consistency of the AI’s responses based on the loaded context.
Relevance Score:
Sufficiency Score:
Consistency Score:
At TestMu AI, we’ve embraced Context Engineering as a core principle across our AI agents. Here’s our high-level approach:
The Results:
Testing AI agents ensures reliability across workflows. It validates isolated and integrated contexts, measures metrics like bias, hallucination, and tone consistency, and detects subtle issues before production deployment.
Platforms such as TestMu AI Agent to Agent Testing allows teams to simulate multiple personas, chat, voice, and multimodal interactions, confirming smooth handoffs between agents and consistent, context-aware performance.
To get started, refer to this TestMu AI Agent to Agent Testing guide.
Context Engineering is where the art of AI system design meets the science of optimization.
The Art:
The Science:
The evidence is clear: as Drew Breunig’s research compilation shows, even frontier models with million-token context windows suffer from context poisoning, distraction, confusion, and clash. Simply having a large context window doesn’t solve the problem – you need thoughtful Context Engineering.
Remember: An AI’s context window is like a backpack. Pack smart, not heavy. At TestMu AI, we’re committed to applying these principles across our AI-native products, continuously pushing the boundaries of what’s possible when context is engineered thoughtfully.
Essential Resources
Co-Author: Sai Krishna
Sai Krishna is a Director of Engineering at TestMu AI. As an active contributor to Appium and a member of the Appium organization, he is deeply involved in the open-source community. He is passionate about innovative thinking and love to contribute to open-source technologies. Additionally, he is a blogger, community builder, mentor, international speaker, and conference organizer.
Did you find this page helpful?
More Related Hubs
TestMu AI forEnterprise
Get access to solutions built on Enterprise
grade security, privacy, & compliance