Scaling Multi-Agent Systems: Lessons from Intuit on Orchestration and Predictability

How to get multiple agents to play nice at scale

Chase Roossin and Steven Kulesza from Intuit address the engineering challenge of orchestrating multiple AI agents within complex systems. They highlight that automated evaluations are critical for making agent behaviors predictable at scale. This approach allows developers to manage the inherent volatility of LLM-based interactions in production.

Why This Matters

While ideal models suggest seamless AI collaboration, technical reality requires managing unpredictable agent interactions in production environments. Scaling these systems necessitates a move away from manual testing toward automated evaluation frameworks to maintain system reliability. Engineering teams must navigate the trade-offs between deploying agent swarms versus single, highly skilled agents. This decision-making process is heavily influenced by customer behavior and the need for reusable AI components across diverse development teams to ensure consistency and speed.

Key Insights

Automated evaluations are used by Intuit in 2026 to ensure agent behaviors remain predictable as system complexity increases.
Agent swarms represent a decentralized architecture alternative to a single highly skilled agent for complex task execution.
Technical architecture at Intuit is shaped by customer behavior data to ensure AI agents meet specific user requirements.
Reusability is leveraged to democratize AI development across various teams, according to Intuit engineering leadership.
The implementation of automated eval pipelines is essential for achieving predictability in agent-based systems.
Scaling multi-agent systems is currently considered one of the hardest problems in engineering.

Practical Applications

Use Case: Intuit integrates automated evals to stabilize agent interactions in production environments.
Pitfall: Scaling agent systems without automated evaluation metrics leads to unpredictable and non-deterministic software behavior.
Use Case: Deploying agent swarms to distribute specialized tasks across multiple smaller models for better performance.
Pitfall: Designing agent architectures in isolation from customer behavior data results in misaligned system outputs.

References:

https://stackoverflow.blog/2026/04/22/how-to-get-multiple-agents-to-play-nice-at-scale/

On This Page

How to get multiple agents to play nice at scale

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

AI Engineering Demystified: From Vibe Coding to Systems Design

Multilingual AI Engineering: Lessons from Building k4pi for Telegram

Refactoring A.I.-Generated Spaghetti Code: Lessons from a 20% Failure Rate