The Three-Agent Model: How we Standardize and Accelerate Agentic Engineering

There's a version of AI-assisted development that most people picture: a developer typing a prompt, reading the output, and deciding whether to keep it. One human, one model, back and forth. It's better than writing everything from scratch. It's not how we work.

At Foxbox we use what's become known in the engineering community as the three-agent model. It's a multi-agent architecture pattern, a way of structuring AI agents in a hierarchy so that complex software work can be decomposed, executed in parallel, and verified before it ships. The terminology around this pattern is still settling; you'll see it called orchestrator-subagent, agentic loop, or simply multi-agent workflow depending on who's writing about it and when. The underlying structure is consistent regardless of what it's called, and that structure is what's worth understanding.

This post explains the model, where it comes from, how we apply it in practice, and what it looks like on a real engagement.

What is The Three-Agent Model?

The three-agent model didn't originate with any single company or tool. It emerged from multi-agent systems research and has been formalized over the past couple of years by engineering teams at Anthropic, Google, Microsoft, and OpenAI as AI coding agents became capable enough to make it practical.

The core idea is older than any of those companies: in any well-designed system, generation and verification should be separate concerns. You don't have the same person write the code and sign off on its correctness. You don't have the same process that builds the artifact also audit it. This separation of concerns is foundational to software engineering. It's why code review exists, why QA is a distinct function, why tests are written independently of the code they test.

The three-agent model applies that principle to AI agents. Instead of one agent trying to do everything — plan, implement, check, refine — you give each role to a specialized agent with its own context, instructions, and scope.

Microsoft's Azure Architecture Center, Google's Agent Development Kit documentation, and Anthropic's guidance on multi-agent workflows all describe variations of the same pattern. The names differ: orchestrator-subagent-evaluator, generator-critic loop, maker-checker. The structure holds.

The Three Roles

The Orchestrator

The orchestrator is the agent at the top of the hierarchy. It receives the task — a feature, a bug fix, a module to build — and handles the planning work: decomposing the problem, deciding what needs to happen in what order, assigning work to subagents, and synthesizing results.

The orchestrator does not write code directly. Its job is coordination and judgment. It maps the work, specifies what each subagent should produce, and determines when output meets the standard to move forward.

This mirrors how a senior engineer actually operates. When a complex feature lands in a sprint, a senior engineer doesn't immediately start coding. They break it down. They think about dependencies. They decide what the interface should look like before anyone touches an implementation. That planning work is what the orchestrator does.

The Builder

The builders are the hands. The generators. Each one receives a bounded, well-defined task from the orchestrator and executes it in its own isolated context window.

The isolation matters. One subagent working on the data access layer doesn't need to know about the UI. Another subagent writing the API schema doesn't need to know about the deployment configuration. Narrower context produces more focused output — the agent isn't holding irrelevant information while trying to do a specific job.

Subagents can run in parallel. If the orchestrator determines that the database schema, the API endpoints, and the authentication middleware are independent of each other, it can assign all three to separate builder subagents working simultaneously. This is where significant time savings come from in practice: tasks that would be sequential in a traditional development workflow can run concurrently.

The Critic

The critic is the quality gate. After a builder produces output, the critic reviews it against a defined set of criteria before the result goes back to the orchestrator.

The critic's job is not to rewrite the output. It's to evaluate it: Does this do what was asked? Does it meet the acceptance criteria? Are there edge cases missing? Is there a security issue? Does it conflict with anything in the broader codebase?

If the output passes, the critic signs off and the orchestrator proceeds. If it fails, specific feedback goes back to the subagent for revision. This loop continues until the output meets the standard, or until a maximum iteration count triggers escalation to a human reviewer.

What this creates is a quality gate that catches problems before they compound. A builder writing a database query that would cause an N+1 problem gets corrected before that query is wired into the API layer. A subagent generating an API endpoint that exposes a sensitive field without proper access controls gets flagged before it ships.

Why Three Roles and Not One

The practical reason is context management. A single AI agent working on a large feature has to hold the entire task in a single context window. As that window fills, quality degrades — the model loses track of earlier constraints, makes decisions that contradict earlier ones, and produces output that works in isolation but conflicts with the rest of the system.

Distributing work across multiple agents, each with a focused context, prevents this. The orchestrator's context stays lean because it only manages plans and summaries, not implementation details. Each builder’ context is scoped to its specific task. The critic's context is structured around evaluation criteria, not generation.

There's also a quality argument independent of context. Generation and evaluation are different modes of work. An agent that wrote the code has already committed to a set of assumptions. A separate agent reviewing that code — with different instructions and no prior context — will catch things the generator missed. This is why pair programming works. It's why code review finds bugs the original author didn't. Separating the roles builds that dynamic into the process.

How We Apply It at Foxbox

We don't apply this pattern mechanically to every task. A simple bug fix doesn't need an orchestrator. A single-endpoint change doesn't need parallel builders. The pattern earns its overhead when the work is complex enough that a single agent would hit its limits.

Where we use it consistently:

Feature builds across multiple layers.

When a feature touches the database, the API, and the frontend — or requires integration with a third-party system alongside product logic — the orchestrator decomposes this into separate workstreams that can be specified and verified independently.

Codebase-wide changes.

Refactoring a data model, migrating an API version, updating an authentication system — these are changes that affect many files and many layers. Parallel subagents, each responsible for a specific domain, move through these changes faster and with less risk of cross-contamination.

High-stakes domains.

Any work where correctness is non-negotiable gets a critic configured with explicit acceptance criteria, not just a general quality review. In regulated industries — healthcare, fintech, anything with real compliance exposure — the critic's checklist includes the specific requirements that can't be compromised.

A Representative Example: Building a Care Coordination Feature

The following is a composite based on the kinds of engagements we run for digital health clients. The specifics are illustrative, but the pattern and the catches are real.

The scenario: a care coordination platform managing care plans for patients across multiple providers. The feature: a notification system that alerts care team members when a patient's status changes, routed correctly based on the member's role, their relationship to the patient, and the type of status change.

This is a non-trivial feature. It has a rules engine (who gets notified, under what conditions), a delivery layer (push, SMS, in-app), a preference system (notification settings per user), a data model change (new notification event types with audit logging), and it touches HIPAA-covered data throughout.

Here's how the orchestrator decomposed it:

Define the event schema for status change notifications and the audit log format
Build the rules engine mapping patient status changes to notification recipients
Build the delivery service (separate subagent per channel: push, SMS, in-app)
Build the user preferences API
Wire the components together and validate the end-to-end flow

Tasks 1 through 4 were assigned to parallel builder subagents. Each had a scoped context: the event schema agent only saw the existing data model and the requirements doc. The rules engine agent saw the event schema output and the business logic spec. The builder agents each saw only their channel's integration requirements.

The critic for this engagement was configured with specific criteria: every function touching PHI must pass through the existing access control middleware, no patient identifiers in log output, notification content must not include clinical notes.

That last requirement caught something on the first pass. The rules engine subagent had included a brief status summary in the notification payload - the kind of detail a developer building quickly might add as helpful context, without recognizing it as protected health information in transit. The critic flagged it. The builder revised the payload to pass only identifiers and a status code, with readable content fetched client-side after authentication.

That's a meaningful catch. It's the kind of thing that slips through code review when reviewers are moving fast and aren't specialized in HIPAA requirements. Having an agent whose entire job is to evaluate output against a defined compliance checklist — before anything leaves the development workflow — is a different category of protection.

What This Means for How We Staff Engagements

The three-agent model changes what senior engineers spend their time on, not whether you need them.

Engineers at Foxbox are not writing boilerplate. They're not typing out CRUD endpoints or writing the same authentication middleware they've written a dozen times. What they're doing is the work the model can't replace: designing the orchestrator's decomposition, defining the critic's acceptance criteria, reviewing synthesized output before it merges, and making the architectural decisions that determine whether a system holds up at scale.

This is what senior engineering capacity should be used for. The judgment work. The design work. The work that requires understanding the whole system, not just the current task.

What AI has changed is the ratio. More of a senior engineer's day is judgment. Less of it is execution.

The Honest Caveats

The model has real costs. Each agent invocation adds latency and token usage. Orchestrating five parallel subagents costs more than running one. For small tasks, the overhead isn't worth it.

Poorly specified acceptance criteria undermine the critic. A vague mandate ("make sure this is good") won't catch the specific things you care about. The value of the critic is proportional to the precision of its instructions.

And the model requires engineers who understand it. Setting up effective orchestration, writing good specifications, configuring critics with meaningful criteria — these are skills. The tools are increasingly accessible. The judgment required to use them well is not.

Using the Three-Agent Model

The three-agent model is: an orchestrator that plans and coordinates, specialized builder subagents that execute in parallel isolated contexts, and a critic that evaluates output before it's accepted.

It's a pattern grounded in software engineering fundamentals and multi-agent systems research, now formalized by the major AI labs and increasingly standard in how serious engineering teams work with these tools.

We use it because it produces better software faster than any single-agent approach at the scale we operate. In the domains our clients work in — healthcare, fintech, enterprise software — speed without correctness isn't a benefit. It's debt accumulated quickly.

June 1, 2026

Rob Volk

Rob Volk is Foxbox Digital’s founder and CEO. Prior to starting Foxbox, Rob helped Fortune 500 clients, including Pfizer, USPS, and Morgan Stanley build and scale enterprise apps. He was the CTO of Beyond Diet and implemented technology that scaled to over 350k+ customers, and was the CTO and Co-Founder of Detective (detective.io), a venture-backed intelligence platform that amassed 200k+ users in a short time frame. Read more about Rob Volk

We love working with other bright minds.

Let's Talk