Multi-Agent Development — three AI agents build your feature together

Last week I asked Claude Code to build a complete feature: a new API endpoint with Entity Framework Core migration, unit tests, and integration tests. Halfway through, it started repeating itself. The context window was filling up, the model became cautious, and eventually it wrapped up early with a “I think that covers the main changes” — even though the integration tests were missing.

That’s not a Claude problem. That’s a single-agent problem.

The context window wall

Every AI coding assistant hits the same limit. The longer a session runs, the more context accumulates. Token by token, the model’s working memory fills up. At some point, the model notices it’s approaching the limit and starts wrapping up prematurely. Or worse — it loses track of decisions made earlier in the session and contradicts itself.

Anthropic’s engineering team measured this directly. On extended coding tasks, single agents tend to overestimate their progress and terminate before the work is actually done. The longer the task, the worse the problem gets.

For .NET developers working on real features — not toy examples — this is a practical limitation. A proper feature branch might need changes across your controller, service layer, repository, EF Core migration, unit tests, and integration tests. That’s a lot of context to hold in one session.

Three agents instead of one

Anthropic’s answer is a three-agent harness that splits autonomous coding into three specialized roles:

The Planner decomposes your request into a structured plan. It understands your architecture, identifies which files need to change, and defines the order of operations. Think of it as the tech lead who writes the implementation plan before anyone touches the keyboard.

The Generator writes code. It takes the plan and produces output — file edits, new files, configuration changes. It doesn’t need to remember the full conversation history because the plan gives it everything it needs for the current step.

The Evaluator reviews the output. It checks whether the generated code actually satisfies the plan, looks for bugs, runs tests, and provides structured feedback. If something isn’t right, the work cycles back to the generator for another iteration.

The key insight: instead of one agent holding everything in its head, each agent gets a fresh context window with structured artifacts from the previous step. Every handoff is a checkpoint. Context doesn’t degrade — it resets.

Why this is different from subagents

If you’ve used Claude Code’s existing subagent feature (the Agent tool), you might think this sounds familiar. It’s not the same thing.

Subagents are short-lived helpers. The main agent spawns a subagent, gives it a specific task (“search for all files that implement IOrderRepository”), gets the result back, and continues. The main agent still holds all the context and makes all the decisions.

The three-agent harness is fundamentally different. Each agent operates independently with its own context window. The planner doesn’t wait around for the generator to finish — it creates a complete plan and hands it off. The evaluator doesn’t need to know the conversation history — it just needs the plan and the output.

This means the system can run for hours without context degradation. The planner’s context is clean when it plans. The generator’s context is clean when it generates. The evaluator’s context is clean when it evaluates. No accumulated noise.

What this looks like for a .NET feature

Let me make this concrete. Say you need to add a POST /api/orders endpoint to your ASP.NET Core application. Here’s how the three agents divide the work:

The Planner analyzes your solution structure and produces something like:

1. Create OrderCreateDto in Models/
2. Create OrderService with CreateAsync method
3. Register OrderService in DI container
4. Add CreateOrder action to OrdersController
5. Create EF Core migration for Orders table
6. Write unit tests for OrderService
7. Write integration test for POST /api/orders

Each step includes the specific files to change, the patterns to follow (based on your existing code), and the acceptance criteria.

The Generator picks up step 1 and writes the DTO:

public class OrderCreateDto
{
    public required string CustomerId { get; init; }
    public required List<OrderLineDto> Lines { get; init; }
}

public class OrderLineDto
{
    public required string ProductId { get; init; }
    public required int Quantity { get; init; }
}

Then step 2, the service. Then step 3, the DI registration. Each step is a focused unit of work with a clean context.

The Evaluator checks the output against the plan:

Does OrderCreateDto follow the same patterns as other DTOs in the project?
Does OrderService handle validation the same way as existing services?
Are the unit tests following xUnit conventions and the test patterns in the solution?
Does the integration test actually start the test server and call the endpoint?

If the evaluator finds an issue — say, the service doesn’t use the Result<T> pattern that every other service uses — it sends structured feedback back to the generator. The generator gets a fresh context with just the plan, the current code, and the specific critique. No accumulated baggage from three hours of prior work.

The evaluator is the real breakthrough

You might look at this and think the planner is the important part. It’s not. The evaluator is what makes this work.

Anthropic’s Prithvi Rajasekaran put it clearly: “Separating the agent doing the work from the agent judging it proves a strong lever.” When one agent both generates and evaluates its own output, it tends to be generous with itself. It’s the same reason we do code reviews — a fresh pair of eyes catches things the author misses.

The evaluator can run your actual test suite. It can navigate a live page with Playwright. It can check whether the generated code compiles. It’s not just reading the output and guessing — it’s verifying.

For frontend work, Anthropic found that iterative cycles between the generator and evaluator typically run 5 to 15 refinements. For a UI component, that means the evaluator keeps pushing back (“the spacing is off”, “this doesn’t match the design system”, “the loading state is missing”) until the output meets the criteria. Sometimes these cycles run for four hours.

That’s four hours of productive iteration that a single agent simply cannot sustain.

What you can do today

The full three-agent harness is Anthropic’s internal framework. You don’t get a “three-agent mode” button in Claude Code today. But the patterns are available, and you can apply them right now.

Use subagents for planning. Before a big change, ask Claude Code to create a detailed plan in a separate agent. Have it analyze your solution structure, identify all files that need to change, and write out the steps. Save that plan to a file.

Use the Agent tool to analyze the solution structure and create a 
detailed implementation plan for adding a POST /api/orders endpoint. 
Include every file that needs to change, the order of changes, and 
the patterns to follow based on existing code. Save the plan to 
PLAN.md.

Use separate sessions for implementation. Instead of one marathon session, break the work into focused sessions. Each session gets the plan file as context and implements one or two steps. Fresh context, focused work.

Use Claude Code as the evaluator. After implementation, start a new session and ask Claude to review the changes against the plan:

Read PLAN.md and review the changes in the current branch. For each 
step in the plan, verify the implementation is complete and follows 
the patterns described. Run dotnet test and report any failures.

This isn’t as seamless as an integrated three-agent system, but it captures the core insight: specialized roles with clean context outperform a single agent that tries to do everything.

The bigger picture

Multi-agent development isn’t just a performance optimization. It changes what’s possible.

A single agent can handle a task that fits in one context window — roughly 30-60 minutes of focused work. Three specialized agents with structured handoffs can sustain coherent work for hours. That’s the difference between “help me write this method” and “build this entire feature while I’m in a meeting.”

Anthropic’s agentic coding trends report highlights this as one of eight trends reshaping software development. The shift isn’t from “developer writes code” to “AI writes code.” It’s from “AI assists for small tasks” to “AI handles complete workflows with human oversight at key checkpoints.”

For .NET developers, the practical implication is clear: the features you build with AI assistance are going to get larger. The refactoring sessions longer. The test coverage more thorough. Not because you’re working harder, but because AI agents are getting better at sustained, coherent work.

Try the pattern

You don’t need to wait for an official three-agent mode. Start applying the pattern today:

Plan in isolation. Ask Claude to analyze and plan before writing any code. Save the plan.
Implement in focused sessions. Give each session the plan and a specific scope. Don’t let context accumulate unnecessarily.
Review with fresh eyes. Start a new session to evaluate the work. Let Claude compare the output against the plan.

The next time you have a feature that spans six files and three projects, try this approach instead of one long session. You’ll notice the difference immediately — the last file gets the same quality of attention as the first.

Three agents, three context windows, one coherent feature. That’s the future of AI-assisted development, and it’s already here.

The context window wall#

Three agents instead of one#

Why this is different from subagents#

What this looks like for a .NET feature#

The evaluator is the real breakthrough#

What you can do today#

The bigger picture#

Try the pattern#