AACP · Blog

The hidden cost in multi-agent LLM systems

By Andrew Mackay·15 June 2026·6 min read

The cost nobody talks about

There is a lot of writing about how to make LLM agents more capable. There is not much writing about how much they cost to coordinate.

When people benchmark multi-agent systems, they tend to measure task performance: did the agent complete the job, how long did it take, what was the output quality. The cost of the coordination layer itself mostly goes unmeasured.

That is a problem, because the coordination layer is not free.

In a standard LangChain or CrewAI implementation, every time one agent needs to instruct another, the orchestrator writes a message in natural language. Something like:

"Please retrieve the employee salary records for the period ending March 2026. I need all active employees, their departments, cost centres, base salary, any changes made this month, and pension contribution rates. Return as JSON array."

That message passes through the LLM. The receiving agent processes it. A response comes back. Then the next instruction goes out, through the LLM again.

Each of those instruction hops is a separate LLM call.

What 59 coordination hops looks like

To make this concrete, I modelled a standard department day across five common workflows: JML onboarding for a new hire, monthly payroll, sales qualification, customer service resolution, and month-end financial close.

Across those five workflows, there are 59 agent-to-agent coordination hops.

In a default LangChain implementation, all 59 go through the LLM. Before a single unit of real task work has been done, 59 LLM calls have already happened. At current pricing, that is not catastrophic. But it compounds across runs, across teams, and across time. A workflow that runs daily for a year is 21,535 coordination calls. None of which are doing reasoning work. All of which are just routing.

Why this happens

The reason is structural, not a mistake.

LLM agent frameworks are designed to be general. They make the LLM responsible for generating coordination messages because that is the simplest approach and it works for any workflow, including ones nobody has seen before.

The cost of that generality is that even known, repetitive workflows pay the same LLM call price on every run. A payroll workflow that runs monthly and follows the same steps every time generates the same coordination calls in January as it did in December. The LLM writes slightly different natural language each time. The instruction means the same thing. The agent on the receiving end interprets it correctly.

But the workflow paid for a new LLM call to produce a message that could have been generated deterministically.

The coordination versus task distinction

This matters because not all LLM calls are equivalent.

Task calls, where the agent is doing the actual work (reading a contract, drafting a summary, qualifying a sales lead), involve genuine reasoning. The quality of the model matters. You want the LLM doing that work.

Coordination calls, where one agent tells another what to retrieve or where to route a result, are structural. The instruction follows a pattern. The content is determined by the workflow definition, not by reasoning in the moment. The LLM is not adding value here. It is just generating a format.

Once you draw that line, the coordination layer starts to look like an optimisation target.

What the numbers look like

In the same five-workflow scenario, replacing LLM-generated coordination with deterministic AACP packets produced the following results, measured from live API usage_metadata:

LangChain: 59 coordination calls eliminated. 18% total workflow cost reduction.
CrewAI: 59 coordination calls eliminated. 30% total workflow cost reduction.

The CrewAI saving is larger because CrewAI's default natural language task descriptions are more verbose per hop, so the per-hop reduction is higher.

Neither of these numbers counts the cost of task calls. Those stay exactly the same. The agents are doing the same reasoning work, accessing the same data, producing the same outputs. The only thing that changed is the format of the instructions that coordinate them.

One more number: the token reduction on coordination messages themselves is approximately 23% versus equivalent English, measured across a four-hop payroll workflow on Claude Sonnet 4.5 and GPT-4o. That is a secondary effect. The elimination of the LLM call entirely is the primary one. Full benchmark table on the case page.

What this does not fix

To be direct about the limits.

This only applies to the coordination layer. Task tokens are not affected. If your agents spend most of their token budget on task work and very little on coordination, the saving is proportionally small.

It also only applies to repetitive, known workflows. Novel instructions, workflows the system has not seen before, still need to route through an LLM the first time. The saving comes from recognising that most enterprise workflows run the same steps repeatedly and do not need to generate new coordination messages from scratch every time.

And it does not change what the agents do. The same payroll data gets fetched, the same finance agent runs the calculation, the same result goes to the audit log. The coordination layer becomes cheaper and more structured. The task layer is untouched.

The auditable side effect

There is a second benefit that turns out to matter more than the cost saving in some contexts.

When coordination messages are natural language, the audit trail is a log of slightly different English sentences. You can read them, but you cannot reliably parse them, query them, or diff them across runs. Something changed between last month's payroll run and this month's. Which agent received a different instruction? You cannot easily tell.

When coordination messages are typed packets, the audit trail is structured data. Machine-readable. Queryable. Diffable. Every packet is identical across runs for the same workflow. If something changes, the change is visible in the packet diff, not buried in a paragraph of generated prose.

That property is not relevant for all systems. For compliance-sensitive pipelines, it is the more important benefit.

Read on

Read the spec View on GitHub IETF Draft-03 More posts →