AACP · Blog
I tested AACP against four agent frameworks. Here is what I found.
The setup
A few weeks ago I published an article about building AACP, a typed coordination protocol for multi-agent LLM systems. The premise was simple: agents currently coordinate in natural language, which is verbose, non-deterministic, and produces no reliable audit trail. AACP replaces that with typed, pipe-delimited packets.
The first article measured token reduction on a single payroll workflow. The finding was a 22.9% reduction on Claude and 23.7% on GPT-4o. Interesting, but limited to one framework and one workflow type.
So I ran a proper benchmark. Same workflow, same data, same model, four frameworks. 59 coordination hops each. Here is what I found.
The benchmark
The test case is a five-workflow department day: JML onboarding, payroll, sales lead qualification, customer service resolution, and month-end close. Three iterations of most workflows. 59 total coordination hops.
For each framework I ran two versions:
- Without AACP: the orchestrator writes natural language instructions to specialist agents in whatever style that framework uses by default.
- With AACP: the orchestrator sends typed packets, rule-based encoder, zero LLM cost.
Same agents. Same mock data. Same gpt-4o-mini model throughout.
The results
Framework Hops Coord. calls Total saving ──────────────────────────────────────────────────── LangChain 59 59 → 0 18% CrewAI 59 59 → 0 30% AutoGen 59 59 → 0 55% Pydantic AI 59 59 → 0 85%
In every case, all 59 coordination LLM calls were eliminated for known workflows. The difference is in how much the agent cost dropped alongside the coordination cost.
Why the numbers are so different
This is the part I found genuinely interesting.
LangChain at 18%. LangChain's default coordination is task-based. The orchestrator writes a description of what to do, relatively concise. AACP packets are more compact, so there is a token saving on the coordination side, and the agents process them slightly more efficiently. The 18% reflects a modest improvement in a framework that was already reasonably tight.
CrewAI at 30%. CrewAI assigns tasks to agents with explicit role and goal context baked into each message. The natural language instructions are more verbose than LangChain's. More verbosity in the baseline means a larger proportional saving when you replace it with a compact packet.
AutoGen at 55%. AutoGen uses a GroupChat model where agents exchange conversational messages. Without AACP, the orchestrator writes something like:
"Hi HR Agent, I need you to retrieve all active employee salary records for the period ending March 2026. Please include each employee's department, cost centre, base salary, any changes applied this month, and their pension contribution rate. Return the data as a JSON array. Let me know if you need any clarification."
That is the coordination overhead AACP eliminates. The conversational preamble, the clarification offer, the verbose field list, all of it goes away. The agent receives a packet and acts on it.
Pydantic AI at 85%. This one required a different framing to understand. Pydantic AI already types the result layer. Agents return validated Pydantic models, not free-form text. That is genuinely good; the output of every agent call is structured and reliable.
But the instruction layer is still natural language. The orchestrator still writes English to tell agents what to do. The result is typed; the instruction is not.
AACP completes the picture. With AACP, the orchestrator sends a typed packet. The agent processes it and returns a typed Pydantic model. Both layers are now deterministic and validated. The 85% saving is partly because Pydantic AI's natural language coordination, while not conversational like AutoGen, is still verbose enough that compact AACP packets drive a significant agent cost reduction. But the more important point is that the full stack, instructions and results, is now typed end to end.
The pattern
The saving is not random. It scales with how verbose the framework's default coordination layer is:
More conversational default → larger AACP saving Less conversational default → smaller AACP saving
This makes sense. AACP is most valuable exactly where coordination overhead is highest. The frameworks that encourage rich, natural language agent-to-agent communication benefit most from a typed replacement.
What the cost numbers actually mean
At gpt-4o-mini scale the absolute dollar differences are small. A 59-hop department day costs fractions of a cent either way. The cost argument becomes significant at volume, workflows running hundreds or thousands of times per day, across enterprise pipelines where coordination is a meaningful fraction of total cost.
The more immediately useful properties are determinism and auditability. Natural language coordination varies on every run. The same intent produces different messages. Validation before transmission is impossible. Audit trails require natural language processing to extract meaning.
AACP packets are identical across runs. They validate against a schema before transmission. Every packet is a machine-readable audit record at zero additional cost.
The typed-framework case
The Pydantic AI result points at something broader. As agent frameworks mature, more of them will move toward typed results. Pydantic AI is ahead of the curve here, but structured outputs are increasingly common across all frameworks.
The instruction side has not followed at the same pace. Orchestrators still write natural language to coordinate agents, even when those agents return typed results. That is a gap AACP is designed to fill.
What is available now
Four framework integrations, all benchmarked at 59 hops:
pip install aacp-langchain # 18% saving pip install aacp-crewai # 30% saving pip install aacp-autogen # 55% saving (Python 3.11) pip install aacp-pydantic-ai # 85% saving pip install aacp # core SDK, encoders, RuleRegistry
241 pre-validated community rules for common business workflows at registry.aacp.dev. IETF Internet-Draft draft-mackay-aacp-03. Full spec on this site.
What I would like to know
These benchmarks used mock data on a single model. I am interested in whether the results hold in production environments with real data, longer workflows, and different model combinations.
Specifically:
- Does the determinism property matter in practice, or is variation in coordination messages genuinely harmless?
- Where does the typed packet format break down; what coordination patterns cannot be expressed in the AACP vocabulary?
- Does the Pydantic AI result, completing the typed stack, reflect something you have felt as a gap in your own systems?
If you are running multi-agent workflows in production, I would genuinely like to hear what you find.
Related reading
- First article: I built a coordination protocol for multi-agent LLM systems
- The cost layer: The hidden cost in multi-agent LLM systems