The Lab
Working multi-agent workflows validated across four LLM models. Real data. Measured results. Open source.
What the lab does
The lab runs five real business workflows using AACP v1.4 packets as the coordination layer between specialist agents. All data is read from CSV files. All output is written to a formatted Excel workbook. All agent communication is via AACP packets — none of the coordination is natural language.
Workflows: Payroll · Month-End Close · Sales Qualification · CS Resolution · JML Onboarding
Validated comparison — Q1 FY2026 close
All four models ran all five workflows. All 76 AACP packets validated per model. Cheapest model first.
| Model | Cost | Tokens in | Latency | Pass |
|---|---|---|---|---|
| gpt-4.1-mini | $0.0190 | 40,585 | 160s | ✓ 5/5 |
| gpt-4.1 | $0.0984 | 41,005 | 124s | ✓ 5/5 |
| claude-sonnet-4-5 | $0.2237 | 53,127 | 423s | ✓ 5/5 |
| gpt-4o | $0.2408 | 40,459 | 101s | ✓ 5/5 |
Q1 FY2026 close. 5 workflows. 76 hops per model. All packets validated against AACP v1.4 schema. June 2026.
Which model for which workflow
Based on measured output quality, not just cost.
| Workflow | Best model | Reason |
|---|---|---|
| JML Onboarding | gpt-4.1-mini | Deterministic tasks. Any model works. Lowest cost. |
| Sales Qualification | gpt-4.1 | Best scoring judgment on borderline leads. |
| CS Resolution | gpt-4.1 | Best goodwill decisions by LTV. Conservative where appropriate. |
| Month-End Close | gpt-4.1 | Caught 3 material variances. gpt-4o caught only 1. |
| Payroll | gpt-4.1 | Reliable anomaly detection. 5/5 hops every run. |
Workflows
Payroll
- data:
- 8 employees, 4 cost centres
- agents:
- HR-Agent, Finance-Agent, Audit-Agent
- basis:
- HMRC PAYE, UK payroll practice
Engineering CC-10 at 90%+ utilisation flagged on all models.
Month-End Close
- data:
- GL trial balance, bank data
- agents:
- Finance-Agent, Audit-Agent
- basis:
- NetSuite Autonomous Close 2026
gpt-4.1 and Claude caught 3 material variances — gpt-4o caught only 1.
Sales Qualification
- data:
- 5 leads with BANT scoring data
- agents:
- Sales-Agent, Audit-Agent
- basis:
- Salesforce Agentforce 2026
CoreTech (CEO, £120k budget) scored 97.5 on all models. Delta (not engaged) correctly rejected on all models.
CS Resolution
- data:
- 5 tickets with LTV and sentiment data
- agents:
- CS-Agent, Audit-Agent
- basis:
- Zendesk Resolution Platform 2026
gpt-4.1 made the most commercially sensible goodwill decisions — generous with high LTV, conservative with low LTV.
JML Onboarding
- data:
- 3 new hires with roles and system requirements
- agents:
- HR-Agent, IT-Agent, Audit-Agent
- basis:
- ConductorOne, Lumos, CloudEagle 2025-2026
All 3 hires provisioned on all 4 models. Perfect pass rate. gpt-4.1-mini sufficient.
Honest framing
These results measure agent task execution not protocol overhead. Every coordination message was an AACP v1.4 packet. Token counts include the full agent conversation including system prompts and data payloads — not just coordination tokens.
The lab is open source. Results are reproducible. Comparison JSON and full workflow output available on GitHub.
What the lab produces
Each run generates a formatted Excel workbook with six sheets:
| Sheet | Contents |
|---|---|
| Summary | All workflows, costs, success rate, model |
| Payroll | Employee pay breakdown, anomalies, CC status |
| Sales Pipeline | Lead scores, BANT breakdown, routing decisions |
| CS Resolution | Ticket outcomes, goodwill offers, amounts |
| JML Onboarding | New hire provisioning status, systems granted |
| Month-End Close | Reconciliation results, material variances |
The workbook is colour-coded: red for breaches, amber for warnings, green for pass. Produced by agent coordination. No manual formatting required.
Run it yourself
Requires: pip install anthropic openai openpyxl
Estimated cost per full run: $0.02 — $0.25 depending on model.