Multi-Agent Orchestration: The Complete Implementation Guide
Build a production-ready multi-agent system from scratch using OpenAI Agents SDK, with real patterns for handoffs, tool management, and failure recovery.
Build a production-ready multi-agent system from scratch using OpenAI Agents SDK, with real patterns for handoffs, tool management, and failure recovery.
TL;DR
Jump to Why multi-agent systems matter · Jump to Architecture patterns · Jump to Implementation walkthrough · Jump to Production considerations
Single-agent systems hit complexity walls fast. When you ask one agent to handle research, code generation, data analysis, and workflow orchestration, you get confused outputs, token bloat, and unreliable results. Multi-agent orchestration solves this by delegating work to specialized agents, each with focused tools and clear responsibilities, coordinated by an orchestrator that routes tasks and manages handoffs.
This guide walks through building a production-ready multi-agent system using the OpenAI Agents SDK, with real patterns from our implementation at Athenic where we orchestrate research, development, analysis, and partnership agents handling thousands of tasks monthly.
Key takeaways
- Orchestrators route work based on task classification; specialized agents execute with domain tools.
- Handoffs require explicit state serialization -don't rely on implicit context sharing.
- Build failure isolation: one agent's crash shouldn't cascade to others.
- Start simple with 2-3 agents; add complexity only when you hit proven bottlenecks.
Single-agent architectures struggle with three core problems:
When one agent has access to 50+ tools (database queries, API calls, code execution, web search), it spends more tokens reasoning about which tool to use than how to solve the problem. Research from Anthropic's alignment team shows that tool-selection accuracy drops 23% when agents have more than 15 tools available simultaneously (Anthropic, 2024).
Agents optimized for creative tasks (like content generation) need different system prompts than analytical tasks (like SQL query construction). Trying to make one agent "good at everything" produces mediocre results across the board.
If a single agent crashes or hallucinates mid-task, the entire workflow fails. Multi-agent systems isolate failures to individual agents, allowing graceful degradation.
According to The State of AI Engineering 2024 report, 68% of production AI systems now use multi-agent orchestration, up from 31% in 2023 (AI Engineering Summit, 2024).
Two dominant patterns exist for multi-agent orchestration: hub-and-spoke and mesh.
A central orchestrator agent routes tasks to specialized agents. Agents never communicate directly -all coordination flows through the orchestrator.
Pros:
Cons:
Use when: You have 3-10 specialized agents with distinct responsibilities.
Agents communicate peer-to-peer using a shared message bus. Each agent subscribes to relevant event types.
Pros:
Cons:
Use when: You need high throughput (>1,000 concurrent tasks) or resilient distributed systems.
At Athenic, we use hub-and-spoke for 90% of workflows. Only our real-time partnership discovery system uses mesh architecture due to volume constraints.
<!-- Hub-and-spoke -->
<circle cx="200" cy="200" r="40" fill="#22d3ee" />
<text x="165" y="208" fill="#0f172a" font-size="12" font-weight="bold">Orchestrator</text>
<circle cx="100" cy="100" r="30" fill="#a855f7" />
<text x="78" y="106" fill="#fff" font-size="10">Research</text>
<circle cx="300" cy="100" r="30" fill="#10b981" />
<text x="285" y="106" fill="#0f172a" font-size="10">Developer</text>
<circle cx="100" cy="300" r="30" fill="#f59e0b" />
<text x="80" y="306" fill="#0f172a" font-size="10">Analysis</text>
<circle cx="300" cy="300" r="30" fill="#e11d48" />
<text x="275" y="306" fill="#fff" font-size="10">Partnership</text>
<line x1="168" y1="172" x2="120" y2="125" stroke="#64748b" stroke-width="2" />
<line x1="232" y1="172" x2="280" y2="125" stroke="#64748b" stroke-width="2" />
<line x1="168" y1="228" x2="120" y2="275" stroke="#64748b" stroke-width="2" />
<line x1="232" y1="228" x2="280" y2="275" stroke="#64748b" stroke-width="2" />
<!-- Mesh -->
<circle cx="520" cy="120" r="30" fill="#a855f7" />
<text x="498" y="126" fill="#fff" font-size="10">Research</text>
<circle cx="650" cy="120" r="30" fill="#10b981" />
<text x="635" y="126" fill="#0f172a" font-size="10">Developer</text>
<circle cx="520" cy="280" r="30" fill="#f59e0b" />
<text x="500" y="286" fill="#0f172a" font-size="10">Analysis</text>
<circle cx="650" cy="280" r="30" fill="#e11d48" />
<text x="625" y="286" fill="#fff" font-size="10">Partnership</text>
<line x1="545" y1="135" x2="625" y2="135" stroke="#64748b" stroke-width="2" />
<line x1="545" y1="265" x2="625" y2="265" stroke="#64748b" stroke-width="2" />
<line x1="535" y1="148" x2="535" y2="252" stroke="#64748b" stroke-width="2" />
<line x1="635" y1="148" x2="635" y2="252" stroke="#64748b" stroke-width="2" />
<line x1="545" y1="145" x2="625" y2="265" stroke="#64748b" stroke-width="2" stroke-dasharray="4" />
<line x1="625" y1="145" x2="545" y2="265" stroke="#64748b" stroke-width="2" stroke-dasharray="4" />
We'll build a 3-agent system: an orchestrator, a research agent, and a developer agent. The orchestrator classifies incoming requests and routes them appropriately.
Create a clear responsibility matrix before writing code.
| Agent | Responsibility | Tools | Handoff triggers |
|---|---|---|---|
| Orchestrator | Task classification, routing, result aggregation | None (routing only) | Always starts workflow |
| Research Agent | Web search, document analysis, competitor intel | Web search, PDF parser, Apollo API | When task contains "research", "find", "analyze market" |
| Developer Agent | Code generation, technical implementation, debugging | Code interpreter, GitHub API, terminal | When task contains "build", "implement", "fix bug" |
Write this down first. Fuzzy boundaries between agents create handoff thrashing.
The orchestrator needs a classification function and handoff logic.
import { Agent } from '@openai/agents';
import { researchAgent } from './agents/research';
import { developerAgent } from './agents/developer';
const orchestrator = new Agent({
name: 'orchestrator',
instructions: `You are a task router. Analyze user requests and delegate to specialized agents:
- Research Agent: market analysis, competitor research, data gathering
- Developer Agent: code generation, debugging, technical implementation
Respond with JSON: { "agent": "research" | "developer", "context": "task details" }`,
model: 'gpt-4o',
});
async function handleRequest(userMessage: string) {
const classification = await orchestrator.run({
messages: [{ role: 'user', content: userMessage }],
});
const { agent, context } = JSON.parse(classification.content);
// Route to appropriate agent
if (agent === 'research') {
return await researchAgent.run({
messages: [{ role: 'user', content: context }],
});
} else if (agent === 'developer') {
return await developerAgent.run({
messages: [{ role: 'user', content: context }],
});
}
}
Key principle: The orchestrator doesn't solve problems -it routes. Keep instructions minimal.
Each specialized agent gets a focused instruction set and specific tools.
import { Agent } from '@openai/agents';
import { webSearchTool } from './tools/web-search';
import { apolloTool } from './tools/apollo';
export const researchAgent = new Agent({
name: 'research',
instructions: `You are a research analyst. Use web search and Apollo to:
1. Find relevant data sources
2. Extract key insights
3. Summarize findings with citations
Always cite sources. If data is unavailable, say so explicitly.`,
model: 'gpt-4o',
tools: [webSearchTool, apolloTool],
});
Tool design tip: Limit each agent to 5-8 tools max. If you need more, split into sub-agents.
Handoffs require explicit state serialization. Don't rely on agents "knowing" what happened previously.
interface HandoffContext {
fromAgent: string;
toAgent: string;
userRequest: string;
previousResults: Record<string, any>;
metadata: {
timestamp: string;
sessionId: string;
};
}
async function handoff(context: HandoffContext) {
const handoffMessage = `
Task: ${context.userRequest}
Previous work by ${context.fromAgent}:
${JSON.stringify(context.previousResults, null, 2)}
Your role: Complete the next phase of this workflow.
`;
return await getAgent(context.toAgent).run({
messages: [{ role: 'user', content: handoffMessage }],
});
}
We learned this the hard way: early versions of our system assumed agents could "see" previous agent outputs via shared context. They couldn't. Explicit handoff messages increased success rates from 64% to 91%.
Agents fail. Network timeouts, API rate limits, hallucinations. Build defensive handoffs.
async function robustHandoff(context: HandoffContext, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const result = await handoff(context);
// Validate result quality
if (isValidResult(result)) {
return result;
}
console.warn(`Attempt ${attempt}: Invalid result from ${context.toAgent}`);
} catch (error) {
console.error(`Attempt ${attempt} failed:`, error);
if (attempt === maxRetries) {
// Fallback to simpler agent or human escalation
return await fallbackHandler(context);
}
// Exponential backoff
await sleep(Math.pow(2, attempt) * 1000);
}
}
}
function isValidResult(result: any): boolean {
// Check for hallucination markers, incomplete outputs, etc.
return result.content.length > 50 && !result.content.includes('[INSERT]');
}
Circuit breaker pattern: If an agent fails 3 times in a row, disable it temporarily and route to a fallback.
Getting multi-agent systems production-ready requires attention to observability, cost control, and latency.
You need to reconstruct the full execution path for debugging. Log every agent transition.
interface AgentTrace {
sessionId: string;
timestamp: string;
agent: string;
action: 'start' | 'handoff' | 'complete' | 'error';
input: string;
output?: string;
metadata: Record<string, any>;
}
function logTrace(trace: AgentTrace) {
// Store in database or observability platform
db.agentTraces.insert(trace);
// Real-time monitoring
metrics.increment('agent.execution', { agent: trace.agent, action: trace.action });
}
At Athenic, we store traces in Supabase and visualize them in a custom dashboard. This lets us see which agents are bottlenecks and where handoffs fail.
Multi-agent systems multiply API costs. A 3-agent workflow might use 5× the tokens of a single-agent approach.
Optimization strategies:
| Agent | Model | Avg tokens/run | Cost/1K runs |
|---|---|---|---|
| Orchestrator | GPT-4o-mini | 450 | $0.12 |
| Research | GPT-4o | 3,200 | $6.40 |
| Developer | GPT-4o | 4,800 | $9.60 |
| Total | - | 8,450 | $16.12 |
For comparison, a single GPT-4o agent handling all three tasks averages 6,200 tokens but produces lower-quality results, requiring 40% more retries. The multi-agent approach is 18% more expensive upfront but 25% cheaper after accounting for rework.
Sequential handoffs add latency. A 3-agent workflow with 2 handoffs might take 12-18 seconds end-to-end.
Parallel execution: If agents don't depend on each other's outputs, run them in parallel.
async function parallelExecution(userRequest: string) {
const [researchResult, marketResult] = await Promise.all([
researchAgent.run({ messages: [{ role: 'user', content: userRequest }] }),
marketAgent.run({ messages: [{ role: 'user', content: userRequest }] }),
]);
// Orchestrator aggregates results
return await orchestrator.run({
messages: [{
role: 'user',
content: `Synthesize these results:\n\nResearch: ${researchResult.content}\n\nMarket: ${marketResult.content}`,
}],
});
}
This pattern reduced our average workflow latency from 14.2s to 7.8s for multi-faceted requests.
For sensitive operations (deleting data, sending emails, making purchases), insert approval gates.
async function executeWithApproval(agent: Agent, task: string) {
const plan = await agent.run({
messages: [{ role: 'user', content: `Plan how to: ${task}. Do not execute yet.` }],
});
// Send plan to approval queue
const approvalId = await approvalQueue.create({
agentName: agent.name,
plan: plan.content,
requestedAt: new Date(),
});
// Wait for approval (webhook, polling, or timeout)
const approval = await waitForApproval(approvalId, { timeout: 300000 });
if (approval.status === 'approved') {
return await agent.run({
messages: [{ role: 'user', content: `Execute: ${plan.content}` }],
});
} else {
throw new Error(`Approval denied: ${approval.reason}`);
}
}
We use this for our partnership agent when it wants to send outreach emails. Human approval prevents embarrassing automated mistakes.
Our partnership orchestration system coordinates three specialized agents to discover and qualify potential partners.
Architecture:
Workflow:
User: "Find 50 B2B SaaS companies in fintech that use Stripe"
↓
Orchestrator → Discovery Agent
→ Finds 120 candidates via Apollo
↓
Orchestrator → Qualification Agent
→ Scores and ranks, returns top 50
↓
Orchestrator → Outreach Agent (with approval gate)
→ Generates email drafts
→ Waits for human approval
→ Sends via SendGrid
Results over 3 months:
The multi-agent approach improved response rates because each agent specialized: discovery found better fits, qualification filtered out poor matches, and outreach personalized messages based on specific company signals.
Mistake: Creating 15 hyper-specialized agents for marginally different tasks.
Fix: Start with 2-3 agents. Split agents only when you observe clear bottlenecks or conflicting instructions.
Mistake: Assuming agents inherit context automatically.
Fix: Always serialize and pass state explicitly in handoff messages.
Mistake: When the primary agent fails, the entire workflow crashes.
Fix: Implement fallback agents (simpler models or rule-based systems) for critical paths.
Mistake: Over-constraining agents with rigid scripts.
Fix: Give agents clear goals and constraints, but let them choose execution paths. Micromanaging eliminates the benefits of LLM-based agents.
Call-to-action (Activation stage) Clone our multi-agent orchestration starter template and deploy your first orchestrated workflow in under 30 minutes.
Start with 2-3 agents: one orchestrator and 1-2 specialized agents. Add more only when you hit proven performance or quality bottlenecks. Over-engineering with 10+ agents upfront creates complexity without clear benefits.
Yes. You might use GPT-4o for complex reasoning agents, Claude for long-context analysis, and Llama for simple classification. Just ensure your SDK supports multi-provider setups or write thin adapter layers.
Implement a tie-breaker agent or escalate to humans. In our system, if two agents produce conflicting recommendations, the orchestrator sends both to a senior human reviewer rather than trying to auto-resolve.
Write integration tests that assert on final outputs, not intermediate agent steps. Mock external tool calls (APIs, databases) but let agents run real LLM inference to catch prompt regressions.
Multi-agent orchestration splits complex work across specialized agents, improving quality and maintainability. Use hub-and-spoke architecture for most cases, implement explicit handoff protocols, and build robust failure handling from day one.
Next steps:
Internal links:
External references:
Crosslinks: