How to Build Multi-Agent AI Systems: Production Architecture Guide
Complete guide to building production multi-agent systems -architecture patterns, communication protocols, state management, and real-world deployment strategies.
Complete guide to building production multi-agent systems -architecture patterns, communication protocols, state management, and real-world deployment strategies.
TL;DR
Jump to architecture patterns · Jump to communication protocols · Jump to state management · Jump to production concerns · Jump to FAQs
Right, so you've built a single AI agent. It works. But now you're trying to make it do three different things, and the prompt has grown to 4,000 tokens of increasingly contradictory instructions. The agent sometimes acts like a researcher, sometimes like a writer, sometimes like an analyst -and it's mediocre at all three because you're asking one model to be everything.
Time for multi-agent systems.
Most multi-agent tutorials show twee examples: Agent Alice says hello to Agent Bob. Delightful. Completely useless for production.
Here's how to actually build multi-agent systems that coordinate complex workflows -research gathering + data analysis + report generation + approval routing -without the whole thing collapsing when one agent hiccups.
Single agent approach:
User: "Research the top 5 competitors in fintech, analyse their pricing,
and write a report with recommendations."
God-Agent: *tries to do research, gets distracted, writes half a report,
forgets what it was analyzing, produces garbage*
Multi-agent approach:
User: "Research the top 5 competitors in fintech, analyse their pricing,
and write a report with recommendations."
Orchestrator: Breaks into subtasks →
1. Research Agent: Finds competitors, gathers data
2. Analysis Agent: Receives structured data, analyses pricing patterns
3. Report Agent: Takes analysis, writes executive summary
4. Approval Agent: Routes to human for review
Result: Clean handoffs, each agent does one thing well, output is coherent.
When multi-agent makes sense:
When single agent is fine:
Data point: We analyzed 120 production agent deployments. Single agents dominated simple tasks (FAQ answering, classification). Multi-agent appeared at 78% of deployments handling complex workflows (sales pipeline automation, research synthesis, multi-step approvals).
Four main patterns. Pick based on your workflow's communication structure, not what sounds cool.
Structure: One orchestrator agent manages all other agents. Orchestrator decides what runs when, hands off tasks, collects results.
┌─────────────────┐
│ Orchestrator │ ← Receives user request
└────────┬────────┘
│
┏━━━━━━━━━━┻━━━━━━━━━━┓
↓ ↓ ↓
┌────────┐ ┌────────┐ ┌────────┐
│ Agent │ │ Agent │ │ Agent │
│ A │ │ B │ │ C │
└────────┘ └────────┘ └────────┘
When to use:
Example use case: Customer support pipeline
class SupportOrchestrator:
def handle_ticket(self, ticket):
# Step 1: Classify urgency
urgency = self.classifier_agent.classify(ticket)
# Step 2: Route based on urgency
if urgency == "high":
response = self.priority_agent.handle(ticket)
else:
response = self.standard_agent.handle(ticket)
# Step 3: Quality check
if self.qa_agent.approve(response):
return response
else:
return self.escalation_agent.escalate(ticket, response)
Pros:
Cons:
Production tip: Keep orchestrator logic dumb. If your orchestrator has 500 lines of branching logic, you've built a god-agent again. Limit to simple routing and delegation.
Structure: Tree-like hierarchy. Top-level agent delegates to sub-agents, who may delegate further.
┌─────────────┐
│ Lead Agent │
└──────┬──────┘
│
┏━━━━━━━━━━┻━━━━━━━━━━┓
↓ ↓
┌───────────┐ ┌───────────┐
│ Agent A │ │ Agent B │
└─────┬─────┘ └─────┬─────┘
│ │
┏━━━━┻━━━━┓ ┏━━━━┻━━━━┓
↓ ↓ ↓ ↓
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Sub-A1│ │Sub-A2│ │Sub-B1│ │Sub-B2│
└──────┘ └──────┘ └──────┘ └──────┘
When to use:
Example use case: Market research agent
Lead Research Agent
├─ Competitor Analysis Agent
│ ├─ Company A Researcher
│ ├─ Company B Researcher
│ └─ Company C Researcher
├─ Market Trends Agent
│ ├─ News Scraper
│ └─ Data Analyst
└─ Report Synthesis Agent
Pros:
Cons:
Production example: We built a sales research agent with 3 levels. Lead agent breaks "research 50 companies" into 5 sub-agents (10 companies each), each sub-agent spawns researchers per company. Whole thing completes in 2 minutes vs 20 minutes sequential.
Structure: Agents work as peers. No central coordinator. Agents communicate directly, often bidding for tasks or collaborating dynamically.
┌────────┐ ←──────→ ┌────────┐
│Agent A │ │Agent B │
└───┬────┘ └────┬───┘
│ ↘ ↗ │
│ ┌────────┐ │
└──────→│Agent C │←───┘
└────────┘
When to use:
Example use case: Distributed web scraping
Task pool: [url1, url2, url3, ... url100]
Agent 1: "I'll take url1"
Agent 2: "I'll take url2 and url3"
Agent 3: "I'm busy, skip"
Agent 1: "Finished url1, taking url4"
Pros:
Cons:
Production reality check: Swarm patterns are rare in production. They sound cool but require sophisticated infrastructure (message queues, distributed locks, consensus protocols). Unless you have genuine distributed systems problems (100+ agents, unpredictable load), stick with Coordinator or Delegator.
When we actually used swarm: Client had 200 retail locations, needed to scrape competitor pricing daily. Deployed 20 scraper agents, each bidding for store locations from a queue. Worked well because: 1) no dependencies between tasks, 2) resilience mattered (scrapers get blocked randomly), 3) load balancing (some stores = complex sites = slower scrapes).
Structure: Mix of above patterns. Coordinator at top level, delegators for complex subtasks, swarm for parallelizable work.
┌─────────────┐
│Orchestrator │
└──────┬──────┘
│
┏━━━━━━━━━━┻━━━━━━━━━━┓
↓ ↓
┌────────────┐ ┌──────────────────┐
│ Delegator │ │ Swarm of Workers │
│ Agent │ │ (5 agents) │
└─────┬──────┘ └──────────────────┘
│
┏━━━━┻━━━━┓
↓ ↓
┌───┐ ┌───┐
│ A │ │ B │
└───┘ └───┘
When to use: Production. Real workflows don't fit neat patterns.
Example: Content production pipeline
Orchestrator (Coordinator pattern)
├─ Research Phase (Delegator pattern)
│ ├─ Topic Agent
│ └─ Competitor Agent
├─ Content Creation (Swarm pattern)
│ ├─ Writer Agent 1
│ ├─ Writer Agent 2
│ └─ Writer Agent 3 (all pull from outline queue)
└─ Review Phase (Coordinator pattern)
├─ Editor Agent
└─ SEO Agent
Recommendation: Start with Coordinator (simplest). Add Delegator when you hit depth (subtasks with subtasks). Add Swarm only if you need dynamic load balancing.
| Pattern | Complexity | Scalability | Fault Tolerance | Use When |
|---|---|---|---|---|
| Coordinator | Low | Medium | Low (SPOF) | Linear workflows, simple routing |
| Delegator | Medium | High | Medium | Complex hierarchical tasks |
| Swarm | High | Very High | Very High | Distributed, independent tasks |
| Hybrid | Very High | Very High | High | Production systems with mixed needs |
Agents need to talk to each other. Three main approaches.
Agent A calls Agent B, waits for response, proceeds.
class OrchestratorAgent:
def process_request(self, user_input):
# Synchronous call to Research Agent
research_data = self.research_agent.gather_data(user_input)
# Wait for completion, then call Analysis Agent
analysis = self.analysis_agent.analyze(research_data)
# Wait for completion, return
return analysis
Pros:
Cons:
When to use: Simple workflows, total execution < 1 minute, failures are rare.
Agents communicate via message queue. Agent A sends message, doesn't wait. Agent B processes when ready.
# Agent A sends message
message_queue.publish("analysis_tasks", {
"task_id": "abc123",
"data": research_data
})
# Agent B subscribes to queue
def on_message(message):
result = analyze(message['data'])
message_queue.publish("results", {
"task_id": message['task_id'],
"result": result
})
Pros:
Cons:
When to use: Long-running workflows (> 1 minute), need resilience, parallelization across agents.
Production stack: We use Redis for simple cases (< 10K messages/day), AWS SQS for production (scales to millions, built-in retries, dead-letter queues).
Agents broadcast events. Other agents subscribe to events they care about.
# Research Agent publishes event
event_bus.publish("research_completed", {
"topic": "fintech competitors",
"data": research_results
})
# Multiple agents can subscribe
class AnalysisAgent:
@subscribe("research_completed")
def on_research_done(self, event):
# Analyze the data
pass
class ReportAgent:
@subscribe("research_completed")
def on_research_done(self, event):
# Start drafting report
pass
Pros:
Cons:
When to use: Complex systems with many agents, multiple agents react to same event, need to add agents without changing existing code.
Real example: Client had approval workflow. When "document_uploaded" event fires, 3 agents react: 1) Virus Scanner, 2) Metadata Extractor, 3) Categorization Agent. All run in parallel. Clean separation, easy to add "4) OCR Agent" later without touching existing code.
| Protocol | Latency | Resilience | Complexity | Best For |
|---|---|---|---|---|
| Synchronous | Low (ms) | Low | Low | Quick, sequential workflows |
| Async Messaging | Medium (seconds) | High | Medium | Long-running, need retries |
| Pub-Sub | Medium | High | High | Many agents, event-driven |
Hardest part of multi-agent systems. Agents need shared context without conflicts.
Problem: Agent A finds 10 competitors. Agent B starts analyzing #1-5. Agent C starts analyzing #6-10. Meanwhile, Agent A finds 5 more competitors. How do B and C know about new data?
Three approaches.
One database holds all state. Agents read/write as needed.
# Shared state in Redis or PostgreSQL
class StateStore:
def get_state(self, task_id):
return db.query("SELECT * FROM task_state WHERE id = ?", task_id)
def update_state(self, task_id, new_data):
db.execute("UPDATE task_state SET data = ? WHERE id = ?",
new_data, task_id)
# Agent A writes
state_store.update_state("task_123", {"competitors": found_competitors})
# Agent B reads
current_state = state_store.get_state("task_123")
analyze(current_state['competitors'])
Pros:
Cons:
When to use: Most use cases. Start here unless you have specific distributed needs.
Production detail: Use optimistic locking to handle concurrent updates:
def update_with_version_check(task_id, new_data):
current_version = db.query("SELECT version FROM task_state WHERE id = ?", task_id)
updated = db.execute("""
UPDATE task_state
SET data = ?, version = version + 1
WHERE id = ? AND version = ?
""", new_data, task_id, current_version)
if not updated:
raise ConflictError("State was modified by another agent")
Instead of storing current state, store all events. State is derived by replaying events.
# Store events
events = [
{"type": "competitor_found", "name": "Company A", "timestamp": "..."},
{"type": "competitor_found", "name": "Company B", "timestamp": "..."},
{"type": "analysis_started", "competitor": "Company A", "timestamp": "..."},
]
# Derive current state
def get_current_state(events):
state = {"competitors": [], "analyses": []}
for event in events:
if event['type'] == "competitor_found":
state['competitors'].append(event['name'])
elif event['type'] == "analysis_started":
state['analyses'].append(event['competitor'])
return state
Pros:
Cons:
When to use: Audit requirements (compliance, debugging), complex state transitions, need time-travel debugging.
Production reality: Event sourcing is overkill for 90% of multi-agent systems. We've used it twice: 1) financial approval workflows (audit trail critical), 2) experimental agent system where we needed to replay to debug emergent behaviors.
No shared state. Agents pass all context in messages.
# Agent A sends full context to Agent B
message = {
"task_id": "abc123",
"competitors": ["Company A", "Company B", "Company C"],
"research_summary": "...",
"timestamp": "..."
}
message_queue.publish("analysis_tasks", message)
# Agent B receives full context, doesn't need to query anywhere
def process_analysis(message):
competitors = message['competitors'] # Everything needed is in message
analyze(competitors)
Pros:
Cons:
When to use: Fully independent tasks, high throughput, agents don't need shared context.
When we used this: Distributed scraping (each agent gets URL + config, scrapes, returns result). No shared state needed -agents are stateless workers.
Multi-agent systems fail in ways single agents don't. Here's what matters.
Problem: Agent A succeeds, Agent B fails. Do you retry? Retry only B? Retry everything?
Solution: Idempotency + Checkpointing
class WorkflowEngine:
def execute_workflow(self, task_id):
# Checkpoint: Research phase
if not self.is_completed(task_id, "research"):
research_data = self.research_agent.gather_data()
self.save_checkpoint(task_id, "research", research_data)
else:
research_data = self.load_checkpoint(task_id, "research")
# Checkpoint: Analysis phase
if not self.is_completed(task_id, "analysis"):
analysis = self.analysis_agent.analyze(research_data)
self.save_checkpoint(task_id, "analysis", analysis)
else:
analysis = self.load_checkpoint(task_id, "analysis")
return analysis
Result: If analysis fails, retry doesn't re-do research (expensive). Just retries from last checkpoint.
Agents fail (LLM rate limits, network issues). Retry, but don't hammer.
def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait_time)
Production tip: Different retry strategies for different errors:
If Agent B fails 10 times in a row, stop calling it (it's broken). Fail fast.
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failures = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.opened_at = None
def call(self, func):
# Circuit open (too many failures) - fail fast
if self.opened_at and time.time() - self.opened_at < self.timeout:
raise CircuitOpenError("Circuit breaker open")
try:
result = func()
self.failures = 0 # Reset on success
return result
except Exception as e:
self.failures += 1
if self.failures >= self.failure_threshold:
self.opened_at = time.time()
raise
# Usage
agent_b_breaker = CircuitBreaker()
try:
result = agent_b_breaker.call(lambda: agent_b.process(data))
except CircuitOpenError:
# Agent B is broken, use fallback
result = fallback_agent.process(data)
Single agent: track latency, cost, errors. Easy.
Multi-agent: track per-agent metrics + cross-agent dependencies. Not easy.
What to monitor:
Per-agent metrics:
Cross-agent metrics:
Distributed traces:
# Tag each workflow with trace ID
trace_id = str(uuid.uuid4())
# Each agent logs with trace ID
logger.info("Research started", extra={"trace_id": trace_id, "agent": "research"})
# ...
logger.info("Research completed", extra={"trace_id": trace_id, "agent": "research"})
logger.info("Analysis started", extra={"trace_id": trace_id, "agent": "analysis"})
# ...
logger.info("Analysis completed", extra={"trace_id": trace_id, "agent": "analysis"})
Now you can query logs by trace_id and see full workflow execution:
[trace_123] Research started (00:00.00)
[trace_123] Research completed (00:32.45)
[trace_123] Analysis started (00:32.46)
[trace_123] Analysis failed (00:54.12) - RateLimitError
Tooling: LangSmith supports multi-agent traces natively. Shows agent call graphs, latency breakdowns, cost per agent.
Use case: Given company name, research company + find decision-makers + draft personalized outreach email.
Architecture: Hybrid (Coordinator + Delegator + Swarm)
Orchestrator (Coordinator pattern)
├─ Company Research (Delegator pattern)
│ ├─ Website Scraper Agent
│ ├─ LinkedIn Scraper Agent
│ └─ News Scraper Agent
├─ Contact Discovery (Swarm pattern - parallel processing)
│ ├─ LinkedIn Search Agent 1
│ ├─ LinkedIn Search Agent 2
│ ├─ LinkedIn Search Agent 3
│ └─ Email Finder Agent (Apollo.io integration)
└─ Email Drafting (Coordinator pattern)
├─ Personalization Agent (uses research data)
└─ Quality Check Agent
Communication: Async messaging (Redis queue)
State: Centralized (PostgreSQL)
Deployment: AWS Lambda (orchestrator) + ECS containers (scraper swarm)
Results:
What made it work:
How many agents is too many?
Depends on your orchestration complexity budget. We've seen:
If you have 30 agents, you've probably created a distributed system problem where you had a workflow problem.
Should I use LangGraph, CrewAI, or build custom orchestration?
LangGraph: Best for complex state management, branching logic, graph-based workflows. Learning curve, but powerful.
CrewAI: Best for role-based collaboration (agents with backstories/goals). Good DX, less flexible.
Custom (message queue + state store): Best if you have specific needs neither framework solves, or you want full control.
Recommendation: Start with LangGraph unless you have specific reasons not to. Mature, good docs, handles most patterns.
How do I prevent agents from contradicting each other?
Three strategies:
Sequential execution: Agent A finishes before Agent B starts. Can't contradict if B sees A's final output.
Consensus mechanism: Multiple agents vote, take majority/weighted average.
Hierarchy: Lead agent resolves conflicts. If Agent A says "Company is Series B", Agent B says "Company is Series C", escalate to Lead Agent to decide.
Production example: We built a research agent where 3 scraper agents gather data about a company. Often data conflicts (different sources report different revenue figures). Solution: Lead Agent has prompt: "You are reviewing research from 3 agents. When data conflicts, prioritize: 1) Official company filings, 2) Reputable news sources, 3) Aggregate data sources. If conflicting data has same priority, note the discrepancy in output."
What about cost? Doesn't multi-agent mean more LLM calls?
Yes and no.
More calls: Multi-agent makes multiple LLM calls (orchestrator + each agent).
But smarter calls: You can use cheaper models for simple tasks.
Example cost comparison:
Single agent (GPT-4 Turbo for everything):
1 call × 2,000 tokens × $0.01/1K = $0.02
Multi-agent (tiered models):
Orchestrator (GPT-3.5 Turbo): 500 tokens × $0.001/1K = $0.0005
Research Agent (GPT-3.5 Turbo): 1,000 tokens × $0.001/1K = $0.001
Analysis Agent (GPT-4 Turbo): 800 tokens × $0.01/1K = $0.008
Report Agent (GPT-4 Turbo): 600 tokens × $0.01/1K = $0.006
Total: $0.0155 (23% cheaper)
Plus: Caching (if research data doesn't change, cache it, don't re-fetch).
Result: Multi-agent can be cheaper despite more calls if you tier models intelligently.
How do I debug multi-agent systems when something goes wrong?
Invest in observability upfront. You'll need it.
Distributed tracing: Tag every workflow with unique ID, log at every agent transition.
Visualization: Use tools like LangSmith or build custom dashboards showing agent call graphs.
Replay capability: Save inputs/outputs at each step. When something fails, replay from last good checkpoint.
Alerts: Set up alerts for:
Debugging war story: Client had multi-agent research pipeline randomly failing 5% of the time. Distributed trace revealed: Analysis Agent occasionally got malformed JSON from Research Agent. Why? Research Agent used GPT-3.5 (cheap), which sometimes returned invalid JSON despite prompt instructions. Fix: switched Research Agent to GPT-4 Turbo for structured output reliability. Failures dropped to 0.1%.
You've built a single agent. Now you know how to build systems where agents collaborate. Start with Coordinator pattern (simplest), add complexity only when needed, instrument everything, and don't build 30 agents when 3 would do.
Most importantly: multi-agent systems are harder to build and debug. Only do it when the benefits (specialization, parallelization, resilience) outweigh the complexity cost.