Academy20 Nov 202412 min read

How to Build Multi-Agent AI Systems: Production Architecture Guide

Complete guide to building production multi-agent systems -architecture patterns, communication protocols, state management, and real-world deployment strategies.

MB
Max Beech
Head of Content

TL;DR

  • Multi-agent systems split complex workflows across specialized agents rather than building one god-agent that does everything poorly.
  • Four core architecture patterns: Coordinator (central orchestrator), Delegator (hierarchical delegation), Swarm (peer-to-peer), and Hybrid (mix of patterns).
  • State management is the hardest problem -agents need to share context without stepping on each other's work (distributed state, message queues, or centralized store).
  • Communication protocols: synchronous handoffs for simple cases, asynchronous messaging for resilience, publish-subscribe for broadcasting.
  • Production deployments need retry logic, circuit breakers, and careful handling of partial failures (unlike single agents that just retry the whole thing).
  • Monitoring gets complex fast: track agent-level metrics plus cross-agent dependencies (traces that span multiple agents).

Jump to architecture patterns · Jump to communication protocols · Jump to state management · Jump to production concerns · Jump to FAQs

How to Build Multi-Agent AI Systems: Production Architecture Guide

Right, so you've built a single AI agent. It works. But now you're trying to make it do three different things, and the prompt has grown to 4,000 tokens of increasingly contradictory instructions. The agent sometimes acts like a researcher, sometimes like a writer, sometimes like an analyst -and it's mediocre at all three because you're asking one model to be everything.

Time for multi-agent systems.

Most multi-agent tutorials show twee examples: Agent Alice says hello to Agent Bob. Delightful. Completely useless for production.

Here's how to actually build multi-agent systems that coordinate complex workflows -research gathering + data analysis + report generation + approval routing -without the whole thing collapsing when one agent hiccups.

Why Multi-Agent vs Single Agent?

Single agent approach:

User: "Research the top 5 competitors in fintech, analyse their pricing,
      and write a report with recommendations."

God-Agent: *tries to do research, gets distracted, writes half a report,
           forgets what it was analyzing, produces garbage*

Multi-agent approach:

User: "Research the top 5 competitors in fintech, analyse their pricing,
      and write a report with recommendations."

Orchestrator: Breaks into subtasks →
  1. Research Agent: Finds competitors, gathers data
  2. Analysis Agent: Receives structured data, analyses pricing patterns
  3. Report Agent: Takes analysis, writes executive summary
  4. Approval Agent: Routes to human for review

Result: Clean handoffs, each agent does one thing well, output is coherent.

When multi-agent makes sense:

  • Workflow has 3+ distinct phases (research → analyze → present)
  • Different steps need different models (fast model for classification, slow model for reasoning)
  • Parallel execution saves time (scrape 10 websites simultaneously)
  • Separation of concerns improves reliability (one agent failing doesn't crash everything)

When single agent is fine:

  • Simple workflow (< 3 steps)
  • No parallelization benefit
  • Total execution < 30 seconds
  • You don't want orchestration complexity

Data point: We analyzed 120 production agent deployments. Single agents dominated simple tasks (FAQ answering, classification). Multi-agent appeared at 78% of deployments handling complex workflows (sales pipeline automation, research synthesis, multi-step approvals).

Architecture Patterns

Four main patterns. Pick based on your workflow's communication structure, not what sounds cool.

Pattern 1: Coordinator (Central Orchestrator)

Structure: One orchestrator agent manages all other agents. Orchestrator decides what runs when, hands off tasks, collects results.

         ┌─────────────────┐
         │  Orchestrator   │ ← Receives user request
         └────────┬────────┘
                  │
       ┏━━━━━━━━━━┻━━━━━━━━━━┓
       ↓          ↓           ↓
  ┌────────┐ ┌────────┐ ┌────────┐
  │ Agent  │ │ Agent  │ │ Agent  │
  │   A    │ │   B    │ │   C    │
  └────────┘ └────────┘ └────────┘

When to use:

  • Linear workflows (step 1 → step 2 → step 3)
  • Orchestrator has simple routing logic
  • Failures need centralized handling
  • Clear single source of truth for workflow state

Example use case: Customer support pipeline

class SupportOrchestrator:
    def handle_ticket(self, ticket):
        # Step 1: Classify urgency
        urgency = self.classifier_agent.classify(ticket)

        # Step 2: Route based on urgency
        if urgency == "high":
            response = self.priority_agent.handle(ticket)
        else:
            response = self.standard_agent.handle(ticket)

        # Step 3: Quality check
        if self.qa_agent.approve(response):
            return response
        else:
            return self.escalation_agent.escalate(ticket, response)

Pros:

  • Simple mental model (one brain coordinating)
  • Easy to debug (orchestrator logs show full workflow)
  • State management straightforward (orchestrator holds state)

Cons:

  • Orchestrator is single point of failure
  • Doesn't scale to complex dependencies (if Agent B needs outputs from both A and C, orchestrator logic gets messy)
  • Orchestrator can become a god-agent itself (defeats the purpose)

Production tip: Keep orchestrator logic dumb. If your orchestrator has 500 lines of branching logic, you've built a god-agent again. Limit to simple routing and delegation.

Pattern 2: Delegator (Hierarchical)

Structure: Tree-like hierarchy. Top-level agent delegates to sub-agents, who may delegate further.

                ┌─────────────┐
                │ Lead Agent  │
                └──────┬──────┘
                       │
            ┏━━━━━━━━━━┻━━━━━━━━━━┓
            ↓                      ↓
      ┌───────────┐          ┌───────────┐
      │  Agent A  │          │  Agent B  │
      └─────┬─────┘          └─────┬─────┘
            │                      │
       ┏━━━━┻━━━━┓            ┏━━━━┻━━━━┓
       ↓         ↓            ↓         ↓
   ┌──────┐  ┌──────┐    ┌──────┐  ┌──────┐
   │Sub-A1│  │Sub-A2│    │Sub-B1│  │Sub-B2│
   └──────┘  └──────┘    └──────┘  └──────┘

When to use:

  • Complex workflows with sub-workflows
  • Natural hierarchical structure (e.g., research → sub-topics → deeper dives)
  • Different levels need different capabilities
  • Parallelization at multiple levels

Example use case: Market research agent

Lead Research Agent
  ├─ Competitor Analysis Agent
  │   ├─ Company A Researcher
  │   ├─ Company B Researcher
  │   └─ Company C Researcher
  ├─ Market Trends Agent
  │   ├─ News Scraper
  │   └─ Data Analyst
  └─ Report Synthesis Agent

Pros:

  • Scales to complex tasks (break down arbitrarily)
  • Parallelization at each level
  • Each level can specialize (top level strategic, bottom level tactical)

Cons:

  • Complex to orchestrate (who waits for whom?)
  • Harder to debug (failures can be several layers deep)
  • State management across levels gets messy

Production example: We built a sales research agent with 3 levels. Lead agent breaks "research 50 companies" into 5 sub-agents (10 companies each), each sub-agent spawns researchers per company. Whole thing completes in 2 minutes vs 20 minutes sequential.

Pattern 3: Swarm (Peer-to-Peer)

Structure: Agents work as peers. No central coordinator. Agents communicate directly, often bidding for tasks or collaborating dynamically.

     ┌────────┐ ←──────→ ┌────────┐
     │Agent A │           │Agent B │
     └───┬────┘           └────┬───┘
         │     ↘      ↗        │
         │       ┌────────┐    │
         └──────→│Agent C │←───┘
                 └────────┘

When to use:

  • No clear workflow structure upfront
  • Agents need to adapt dynamically
  • Fault tolerance critical (if one agent dies, others continue)
  • Decentralized decision-making

Example use case: Distributed web scraping

Task pool: [url1, url2, url3, ... url100]

Agent 1: "I'll take url1"
Agent 2: "I'll take url2 and url3"
Agent 3: "I'm busy, skip"
Agent 1: "Finished url1, taking url4"

Pros:

  • Highly resilient (no single point of failure)
  • Scales horizontally (add more agents → more throughput)
  • Adapts to load dynamically

Cons:

  • Complex to implement (need task queue, coordination protocol)
  • Hard to predict behavior (emergent properties)
  • Debugging is a nightmare (no central log of what happened)

Production reality check: Swarm patterns are rare in production. They sound cool but require sophisticated infrastructure (message queues, distributed locks, consensus protocols). Unless you have genuine distributed systems problems (100+ agents, unpredictable load), stick with Coordinator or Delegator.

When we actually used swarm: Client had 200 retail locations, needed to scrape competitor pricing daily. Deployed 20 scraper agents, each bidding for store locations from a queue. Worked well because: 1) no dependencies between tasks, 2) resilience mattered (scrapers get blocked randomly), 3) load balancing (some stores = complex sites = slower scrapes).

Pattern 4: Hybrid

Structure: Mix of above patterns. Coordinator at top level, delegators for complex subtasks, swarm for parallelizable work.

              ┌─────────────┐
              │Orchestrator │
              └──────┬──────┘
                     │
          ┏━━━━━━━━━━┻━━━━━━━━━━┓
          ↓                      ↓
   ┌────────────┐      ┌──────────────────┐
   │ Delegator  │      │ Swarm of Workers │
   │  Agent     │      │  (5 agents)      │
   └─────┬──────┘      └──────────────────┘
         │
    ┏━━━━┻━━━━┓
    ↓         ↓
  ┌───┐     ┌───┐
  │ A │     │ B │
  └───┘     └───┘

When to use: Production. Real workflows don't fit neat patterns.

Example: Content production pipeline

Orchestrator (Coordinator pattern)
  ├─ Research Phase (Delegator pattern)
  │   ├─ Topic Agent
  │   └─ Competitor Agent
  ├─ Content Creation (Swarm pattern)
  │   ├─ Writer Agent 1
  │   ├─ Writer Agent 2
  │   └─ Writer Agent 3 (all pull from outline queue)
  └─ Review Phase (Coordinator pattern)
      ├─ Editor Agent
      └─ SEO Agent

Recommendation: Start with Coordinator (simplest). Add Delegator when you hit depth (subtasks with subtasks). Add Swarm only if you need dynamic load balancing.

Architecture Pattern Decision Matrix

PatternComplexityScalabilityFault ToleranceUse When
CoordinatorLowMediumLow (SPOF)Linear workflows, simple routing
DelegatorMediumHighMediumComplex hierarchical tasks
SwarmHighVery HighVery HighDistributed, independent tasks
HybridVery HighVery HighHighProduction systems with mixed needs

Communication Protocols

Agents need to talk to each other. Three main approaches.

Approach 1: Synchronous Handoffs

Agent A calls Agent B, waits for response, proceeds.

class OrchestratorAgent:
    def process_request(self, user_input):
        # Synchronous call to Research Agent
        research_data = self.research_agent.gather_data(user_input)

        # Wait for completion, then call Analysis Agent
        analysis = self.analysis_agent.analyze(research_data)

        # Wait for completion, return
        return analysis

Pros:

  • Simple to implement (just function calls)
  • Easy to debug (linear execution)
  • Natural for sequential workflows

Cons:

  • Blocks on slow agents (if research takes 2 minutes, everything waits)
  • No parallelization
  • If any agent fails, whole chain fails

When to use: Simple workflows, total execution < 1 minute, failures are rare.

Approach 2: Asynchronous Messaging

Agents communicate via message queue. Agent A sends message, doesn't wait. Agent B processes when ready.

# Agent A sends message
message_queue.publish("analysis_tasks", {
    "task_id": "abc123",
    "data": research_data
})

# Agent B subscribes to queue
def on_message(message):
    result = analyze(message['data'])
    message_queue.publish("results", {
        "task_id": message['task_id'],
        "result": result
    })

Pros:

  • Resilient (if Agent B crashes, message stays in queue)
  • Scalable (add more Agent B instances → higher throughput)
  • Decouples agents (A doesn't care how long B takes)

Cons:

  • Complex (need message broker: RabbitMQ, Redis, AWS SQS)
  • Harder to debug (distributed traces)
  • Eventual consistency (results arrive later)

When to use: Long-running workflows (> 1 minute), need resilience, parallelization across agents.

Production stack: We use Redis for simple cases (< 10K messages/day), AWS SQS for production (scales to millions, built-in retries, dead-letter queues).

Approach 3: Publish-Subscribe (Event-Driven)

Agents broadcast events. Other agents subscribe to events they care about.

# Research Agent publishes event
event_bus.publish("research_completed", {
    "topic": "fintech competitors",
    "data": research_results
})

# Multiple agents can subscribe
class AnalysisAgent:
    @subscribe("research_completed")
    def on_research_done(self, event):
        # Analyze the data
        pass

class ReportAgent:
    @subscribe("research_completed")
    def on_research_done(self, event):
        # Start drafting report
        pass

Pros:

  • Highly decoupled (agents don't know about each other)
  • Enables parallel processing (multiple subscribers react simultaneously)
  • Easy to add new agents (just subscribe to events)

Cons:

  • Hardest to debug (event cascade can be non-obvious)
  • Risk of event storms (one event triggers 10 more, each triggers 10...)
  • Need careful event schema management

When to use: Complex systems with many agents, multiple agents react to same event, need to add agents without changing existing code.

Real example: Client had approval workflow. When "document_uploaded" event fires, 3 agents react: 1) Virus Scanner, 2) Metadata Extractor, 3) Categorization Agent. All run in parallel. Clean separation, easy to add "4) OCR Agent" later without touching existing code.

Communication Protocol Decision Matrix

ProtocolLatencyResilienceComplexityBest For
SynchronousLow (ms)LowLowQuick, sequential workflows
Async MessagingMedium (seconds)HighMediumLong-running, need retries
Pub-SubMediumHighHighMany agents, event-driven

State Management

Hardest part of multi-agent systems. Agents need shared context without conflicts.

Problem: Agent A finds 10 competitors. Agent B starts analyzing #1-5. Agent C starts analyzing #6-10. Meanwhile, Agent A finds 5 more competitors. How do B and C know about new data?

Three approaches.

Approach 1: Centralized State Store

One database holds all state. Agents read/write as needed.

# Shared state in Redis or PostgreSQL
class StateStore:
    def get_state(self, task_id):
        return db.query("SELECT * FROM task_state WHERE id = ?", task_id)

    def update_state(self, task_id, new_data):
        db.execute("UPDATE task_state SET data = ? WHERE id = ?",
                   new_data, task_id)

# Agent A writes
state_store.update_state("task_123", {"competitors": found_competitors})

# Agent B reads
current_state = state_store.get_state("task_123")
analyze(current_state['competitors'])

Pros:

  • Simple mental model (single source of truth)
  • Easy to debug (inspect database)
  • Consistency (all agents see same state)

Cons:

  • Database is bottleneck (every agent read/write hits DB)
  • Concurrency issues (what if 2 agents update simultaneously?)
  • Coupling (all agents depend on shared schema)

When to use: Most use cases. Start here unless you have specific distributed needs.

Production detail: Use optimistic locking to handle concurrent updates:

def update_with_version_check(task_id, new_data):
    current_version = db.query("SELECT version FROM task_state WHERE id = ?", task_id)

    updated = db.execute("""
        UPDATE task_state
        SET data = ?, version = version + 1
        WHERE id = ? AND version = ?
    """, new_data, task_id, current_version)

    if not updated:
        raise ConflictError("State was modified by another agent")

Approach 2: Event Sourcing

Instead of storing current state, store all events. State is derived by replaying events.

# Store events
events = [
    {"type": "competitor_found", "name": "Company A", "timestamp": "..."},
    {"type": "competitor_found", "name": "Company B", "timestamp": "..."},
    {"type": "analysis_started", "competitor": "Company A", "timestamp": "..."},
]

# Derive current state
def get_current_state(events):
    state = {"competitors": [], "analyses": []}
    for event in events:
        if event['type'] == "competitor_found":
            state['competitors'].append(event['name'])
        elif event['type'] == "analysis_started":
            state['analyses'].append(event['competitor'])
    return state

Pros:

  • Full audit trail (see what happened when)
  • Time travel (replay to any point)
  • Decouples agents (they emit events, don't care who consumes)

Cons:

  • Complex to implement (event replay logic)
  • Performance (rebuilding state from 10K events is slow)
  • Storage (events accumulate forever unless purged)

When to use: Audit requirements (compliance, debugging), complex state transitions, need time-travel debugging.

Production reality: Event sourcing is overkill for 90% of multi-agent systems. We've used it twice: 1) financial approval workflows (audit trail critical), 2) experimental agent system where we needed to replay to debug emergent behaviors.

Approach 3: Agent-Local State (Pass Messages)

No shared state. Agents pass all context in messages.

# Agent A sends full context to Agent B
message = {
    "task_id": "abc123",
    "competitors": ["Company A", "Company B", "Company C"],
    "research_summary": "...",
    "timestamp": "..."
}

message_queue.publish("analysis_tasks", message)

# Agent B receives full context, doesn't need to query anywhere
def process_analysis(message):
    competitors = message['competitors']  # Everything needed is in message
    analyze(competitors)

Pros:

  • No shared state bottleneck
  • Agents fully independent
  • Scales horizontally (no database to overwhelm)

Cons:

  • Message size grows (passing full context every time)
  • Inconsistency (if Agent A finds new data, Agent B won't see it unless A sends new message)
  • Duplication (same data in multiple messages)

When to use: Fully independent tasks, high throughput, agents don't need shared context.

When we used this: Distributed scraping (each agent gets URL + config, scrapes, returns result). No shared state needed -agents are stateless workers.

Production Deployment Considerations

Multi-agent systems fail in ways single agents don't. Here's what matters.

Partial Failures

Problem: Agent A succeeds, Agent B fails. Do you retry? Retry only B? Retry everything?

Solution: Idempotency + Checkpointing

class WorkflowEngine:
    def execute_workflow(self, task_id):
        # Checkpoint: Research phase
        if not self.is_completed(task_id, "research"):
            research_data = self.research_agent.gather_data()
            self.save_checkpoint(task_id, "research", research_data)
        else:
            research_data = self.load_checkpoint(task_id, "research")

        # Checkpoint: Analysis phase
        if not self.is_completed(task_id, "analysis"):
            analysis = self.analysis_agent.analyze(research_data)
            self.save_checkpoint(task_id, "analysis", analysis)
        else:
            analysis = self.load_checkpoint(task_id, "analysis")

        return analysis

Result: If analysis fails, retry doesn't re-do research (expensive). Just retries from last checkpoint.

Retry Logic with Exponential Backoff

Agents fail (LLM rate limits, network issues). Retry, but don't hammer.

def retry_with_backoff(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait_time)

Production tip: Different retry strategies for different errors:

  • Rate limit (429): Exponential backoff (wait longer each time)
  • Server error (500): Immediate retry (transient issue)
  • Bad request (400): No retry (agent prompt is wrong, retrying won't help)

Circuit Breakers

If Agent B fails 10 times in a row, stop calling it (it's broken). Fail fast.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.opened_at = None

    def call(self, func):
        # Circuit open (too many failures) - fail fast
        if self.opened_at and time.time() - self.opened_at < self.timeout:
            raise CircuitOpenError("Circuit breaker open")

        try:
            result = func()
            self.failures = 0  # Reset on success
            return result
        except Exception as e:
            self.failures += 1
            if self.failures >= self.failure_threshold:
                self.opened_at = time.time()
            raise

# Usage
agent_b_breaker = CircuitBreaker()

try:
    result = agent_b_breaker.call(lambda: agent_b.process(data))
except CircuitOpenError:
    # Agent B is broken, use fallback
    result = fallback_agent.process(data)

Monitoring Multi-Agent Systems

Single agent: track latency, cost, errors. Easy.

Multi-agent: track per-agent metrics + cross-agent dependencies. Not easy.

What to monitor:

  1. Per-agent metrics:

    • Latency (avg, p50, p95, p99)
    • Error rate (by error type)
    • Cost (API calls, tokens)
    • Throughput (requests/minute)
  2. Cross-agent metrics:

    • End-to-end latency (user request → final result)
    • Handoff success rate (Agent A → Agent B, did handoff work?)
    • Workflow completion rate (% of workflows that finish vs. fail partway)
  3. Distributed traces:

# Tag each workflow with trace ID
trace_id = str(uuid.uuid4())

# Each agent logs with trace ID
logger.info("Research started", extra={"trace_id": trace_id, "agent": "research"})
# ...
logger.info("Research completed", extra={"trace_id": trace_id, "agent": "research"})

logger.info("Analysis started", extra={"trace_id": trace_id, "agent": "analysis"})
# ...
logger.info("Analysis completed", extra={"trace_id": trace_id, "agent": "analysis"})

Now you can query logs by trace_id and see full workflow execution:

[trace_123] Research started (00:00.00)
[trace_123] Research completed (00:32.45)
[trace_123] Analysis started (00:32.46)
[trace_123] Analysis failed (00:54.12) - RateLimitError

Tooling: LangSmith supports multi-agent traces natively. Shows agent call graphs, latency breakdowns, cost per agent.

Real Production Example: Sales Research Pipeline

Use case: Given company name, research company + find decision-makers + draft personalized outreach email.

Architecture: Hybrid (Coordinator + Delegator + Swarm)

Orchestrator (Coordinator pattern)
  ├─ Company Research (Delegator pattern)
  │   ├─ Website Scraper Agent
  │   ├─ LinkedIn Scraper Agent
  │   └─ News Scraper Agent
  ├─ Contact Discovery (Swarm pattern - parallel processing)
  │   ├─ LinkedIn Search Agent 1
  │   ├─ LinkedIn Search Agent 2
  │   ├─ LinkedIn Search Agent 3
  │   └─ Email Finder Agent (Apollo.io integration)
  └─ Email Drafting (Coordinator pattern)
      ├─ Personalization Agent (uses research data)
      └─ Quality Check Agent

Communication: Async messaging (Redis queue)

State: Centralized (PostgreSQL)

Deployment: AWS Lambda (orchestrator) + ECS containers (scraper swarm)

Results:

  • Processes 50 companies/hour (vs 5/hour with single agent)
  • 92% success rate (checkpointing handles partial failures)
  • £2.30 cost per company (LLM calls + scraping APIs)

What made it work:

  1. Parallelization: Scraper swarm hits 3 data sources simultaneously (website, LinkedIn, news).
  2. Specialization: Each agent does one thing well (website scraper doesn't try to analyze news).
  3. Resilience: If LinkedIn scraper fails (gets blocked), workflow continues with website + news data.
  4. Checkpointing: Orchestrator saves progress after each phase. If email drafting fails, don't re-scrape everything.

Frequently Asked Questions

How many agents is too many?

Depends on your orchestration complexity budget. We've seen:

  • 3-5 agents: Common, manageable, most workflows fit here
  • 6-10 agents: Complex but fine if architecture is clean
  • 11-20 agents: Rare, better have good reason (parallelization, strict separation of concerns)
  • 20+ agents: Red flag. Likely over-engineering. Revisit architecture.

If you have 30 agents, you've probably created a distributed system problem where you had a workflow problem.

Should I use LangGraph, CrewAI, or build custom orchestration?

LangGraph: Best for complex state management, branching logic, graph-based workflows. Learning curve, but powerful.

CrewAI: Best for role-based collaboration (agents with backstories/goals). Good DX, less flexible.

Custom (message queue + state store): Best if you have specific needs neither framework solves, or you want full control.

Recommendation: Start with LangGraph unless you have specific reasons not to. Mature, good docs, handles most patterns.

How do I prevent agents from contradicting each other?

Three strategies:

  1. Sequential execution: Agent A finishes before Agent B starts. Can't contradict if B sees A's final output.

  2. Consensus mechanism: Multiple agents vote, take majority/weighted average.

  3. Hierarchy: Lead agent resolves conflicts. If Agent A says "Company is Series B", Agent B says "Company is Series C", escalate to Lead Agent to decide.

Production example: We built a research agent where 3 scraper agents gather data about a company. Often data conflicts (different sources report different revenue figures). Solution: Lead Agent has prompt: "You are reviewing research from 3 agents. When data conflicts, prioritize: 1) Official company filings, 2) Reputable news sources, 3) Aggregate data sources. If conflicting data has same priority, note the discrepancy in output."

What about cost? Doesn't multi-agent mean more LLM calls?

Yes and no.

More calls: Multi-agent makes multiple LLM calls (orchestrator + each agent).

But smarter calls: You can use cheaper models for simple tasks.

Example cost comparison:

Single agent (GPT-4 Turbo for everything):

1 call × 2,000 tokens × $0.01/1K = $0.02

Multi-agent (tiered models):

Orchestrator (GPT-3.5 Turbo): 500 tokens × $0.001/1K = $0.0005
Research Agent (GPT-3.5 Turbo): 1,000 tokens × $0.001/1K = $0.001
Analysis Agent (GPT-4 Turbo): 800 tokens × $0.01/1K = $0.008
Report Agent (GPT-4 Turbo): 600 tokens × $0.01/1K = $0.006

Total: $0.0155 (23% cheaper)

Plus: Caching (if research data doesn't change, cache it, don't re-fetch).

Result: Multi-agent can be cheaper despite more calls if you tier models intelligently.

How do I debug multi-agent systems when something goes wrong?

Invest in observability upfront. You'll need it.

  1. Distributed tracing: Tag every workflow with unique ID, log at every agent transition.

  2. Visualization: Use tools like LangSmith or build custom dashboards showing agent call graphs.

  3. Replay capability: Save inputs/outputs at each step. When something fails, replay from last good checkpoint.

  4. Alerts: Set up alerts for:

    • Workflows taking > 2x normal time
    • Error rate spikes on any agent
    • Circuit breakers opening

Debugging war story: Client had multi-agent research pipeline randomly failing 5% of the time. Distributed trace revealed: Analysis Agent occasionally got malformed JSON from Research Agent. Why? Research Agent used GPT-3.5 (cheap), which sometimes returned invalid JSON despite prompt instructions. Fix: switched Research Agent to GPT-4 Turbo for structured output reliability. Failures dropped to 0.1%.


Recommended Reading


You've built a single agent. Now you know how to build systems where agents collaborate. Start with Coordinator pattern (simplest), add complexity only when needed, instrument everything, and don't build 30 agents when 3 would do.

Most importantly: multi-agent systems are harder to build and debug. Only do it when the benefits (specialization, parallelization, resilience) outweigh the complexity cost.