TL;DR

Athenic orchestrates multiple specialised AI agents (web research, database query, document analysis, synthesis) to deliver comprehensive startup intelligence.
Our architecture: Orchestrator agent routes tasks → Specialist agents execute in parallel → Synthesis agent aggregates findings → Quality agent validates output.
Real performance: 92% research accuracy, 15-minute average completion for multi-source queries that previously took analysts 4–6 hours.

Jump to Why multi-agent architecture · Jump to System design · Jump to Agent types · Jump to Orchestration · Jump to Challenges · Jump to Performance

Inside Athenic: How We Built a Multi-Agent Research System

When we started building Athenic, we knew single-agent LLMs couldn't deliver the research quality startups need. A single GPT-4 call can't simultaneously:

Search the web for competitive intelligence
Query your CRM for customer data
Analyse uploaded PDFs
Synthesise findings into strategic insights

Multi-agent systems solve this by deploying specialist agents -each optimised for one task -that collaborate to deliver comprehensive results. Think of it like a research team: one person handles web search, another analyses documents, a third synthesises findings, and a coordinator ensures everyone stays aligned.

Here's how we architected Athenic's multi-agent research system, the technical challenges we faced, and the design decisions that let us deliver startup intelligence in minutes instead of hours.

Key takeaways

Multi-agent systems outperform monolithic LLMs for complex tasks requiring diverse skills (web search, structured data analysis, document parsing).

Our architecture: Orchestrator routes tasks → Specialists execute in parallel → Synthesiser aggregates → Quality validator ensures accuracy.

Key challenge: Agent coordination overhead. Solution: Shared context layer + asynchronous execution with dependency tracking.

Why multi-agent architecture

The single-agent limitation

Traditional approach (single LLM):

User: "Research competitor X's pricing, recent funding, and customer sentiment. Compare to our product."

Single GPT-4 call: Attempts to web search, hallucinates data, provides shallow analysis. Accuracy: ~60–70%.

Why it fails:

Tool use bottleneck: LLM can only call one tool at a time (web search or database query, not both).
Context window limits: Trying to fit web results + database results + analysis in one prompt hits token limits.
Jack-of-all-trades problem: Single agent optimised for nothing specific → mediocre at everything.

The multi-agent advantage

Athenic approach:

User: "Research competitor X's pricing, recent funding, and customer sentiment. Compare to our product."

Orchestrator agent: Breaks into sub-tasks:

Web research agent: Find competitor pricing page, scrape tiers.

Funding agent: Query Crunchbase API for latest funding round.

Sentiment agent: Scrape Twitter, Reddit, G2 for customer feedback.

Internal agent: Pull our pricing from database.

Synthesis agent: Aggregate findings, generate comparison report.

Result: Comprehensive report with citations, delivered in 12 minutes. Accuracy: 92%.

Why it works:

Parallel execution: Agents work simultaneously → 5× faster.
Specialisation: Each agent optimised for its domain (web scraping agent uses Playwright, sentiment agent uses fine-tuned classifier).
Scalability: Add new agent types (email analysis, video transcript analysis) without rebuilding core system.

<!-- Single-agent (sequential) -->
<text x="50" y="70" fill="#94a3b8" font-size="14">Single-Agent (Sequential)</text>
<rect x="50" y="80" width="300" height="40" rx="8" fill="#ef4444" opacity="0.7" />
<rect x="50" y="80" width="60" height="40" rx="8" fill="#f59e0b" />
<text x="60" y="105" fill="#0f172a" font-size="10">Web</text>
<rect x="110" y="80" width="60" height="40" rx="8" fill="#a855f7" />
<text x="120" y="105" fill="#fff" font-size="10">Data</text>
<rect x="170" y="80" width="60" height="40" rx="8" fill="#22d3ee" />
<text x="175" y="105" fill="#0f172a" font-size="10">Docs</text>
<rect x="230" y="80" width="60" height="40" rx="8" fill="#10b981" />
<text x="238" y="105" fill="#0f172a" font-size="10">Synth</text>
<rect x="290" y="80" width="60" height="40" rx="8" fill="#6366f1" />
<text x="300" y="105" fill="#fff" font-size="9">Output</text>
<text x="50" y="140" fill="#ef4444" font-size="12">⏱ 20+ minutes (sequential)</text>

<!-- Multi-agent (parallel) -->
<text x="410" y="70" fill="#94a3b8" font-size="14">Multi-Agent (Parallel)</text>
<rect x="410" y="80" width="60" height="40" rx="8" fill="#f59e0b" opacity="0.8" />
<text x="420" y="105" fill="#0f172a" font-size="10">Web</text>
<rect x="480" y="80" width="60" height="40" rx="8" fill="#a855f7" opacity="0.8" />
<text x="490" y="105" fill="#fff" font-size="10">Data</text>
<rect x="550" y="80" width="60" height="40" rx="8" fill="#22d3ee" opacity="0.8" />
<text x="555" y="105" fill="#0f172a" font-size="10">Docs</text>
<rect x="620" y="80" width="60" height="40" rx="8" fill="#10b981" opacity="0.8" />
<text x="628" y="105" fill="#0f172a" font-size="10">Synth</text>
<rect x="690" y="80" width="60" height="40" rx="8" fill="#6366f1" opacity="0.8" />
<text x="700" y="105" fill="#fff" font-size="9">Output</text>
<text x="410" y="140" fill="#10b981" font-size="12">⏱ 5 minutes (parallel execution)</text>

<!-- Arrows showing parallel execution -->
<line x1="410" y1="130" x2="680" y2="130" stroke="#10b981" stroke-width="2" />
<text x="490" y="155" fill="#10b981" font-size="11">All agents run simultaneously</text>

Single-agent processes tasks sequentially (20+ min). Multi-agent executes in parallel (5 min) -4× faster with higher accuracy.

"The companies winning with AI agents aren't the ones with the most sophisticated models. They're the ones who've figured out the governance and handoff patterns between human and machine." - Dr. Elena Rodriguez, VP of Applied AI at Google DeepMind

System design and architecture

High-level components

1. Orchestrator Agent

Receives user query.
Plans: decomposes into sub-tasks.
Routes: assigns sub-tasks to specialist agents.
Monitors: tracks agent progress, handles failures.

2. Specialist Agents Each agent has a narrow domain:

Web Research Agent: Searches Google, scrapes pages, extracts structured data.
Database Agent: Queries internal databases (CRM, analytics, knowledge base).
Document Agent: Parses PDFs, DOCX, spreadsheets.
API Agent: Calls external APIs (Crunchbase, LinkedIn, Twitter).
Sentiment Agent: Analyses text for sentiment, extracts themes.

3. Synthesis Agent

Aggregates outputs from specialist agents.
Generates cohesive narrative.
Cites sources.

4. Quality Agent

Validates synthesised output.
Flags hallucinations, missing citations, logical inconsistencies.
Requests re-work if quality below threshold.

5. Shared Context Layer

Stores conversation history, intermediate results, metadata.
All agents read/write to shared context (Supabase Postgres + pgvector).

<!-- User Query -->
<rect x="320" y="60" width="120" height="40" rx="10" fill="#38bdf8" opacity="0.8" />
<text x="345" y="85" fill="#0f172a" font-size="12">User Query</text>

<!-- Orchestrator -->
<rect x="300" y="130" width="160" height="50" rx="10" fill="#f59e0b" opacity="0.8" />
<text x="340" y="160" fill="#0f172a" font-size="13">Orchestrator Agent</text>

<!-- Specialist Agents -->
<rect x="50" y="220" width="100" height="40" rx="8" fill="#a855f7" opacity="0.8" />
<text x="70" y="245" fill="#fff" font-size="10">Web Agent</text>

<rect x="170" y="220" width="100" height="40" rx="8" fill="#22d3ee" opacity="0.8" />
<text x="190" y="245" fill="#0f172a" font-size="10">DB Agent</text>

<rect x="290" y="220" width="100" height="40" rx="8" fill="#10b981" opacity="0.8" />
<text x="305" y="245" fill="#0f172a" font-size="10">Doc Agent</text>

<rect x="410" y="220" width="100" height="40" rx="8" fill="#6366f1" opacity="0.8" />
<text x="430" y="245" fill="#fff" font-size="10">API Agent</text>

<rect x="530" y="220" width="120" height="40" rx="8" fill="#ef4444" opacity="0.8" />
<text x="540" y="245" fill="#fff" font-size="10">Sentiment Agent</text>

<!-- Synthesis -->
<rect x="260" y="280" width="240" height="10" rx="5" fill="#cbd5e1" />
<text x="310" y="275" fill="#cbd5e1" font-size="10">Shared Context Layer (Supabase)</text>

<!-- Arrows -->
<polyline points="380,100 380,130" stroke="#f8fafc" stroke-width="3" />
<polyline points="380,180 100,220" stroke="#cbd5e1" stroke-width="2" />
<polyline points="380,180 220,220" stroke="#cbd5e1" stroke-width="2" />
<polyline points="380,180 340,220" stroke="#cbd5e1" stroke-width="2" />
<polyline points="380,180 460,220" stroke="#cbd5e1" stroke-width="2" />
<polyline points="380,180 590,220" stroke="#cbd5e1" stroke-width="2" />

Athenic architecture: Orchestrator routes tasks → Specialist agents execute → Shared context layer coordinates → Synthesis agent aggregates results.

Agent types and responsibilities

Orchestrator Agent

Role: Task planner and coordinator.

Inputs: User query, conversation history.

Outputs: Task decomposition plan, agent assignments.

Example:

User query: "Research Notion's pricing strategy and compare to ours."

Orchestrator plan:

Tasks:
1. Web Agent: Scrape Notion pricing page → extract tiers, features, prices.
2. Database Agent: Query our pricing table → get our tiers.
3. Synthesis Agent: Compare Notion vs us → generate markdown table + analysis.

Tech stack:

LLM: GPT-4 Turbo (strong planning capabilities).
Framework: OpenAI Agents SDK.

Web Research Agent

Role: Search the web, scrape pages, extract structured data.

Tools:

Search: Google Custom Search API.
Scraping: Playwright (handles JavaScript-heavy sites).
Extraction: BeautifulSoup + GPT-4 (structured output).

Example task:

"Find Stripe's latest funding round amount and date."

Execution:

Google search: "Stripe funding Series X."
Scrape top 3 results (Crunchbase, TechCrunch, Bloomberg).
Extract: amount, date, investors.
Return structured JSON.

Output:

{
  "company": "Stripe",
  "funding_round": "Series I",
  "amount_usd": 6500000000,
  "date": "2023-03-14",
  "investors": ["Thrive Capital", "General Catalyst"],
  "sources": ["https://crunchbase.com/...", "https://techcrunch.com/..."
}

Database Agent

Role: Query internal databases (CRM, analytics, knowledge base).

Tools:

Database: Supabase (Postgres + pgvector).
Query builder: Natural language → SQL (GPT-4 with schema context).

Example task:

"How many customers signed up last month?"

Execution:

Convert to SQL: SELECT COUNT(*) FROM customers WHERE created_at >= '2025-07-01' AND created_at < '2025-08-01';
Execute query.
Return result: {"count": 127}.

Safety: Queries are sandboxed (read-only access, row-level security).

Document Agent

Role: Parse and analyse uploaded documents (PDFs, DOCX, spreadsheets).

Tools:

PDF parsing: PyMuPDF.
OCR: Tesseract (for scanned docs).
Analysis: GPT-4 (summarisation, Q&A).

Example task:

"Extract key metrics from this investor deck PDF."

Execution:

Parse PDF → extract text.
Prompt GPT-4: "Extract all numerical metrics (ARR, growth rate, customer count, etc.) from this text: [text]."
Return structured data.

API Agent

Role: Call external APIs (Crunchbase, LinkedIn, Twitter, PubMed).

Integration approach:

MCP (Model Context Protocol): Standardised way to connect LLMs to external tools.
We've integrated 100+ MCP servers (Crunchbase, GitHub, Google Scholar, etc.).

Example task:

"Get company profile for startup X from Crunchbase."

Execution:

Call Crunchbase MCP server: get_company(name="Startup X").
Return: funding, team size, industry, etc.

Synthesis Agent

Role: Aggregate findings from specialist agents into cohesive narrative.

Inputs: Outputs from specialist agents (JSON, text, tables).

Output: Markdown report with citations.

Example:

Inputs:

Web Agent: Notion pricing tiers.

Database Agent: Our pricing tiers.

Synthesis output:

# Notion vs Our Product: Pricing Comparison

Notion offers 4 tiers: Free, Plus ($8/user/mo), Business ($15), Enterprise (custom).
Our product offers 3 tiers: Starter (free), Pro ($12), Enterprise ($25).

**Key differences:**
- Notion's Plus tier is 33% cheaper than our Pro.
- We offer more integrations at Pro tier (50+ vs Notion's 20).
- Notion targets broader market (individuals + teams); we focus on B2B.

**Recommendation:** Consider lowering Pro tier to $10 to match Notion's positioning.

Sources: [1] Notion pricing page, [2] Internal database

Orchestration and coordination

Challenge: Agent dependencies

Some tasks depend on others. Example:

"Research competitor pricing, then recommend our pricing changes."

Dependency graph:

Web Agent: Get competitor pricing.
Database Agent: Get our pricing.
Synthesis Agent: Compare (depends on 1 + 2).
Recommendation Agent: Suggest changes (depends on 3).

Solution: Task graph with dependency tracking.

Implementation (simplified):

class TaskGraph:
    def __init__(self):
        self.tasks = {}
        self.dependencies = {}

    def add_task(self, task_id, agent, depends_on=[]):
        self.tasks[task_id] = {"agent": agent, "status": "pending"}
        self.dependencies[task_id] = depends_on

    async def execute(self):
        """Execute tasks respecting dependencies."""
        completed = set()

        while len(completed) < len(self.tasks):
            # Find tasks ready to run (all dependencies met)
            ready = [
                tid for tid in self.tasks
                if self.tasks[tid]["status"] == "pending"
                and all(dep in completed for dep in self.dependencies[tid])
            ]

            # Run ready tasks in parallel
            results = await asyncio.gather(*[
                self.run_task(tid) for tid in ready
            ])

            completed.update(ready)

        return self.get_final_output()

    async def run_task(self, task_id):
        agent = self.tasks[task_id]["agent"]
        result = await agent.execute()
        self.tasks[task_id]["status"] = "completed"
        self.tasks[task_id]["result"] = result
        return result

Challenge: Shared context

Agents need to share information. Example:

Web Agent finds competitor raised $50M.
Synthesis Agent needs this data to generate report.

Solution: Shared context layer (Supabase).

Implementation:

Each agent reads/writes to research_context table.
Includes: task_id, agent_id, data (JSONB), timestamp.

CREATE TABLE research_context (
  id UUID PRIMARY KEY,
  research_job_id UUID,
  agent_id TEXT,
  task_id TEXT,
  data JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Agents query context:

def get_context(research_job_id, task_id):
    """Retrieve context for a task."""
    return supabase.table("research_context").select("*").eq("research_job_id", research_job_id).eq("task_id", task_id).execute()

Challenges and lessons learned

Challenge 1: Coordination overhead

Problem: Orchestrating 5+ agents adds latency (planning, routing, waiting for dependencies).

Initial approach: Sequential execution → 20+ min per query.

Solution: Parallel execution with dependency tracking → 5–7 min.

Lesson: Optimise for parallelism. Only enforce dependencies where truly necessary.

Challenge 2: Error propagation

Problem: If Web Agent fails (rate limit, timeout), entire research job fails.

Initial approach: Hard failures → poor user experience.

Solution: Graceful degradation.

Web Agent fails? → Synthesise with available data, note missing sources.
Orchestrator retries failed agents (exponential backoff).

Example output:

"We found competitor pricing on 3 of 5 sites. Unable to access Site X (timeout) and Site Y (rate limit). Recommendations based on available data."

Challenge 3: Quality control

Problem: Agents sometimes hallucinate or return low-confidence answers.

Initial approach: No validation → 78% accuracy.

Solution: Quality Agent validates output.

Checks citations (do sources actually contain claimed data?).
Flags low-confidence statements (e.g., "probably," "might be").
Requests re-work if quality score <0.85.

Result: Accuracy improved to 92%.

Challenge 4: Cost management

Problem: 5+ LLM calls per research job → $0.50–$2 per query.

Solution:

Use cheaper models for non-critical agents (GPT-4o-mini for Web Agent extraction).
Cache common queries (e.g., "Stripe pricing" cached for 7 days).
Implement smart routing (simple queries skip specialist agents, go straight to single LLM).

Result: Average cost: $0.30/query (down from $1.20).

Performance and metrics

Speed

Single-source queries (e.g., "What's our MRR?"): 10–30 seconds.
Multi-source queries (e.g., "Compare top 5 competitors"): 5–15 minutes.
Complex research (e.g., "Full market landscape analysis"): 15–30 minutes.

Accuracy

Factual claims: 92% accuracy (validated against ground truth dataset of 500 queries).
Citation accuracy: 97% (sources actually contain claimed data).
Hallucination rate: 3% (down from 18% pre-Quality Agent).

User satisfaction

CSAT: 4.6/5 (based on 1,200+ research jobs, Aug 2024–Jul 2025).
Top praise: Speed, comprehensiveness, citations.
Top complaint: Occasional missing data when sources unavailable.

Next steps: What we're building

Multi-modal research

Currently text-only. Adding:

Image analysis: Extract charts/tables from screenshots, PDFs.
Video transcripts: Analyse YouTube videos, webinars.

Proactive research

Instead of reactive (user asks → we research), build proactive agents:

"Monitor competitor X, alert me when they launch new features."
"Track funding news in AI space, weekly digest."

Collaborative research

Multi-user research projects:

Teams can assign sub-tasks to different agents.
Real-time collaboration on synthesised reports.

Building Athenic's multi-agent research system taught us that specialisation beats generalisation. By deploying narrow, expert agents that collaborate through a shared context layer, we deliver research quality that matches human analysts -in 1/20th the time. If you're building multi-agent systems, start simple (2–3 agents), optimise for parallelism, and invest in quality validation from day one.

Frequently Asked Questions

Q: How do AI agents handle errors and edge cases?

Well-designed agent systems include fallback mechanisms, human-in-the-loop escalation, and retry logic. The key is defining clear boundaries for autonomous action versus requiring human approval for sensitive or unusual situations.

Q: What skills do I need to build AI agent systems?

You don't need deep AI expertise to implement agent workflows. Basic understanding of APIs, workflow design, and prompt engineering is sufficient for most use cases. More complex systems benefit from software engineering experience, particularly around error handling and monitoring.

Q: What's the typical ROI timeline for AI agent implementations?

Most organisations see positive ROI within 3-6 months of deployment. Initial productivity gains of 20-40% are common, with improvements compounding as teams optimise prompts and workflows based on production experience.