Athenic5 Aug 202516 min read

Inside Athenic: How We Built a Multi-Agent Research System

Deep dive into Athenic's multi-agent architecture -how we orchestrate research agents across data sources to deliver comprehensive market intelligence in minutes.

MB
Max Beech
Head of Content

TL;DR

  • Athenic orchestrates multiple specialised AI agents (web research, database query, document analysis, synthesis) to deliver comprehensive startup intelligence.
  • Our architecture: Orchestrator agent routes tasks → Specialist agents execute in parallel → Synthesis agent aggregates findings → Quality agent validates output.
  • Real performance: 92% research accuracy, 15-minute average completion for multi-source queries that previously took analysts 4–6 hours.

Jump to Why multi-agent architecture · Jump to System design · Jump to Agent types · Jump to Orchestration · Jump to Challenges · Jump to Performance

Inside Athenic: How We Built a Multi-Agent Research System

When we started building Athenic, we knew single-agent LLMs couldn't deliver the research quality startups need. A single GPT-4 call can't simultaneously:

  • Search the web for competitive intelligence
  • Query your CRM for customer data
  • Analyse uploaded PDFs
  • Synthesise findings into strategic insights

Multi-agent systems solve this by deploying specialist agents -each optimised for one task -that collaborate to deliver comprehensive results. Think of it like a research team: one person handles web search, another analyses documents, a third synthesises findings, and a coordinator ensures everyone stays aligned.

Here's how we architected Athenic's multi-agent research system, the technical challenges we faced, and the design decisions that let us deliver startup intelligence in minutes instead of hours.

Key takeaways

  • Multi-agent systems outperform monolithic LLMs for complex tasks requiring diverse skills (web search, structured data analysis, document parsing).
  • Our architecture: Orchestrator routes tasks → Specialists execute in parallel → Synthesiser aggregates → Quality validator ensures accuracy.
  • Key challenge: Agent coordination overhead. Solution: Shared context layer + asynchronous execution with dependency tracking.

Why multi-agent architecture

The single-agent limitation

Traditional approach (single LLM):

User: "Research competitor X's pricing, recent funding, and customer sentiment. Compare to our product."

Single GPT-4 call: Attempts to web search, hallucinates data, provides shallow analysis. Accuracy: ~60–70%.

Why it fails:

  1. Tool use bottleneck: LLM can only call one tool at a time (web search or database query, not both).
  2. Context window limits: Trying to fit web results + database results + analysis in one prompt hits token limits.
  3. Jack-of-all-trades problem: Single agent optimised for nothing specific → mediocre at everything.

The multi-agent advantage

Athenic approach:

User: "Research competitor X's pricing, recent funding, and customer sentiment. Compare to our product."

Orchestrator agent: Breaks into sub-tasks:

  1. Web research agent: Find competitor pricing page, scrape tiers.
  2. Funding agent: Query Crunchbase API for latest funding round.
  3. Sentiment agent: Scrape Twitter, Reddit, G2 for customer feedback.
  4. Internal agent: Pull our pricing from database.
  5. Synthesis agent: Aggregate findings, generate comparison report.

Result: Comprehensive report with citations, delivered in 12 minutes. Accuracy: 92%.

Why it works:

  1. Parallel execution: Agents work simultaneously → 5× faster.
  2. Specialisation: Each agent optimised for its domain (web scraping agent uses Playwright, sentiment agent uses fine-tuned classifier).
  3. Scalability: Add new agent types (email analysis, video transcript analysis) without rebuilding core system.
Single-Agent vs Multi-Agent: Research Task
<!-- Single-agent (sequential) -->
<text x="50" y="70" fill="#94a3b8" font-size="14">Single-Agent (Sequential)</text>
<rect x="50" y="80" width="300" height="40" rx="8" fill="#ef4444" opacity="0.7" />
<rect x="50" y="80" width="60" height="40" rx="8" fill="#f59e0b" />
<text x="60" y="105" fill="#0f172a" font-size="10">Web</text>
<rect x="110" y="80" width="60" height="40" rx="8" fill="#a855f7" />
<text x="120" y="105" fill="#fff" font-size="10">Data</text>
<rect x="170" y="80" width="60" height="40" rx="8" fill="#22d3ee" />
<text x="175" y="105" fill="#0f172a" font-size="10">Docs</text>
<rect x="230" y="80" width="60" height="40" rx="8" fill="#10b981" />
<text x="238" y="105" fill="#0f172a" font-size="10">Synth</text>
<rect x="290" y="80" width="60" height="40" rx="8" fill="#6366f1" />
<text x="300" y="105" fill="#fff" font-size="9">Output</text>
<text x="50" y="140" fill="#ef4444" font-size="12">⏱ 20+ minutes (sequential)</text>

<!-- Multi-agent (parallel) -->
<text x="410" y="70" fill="#94a3b8" font-size="14">Multi-Agent (Parallel)</text>
<rect x="410" y="80" width="60" height="40" rx="8" fill="#f59e0b" opacity="0.8" />
<text x="420" y="105" fill="#0f172a" font-size="10">Web</text>
<rect x="480" y="80" width="60" height="40" rx="8" fill="#a855f7" opacity="0.8" />
<text x="490" y="105" fill="#fff" font-size="10">Data</text>
<rect x="550" y="80" width="60" height="40" rx="8" fill="#22d3ee" opacity="0.8" />
<text x="555" y="105" fill="#0f172a" font-size="10">Docs</text>
<rect x="620" y="80" width="60" height="40" rx="8" fill="#10b981" opacity="0.8" />
<text x="628" y="105" fill="#0f172a" font-size="10">Synth</text>
<rect x="690" y="80" width="60" height="40" rx="8" fill="#6366f1" opacity="0.8" />
<text x="700" y="105" fill="#fff" font-size="9">Output</text>
<text x="410" y="140" fill="#10b981" font-size="12">⏱ 5 minutes (parallel execution)</text>

<!-- Arrows showing parallel execution -->
<line x1="410" y1="130" x2="680" y2="130" stroke="#10b981" stroke-width="2" />
<text x="490" y="155" fill="#10b981" font-size="11">All agents run simultaneously</text>
Single-agent processes tasks sequentially (20+ min). Multi-agent executes in parallel (5 min) -4× faster with higher accuracy.

System design and architecture

High-level components

1. Orchestrator Agent

  • Receives user query.
  • Plans: decomposes into sub-tasks.
  • Routes: assigns sub-tasks to specialist agents.
  • Monitors: tracks agent progress, handles failures.

2. Specialist Agents Each agent has a narrow domain:

  • Web Research Agent: Searches Google, scrapes pages, extracts structured data.
  • Database Agent: Queries internal databases (CRM, analytics, knowledge base).
  • Document Agent: Parses PDFs, DOCX, spreadsheets.
  • API Agent: Calls external APIs (Crunchbase, LinkedIn, Twitter).
  • Sentiment Agent: Analyses text for sentiment, extracts themes.

3. Synthesis Agent

  • Aggregates outputs from specialist agents.
  • Generates cohesive narrative.
  • Cites sources.

4. Quality Agent

  • Validates synthesised output.
  • Flags hallucinations, missing citations, logical inconsistencies.
  • Requests re-work if quality below threshold.

5. Shared Context Layer

  • Stores conversation history, intermediate results, metadata.
  • All agents read/write to shared context (Supabase Postgres + pgvector).
Athenic Multi-Agent Architecture
<!-- User Query -->
<rect x="320" y="60" width="120" height="40" rx="10" fill="#38bdf8" opacity="0.8" />
<text x="345" y="85" fill="#0f172a" font-size="12">User Query</text>

<!-- Orchestrator -->
<rect x="300" y="130" width="160" height="50" rx="10" fill="#f59e0b" opacity="0.8" />
<text x="340" y="160" fill="#0f172a" font-size="13">Orchestrator Agent</text>

<!-- Specialist Agents -->
<rect x="50" y="220" width="100" height="40" rx="8" fill="#a855f7" opacity="0.8" />
<text x="70" y="245" fill="#fff" font-size="10">Web Agent</text>

<rect x="170" y="220" width="100" height="40" rx="8" fill="#22d3ee" opacity="0.8" />
<text x="190" y="245" fill="#0f172a" font-size="10">DB Agent</text>

<rect x="290" y="220" width="100" height="40" rx="8" fill="#10b981" opacity="0.8" />
<text x="305" y="245" fill="#0f172a" font-size="10">Doc Agent</text>

<rect x="410" y="220" width="100" height="40" rx="8" fill="#6366f1" opacity="0.8" />
<text x="430" y="245" fill="#fff" font-size="10">API Agent</text>

<rect x="530" y="220" width="120" height="40" rx="8" fill="#ef4444" opacity="0.8" />
<text x="540" y="245" fill="#fff" font-size="10">Sentiment Agent</text>

<!-- Synthesis -->
<rect x="260" y="280" width="240" height="10" rx="5" fill="#cbd5e1" />
<text x="310" y="275" fill="#cbd5e1" font-size="10">Shared Context Layer (Supabase)</text>

<!-- Arrows -->
<polyline points="380,100 380,130" stroke="#f8fafc" stroke-width="3" />
<polyline points="380,180 100,220" stroke="#cbd5e1" stroke-width="2" />
<polyline points="380,180 220,220" stroke="#cbd5e1" stroke-width="2" />
<polyline points="380,180 340,220" stroke="#cbd5e1" stroke-width="2" />
<polyline points="380,180 460,220" stroke="#cbd5e1" stroke-width="2" />
<polyline points="380,180 590,220" stroke="#cbd5e1" stroke-width="2" />
Athenic architecture: Orchestrator routes tasks → Specialist agents execute → Shared context layer coordinates → Synthesis agent aggregates results.

Agent types and responsibilities

Orchestrator Agent

Role: Task planner and coordinator.

Inputs: User query, conversation history.

Outputs: Task decomposition plan, agent assignments.

Example:

User query: "Research Notion's pricing strategy and compare to ours."

Orchestrator plan:

Tasks:
1. Web Agent: Scrape Notion pricing page → extract tiers, features, prices.
2. Database Agent: Query our pricing table → get our tiers.
3. Synthesis Agent: Compare Notion vs us → generate markdown table + analysis.

Tech stack:

  • LLM: GPT-4 Turbo (strong planning capabilities).
  • Framework: OpenAI Agents SDK.

Web Research Agent

Role: Search the web, scrape pages, extract structured data.

Tools:

  • Search: Google Custom Search API.
  • Scraping: Playwright (handles JavaScript-heavy sites).
  • Extraction: BeautifulSoup + GPT-4 (structured output).

Example task:

"Find Stripe's latest funding round amount and date."

Execution:

  1. Google search: "Stripe funding Series X."
  2. Scrape top 3 results (Crunchbase, TechCrunch, Bloomberg).
  3. Extract: amount, date, investors.
  4. Return structured JSON.

Output:

{
  "company": "Stripe",
  "funding_round": "Series I",
  "amount_usd": 6500000000,
  "date": "2023-03-14",
  "investors": ["Thrive Capital", "General Catalyst"],
  "sources": ["https://crunchbase.com/...", "https://techcrunch.com/..."]
}

Database Agent

Role: Query internal databases (CRM, analytics, knowledge base).

Tools:

  • Database: Supabase (Postgres + pgvector).
  • Query builder: Natural language → SQL (GPT-4 with schema context).

Example task:

"How many customers signed up last month?"

Execution:

  1. Convert to SQL: SELECT COUNT(*) FROM customers WHERE created_at >= '2025-07-01' AND created_at < '2025-08-01';
  2. Execute query.
  3. Return result: {"count": 127}.

Safety: Queries are sandboxed (read-only access, row-level security).

Document Agent

Role: Parse and analyse uploaded documents (PDFs, DOCX, spreadsheets).

Tools:

  • PDF parsing: PyMuPDF.
  • OCR: Tesseract (for scanned docs).
  • Analysis: GPT-4 (summarisation, Q&A).

Example task:

"Extract key metrics from this investor deck PDF."

Execution:

  1. Parse PDF → extract text.
  2. Prompt GPT-4: "Extract all numerical metrics (ARR, growth rate, customer count, etc.) from this text: [text]."
  3. Return structured data.

API Agent

Role: Call external APIs (Crunchbase, LinkedIn, Twitter, PubMed).

Integration approach:

  • MCP (Model Context Protocol): Standardised way to connect LLMs to external tools.
  • We've integrated 100+ MCP servers (Crunchbase, GitHub, Google Scholar, etc.).

Example task:

"Get company profile for startup X from Crunchbase."

Execution:

  1. Call Crunchbase MCP server: get_company(name="Startup X").
  2. Return: funding, team size, industry, etc.

Synthesis Agent

Role: Aggregate findings from specialist agents into cohesive narrative.

Inputs: Outputs from specialist agents (JSON, text, tables).

Output: Markdown report with citations.

Example:

Inputs:

  • Web Agent: Notion pricing tiers.
  • Database Agent: Our pricing tiers.

Synthesis output:

# Notion vs Our Product: Pricing Comparison

Notion offers 4 tiers: Free, Plus ($8/user/mo), Business ($15), Enterprise (custom).
Our product offers 3 tiers: Starter (free), Pro ($12), Enterprise ($25).

**Key differences:**
- Notion's Plus tier is 33% cheaper than our Pro.
- We offer more integrations at Pro tier (50+ vs Notion's 20).
- Notion targets broader market (individuals + teams); we focus on B2B.

**Recommendation:** Consider lowering Pro tier to $10 to match Notion's positioning.

Sources: [1] Notion pricing page, [2] Internal database

Orchestration and coordination

Challenge: Agent dependencies

Some tasks depend on others. Example:

"Research competitor pricing, then recommend our pricing changes."

Dependency graph:

  1. Web Agent: Get competitor pricing.
  2. Database Agent: Get our pricing.
  3. Synthesis Agent: Compare (depends on 1 + 2).
  4. Recommendation Agent: Suggest changes (depends on 3).

Solution: Task graph with dependency tracking.

Implementation (simplified):

class TaskGraph:
    def __init__(self):
        self.tasks = {}
        self.dependencies = {}

    def add_task(self, task_id, agent, depends_on=[]):
        self.tasks[task_id] = {"agent": agent, "status": "pending"}
        self.dependencies[task_id] = depends_on

    async def execute(self):
        """Execute tasks respecting dependencies."""
        completed = set()

        while len(completed) < len(self.tasks):
            # Find tasks ready to run (all dependencies met)
            ready = [
                tid for tid in self.tasks
                if self.tasks[tid]["status"] == "pending"
                and all(dep in completed for dep in self.dependencies[tid])
            ]

            # Run ready tasks in parallel
            results = await asyncio.gather(*[
                self.run_task(tid) for tid in ready
            ])

            completed.update(ready)

        return self.get_final_output()

    async def run_task(self, task_id):
        agent = self.tasks[task_id]["agent"]
        result = await agent.execute()
        self.tasks[task_id]["status"] = "completed"
        self.tasks[task_id]["result"] = result
        return result

Challenge: Shared context

Agents need to share information. Example:

  • Web Agent finds competitor raised $50M.
  • Synthesis Agent needs this data to generate report.

Solution: Shared context layer (Supabase).

Implementation:

  • Each agent reads/writes to research_context table.
  • Includes: task_id, agent_id, data (JSONB), timestamp.
CREATE TABLE research_context (
  id UUID PRIMARY KEY,
  research_job_id UUID,
  agent_id TEXT,
  task_id TEXT,
  data JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Agents query context:

def get_context(research_job_id, task_id):
    """Retrieve context for a task."""
    return supabase.table("research_context").select("*").eq("research_job_id", research_job_id).eq("task_id", task_id).execute()

Challenges and lessons learned

Challenge 1: Coordination overhead

Problem: Orchestrating 5+ agents adds latency (planning, routing, waiting for dependencies).

Initial approach: Sequential execution → 20+ min per query.

Solution: Parallel execution with dependency tracking → 5–7 min.

Lesson: Optimise for parallelism. Only enforce dependencies where truly necessary.

Challenge 2: Error propagation

Problem: If Web Agent fails (rate limit, timeout), entire research job fails.

Initial approach: Hard failures → poor user experience.

Solution: Graceful degradation.

  • Web Agent fails? → Synthesise with available data, note missing sources.
  • Orchestrator retries failed agents (exponential backoff).

Example output:

"We found competitor pricing on 3 of 5 sites. Unable to access Site X (timeout) and Site Y (rate limit). Recommendations based on available data."

Challenge 3: Quality control

Problem: Agents sometimes hallucinate or return low-confidence answers.

Initial approach: No validation → 78% accuracy.

Solution: Quality Agent validates output.

  • Checks citations (do sources actually contain claimed data?).
  • Flags low-confidence statements (e.g., "probably," "might be").
  • Requests re-work if quality score <0.85.

Result: Accuracy improved to 92%.

Challenge 4: Cost management

Problem: 5+ LLM calls per research job → $0.50–$2 per query.

Solution:

  • Use cheaper models for non-critical agents (GPT-4o-mini for Web Agent extraction).
  • Cache common queries (e.g., "Stripe pricing" cached for 7 days).
  • Implement smart routing (simple queries skip specialist agents, go straight to single LLM).

Result: Average cost: $0.30/query (down from $1.20).

Performance and metrics

Speed

  • Single-source queries (e.g., "What's our MRR?"): 10–30 seconds.
  • Multi-source queries (e.g., "Compare top 5 competitors"): 5–15 minutes.
  • Complex research (e.g., "Full market landscape analysis"): 15–30 minutes.

Accuracy

  • Factual claims: 92% accuracy (validated against ground truth dataset of 500 queries).
  • Citation accuracy: 97% (sources actually contain claimed data).
  • Hallucination rate: 3% (down from 18% pre-Quality Agent).

User satisfaction

  • CSAT: 4.6/5 (based on 1,200+ research jobs, Aug 2024–Jul 2025).
  • Top praise: Speed, comprehensiveness, citations.
  • Top complaint: Occasional missing data when sources unavailable.

Next steps: What we're building

Multi-modal research

Currently text-only. Adding:

  • Image analysis: Extract charts/tables from screenshots, PDFs.
  • Video transcripts: Analyse YouTube videos, webinars.

Proactive research

Instead of reactive (user asks → we research), build proactive agents:

  • "Monitor competitor X, alert me when they launch new features."
  • "Track funding news in AI space, weekly digest."

Collaborative research

Multi-user research projects:

  • Teams can assign sub-tasks to different agents.
  • Real-time collaboration on synthesised reports.

Building Athenic's multi-agent research system taught us that specialisation beats generalisation. By deploying narrow, expert agents that collaborate through a shared context layer, we deliver research quality that matches human analysts -in 1/20th the time. If you're building multi-agent systems, start simple (2–3 agents), optimise for parallelism, and invest in quality validation from day one.