Academy3 Aug 20249 min read

Case Study: How Ramp Automated 83% of Expense Categorization

Deep dive into Ramp's AI agent implementation -architecture, challenges, results, and lessons learned from automating expense categorization at scale.

MB
Max Beech
Head of Content

TL;DR

  • Ramp automated 83% of expense categorization using a multi-agent system, reducing finance team workload by 12 hours/week.
  • Implementation timeline: 12 weeks from kickoff to production (4 weeks design, 6 weeks build, 2 weeks testing).
  • Results: 96% categorization accuracy, £127K wasteful SaaS spend flagged annually, monthly close time reduced from 4-6 hours to 10 minutes.
  • Architecture: Three-agent parallel execution (categorizer, department assigner, anomaly detector) with human oversight for edge cases.
  • Key lesson: Started with historical data (2+ years labeled transactions) for training -accuracy improved from 76% (zero-shot) to 96% (fine-tuned).

Case Study: How Ramp Automated 83% of Expense Categorization

Ramp processes millions of corporate card transactions monthly. Before agent automation, their finance team spent 15-20 hours weekly manually categorizing expenses -clicking dropdown menus, cross-referencing merchant names with departments, flagging unusual charges.

In Q4 2024, they deployed an AI agent system that handles 83% of this work autonomously. Here's how they did it, what went wrong, and what they learned.

The Problem

Manual expense categorization bottleneck:

  • 45,000 transactions/month across 800 companies
  • Average 2 minutes per transaction for complex cases (international charges, new vendors, ambiguous merchants)
  • Finance team: 3 people spending 50% time on categorization
  • Monthly close delayed 2-3 days waiting for categorization completion
  • Errors: 8-12% miscategorization rate (audit findings)

Cost of status quo:

  • 60 hours/month × £45/hour = £2,700/month labor cost
  • Delayed close = delayed financial reporting
  • Miscategorization = wrong budgets, tax issues

Solution Architecture

Ramp built a three-agent parallel execution system:

Agent 1: Expense Categorizer

  • Task: Assign accounting category (software, ads, travel, meals, office, contractor, other)
  • Input: Merchant name, amount, description, date
  • Model: Fine-tuned GPT-4 on 50,000 labeled historical transactions
  • Output: Category + confidence score

Agent 2: Department Assigner

  • Task: Attribute expense to department (engineering, sales, marketing, ops)
  • Input: Transaction + employee data (title, department, manager)
  • Model: GPT-4 Turbo with few-shot examples
  • Output: Department + reasoning

Agent 3: Anomaly Detector

  • Task: Flag unusual patterns (duplicates, amount >2x median, new vendors, international charges)
  • Input: Transaction + 12-month spending history
  • Model: GPT-4 Turbo + rule-based checks
  • Output: Anomaly flags with explanation

Orchestrator:

  • Runs 3 agents in parallel (reduces latency from 6s to 2s)
  • Aggregates results
  • If any agent confidence <85% OR anomaly flagged → escalate to human
  • Otherwise: Auto-categorize and update QuickBooks

Implementation Timeline

Week 1-4: Data Preparation & Design

  • Exported 2 years of transaction history (120,000 transactions)
  • Manually labeled 10,000 for validation set
  • Designed agent architecture (initial plan: sequential, changed to parallel for speed)
  • Selected fine-tuning vs RAG (chose fine-tuning for stable category taxonomy)

Week 5-10: Build & Training

  • Fine-tuned GPT-4 on categorization (50K examples, £1,800 training cost)
  • Built orchestrator logic with parallel execution
  • Integrated with Ramp API and QuickBooks API
  • Implemented human approval queue for low-confidence cases

Week 11-12: Testing & Iteration

  • Shadow mode: Agent categorized but didn't write to QuickBooks (finance team reviewed)
  • Measured accuracy: 84% initially
  • Iterated on prompts and added edge case handling: 96% accuracy
  • Load testing: 1,000 transactions/hour with <2s latency

Week 13: Production Rollout

  • Deployed to 10 pilot customers (3,000 transactions)
  • Monitored for errors, accuracy held at 96%
  • Full rollout to all customers

Total: 12 weeks, £45,000 engineering cost + £2,500 training

Results (After 6 Months)

Automation Rate:

  • 83% of transactions auto-categorized (37,350/month)
  • 17% escalated to humans (7,650/month) - complex cases, low confidence, or anomalies

Accuracy:

  • 96% categorization accuracy (vs 92% human baseline from audits)
  • 99.2% department assignment accuracy

Time Savings:

  • Finance team: 15 hrs/week → 3 hrs/week (reviewing escalations only)
  • 12 hours/week saved = 624 hours/year
  • Annual value: £28,080 (at £45/hour)

Additional Value:

  • £127K wasteful spend flagged: Unused SaaS seats, duplicate tools, forgotten subscriptions
  • Monthly close time: 4-6 hours → 10 minutes (automated report generation)
  • Error rate: 8% → 4% (fewer miscategorizations)

ROI:

  • Build cost: £47,500
  • Annual ongoing cost: £14,400 (API costs, maintenance)
  • Annual value: £28,080 (time saved) + £127K (waste eliminated) = £155,080
  • Payback: 3.7 months

Technical Insights

Why fine-tuning over RAG:

Initially considered RAG (retrieve similar past transactions, include in prompt). Chose fine-tuning because:

  • Category taxonomy stable (8 categories, rarely change)
  • 50K labeled examples available (strong training signal)
  • Wanted low latency (<2s) - RAG adds vector search overhead
  • Fine-tuned model achieved 96% vs 89% for RAG

Parallel vs sequential execution:

Originally designed sequential (categorize → assign department → detect anomalies). Changed to parallel because:

  • Agents don't depend on each other's outputs
  • Parallel reduced latency: 6s → 2s
  • Slight implementation complexity but worth the speed gain

Human-in-the-loop design:

Tier 1 (autonomous): Confidence ≥85%, no anomalies (83% of transactions) Tier 2 (notify): Confidence 70-85% (12% of transactions) - auto-categorize but notify finance team Tier 3 (approve): Confidence <70% OR anomaly flagged (5% of transactions) - requires human review before categorizing

This tiered approach built trust -finance team saw agent wasn't blindly categorizing everything.

Challenges & Solutions

Challenge 1: International merchant names

Problem: Agent struggled with non-English merchant names (e.g., "株式会社ABC" instead of "ABC Corporation") Solution: Added translation step -detect language, translate to English, then categorize Result: Accuracy on international transactions improved from 68% to 91%

Challenge 2: Ambiguous merchants

Problem: "Amazon" could be AWS (software), Amazon Business (office supplies), or Amazon Marketplace (various) Solution: Added category hints to prompt based on amount patterns:

  • <£50 typically office supplies
  • £50-500 could be office or software
  • £500 likely AWS

Also checked employee department (engineers → likely AWS, ops → likely supplies) Result: Amazon categorization accuracy: 73% → 94%

Challenge 3: New vendor false positives

Problem: Anomaly detector flagged every new vendor as suspicious Solution: Changed logic: Flag only if new vendor AND amount >£500 Result: False positive rate: 42% → 8%

Challenge 4: Finance team resistance

Problem: Team initially skeptical -"AI will make mistakes, I'll have to fix them anyway" Solution:

  • Ran shadow mode for 2 weeks -showed 96% accuracy matches human performance
  • Positioned as "handles boring stuff, you focus on complex cases"
  • No headcount reduction (redeployed to financial analysis)

Result: Full buy-in after shadow mode demonstration

Key Lessons Learned

1. Historical data is gold

Access to 2+ years labeled transactions enabled fine-tuning. Companies without historical data should start with RAG or zero-shot and gradually build labeled dataset.

2. Start with high-confidence only

Week 1 of production: Only auto-categorized transactions with ≥95% confidence (40% of volume). Gradually lowered threshold to 85% as team gained trust.

3. Anomaly detection requires domain rules

Pure LLM anomaly detection had 38% false positive rate. Hybrid approach (LLM + rule-based checks) reduced to 8%.

4. Parallel execution worth the complexity

3x speedup (6s → 2s) made user experience dramatically better. Implementation took extra 2 weeks but paid off.

5. Monthly accuracy reviews essential

Ramp reviews 100 random transactions monthly to ensure accuracy hasn't degraded. Found minor drift after 3 months (96% → 93%), retrained model, back to 96%.

Replication Guide

To implement similar system:

Requirements:

  • 10,000+ historical labeled transactions (for fine-tuning) OR start with RAG
  • API access to expense system (Ramp, Brex, Expensify, etc.)
  • API access to accounting system (QuickBooks, Xero, NetSuite)

Timeline:

  • With fine-tuning: 10-14 weeks
  • With RAG: 6-8 weeks

Team:

  • 1-2 engineers (full-time for 8-12 weeks)
  • 1 finance lead (25% time for requirements and validation)

Cost:

  • Engineering: £40K-60K
  • Training (if fine-tuning): £1,500-3,000
  • Ongoing API costs: £800-1,500/month (for 50K transactions)

Expected Results:

  • 75-85% automation rate
  • 90-96% accuracy (with iteration)
  • 10-15 hours/week saved
  • 3-6 month payback period

Conclusion

Ramp's expense automation agent demonstrates that AI agents can reliably handle high-volume, judgment-based workflows when implemented thoughtfully.

Key success factors:

  • Sufficient training data (50K labeled examples)
  • Human-in-the-loop for edge cases
  • Parallel execution for performance
  • Continuous monitoring and retraining

If you're considering similar automation: Start with shadow mode, measure accuracy rigorously, and expand autonomy gradually as trust builds.

The technology works. The challenge is implementation discipline.