Case Study: How Ramp Automated 83% of Expense Categorization
Deep dive into Ramp's AI agent implementation -architecture, challenges, results, and lessons learned from automating expense categorization at scale.

Deep dive into Ramp's AI agent implementation -architecture, challenges, results, and lessons learned from automating expense categorization at scale.

TL;DR
Ramp processes millions of corporate card transactions monthly. Before agent automation, their finance team spent 15-20 hours weekly manually categorizing expenses -clicking dropdown menus, cross-referencing merchant names with departments, flagging unusual charges.
In Q4 2024, they deployed an AI agent system that handles 83% of this work autonomously. Here's how they did it, what went wrong, and what they learned.
Manual expense categorization bottleneck:
Cost of status quo:
"Agent orchestration is where the real value lives. Individual AI capabilities matter less than how well you coordinate them into coherent workflows." - James Park, Founder of AI Infrastructure Labs
Ramp built a three-agent parallel execution system:
Agent 1: Expense Categorizer
Agent 2: Department Assigner
Agent 3: Anomaly Detector
Orchestrator:
Week 1-4: Data Preparation & Design
Week 5-10: Build & Training
Week 11-12: Testing & Iteration
Week 13: Production Rollout
Total: 12 weeks, £45,000 engineering cost + £2,500 training
Automation Rate:
Accuracy:
Time Savings:
Additional Value:
ROI:
Why fine-tuning over RAG:
Initially considered RAG (retrieve similar past transactions, include in prompt). Chose fine-tuning because:
Parallel vs sequential execution:
Originally designed sequential (categorize → assign department → detect anomalies). Changed to parallel because:
Human-in-the-loop design:
Tier 1 (autonomous): Confidence ≥85%, no anomalies (83% of transactions) Tier 2 (notify): Confidence 70-85% (12% of transactions) - auto-categorize but notify finance team Tier 3 (approve): Confidence <70% OR anomaly flagged (5% of transactions) - requires human review before categorizing
This tiered approach built trust -finance team saw agent wasn't blindly categorizing everything.
Challenge 1: International merchant names
Problem: Agent struggled with non-English merchant names (e.g., "株式会社ABC" instead of "ABC Corporation") Solution: Added translation step -detect language, translate to English, then categorize Result: Accuracy on international transactions improved from 68% to 91%
Challenge 2: Ambiguous merchants
Problem: "Amazon" could be AWS (software), Amazon Business (office supplies), or Amazon Marketplace (various) Solution: Added category hints to prompt based on amount patterns:
£500 likely AWS
Also checked employee department (engineers → likely AWS, ops → likely supplies) Result: Amazon categorization accuracy: 73% → 94%
Challenge 3: New vendor false positives
Problem: Anomaly detector flagged every new vendor as suspicious Solution: Changed logic: Flag only if new vendor AND amount >£500 Result: False positive rate: 42% → 8%
Challenge 4: Finance team resistance
Problem: Team initially skeptical -"AI will make mistakes, I'll have to fix them anyway" Solution:
Result: Full buy-in after shadow mode demonstration
1. Historical data is gold
Access to 2+ years labeled transactions enabled fine-tuning. Companies without historical data should start with RAG or zero-shot and gradually build labeled dataset.
2. Start with high-confidence only
Week 1 of production: Only auto-categorized transactions with ≥95% confidence (40% of volume). Gradually lowered threshold to 85% as team gained trust.
3. Anomaly detection requires domain rules
Pure LLM anomaly detection had 38% false positive rate. Hybrid approach (LLM + rule-based checks) reduced to 8%.
4. Parallel execution worth the complexity
3x speedup (6s → 2s) made user experience dramatically better. Implementation took extra 2 weeks but paid off.
5. Monthly accuracy reviews essential
Ramp reviews 100 random transactions monthly to ensure accuracy hasn't degraded. Found minor drift after 3 months (96% → 93%), retrained model, back to 96%.
To implement similar system:
Requirements:
Timeline:
Team:
Cost:
Expected Results:
Ramp's expense automation agent demonstrates that AI agents can reliably handle high-volume, judgment-based workflows when implemented thoughtfully.
Key success factors:
If you're considering similar automation: Start with shadow mode, measure accuracy rigorously, and expand autonomy gradually as trust builds.
The technology works. The challenge is implementation discipline.
Q: What's the typical ROI timeline for AI agent implementations?
Most organisations see positive ROI within 3-6 months of deployment. Initial productivity gains of 20-40% are common, with improvements compounding as teams optimise prompts and workflows based on production experience.
Q: How do AI agents handle errors and edge cases?
Well-designed agent systems include fallback mechanisms, human-in-the-loop escalation, and retry logic. The key is defining clear boundaries for autonomous action versus requiring human approval for sensitive or unusual situations.
Q: How long does it take to implement an AI agent workflow?
Implementation timelines vary based on complexity, but most teams see initial results within 2-4 weeks for simple workflows. More sophisticated multi-agent systems typically require 6-12 weeks for full deployment with proper testing and governance.