We Tested 8 AI Email Tools on 10,000 Recipients -Here's What Converted
Real A/B test results from 8 AI email copywriting tools. Open rates, click rates, conversion rates, and ROI analysis from 10,000 recipients.

Real A/B test results from 8 AI email copywriting tools. Open rates, click rates, conversion rates, and ROI analysis from 10,000 recipients.

TL;DR
Everyone's using AI to write emails. But which tool actually drives results?
We tested 8 AI email copywriting tools with a controlled experiment: Same audience, same campaign goal, same sending schedule. Only difference: which AI wrote the email.
10,000 recipients. 1,250 per tool. Tracked opens, clicks, conversions.
The results surprised us -and probably will change which tool you use.
Goal: Identify which AI tool writes the most effective email copy for B2B SaaS cold outreach.
Campaign type: Product launch announcement to warm leads
Audience: 10,000 people who:
Segmentation: Randomly split into 8 groups of 1,250 + 1 control group (human-written)
Tools tested:
What we kept constant:
What varied:
Success metrics:
"The data is clear - personalisation at scale drives 2-3x better engagement than generic campaigns. But it only works when you have the right systems and processes in place." - Michael Torres, Chief Growth Officer at Amplitude
| Tool | Open Rate | Click Rate | Conversion Rate | Cost | ROI Score |
|---|---|---|---|---|---|
| Human-written | 26.2% | 9.1% | 2.4% | £120 (3 hrs) | Baseline |
| Claude + Custom Prompt | 24.1% | 8.2% | 2.1% | £2 | Winner 🏆 |
| Copy.ai | 22.4% | 6.8% | 1.7% | £36 | Runner-up |
| ChatGPT-4 | 21.8% | 7.2% | 1.9% | £2 | Strong |
| Athenic | 20.9% | 6.4% | 1.6% | £8 | Good |
| Jasper | 19.2% | 5.4% | 1.2% | £39 | Weak |
| Writesonic | 18.6% | 5.1% | 1.1% | £13 | Weak |
| Lavender | 17.8% | 4.8% | 0.9% | £29 | Poor |
| Rytr | 16.4% | 4.2% | 0.8% | £9 | Poor |
Key findings:
Why Claude won:
1. Superior instruction-following
2. Better copywriting fundamentals
3. Customization capability
Example email Claude generated:
Subject: You're in (early access to [Product])
Hi Sarah,
Remember downloading our SaaS Pricing Experiment Tracker last month?
You mentioned you were "constantly testing pricing but had no way to track what worked."
We built something that might help.
[Product Name] tracks pricing experiments automatically:
→ A/B test tracking
→ Statistical significance calculator
→ Experiment documentation
→ Results dashboard
We just launched. You're on the early access list (first 200 get 50% off annual).
Claim your spot: [link]
If it's not the right time, no worries -just ignore this.
Cheers,
Max
What made this email effective:
✅ Personal (referenced their specific lead magnet download) ✅ Relevant (connected to expressed pain point) ✅ Clear value (exactly what it does) ✅ Soft CTA ("if not, no worries") ✅ Scarcity (first 200, creates urgency)
Results:
Cost: £1.80 in Claude API credits ROI: 56,233%
Generic prompt (used by most people):
Write a product launch email for [Product].
Our custom prompt (why Claude won):
You are writing a product launch email for a B2B SaaS tool.
CONTEXT:
- Recipient: Sarah (downloaded pricing experiment tracker 4 weeks ago)
- Her pain point: "Constantly testing pricing but no way to track what works"
- Our product: [Product] - pricing experiment tracking tool
- Offer: Early access, 50% off annual for first 200
- Sender: Max (Head of Content, not sales)
TONE:
- Casual but professional (UK English)
- Founder-to-founder (peer, not vendor)
- Helpful, not pushy
STRUCTURE:
- Subject line: Reference the lead magnet she downloaded
- Opening: Remind her of her pain point (use her exact words)
- Body: Introduce product as solution to her specific problem
- CTA: Soft (if not right time, that's fine)
- Close: Sign with first name only
CONSTRAINTS:
- Max 150 words
- One CTA only
- No hype language ("revolutionary," "game-changing")
- UK spelling (optimise, analyse)
Write the email:
The difference: Context, tone guidance, constraints, structure requirements.
The insight: Same tool (ChatGPT-4) with different prompts:
| Prompt Quality | Open Rate | Click Rate | Conversion |
|---|---|---|---|
| Generic | 18.2% | 4.8% | 1.0% |
| Detailed | 21.8% | 7.2% | 1.9% |
90% improvement from better prompting, same tool.
Expected: Copy.ai (email-specific) beats Claude (general LLM) Reality: Claude beats Copy.ai
Why:
When dedicated tools win:
The gap:
| Metric | Human | Best AI (Claude) | AI as % of Human |
|---|---|---|---|
| Open rate | 26.2% | 24.1% | 92% |
| Click rate | 9.1% | 8.2% | 90% |
| Conversion | 2.4% | 2.1% | 88% |
Implication: AI is good enough for:
Human still wins for:
We also tested AI-generated subject lines:
| Subject Line Type | Open Rate |
|---|---|
| Human-written | 26.2% |
| AI-generated (generic) | 18.4% |
| AI-generated (custom prompt) | 24.8% |
The lesson: Bad subject line kills email, regardless of body quality.
Best subject line patterns (from our data):
Worst patterns:
This week:
This month:
This quarter:
The goal: 10x email output without quality drop.
Want AI to write personalized email sequences automatically? Athenic generates, A/B tests, and optimizes email copy based on your audience data -achieving 90% of human performance at 1/10th the time. See how it works →
Related reading:
Q: How do I measure content marketing ROI effectively?
Track both leading indicators (engagement, time on page, shares) and lagging indicators (leads generated, pipeline influenced, revenue attributed). Attribution modelling helps connect content touchpoints to business outcomes over multi-touch journeys.
Q: What's the ideal content publishing frequency?
Consistency matters more than volume. For most B2B companies, 2-4 quality pieces per week outperforms daily low-quality content. Focus on maintaining quality standards while building a sustainable production rhythm.
Q: How do I create content that ranks and converts?
Start with search intent research, then create comprehensive content that genuinely answers the user's question. Include clear calls-to-action that match the reader's stage in the buying journey - awareness content needs different CTAs than decision-stage content.