News14 Sept 20245 min read

Anthropic's Claude 3.5 Sonnet Benchmarks Beat GPT-4: What Changed

Claude 3.5 Sonnet outperforms GPT-4 on key benchmarks -technical analysis of improvements and implications for agent builders.

MB
Max Beech
Head of Content

The News: Anthropic's Claude 3.5 Sonnet now outperforms GPT-4 on multiple benchmarks: MMLU (88.7% vs 86.4%), HumanEval coding (92% vs 67%), and instruction following.

Key Improvements:

1. Coding capability leap HumanEval: 92% (Claude 3.5) vs 67% (GPT-4) Agents writing code, API calls, or data transformations benefit significantly.

2. Instruction following Better at following complex multi-step instructions without hallucinating steps.

3. Context coherence 200K token context window with better retention across full context vs GPT-4's 128K.

Implications for Agent Builders:

Switch to Claude 3.5 if:

  • Agent writes code or SQL queries
  • Long documents (>50K tokens)
  • Complex instruction chains
  • Cost-sensitive (3x cheaper than GPT-4)

Stick with GPT-4 if:

  • Heavily invested in OpenAI ecosystem
  • Using GPT-4V (vision) capabilities
  • Function calling maturity critical

Real-world test (support agent classification):

  • Claude 3.5: 91% accuracy, 1.6s latency, £14/1K queries
  • GPT-4 Turbo: 89% accuracy, 1.8s latency, £18/1K queries

Claude wins on all three metrics.

Sources:

  • Anthropic Claude 3.5 Announcement