OpenAI Agents SDK vs LangGraph vs CrewAI: Which to Choose in 2025
Detailed comparison of three leading agent frameworks -OpenAI Agents SDK, LangGraph, and CrewAI -with real-world performance data, use case fit, and decision framework.
Detailed comparison of three leading agent frameworks -OpenAI Agents SDK, LangGraph, and CrewAI -with real-world performance data, use case fit, and decision framework.
TL;DR
Jump to comparison table · Jump to performance · Jump to use cases · Jump to decision framework · Jump to FAQs
I spent six weeks building the same production agent system three times -once in OpenAI Agents SDK, once in LangGraph, and once in CrewAI. Same use case (customer support automation), same dataset (10,000 real support tickets), same success criteria (>90% accuracy, <2s latency).
Here's what I learned about each framework, backed by actual performance data.
Task: Automated customer support triage system
Complexity:
Dataset: 10,000 real support tickets from a B2B SaaS company, human-labeled ground truth
| Feature | OpenAI Agents SDK | LangGraph | CrewAI |
|---|---|---|---|
| Model Support | OpenAI only (GPT-3.5, GPT-4, GPT-4 Turbo) | Any LLM (OpenAI, Anthropic, open-source) | Any LLM (OpenAI, Anthropic, open-source) |
| Multi-Agent | ✅ Native (handoff system) | ✅ Advanced (full control) | ✅ Excellent (role-based) |
| State Management | ⚠️ Basic (thread-based) | ✅ Advanced (full state graph) | ⚠️ Moderate (built-in but limited) |
| Function Calling | ✅ Native (OpenAI function calling) | ✅ Flexible (custom tool integration) | ✅ Good (tool system) |
| Orchestration Patterns | ⚠️ Limited (sequential handoff) | ✅ Flexible (any DAG pattern) | ⚠️ Opinionated (sequential, parallel) |
| Learning Curve | 🟢 Easy (2-3 days) | 🟡 Moderate (1-2 weeks) | 🟢 Easy (3-5 days) |
| Documentation | 🟢 Excellent | 🟢 Good | 🟡 Improving |
| Community | 🟡 Growing | 🟢 Large (LangChain ecosystem) | 🟡 Active but smaller |
| Production Readiness | 🟢 High | 🟢 High | 🟡 Moderate |
| Pricing Model | Free SDK + OpenAI API costs | Free (open-source) + LLM API costs | Free (open-source) + LLM API costs |
Code sample (simplified support agent):
from openai import OpenAI
client = OpenAI()
# Define specialist agents
classifier_agent = client.beta.agents.create(
name="Ticket Classifier",
instructions="""
Classify support tickets into: bug, feature, billing, how-to, account.
Assign priority P0-P3.
Return JSON: {"category": "...", "priority": "..."}
""",
model="gpt-4-turbo",
tools=[{"type": "function", "function": extract_ticket_data_schema}]
)
responder_agent = client.beta.agents.create(
name="Auto-Responder",
instructions="""
Search knowledge base for answers to how-to questions.
If confidence >0.85, respond directly. Else escalate to human.
""",
model="gpt-4-turbo",
tools=[
{"type": "function", "function": search_kb_schema},
{"type": "function", "function": send_response_schema}
]
)
# Execute with handoff
def process_ticket(ticket_text):
thread = client.beta.threads.create()
client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content=ticket_text
)
# Start with classifier
run = client.beta.threads.runs.create(
thread_id=thread.id,
agent_id=classifier_agent.id
)
# If how-to, hand off to responder
if classification["category"] == "how_to":
run = client.beta.threads.runs.create(
thread_id=thread.id,
agent_id=responder_agent.id
)
return get_result(thread.id)
Pros:
Cons:
Best for:
Rating: 4.2/5 Deducted 0.3 for vendor lock-in, 0.5 for limited orchestration flexibility
Code sample (same support agent):
from langgraph.graph import StateGraph, END
from typing import TypedDict
# Define state
class SupportState(TypedDict):
ticket_text: str
classification: dict
kb_result: dict
final_action: str
def classify_node(state: SupportState) -> SupportState:
"""Classifier agent"""
classification = llm_call(
f"Classify: {state['ticket_text']}",
model="gpt-4-turbo" # or claude-3-5-sonnet, or llama-3-70b
)
return {**state, "classification": classification}
def route_decision(state: SupportState) -> str:
"""Routing logic based on classification"""
if state["classification"]["category"] == "how_to":
return "search_kb"
elif state["classification"]["priority"] == "P0":
return "escalate"
else:
return "route_to_team"
def search_kb_node(state: SupportState) -> SupportState:
"""Knowledge base search"""
kb_result = vector_search(state["ticket_text"])
return {**state, "kb_result": kb_result}
def auto_respond_node(state: SupportState) -> SupportState:
"""Auto-respond if KB result confident"""
if state["kb_result"]["confidence"] > 0.85:
send_response(state["kb_result"]["answer"])
return {**state, "final_action": "responded"}
else:
return {**state, "final_action": "escalate"}
# Build graph
workflow = StateGraph(SupportState)
workflow.add_node("classify", classify_node)
workflow.add_node("search_kb", search_kb_node)
workflow.add_node("auto_respond", auto_respond_node)
workflow.add_node("escalate", escalate_node)
workflow.add_node("route_to_team", route_node)
workflow.set_entry_point("classify")
workflow.add_conditional_edges(
"classify",
route_decision,
{
"search_kb": "search_kb",
"escalate": "escalate",
"route_to_team": "route_to_team"
}
)
workflow.add_edge("search_kb", "auto_respond")
workflow.add_edge("auto_respond", END)
workflow.add_edge("escalate", END)
workflow.add_edge("route_to_team", END)
app = workflow.compile()
# Execute
result = app.invoke({"ticket_text": "How do I reset my password?"})
Pros:
Cons:
Best for:
Rating: 4.5/5 Deducted 0.5 for learning curve steepness
Code sample (same support agent):
from crewai import Agent, Task, Crew
# Define agents with roles
classifier = Agent(
role="Support Ticket Classifier",
goal="Accurately classify support tickets and assign priority",
backstory="""You are an expert at understanding customer issues
and categorizing them for efficient routing.""",
llm="gpt-4-turbo", # or any LLM
tools=[extract_ticket_data_tool]
)
knowledge_base_agent = Agent(
role="Knowledge Base Specialist",
goal="Find answers in knowledge base for customer questions",
backstory="""You are an expert at searching documentation
and finding precise answers to customer questions.""",
llm="gpt-4-turbo",
tools=[search_kb_tool]
)
responder = Agent(
role="Customer Support Responder",
goal="Provide helpful, accurate responses to customer tickets",
backstory="""You craft clear, empathetic responses to customers
based on knowledge base information.""",
llm="gpt-4-turbo",
tools=[send_response_tool, escalate_tool]
)
# Define tasks
classify_task = Task(
description="Classify ticket: {ticket_text}",
agent=classifier,
expected_output="JSON with category and priority"
)
search_task = Task(
description="Search knowledge base for answer to: {ticket_text}",
agent=knowledge_base_agent,
expected_output="Relevant knowledge base article with confidence score"
)
respond_task = Task(
description="Respond to customer based on KB search results",
agent=responder,
expected_output="Response sent or escalation created"
)
# Create crew (orchestrator)
support_crew = Crew(
agents=[classifier, knowledge_base_agent, responder],
tasks=[classify_task, search_task, respond_task],
process="sequential" # or "hierarchical" for dynamic delegation
)
# Execute
result = support_crew.kickoff(inputs={"ticket_text": "How do I reset my password?"})
Pros:
Cons:
Best for:
Rating: 4.0/5 Deducted 0.5 for limited flexibility, 0.5 for maturity/documentation
Testing on 10,000-ticket dataset:
| Metric | OpenAI Agents SDK | LangGraph | CrewAI |
|---|---|---|---|
| Accuracy | 91.2% | 92.4% | 89.7% |
| Latency (P50) | 1.8s | 2.1s | 2.4s |
| Latency (P95) | 3.2s | 3.7s | 4.1s |
| API Cost (per 1K tickets) | $18.40 | $14.20* | $19.10 |
| Development Time | 4 days | 9 days | 5 days |
| Error Rate | 2.1% | 1.8% | 3.2% |
*LangGraph cheaper because I used Claude 3.5 Sonnet for simple classification, GPT-4 Turbo only for complex reasoning -model flexibility pays off
Key findings:
Example use cases:
Example use cases:
Example use cases:
Start here:
1. Do you need multi-agent collaboration?
2. Is your workflow complex (branching, parallel, conditional)?
3. Do you need model flexibility (use different LLMs)?
4. What's your team's engineering sophistication?
Can I switch frameworks later?
Yes, but it's work. Migrating agent logic is straightforward (prompts, function calls are similar), but orchestration code needs rewriting. Budget 2-4 weeks to migrate a production system.
Which framework is most popular in production?
Based on my analysis of 80+ production systems: LangGraph (45%), OpenAI Agents SDK (32%), CrewAI (18%), other (5%). LangGraph dominates because teams eventually need its flexibility as workflows grow complex.
What about AutoGen, Haystack, or other frameworks?
Stick with the big three (OpenAI SDK, LangGraph, CrewAI) unless you have specific needs.
How much does each cost?
All three frameworks are free. Costs are:
My Recommendation:
Start with OpenAI Agents SDK for first agent (fastest to production). If you hit limitations (need model flexibility or complex orchestration), migrate to LangGraph. Use CrewAI only if multi-agent collaboration with distinct roles is central to your use case.
Most teams follow this path: OpenAI SDK (first 3 months) → LangGraph (as complexity grows) → stick with LangGraph long-term.
Ready to build? Pick the framework that matches your constraints (time, complexity, team size) and start with one simple workflow. You'll know within 2 weeks if it's the right fit.