Prompt Engineering for AI Agents: Production-Grade Techniques
Advanced prompt engineering techniques for reliable AI agents -specific patterns, testing frameworks, and real examples from production systems achieving 90%+ accuracy.
Advanced prompt engineering techniques for reliable AI agents -specific patterns, testing frameworks, and real examples from production systems achieving 90%+ accuracy.
TL;DR
Jump to core techniques · Jump to testing · Jump to examples · Jump to FAQs
The difference between a mediocre agent (72% accuracy) and a reliable one (92% accuracy) usually isn't the model -it's the prompt.
I've analyzed 80+ production agent prompts from companies running at scale. The ones that work share specific patterns. The ones that fail make predictable mistakes.
Here's what separates reliable agents from unreliable ones.
Bad (vague):
You are a helpful assistant that processes support tickets.
Good (specific):
You are a customer support classification agent for [Company Name], a B2B SaaS product.
Your role:
- Classify incoming support tickets into exactly ONE category
- Extract key information (account ID, urgency, affected feature)
- Determine if human escalation is required
Constraints:
- NEVER respond directly to customers (you classify only)
- NEVER guess account information (extract from ticket or mark unknown)
- NEVER classify as "bug" without clear error message or unexpected behavior description
- If confidence <85%, escalate to human review
Why it works: Specific constraints prevent common failure modes. "Never respond directly to customers" stops agent from hallucinating customer responses. "Never classify as bug without error message" reduces false positives.
Bad (freeform):
Classify this support ticket and tell me what to do with it.
Good (JSON schema):
Classify this support ticket. Return valid JSON matching this exact schema:
{
"category": "bug" | "feature_request" | "billing" | "how_to" | "account_issue",
"priority": "P0" | "P1" | "P2" | "P3",
"confidence": <float 0.0-1.0>,
"reasoning": "<brief explanation of classification>",
"escalate_to_human": <boolean>,
"extracted_data": {
"account_id": "<string or null>",
"affected_feature": "<string or null>",
"error_message": "<string or null>"
}
}
Example:
{
"category": "bug",
"priority": "P1",
"confidence": 0.92,
"reasoning": "User reports 500 error when uploading files, includes error message",
"escalate_to_human": false,
"extracted_data": {
"account_id": "ACC_12345",
"affected_feature": "file_upload",
"error_message": "Internal Server Error (500)"
}
}
Why it works: Structured output is parseable, validatable, and prevents hallucinations. Agent can't invent new categories or return malformed data.
Bad (zero-shot):
Classify this ticket into bug|feature|billing|how-to.
Ticket: "I can't log in, getting 'invalid credentials' error but I know my password is correct."
Good (few-shot with edge cases):
Classify tickets into bug|feature|billing|how-to based on these examples:
Example 1:
Ticket: "I can't log in, getting 'invalid credentials' error but I know my password is correct."
Classification: bug
Reasoning: Login failure despite correct credentials indicates system error.
Example 2:
Ticket: "How do I reset my password?"
Classification: how_to
Reasoning: Standard procedural question, not a system error.
Example 3:
Ticket: "I can't log in, forgot my password."
Classification: how_to
Reasoning: User error, not system error (contrast with Example 1).
Example 4:
Ticket: "Can you add SSO support?"
Classification: feature_request
Reasoning: Requesting new functionality.
Example 5:
Ticket: "Why was I charged $99? I thought this was free."
Classification: billing
Reasoning: Question about charges or invoicing.
Now classify this ticket:
[USER TICKET HERE]
Why it works: Examples teach edge case handling. Example 1 vs Example 3 shows distinction between "can't log in due to bug" vs "can't log in due to forgot password." Without examples, agents misclassify user errors as bugs.
How many examples? Data from my analysis:
Sweet spot: 8-12 examples covering common cases + edge cases.
Bad (direct answer):
Classify this ticket: "..."
Return: bug
Good (reasoning first):
Classify this ticket. Think step-by-step:
1. What is the user trying to do?
2. What went wrong?
3. Is this a system error or user error?
4. Does it match our criteria for each category?
5. What category best fits?
Then return your classification.
Why it works: Chain-of-thought improves accuracy by 7-15 percentage points in complex classification tasks (Google Research, 2023). Forces model to reason through ambiguous cases instead of pattern-matching superficially.
When to use:
When NOT to use:
Production-grade prompts aren't written once -they're refined through systematic testing.
Don't use synthetic data. Use real examples from your workflow with human-labeled ground truth.
Example evaluation set for support ticket classification:
[
{
"ticket": "I can't log in, getting 500 error",
"ground_truth": "bug",
"priority": "P1"
},
{
"ticket": "How do I export my data?",
"ground_truth": "how_to",
"priority": "P3"
},
// ... 98 more real examples
]
Evaluation set composition:
def evaluate_agent(agent_fn, eval_set):
"""Test agent against evaluation set"""
correct = 0
errors = []
for example in eval_set:
agent_output = agent_fn(example["ticket"])
if agent_output["category"] == example["ground_truth"]:
correct += 1
else:
errors.append({
"ticket": example["ticket"],
"expected": example["ground_truth"],
"actual": agent_output["category"],
"reasoning": agent_output.get("reasoning", "")
})
accuracy = correct / len(eval_set)
return {
"accuracy": accuracy,
"errors": errors
}
# Run evaluation
results = evaluate_agent(classify_ticket_agent, evaluation_set)
print(f"Accuracy: {results['accuracy']*100:.1f}%")
# Analyze errors
for error in results["errors"]:
print(f"\nMisclassified: {error['ticket']}")
print(f"Expected: {error['expected']}, Got: {error['actual']}")
print(f"Agent reasoning: {error['reasoning']}")
Common patterns in errors:
Pattern 1: Misclassifying user errors as bugs
Pattern 2: Over-classifying as "how_to"
Pattern 3: Ambiguous priority assignment
Iteration 1 (baseline):
Iteration 2 (add few-shot examples):
Iteration 3 (add priority criteria):
Iteration 4 (refine categories):
Typical iteration cycles: 4-6 rounds over 3-6 weeks to reach 90%+ accuracy.
Challenge: Classify support tickets into 5 categories with 90%+ accuracy
Initial prompt (accuracy: 76%):
You are a support agent. Classify tickets into bug, feature request, billing, how-to, or other.
Ticket: {ticket_text}
Final prompt (accuracy: 92%, after 6 weeks of iteration):
You are a support ticket classification agent for Mercury (banking for startups).
Classify tickets into exactly ONE category:
- bug: System error, unexpected behavior, or feature not working as designed
- feature_request: Request for new functionality or enhancement
- billing: Questions about charges, invoices, or account balance
- how_to: Procedural questions about using existing features
- account_issue: Login problems, access requests, or account settings
Priority levels:
- P0: Service completely down for all users
- P1: Critical feature broken for multiple users
- P2: Issue affecting single user or account
- P3: Question or minor inconvenience
Examples:
[10 examples covering edge cases...]
Return valid JSON:
{
"category": "<category>",
"priority": "<priority>",
"confidence": <0.0-1.0>,
"reasoning": "<brief explanation>"
}
If confidence <0.85, return "escalate_to_human": true
Ticket: {ticket_text}
Key improvements:
Challenge: Categorize expenses into accounting categories with 85%+ accuracy
Final prompt (accuracy: 89%):
You are an expense categorization agent for Ramp corporate cards.
Categorize expenses into these accounting categories:
- software_saas: Software subscriptions, APIs, cloud services
- advertising: Google Ads, Facebook Ads, LinkedIn Ads, sponsored content
- travel: Flights, hotels, rental cars, Uber/Lyft for business travel
- meals_entertainment: Client dinners, team meals, conference catering
- office_supplies: Equipment, furniture, supplies for physical office
- contractor_payments: Payments to freelancers, agencies, or contractors
- other: Anything not clearly fitting above categories
Decision criteria:
- AWS, Google Cloud, Azure → software_saas (not other)
- Uber/Lyft → travel only if >$50 or business hours, else meals_entertainment
- Restaurants → meals_entertainment (unless explicitly catering invoice)
- Domain registrations, SSL certificates → software_saas
Examples:
[12 examples including edge cases like "Uber $12 on Saturday" → meals_entertainment vs "Uber $87 to airport on Monday" → travel...]
Return JSON:
{
"category": "<category>",
"confidence": <0.0-1.0>,
"merchant_type": "<detected type>",
"amount_flag": <boolean if amount unusual for this merchant>
}
Transaction:
Merchant: {merchant}
Amount: {amount}
Description: {description}
Date: {date}
Employee: {employee_title}
Key improvements:
Challenge: Score inbound leads 0-10 with 88%+ inter-rater reliability vs human scorers
Final prompt (accuracy: 91% agreement with human scores ±1 point):
You are a lead qualification agent for Glean (enterprise search).
Score leads 0-10 based on ICP fit:
Scoring criteria:
+3: Company size 50-500 employees (our sweet spot)
+2: Job title indicates buying authority (VP, Director, Head of, CTO, COO)
+2: Company uses tech stack we integrate with (Slack, Notion, Google Workspace)
+1: Company is funded (Series A or later)
+1: Job function matches our use cases (eng, product, ops, sales)
+1: High-intent message (mentions specific pain point, timeline, or competitor)
-2: Company <10 employees or >5000 employees (outside our ICP)
-1: Job title = IC/junior role (no buying authority)
-1: Generic "just checking out options" message (low intent)
Examples:
[15 examples with scoring breakdown...]
Return JSON:
{
"score": <0-10>,
"classification": "hot" (≥7) | "warm" (4-6) | "cold" (<4),
"reasoning": "<point-by-point breakdown>",
"recommended_action": "book_meeting" | "nurture_sequence" | "archive"
}
Lead data:
Name: {name}
Company: {company}
Company size: {size}
Job title: {title}
Tech stack: {tech_stack}
Funding: {funding}
Message: {message}
Key improvements:
Problem: Trying to make agent do too many things in one prompt (classify + prioritize + extract + suggest solution)
Result: Accuracy drops, agent gets confused
Fix: Break into multiple specialized agents or sequential steps
Problem: Agent returns freeform text instead of structured JSON, breaking downstream systems
Fix: Validate output with schema, retry if invalid
def classify_with_validation(ticket):
for attempt in range(3):
output = agent_classify(ticket)
if validate_json_schema(output):
return output
else:
logger.warning(f"Invalid output, retry {attempt+1}")
# After 3 failures, escalate to human
return escalate_to_human(ticket, reason="invalid_agent_output")
Problem: Assuming prompt works without systematic evaluation
Result: Deploy at 68% accuracy, users lose trust
Fix: Test on 100+ real examples, hit ≥85% accuracy before production
How long should prompts be?
No strict limit, but typical production prompts: 300-800 words. Longer prompts (>1,000 words) risk model attention issues. If your prompt is >1,500 words, consider breaking into multiple agents.
Should I use temperature=0 for determinism?
For classification and structured tasks: Yes, temperature=0 or 0.1 (near-deterministic). For creative tasks (draft emails, generate content): temperature=0.7-0.9.
What accuracy is good enough for production?
Depends on stakes:
How often should I update prompts?
Review monthly: analyze errors from past month, add examples for new failure patterns, refine criteria. Major iterations every quarter as workflows evolve.
Can I use prompt optimization tools (DSPy, PromptPerfect)?
Yes, but manually review outputs. Automated tools help generate variations, but human review ensures prompts match your specific use case and edge cases.
Final word: Reliable agents aren't built with one clever prompt -they're refined through systematic testing, error analysis, and iteration. Budget 4-6 weeks to go from 70% baseline to 90%+ production-ready.
Start with structured output, add few-shot examples, test on 100+ real cases, iterate weekly. You'll get there.