AI Agent Retry Strategies: Exponential Backoff and Graceful Degradation
Build resilient AI agents that handle failures gracefully - covering retry patterns, exponential backoff, circuit breakers, and fallback strategies for production reliability.
Build resilient AI agents that handle failures gracefully - covering retry patterns, exponential backoff, circuit breakers, and fallback strategies for production reliability.
TL;DR
Jump to Error classification · Jump to Retry patterns · Jump to Circuit breakers · Jump to Fallback strategies
At 2am, OpenAI's API starts returning 503 errors. Your agent retries immediately, fails, retries again, fails - hammering the already-struggling service while your users see timeout errors. By morning, you've burned through your rate limit budget, your monitoring is screaming, and users have given up.
Retry strategies determine whether your agents weather transient failures gracefully or amplify them into cascading outages. The difference between "worked fine during the incident" and "made everything worse" is often just a few lines of retry logic.
This guide covers production-tested patterns for handling LLM API failures, from basic exponential backoff to sophisticated multi-model fallback chains.
Key takeaways
- Classify errors first. Retrying authentication failures wastes time; not retrying rate limits loses requests unnecessarily.
- Exponential backoff + jitter is the baseline. Start at 1s, cap at 60s, add random jitter.
- Circuit breakers protect both you and the service. When something is clearly broken, stop trying.
- Fallbacks should be pre-planned and tested, not improvised during incidents.
Different errors demand different responses. Before implementing retry logic, classify what you're handling.
| Error Type | HTTP Codes | Retry? | Strategy |
|---|---|---|---|
| Rate limit | 429 | Yes | Backoff per Retry-After header |
| Server error | 500, 502, 503, 504 | Yes | Exponential backoff |
| Timeout | - | Maybe | Shorter timeout, limited retries |
| Client error | 400, 401, 403 | No | Fix request, don't retry |
| Model overload | 529 (Anthropic) | Yes | Longer backoff |
| Context length | 400 (specific) | No | Reduce input, don't retry same |
interface ErrorClassification {
retryable: boolean;
strategy: 'none' | 'immediate' | 'backoff' | 'long_backoff';
retryAfter?: number; // Milliseconds
switchModel?: boolean;
}
function classifyError(error: any): ErrorClassification {
// Check for rate limit with Retry-After header
if (error.status === 429) {
const retryAfter = error.headers?.['retry-after'];
return {
retryable: true,
strategy: 'backoff',
retryAfter: retryAfter ? parseInt(retryAfter) * 1000 : 60000
};
}
// Server errors - temporary, retry with backoff
if ([500, 502, 503, 504].includes(error.status)) {
return {
retryable: true,
strategy: 'backoff'
};
}
// Anthropic overloaded
if (error.status === 529) {
return {
retryable: true,
strategy: 'long_backoff',
retryAfter: 120000 // 2 minutes
};
}
// Timeout - retry but not indefinitely
if (error.code === 'ETIMEDOUT' || error.code === 'ESOCKETTIMEDOUT') {
return {
retryable: true,
strategy: 'immediate', // Try again quickly
};
}
// Context length exceeded - don't retry same request
if (error.status === 400 && error.message?.includes('context_length')) {
return {
retryable: false,
strategy: 'none',
switchModel: true // Try model with larger context
};
}
// Auth errors - fix the problem, don't retry
if ([401, 403].includes(error.status)) {
return {
retryable: false,
strategy: 'none'
};
}
// Unknown errors - conservative retry
return {
retryable: true,
strategy: 'backoff'
};
}
The gold standard for retry logic. Wait times increase exponentially, and jitter prevents synchronised retries.
interface BackoffConfig {
initialDelayMs: number;
maxDelayMs: number;
maxRetries: number;
multiplier: number;
jitterFactor: number; // 0 to 1, percentage of delay to randomise
}
const defaultConfig: BackoffConfig = {
initialDelayMs: 1000,
maxDelayMs: 60000,
maxRetries: 5,
multiplier: 2,
jitterFactor: 0.3
};
function calculateDelay(
attempt: number,
config: BackoffConfig = defaultConfig
): number {
// Base delay: initial * multiplier^attempt
const baseDelay = Math.min(
config.initialDelayMs * Math.pow(config.multiplier, attempt),
config.maxDelayMs
);
// Add jitter: ±jitterFactor of base delay
const jitter = baseDelay * config.jitterFactor * (Math.random() * 2 - 1);
return Math.max(0, Math.floor(baseDelay + jitter));
}
async function withRetry<T>(
operation: () => Promise<T>,
config: BackoffConfig = defaultConfig
): Promise<T> {
let lastError: Error;
for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
lastError = error as Error;
const classification = classifyError(error);
if (!classification.retryable) {
throw error; // Don't retry non-retryable errors
}
if (attempt === config.maxRetries) {
break; // No more retries
}
// Calculate delay
const delay = classification.retryAfter ??
calculateDelay(attempt, config);
console.log(`Attempt ${attempt + 1} failed, retrying in ${delay}ms`);
await sleep(delay);
}
}
throw lastError!;
}
// Usage
const response = await withRetry(
() => openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello' }]
}),
{ ...defaultConfig, maxRetries: 3 }
);
Without jitter, if 100 clients hit a rate limit at the same time, they all retry at the same time, creating another spike. Jitter spreads retries across time.
Without jitter (thundering herd):
Time 0s: [100 requests] → Rate limited
Time 2s: [100 retries] → Rate limited again
Time 4s: [100 retries] → Rate limited again
With jitter (distributed):
Time 0s: [100 requests] → Rate limited
Time 1.5-2.5s: [~30 retries] → Some succeed
Time 3-5s: [~50 retries] → More succeed
Time 5-8s: [~20 retries] → All succeed
When you have a user waiting, absolute time matters more than attempt count.
async function withDeadline<T>(
operation: () => Promise<T>,
deadlineMs: number,
minRetryDelayMs: number = 500
): Promise<T> {
const deadline = Date.now() + deadlineMs;
let attempt = 0;
let lastError: Error;
while (Date.now() < deadline) {
try {
return await operation();
} catch (error) {
lastError = error as Error;
const classification = classifyError(error);
if (!classification.retryable) {
throw error;
}
// Calculate how much time we have left
const remaining = deadline - Date.now();
if (remaining < minRetryDelayMs) {
break; // Not enough time for another attempt
}
// Use shorter delays as deadline approaches
const delay = Math.min(
calculateDelay(attempt),
remaining - minRetryDelayMs
);
await sleep(delay);
attempt++;
}
}
throw new DeadlineExceededError(
`Operation failed after ${attempt} attempts`,
lastError!
);
}
// Usage: Must complete within 10 seconds
const response = await withDeadline(
() => callLLM(prompt),
10000
);
Some operations shouldn't be retried blindly - you might create duplicates.
interface IdempotentOperation<T> {
execute: () => Promise<T>;
checkIfCompleted: () => Promise<T | null>;
idempotencyKey: string;
}
async function withIdempotentRetry<T>(
operation: IdempotentOperation<T>,
config: BackoffConfig
): Promise<T> {
// First, check if operation already completed
const existing = await operation.checkIfCompleted();
if (existing !== null) {
return existing;
}
return withRetry(async () => {
// Re-check before each attempt (operation might have completed during wait)
const completed = await operation.checkIfCompleted();
if (completed !== null) {
return completed;
}
return operation.execute();
}, config);
}
// Usage: Sending email (don't want duplicates)
await withIdempotentRetry({
execute: () => sendEmail(emailContent),
checkIfCompleted: () => checkEmailSent(emailId),
idempotencyKey: emailId
}, defaultConfig);
When a service is down, continuing to send requests makes things worse. Circuit breakers detect failure patterns and "open" to reject requests immediately.
CLOSED → Requests flow normally
↓ (failure threshold exceeded)
OPEN → Requests rejected immediately
↓ (timeout expires)
HALF-OPEN → Allow one test request
↓ (success) ↓ (failure)
CLOSED OPEN
interface CircuitBreakerConfig {
failureThreshold: number; // Failures before opening
successThreshold: number; // Successes to close from half-open
timeout: number; // Time in open state before half-open
monitoringWindow: number; // Window for counting failures
}
class CircuitBreaker {
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
private failures: number[] = []; // Timestamps of failures
private successCount = 0;
private openedAt?: number;
constructor(
private name: string,
private config: CircuitBreakerConfig = {
failureThreshold: 5,
successThreshold: 2,
timeout: 30000,
monitoringWindow: 60000
}
) {}
async execute<T>(operation: () => Promise<T>): Promise<T> {
// Check if we should allow this request
if (!this.allowRequest()) {
throw new CircuitOpenError(
`Circuit ${this.name} is open, rejecting request`
);
}
try {
const result = await operation();
this.recordSuccess();
return result;
} catch (error) {
this.recordFailure();
throw error;
}
}
private allowRequest(): boolean {
switch (this.state) {
case 'CLOSED':
return true;
case 'OPEN':
// Check if timeout has passed
if (Date.now() - this.openedAt! > this.config.timeout) {
this.state = 'HALF_OPEN';
this.successCount = 0;
console.log(`Circuit ${this.name} entering half-open state`);
return true;
}
return false;
case 'HALF_OPEN':
return true; // Allow test requests
}
}
private recordSuccess(): void {
if (this.state === 'HALF_OPEN') {
this.successCount++;
if (this.successCount >= this.config.successThreshold) {
this.state = 'CLOSED';
this.failures = [];
console.log(`Circuit ${this.name} closed after recovery`);
}
}
}
private recordFailure(): void {
const now = Date.now();
// Clean old failures outside monitoring window
this.failures = this.failures.filter(
t => now - t < this.config.monitoringWindow
);
this.failures.push(now);
if (this.state === 'HALF_OPEN') {
// Single failure in half-open reopens circuit
this.state = 'OPEN';
this.openedAt = now;
console.log(`Circuit ${this.name} reopened after half-open failure`);
} else if (this.failures.length >= this.config.failureThreshold) {
this.state = 'OPEN';
this.openedAt = now;
console.log(`Circuit ${this.name} opened after ${this.failures.length} failures`);
}
}
getState(): string {
return this.state;
}
}
// Usage
const openaiCircuit = new CircuitBreaker('openai', {
failureThreshold: 3,
successThreshold: 2,
timeout: 60000,
monitoringWindow: 30000
});
async function callOpenAI(prompt: string) {
return openaiCircuit.execute(() =>
openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }]
})
);
}
When retries fail and circuits open, you need alternatives.
interface ModelConfig {
provider: 'openai' | 'anthropic' | 'google';
model: string;
maxTokens: number;
costPer1kTokens: number;
}
const modelChain: ModelConfig[] = [
{ provider: 'openai', model: 'gpt-4o', maxTokens: 128000, costPer1kTokens: 0.005 },
{ provider: 'anthropic', model: 'claude-sonnet-4-20250514', maxTokens: 200000, costPer1kTokens: 0.003 },
{ provider: 'openai', model: 'gpt-4o-mini', maxTokens: 128000, costPer1kTokens: 0.00015 },
{ provider: 'google', model: 'gemini-1.5-flash', maxTokens: 1000000, costPer1kTokens: 0.000075 }
];
async function callWithFallback(
messages: Message[],
config: { maxAttempts?: number } = {}
): Promise<LLMResponse> {
const maxAttempts = config.maxAttempts ?? modelChain.length;
for (let i = 0; i < Math.min(maxAttempts, modelChain.length); i++) {
const model = modelChain[i];
try {
const response = await callModel(model, messages);
return response;
} catch (error) {
console.warn(`Model ${model.model} failed:`, error.message);
if (i === modelChain.length - 1) {
throw new AllModelsFailed('All models in fallback chain failed', error);
}
// Check if error suggests trying different model
const classification = classifyError(error);
if (!classification.retryable && !classification.switchModel) {
throw error; // Don't try other models for auth errors, etc.
}
}
}
throw new AllModelsFailed('Exhausted fallback chain');
}
For common queries, serve cached responses when live calls fail.
class CachedFallback {
private cache: Map<string, { response: string; cachedAt: Date }> = new Map();
private maxAge = 3600000; // 1 hour
async callWithCache(
prompt: string,
options: { allowStale?: boolean } = {}
): Promise<string> {
const cacheKey = this.hashPrompt(prompt);
const cached = this.cache.get(cacheKey);
try {
// Try live call
const response = await this.callLLM(prompt);
// Update cache on success
this.cache.set(cacheKey, {
response,
cachedAt: new Date()
});
return response;
} catch (error) {
// On failure, try cache
if (cached && (options.allowStale || !this.isStale(cached))) {
console.log('Serving cached response due to LLM failure');
return cached.response;
}
throw error;
}
}
private isStale(entry: { cachedAt: Date }): boolean {
return Date.now() - entry.cachedAt.getTime() > this.maxAge;
}
private hashPrompt(prompt: string): string {
// Simple hash for cache key
return require('crypto').createHash('md5').update(prompt).digest('hex');
}
}
When all else fails, communicate clearly.
async function handleWithGracefulDegradation(
operation: () => Promise<string>,
context: { userId: string; operationType: string }
): Promise<{ result: string; degraded: boolean }> {
try {
const result = await operation();
return { result, degraded: false };
} catch (error) {
// Log for investigation
console.error('Operation failed after all retries:', {
...context,
error: error.message
});
// Return graceful message
const degradedResponse = getDegradedResponse(context.operationType);
return { result: degradedResponse, degraded: true };
}
}
function getDegradedResponse(operationType: string): string {
const responses: Record<string, string> = {
'search': "I'm having trouble searching right now. Please try again in a moment, or rephrase your question.",
'analysis': "I can't perform that analysis at the moment. I've noted your request - would you like to try a simpler question?",
'generation': "Content generation is temporarily unavailable. Please try again shortly.",
'default': "I encountered an issue processing your request. Please try again in a few moments."
};
return responses[operationType] ?? responses.default;
}
Track retry patterns to identify systemic issues before they become incidents.
const retryMetrics = {
recordAttempt(
operation: string,
attempt: number,
success: boolean,
durationMs: number,
error?: string
) {
metrics.histogram('retry.duration_ms', durationMs, { operation, attempt: String(attempt) });
metrics.increment('retry.attempts', { operation, success: String(success) });
if (!success && error) {
metrics.increment('retry.errors', { operation, error_type: error });
}
// Alert on high retry rates
if (attempt > 3) {
metrics.increment('retry.high_attempt_count', { operation });
}
},
recordCircuitState(name: string, state: string) {
metrics.gauge('circuit.state', state === 'OPEN' ? 1 : 0, { circuit: name });
},
recordFallback(operation: string, fallbackLevel: number) {
metrics.increment('fallback.triggered', { operation, level: String(fallbackLevel) });
}
};
For user-facing requests, 2-3 retries with aggressive timeouts. For background jobs, up to 5-7 retries with longer delays. Never retry indefinitely.
Yes, but with exponential backoff. 500 errors usually indicate temporary issues (overload, deployment). However, if you see consistent 500s for the same request, investigate rather than retrying forever.
For streaming responses, checkpoint progress and resume. For batch operations, track which items succeeded and retry only failures.
Treat them differently. Connection timeouts suggest network issues - retry immediately. Request timeouts might mean slow processing - wait longer before retry.
Generally yes, but consider reducing retries for fallbacks since you're already in a degraded state and users are waiting.
Retry strategies are the difference between resilient agents and fragile ones. Classify errors appropriately, use exponential backoff with jitter, implement circuit breakers for cascade protection, and always have fallbacks ready.
Implementation checklist:
Quick wins:
Internal links:
External references: