TL;DR

Not all errors are equal: rate limits need backoff; timeouts need shorter retries; auth errors need no retry.
Exponential backoff with jitter prevents thundering herds - multiple clients retrying simultaneously.
Circuit breakers stop cascading failures when a service is down - fail fast instead of piling up requests.
Design fallbacks: primary model fails → cheaper model → cached response → graceful error message.

Jump to Error classification · Jump to Retry patterns · Jump to Circuit breakers · Jump to Fallback strategies

AI Agent Retry Strategies: Exponential Backoff and Graceful Degradation

At 2am, OpenAI's API starts returning 503 errors. Your agent retries immediately, fails, retries again, fails - hammering the already-struggling service while your users see timeout errors. By morning, you've burned through your rate limit budget, your monitoring is screaming, and users have given up.

Retry strategies determine whether your agents weather transient failures gracefully or amplify them into cascading outages. The difference between "worked fine during the incident" and "made everything worse" is often just a few lines of retry logic.

This guide covers production-tested patterns for handling LLM API failures, from basic exponential backoff to sophisticated multi-model fallback chains.

Key takeaways

Classify errors first. Retrying authentication failures wastes time; not retrying rate limits loses requests unnecessarily.

Exponential backoff + jitter is the baseline. Start at 1s, cap at 60s, add random jitter.

Circuit breakers protect both you and the service. When something is clearly broken, stop trying.

Fallbacks should be pre-planned and tested, not improvised during incidents.

Error classification

Different errors demand different responses. Before implementing retry logic, classify what you're handling.

Error taxonomy

Error Type	HTTP Codes	Retry?	Strategy
Rate limit	429	Yes	Backoff per Retry-After header
Server error	500, 502, 503, 504	Yes	Exponential backoff
Timeout	-	Maybe	Shorter timeout, limited retries
Client error	400, 401, 403	No	Fix request, don't retry
Model overload	529 (Anthropic)	Yes	Longer backoff
Context length	400 (specific)	No	Reduce input, don't retry same

Implementation

interface ErrorClassification {
  retryable: boolean;
  strategy: 'none' | 'immediate' | 'backoff' | 'long_backoff';
  retryAfter?: number;  // Milliseconds
  switchModel?: boolean;
}

function classifyError(error: any): ErrorClassification {
  // Check for rate limit with Retry-After header
  if (error.status === 429) {
    const retryAfter = error.headers?.['retry-after'];
    return {
      retryable: true,
      strategy: 'backoff',
      retryAfter: retryAfter ? parseInt(retryAfter) * 1000 : 60000
    };
  }

  // Server errors - temporary, retry with backoff
  if ([500, 502, 503, 504].includes(error.status)) {
    return {
      retryable: true,
      strategy: 'backoff'
    };
  }

  // Anthropic overloaded
  if (error.status === 529) {
    return {
      retryable: true,
      strategy: 'long_backoff',
      retryAfter: 120000  // 2 minutes
    };
  }

  // Timeout - retry but not indefinitely
  if (error.code === 'ETIMEDOUT' || error.code === 'ESOCKETTIMEDOUT') {
    return {
      retryable: true,
      strategy: 'immediate',  // Try again quickly
    };
  }

  // Context length exceeded - don't retry same request
  if (error.status === 400 && error.message?.includes('context_length')) {
    return {
      retryable: false,
      strategy: 'none',
      switchModel: true  // Try model with larger context
    };
  }

  // Auth errors - fix the problem, don't retry
  if ([401, 403].includes(error.status)) {
    return {
      retryable: false,
      strategy: 'none'
    };
  }

  // Unknown errors - conservative retry
  return {
    retryable: true,
    strategy: 'backoff'
  };
}

"The companies winning with AI agents aren't the ones with the most sophisticated models. They're the ones who've figured out the governance and handoff patterns between human and machine." - Dr. Elena Rodriguez, VP of Applied AI at Google DeepMind

Retry patterns

Pattern 1: Exponential backoff with jitter

The gold standard for retry logic. Wait times increase exponentially, and jitter prevents synchronised retries.

interface BackoffConfig {
  initialDelayMs: number;
  maxDelayMs: number;
  maxRetries: number;
  multiplier: number;
  jitterFactor: number;  // 0 to 1, percentage of delay to randomise
}

const defaultConfig: BackoffConfig = {
  initialDelayMs: 1000,
  maxDelayMs: 60000,
  maxRetries: 5,
  multiplier: 2,
  jitterFactor: 0.3
};

function calculateDelay(
  attempt: number,
  config: BackoffConfig = defaultConfig
): number {
  // Base delay: initial * multiplier^attempt
  const baseDelay = Math.min(
    config.initialDelayMs * Math.pow(config.multiplier, attempt),
    config.maxDelayMs
  );

  // Add jitter: ±jitterFactor of base delay
  const jitter = baseDelay * config.jitterFactor * (Math.random() * 2 - 1);

  return Math.max(0, Math.floor(baseDelay + jitter));
}

async function withRetry<T>(
  operation: () => Promise<T>,
  config: BackoffConfig = defaultConfig
): Promise<T> {
  let lastError: Error;

  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      lastError = error as Error;

      const classification = classifyError(error);

      if (!classification.retryable) {
        throw error;  // Don't retry non-retryable errors
      }

      if (attempt === config.maxRetries) {
        break;  // No more retries
      }

      // Calculate delay
      const delay = classification.retryAfter ??
        calculateDelay(attempt, config);

      console.log(`Attempt ${attempt + 1} failed, retrying in ${delay}ms`);

      await sleep(delay);
    }
  }

  throw lastError!;
}

// Usage
const response = await withRetry(
  () => openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'Hello' }]
  }),
  { ...defaultConfig, maxRetries: 3 }
);

Why jitter matters

Without jitter, if 100 clients hit a rate limit at the same time, they all retry at the same time, creating another spike. Jitter spreads retries across time.

Without jitter (thundering herd):
Time 0s:    [100 requests] → Rate limited
Time 2s:    [100 retries]  → Rate limited again
Time 4s:    [100 retries]  → Rate limited again

With jitter (distributed):
Time 0s:    [100 requests] → Rate limited
Time 1.5-2.5s: [~30 retries] → Some succeed
Time 3-5s:    [~50 retries] → More succeed
Time 5-8s:    [~20 retries] → All succeed

Pattern 2: Deadline-based retry

When you have a user waiting, absolute time matters more than attempt count.

async function withDeadline<T>(
  operation: () => Promise<T>,
  deadlineMs: number,
  minRetryDelayMs: number = 500
): Promise<T> {
  const deadline = Date.now() + deadlineMs;
  let attempt = 0;
  let lastError: Error;

  while (Date.now() < deadline) {
    try {
      return await operation();
    } catch (error) {
      lastError = error as Error;

      const classification = classifyError(error);
      if (!classification.retryable) {
        throw error;
      }

      // Calculate how much time we have left
      const remaining = deadline - Date.now();
      if (remaining < minRetryDelayMs) {
        break;  // Not enough time for another attempt
      }

      // Use shorter delays as deadline approaches
      const delay = Math.min(
        calculateDelay(attempt),
        remaining - minRetryDelayMs
      );

      await sleep(delay);
      attempt++;
    }
  }

  throw new DeadlineExceededError(
    `Operation failed after ${attempt} attempts`,
    lastError!
  );
}

// Usage: Must complete within 10 seconds
const response = await withDeadline(
  () => callLLM(prompt),
  10000
);

Pattern 3: Idempotency-aware retry

Some operations shouldn't be retried blindly - you might create duplicates.

interface IdempotentOperation<T> {
  execute: () => Promise<T>;
  checkIfCompleted: () => Promise<T | null>;
  idempotencyKey: string;
}

async function withIdempotentRetry<T>(
  operation: IdempotentOperation<T>,
  config: BackoffConfig
): Promise<T> {
  // First, check if operation already completed
  const existing = await operation.checkIfCompleted();
  if (existing !== null) {
    return existing;
  }

  return withRetry(async () => {
    // Re-check before each attempt (operation might have completed during wait)
    const completed = await operation.checkIfCompleted();
    if (completed !== null) {
      return completed;
    }

    return operation.execute();
  }, config);
}

// Usage: Sending email (don't want duplicates)
await withIdempotentRetry({
  execute: () => sendEmail(emailContent),
  checkIfCompleted: () => checkEmailSent(emailId),
  idempotencyKey: emailId
}, defaultConfig);

Circuit breakers

When a service is down, continuing to send requests makes things worse. Circuit breakers detect failure patterns and "open" to reject requests immediately.

States

CLOSED → Requests flow normally
    ↓ (failure threshold exceeded)
OPEN → Requests rejected immediately
    ↓ (timeout expires)
HALF-OPEN → Allow one test request
    ↓ (success)     ↓ (failure)
CLOSED            OPEN

Implementation

interface CircuitBreakerConfig {
  failureThreshold: number;     // Failures before opening
  successThreshold: number;     // Successes to close from half-open
  timeout: number;              // Time in open state before half-open
  monitoringWindow: number;     // Window for counting failures
}

class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
  private failures: number[] = [];  // Timestamps of failures
  private successCount = 0;
  private openedAt?: number;

  constructor(
    private name: string,
    private config: CircuitBreakerConfig = {
      failureThreshold: 5,
      successThreshold: 2,
      timeout: 30000,
      monitoringWindow: 60000
    }
  ) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    // Check if we should allow this request
    if (!this.allowRequest()) {
      throw new CircuitOpenError(
        `Circuit ${this.name} is open, rejecting request`
      );
    }

    try {
      const result = await operation();
      this.recordSuccess();
      return result;
    } catch (error) {
      this.recordFailure();
      throw error;
    }
  }

  private allowRequest(): boolean {
    switch (this.state) {
      case 'CLOSED':
        return true;

      case 'OPEN':
        // Check if timeout has passed
        if (Date.now() - this.openedAt! > this.config.timeout) {
          this.state = 'HALF_OPEN';
          this.successCount = 0;
          console.log(`Circuit ${this.name} entering half-open state`);
          return true;
        }
        return false;

      case 'HALF_OPEN':
        return true;  // Allow test requests
    }
  }

  private recordSuccess(): void {
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= this.config.successThreshold) {
        this.state = 'CLOSED';
        this.failures = [];
        console.log(`Circuit ${this.name} closed after recovery`);
      }
    }
  }

  private recordFailure(): void {
    const now = Date.now();

    // Clean old failures outside monitoring window
    this.failures = this.failures.filter(
      t => now - t < this.config.monitoringWindow
    );

    this.failures.push(now);

    if (this.state === 'HALF_OPEN') {
      // Single failure in half-open reopens circuit
      this.state = 'OPEN';
      this.openedAt = now;
      console.log(`Circuit ${this.name} reopened after half-open failure`);
    } else if (this.failures.length >= this.config.failureThreshold) {
      this.state = 'OPEN';
      this.openedAt = now;
      console.log(`Circuit ${this.name} opened after ${this.failures.length} failures`);
    }
  }

  getState(): string {
    return this.state;
  }
}

// Usage
const openaiCircuit = new CircuitBreaker('openai', {
  failureThreshold: 3,
  successThreshold: 2,
  timeout: 60000,
  monitoringWindow: 30000
});

async function callOpenAI(prompt: string) {
  return openaiCircuit.execute(() =>
    openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: prompt }]
    })
  );
}

Fallback strategies

When retries fail and circuits open, you need alternatives.

Strategy 1: Model fallback chain

interface ModelConfig {
  provider: 'openai' | 'anthropic' | 'google';
  model: string;
  maxTokens: number;
  costPer1kTokens: number;
}

const modelChain: ModelConfig[] = [
  { provider: 'openai', model: 'gpt-4o', maxTokens: 128000, costPer1kTokens: 0.005 },
  { provider: 'anthropic', model: 'claude-sonnet-4-20250514', maxTokens: 200000, costPer1kTokens: 0.003 },
  { provider: 'openai', model: 'gpt-4o-mini', maxTokens: 128000, costPer1kTokens: 0.00015 },
  { provider: 'google', model: 'gemini-1.5-flash', maxTokens: 1000000, costPer1kTokens: 0.000075 }
];

async function callWithFallback(
  messages: Message[],
  config: { maxAttempts?: number } = {}
): Promise<LLMResponse> {
  const maxAttempts = config.maxAttempts ?? modelChain.length;

  for (let i = 0; i < Math.min(maxAttempts, modelChain.length); i++) {
    const model = modelChain[i];

    try {
      const response = await callModel(model, messages);
      return response;
    } catch (error) {
      console.warn(`Model ${model.model} failed:`, error.message);

      if (i === modelChain.length - 1) {
        throw new AllModelsFailed('All models in fallback chain failed', error);
      }

      // Check if error suggests trying different model
      const classification = classifyError(error);
      if (!classification.retryable && !classification.switchModel) {
        throw error;  // Don't try other models for auth errors, etc.
      }
    }
  }

  throw new AllModelsFailed('Exhausted fallback chain');
}

Strategy 2: Cached responses

For common queries, serve cached responses when live calls fail.

class CachedFallback {
  private cache: Map<string, { response: string; cachedAt: Date }> = new Map();
  private maxAge = 3600000;  // 1 hour

  async callWithCache(
    prompt: string,
    options: { allowStale?: boolean } = {}
  ): Promise<string> {
    const cacheKey = this.hashPrompt(prompt);
    const cached = this.cache.get(cacheKey);

    try {
      // Try live call
      const response = await this.callLLM(prompt);

      // Update cache on success
      this.cache.set(cacheKey, {
        response,
        cachedAt: new Date()
      });

      return response;
    } catch (error) {
      // On failure, try cache
      if (cached && (options.allowStale || !this.isStale(cached))) {
        console.log('Serving cached response due to LLM failure');
        return cached.response;
      }

      throw error;
    }
  }

  private isStale(entry: { cachedAt: Date }): boolean {
    return Date.now() - entry.cachedAt.getTime() > this.maxAge;
  }

  private hashPrompt(prompt: string): string {
    // Simple hash for cache key
    return require('crypto').createHash('md5').update(prompt).digest('hex');
  }
}

Strategy 3: Graceful degradation messages

When all else fails, communicate clearly.

async function handleWithGracefulDegradation(
  operation: () => Promise<string>,
  context: { userId: string; operationType: string }
): Promise<{ result: string; degraded: boolean }> {
  try {
    const result = await operation();
    return { result, degraded: false };
  } catch (error) {
    // Log for investigation
    console.error('Operation failed after all retries:', {
      ...context,
      error: error.message
    });

    // Return graceful message
    const degradedResponse = getDegradedResponse(context.operationType);
    return { result: degradedResponse, degraded: true };
  }
}

function getDegradedResponse(operationType: string): string {
  const responses: Record<string, string> = {
    'search': "I'm having trouble searching right now. Please try again in a moment, or rephrase your question.",
    'analysis': "I can't perform that analysis at the moment. I've noted your request - would you like to try a simpler question?",
    'generation': "Content generation is temporarily unavailable. Please try again shortly.",
    'default': "I encountered an issue processing your request. Please try again in a few moments."
  };

  return responses[operationType] ?? responses.default;
}

Monitoring retry behaviour

Track retry patterns to identify systemic issues before they become incidents.

const retryMetrics = {
  recordAttempt(
    operation: string,
    attempt: number,
    success: boolean,
    durationMs: number,
    error?: string
  ) {
    metrics.histogram('retry.duration_ms', durationMs, { operation, attempt: String(attempt) });
    metrics.increment('retry.attempts', { operation, success: String(success) });

    if (!success && error) {
      metrics.increment('retry.errors', { operation, error_type: error });
    }

    // Alert on high retry rates
    if (attempt > 3) {
      metrics.increment('retry.high_attempt_count', { operation });
    }
  },

  recordCircuitState(name: string, state: string) {
    metrics.gauge('circuit.state', state === 'OPEN' ? 1 : 0, { circuit: name });
  },

  recordFallback(operation: string, fallbackLevel: number) {
    metrics.increment('fallback.triggered', { operation, level: String(fallbackLevel) });
  }
};

FAQs

How many retries is too many?

For user-facing requests, 2-3 retries with aggressive timeouts. For background jobs, up to 5-7 retries with longer delays. Never retry indefinitely.

Should I retry on all 500 errors?

Yes, but with exponential backoff. 500 errors usually indicate temporary issues (overload, deployment). However, if you see consistent 500s for the same request, investigate rather than retrying forever.

How do I handle partial failures?

For streaming responses, checkpoint progress and resume. For batch operations, track which items succeeded and retry only failures.

What about request timeouts vs connection timeouts?

Treat them differently. Connection timeouts suggest network issues - retry immediately. Request timeouts might mean slow processing - wait longer before retry.

Should fallback models get the same retry config?

Generally yes, but consider reducing retries for fallbacks since you're already in a degraded state and users are waiting.

Summary and next steps

Retry strategies are the difference between resilient agents and fragile ones. Classify errors appropriately, use exponential backoff with jitter, implement circuit breakers for cascade protection, and always have fallbacks ready.

Implementation checklist:

Add error classification to LLM calls
Implement exponential backoff wrapper
Add jitter to all retry delays
Deploy circuit breakers for external services
Build fallback chain with alternate models
Monitor retry rates and circuit states

Quick wins:

Wrap existing LLM calls with basic retry logic
Add Retry-After header parsing for 429 errors
Implement simple circuit breaker for primary provider

Internal links:

External references: