TL;DR

Rate limiting for AI is primarily about cost control. A single runaway agent can burn through your monthly budget in hours.
Implement limits at multiple levels: per-request, per-user, per-org, and global.
Use token sliding windows, not just request counts. 10 small requests ≠ 1 massive request.
Pre-flight estimation catches expensive requests before they execute.

Jump to Why rate limiting matters · Jump to Multi-level limiting · Jump to Implementation · Jump to Graceful degradation

AI Agent Rate Limiting: Implementing Token Budgets and Usage Quotas

It's 3am. An agent enters an infinite loop, calling GPT-4 with maximum context every 2 seconds. By morning, you've burned £8,000 in API costs - your entire monthly budget gone in 6 hours. No alerts fired because you only monitored request counts, not token usage.

Rate limiting for AI agents isn't just about fairness or DDoS protection - it's existential cost control. LLM API costs scale with usage, and usage can explode without warning when agents behave unexpectedly.

This guide covers building multi-layer rate limiting that prevents cost disasters while maintaining service quality for legitimate use.

Key takeaways

Token-based limits matter more than request-based limits. One 100K token request costs 100x more than one 1K token request.

Implement limits at request, user, org, and global levels. Each catches different failure modes.

Pre-flight estimation lets you reject expensive requests before incurring costs.

Graceful degradation (smaller models, shorter outputs) is better than hard rejections.

Why rate limiting matters

AI cost structure differs fundamentally from traditional SaaS. A database query costs fractions of a penny. An LLM call can cost pounds.

The cost explosion problem

Scenario	Traditional API	LLM API
Single request	£0.0001	£0.01-£0.50
Runaway loop (1000 req/min)	£6/hour	£600-£30,000/hour
Single bad actor	Annoying	Bankrupting

Real incident patterns

Pattern 1: Infinite retry loops

Agent hits an error, retries with the same massive context, fails again. Each retry costs £0.15. Loop runs for 4 hours before detection = £3,600 burned.

Pattern 2: Context accumulation

Conversation grows without trimming. By turn 50, each message includes 80K tokens of history. User asks 20 questions that afternoon = £60 for one session.

Pattern 3: Abuse by bad actors

Free tier user discovers they can trigger expensive operations. Scripts 1000 requests with complex prompts = £500 in usage you can't recover.

"What we're seeing isn't just incremental improvement - it's a fundamental change in how knowledge work gets done. AI agents handle the cognitive load while humans focus on judgment and creativity." - Marcus Chen, Chief AI Officer at McKinsey Digital

Multi-level rate limiting

Effective rate limiting operates at multiple levels, each catching different failure modes.

Level 1: Per-request limits

Prevent single requests from being unreasonably expensive.

interface RequestLimits {
  maxInputTokens: number;
  maxOutputTokens: number;
  maxToolCalls: number;
  maxExecutionTime: number;
}

const defaultRequestLimits: RequestLimits = {
  maxInputTokens: 32000,
  maxOutputTokens: 4000,
  maxToolCalls: 10,
  maxExecutionTime: 60000  // 60 seconds
};

Level 2: Per-user limits

Prevent individual users from excessive usage.

interface UserLimits {
  tokensPerMinute: number;
  tokensPerHour: number;
  tokensPerDay: number;
  requestsPerMinute: number;
  costPerDay: number;
}

const userLimitsByTier = {
  free: {
    tokensPerMinute: 10000,
    tokensPerHour: 100000,
    tokensPerDay: 500000,
    requestsPerMinute: 10,
    costPerDay: 0.50
  },
  pro: {
    tokensPerMinute: 50000,
    tokensPerHour: 500000,
    tokensPerDay: 5000000,
    requestsPerMinute: 60,
    costPerDay: 10
  },
  enterprise: {
    tokensPerMinute: 200000,
    tokensPerHour: 2000000,
    tokensPerDay: 20000000,
    requestsPerMinute: 200,
    costPerDay: 100
  }
};

Level 3: Per-organisation limits

Prevent one org from consuming disproportionate resources.

interface OrgLimits {
  tokensPerDay: number;
  costPerMonth: number;
  concurrentRequests: number;
}

// Based on subscription tier
const orgLimits = {
  starter: {
    tokensPerDay: 1000000,
    costPerMonth: 50,
    concurrentRequests: 5
  },
  growth: {
    tokensPerDay: 10000000,
    costPerMonth: 500,
    concurrentRequests: 20
  },
  enterprise: {
    tokensPerDay: 100000000,
    costPerMonth: 5000,
    concurrentRequests: 100
  }
};

Level 4: Global limits

Protect your overall budget regardless of individual limits.

interface GlobalLimits {
  totalCostPerHour: number;
  totalCostPerDay: number;
  totalTokensPerMinute: number;
  emergencyShutoffCost: number;
}

const globalLimits: GlobalLimits = {
  totalCostPerHour: 500,
  totalCostPerDay: 3000,
  totalTokensPerMinute: 5000000,
  emergencyShutoffCost: 10000  // Auto-pause all if exceeded
};

Implementation guide

Token budget manager

interface UsageRecord {
  userId: string;
  orgId: string;
  tokens: number;
  cost: number;
  timestamp: Date;
}

class TokenBudgetManager {
  private redis: Redis;

  // Sliding window token counting
  async checkUserLimit(
    userId: string,
    estimatedTokens: number,
    tier: string
  ): Promise<{ allowed: boolean; remaining: number; resetIn: number }> {
    const limits = userLimitsByTier[tier];
    const now = Date.now();

    // Check multiple windows
    const [perMinute, perHour, perDay] = await Promise.all([
      this.getWindowUsage(userId, 60000),
      this.getWindowUsage(userId, 3600000),
      this.getWindowUsage(userId, 86400000)
    ]);

    // Check against limits
    if (perMinute + estimatedTokens > limits.tokensPerMinute) {
      return {
        allowed: false,
        remaining: limits.tokensPerMinute - perMinute,
        resetIn: this.getResetTime(userId, 60000)
      };
    }

    if (perHour + estimatedTokens > limits.tokensPerHour) {
      return {
        allowed: false,
        remaining: limits.tokensPerHour - perHour,
        resetIn: this.getResetTime(userId, 3600000)
      };
    }

    if (perDay + estimatedTokens > limits.tokensPerDay) {
      return {
        allowed: false,
        remaining: limits.tokensPerDay - perDay,
        resetIn: this.getResetTime(userId, 86400000)
      };
    }

    return {
      allowed: true,
      remaining: Math.min(
        limits.tokensPerMinute - perMinute,
        limits.tokensPerHour - perHour,
        limits.tokensPerDay - perDay
      ),
      resetIn: 0
    };
  }

  async recordUsage(record: UsageRecord): Promise<void> {
    const key = `usage:${record.userId}`;
    const timestamp = record.timestamp.getTime();

    // Add to sorted set with timestamp as score
    await this.redis.zadd(key, timestamp, JSON.stringify({
      tokens: record.tokens,
      cost: record.cost,
      timestamp
    }));

    // Trim old entries (keep last 24 hours)
    const cutoff = Date.now() - 86400000;
    await this.redis.zremrangebyscore(key, 0, cutoff);
  }

  private async getWindowUsage(userId: string, windowMs: number): Promise<number> {
    const key = `usage:${userId}`;
    const cutoff = Date.now() - windowMs;

    const entries = await this.redis.zrangebyscore(key, cutoff, '+inf');

    return entries.reduce((sum, entry) => {
      const data = JSON.parse(entry);
      return sum + data.tokens;
    }, 0);
  }

  private getResetTime(userId: string, windowMs: number): number {
    // Returns seconds until oldest entry in window expires
    // Implementation depends on your windowing strategy
    return Math.ceil(windowMs / 1000);
  }
}

Pre-flight cost estimation

Estimate costs before making expensive calls:

interface CostEstimate {
  inputTokens: number;
  estimatedOutputTokens: number;
  totalTokens: number;
  estimatedCost: number;
  wouldExceedLimit: boolean;
  limitType?: string;
}

class CostEstimator {
  private tokenizer: Tokenizer;

  async estimate(
    messages: Message[],
    model: string,
    context: { userId: string; orgId: string; tier: string }
  ): Promise<CostEstimate> {
    // Count input tokens
    const inputTokens = this.countTokens(messages);

    // Estimate output (use historical average or model-specific estimate)
    const avgOutputRatio = 0.5;  // Output typically ~50% of input
    const estimatedOutputTokens = Math.min(
      Math.ceil(inputTokens * avgOutputRatio),
      4000  // Cap at max output
    );

    const totalTokens = inputTokens + estimatedOutputTokens;

    // Calculate cost
    const pricing = MODEL_PRICING[model];
    const estimatedCost =
      (inputTokens / 1000) * pricing.input +
      (estimatedOutputTokens / 1000) * pricing.output;

    // Check against limits
    const budgetManager = new TokenBudgetManager();

    const userCheck = await budgetManager.checkUserLimit(
      context.userId,
      totalTokens,
      context.tier
    );

    const orgCheck = await this.checkOrgLimit(context.orgId, estimatedCost);

    const wouldExceedLimit = !userCheck.allowed || !orgCheck.allowed;
    const limitType = !userCheck.allowed ? 'user' : !orgCheck.allowed ? 'org' : undefined;

    return {
      inputTokens,
      estimatedOutputTokens,
      totalTokens,
      estimatedCost,
      wouldExceedLimit,
      limitType
    };
  }

  private countTokens(messages: Message[]): number {
    return messages.reduce((sum, msg) => {
      return sum + this.tokenizer.encode(msg.content).length;
    }, 0);
  }
}

Rate limiter middleware

class RateLimitMiddleware {
  private budgetManager: TokenBudgetManager;
  private estimator: CostEstimator;

  async process(
    request: AgentRequest,
    context: ExecutionContext
  ): Promise<void> {
    // Step 1: Estimate cost
    const estimate = await this.estimator.estimate(
      request.messages,
      request.model,
      context
    );

    // Step 2: Check if would exceed limits
    if (estimate.wouldExceedLimit) {
      throw new RateLimitError({
        type: estimate.limitType,
        estimatedTokens: estimate.totalTokens,
        estimatedCost: estimate.estimatedCost,
        message: this.getLimitMessage(estimate.limitType, context.tier)
      });
    }

    // Step 3: Check request-level limits
    if (estimate.inputTokens > defaultRequestLimits.maxInputTokens) {
      throw new RequestTooLargeError({
        inputTokens: estimate.inputTokens,
        maxAllowed: defaultRequestLimits.maxInputTokens
      });
    }

    // Step 4: Reserve budget (to prevent race conditions)
    await this.budgetManager.reserve(context.userId, estimate.totalTokens);

    // Step 5: Execute (in calling code)
    // ...

    // Step 6: Record actual usage (after execution)
    // await this.recordUsage(actualTokens, actualCost);
  }

  private getLimitMessage(limitType: string, tier: string): string {
    const messages = {
      user: `You've reached your ${tier} tier usage limit. Please wait or upgrade your plan.`,
      org: `Your organisation has reached its usage limit. Please contact your admin.`,
      global: `Our service is experiencing high demand. Please try again shortly.`
    };
    return messages[limitType] || 'Usage limit reached.';
  }
}

Adaptive throttling

When approaching provider rate limits, slow down proactively:

class AdaptiveThrottler {
  private currentDelay = 0;
  private recentErrors: number[] = [];
  private maxDelay = 30000;

  async throttle(): Promise<void> {
    if (this.currentDelay > 0) {
      await sleep(this.currentDelay);
    }
  }

  recordSuccess(): void {
    // Decrease delay on success
    this.currentDelay = Math.max(0, this.currentDelay - 100);

    // Clean old errors
    const cutoff = Date.now() - 60000;
    this.recentErrors = this.recentErrors.filter(t => t > cutoff);
  }

  recordRateLimit(retryAfter?: number): void {
    this.recentErrors.push(Date.now());

    if (retryAfter) {
      this.currentDelay = retryAfter * 1000;
    } else {
      // Exponential backoff based on error frequency
      const errorCount = this.recentErrors.length;
      this.currentDelay = Math.min(
        this.maxDelay,
        Math.pow(2, errorCount) * 100
      );
    }
  }

  getStatus(): { delay: number; recentErrors: number } {
    return {
      delay: this.currentDelay,
      recentErrors: this.recentErrors.length
    };
  }
}

Graceful degradation

When limits are approached, degrade gracefully instead of hard failing.

Strategy 1: Model downgrade

async function executeWithDegradation(
  request: AgentRequest,
  context: ExecutionContext
): Promise<AgentResponse> {
  const modelChain = ['gpt-4o', 'gpt-4o-mini', 'gpt-3.5-turbo'];

  for (const model of modelChain) {
    const estimate = await estimator.estimate(request.messages, model, context);

    if (!estimate.wouldExceedLimit) {
      // Use this model
      return execute({ ...request, model });
    }
  }

  // All models would exceed - show degraded response
  return {
    content: "I'm currently limited in how I can help. Please try a simpler question or wait a few minutes.",
    degraded: true
  };
}

Strategy 2: Output limiting

async function executeWithOutputLimit(
  request: AgentRequest,
  remainingBudget: number
): Promise<AgentResponse> {
  // Calculate safe output tokens
  const inputTokens = countTokens(request.messages);
  const safeOutputTokens = Math.max(
    100,  // Minimum useful response
    remainingBudget - inputTokens
  );

  return execute({
    ...request,
    maxTokens: safeOutputTokens
  });
}

Strategy 3: Queue and batch

class RequestQueue {
  private queue: QueuedRequest[] = [];
  private processing = false;

  async enqueue(request: AgentRequest): Promise<AgentResponse> {
    return new Promise((resolve, reject) => {
      this.queue.push({ request, resolve, reject });
      this.processQueue();
    });
  }

  private async processQueue(): Promise<void> {
    if (this.processing) return;
    this.processing = true;

    while (this.queue.length > 0) {
      // Check if we have budget
      const budgetAvailable = await this.checkBudget();

      if (!budgetAvailable) {
        // Wait before trying again
        await sleep(5000);
        continue;
      }

      const item = this.queue.shift();
      try {
        const result = await execute(item.request);
        item.resolve(result);
      } catch (error) {
        item.reject(error);
      }
    }

    this.processing = false;
  }
}

Cost alerts and monitoring

Alert configuration

interface AlertConfig {
  userId?: string;
  orgId?: string;
  thresholdPercent: number;  // Alert at X% of limit
  channel: 'email' | 'slack' | 'webhook';
  destination: string;
}

const defaultAlerts: AlertConfig[] = [
  // User approaching limit
  { thresholdPercent: 80, channel: 'email', destination: '{{user.email}}' },

  // Org approaching limit
  { orgId: '*', thresholdPercent: 90, channel: 'slack', destination: '#billing-alerts' },

  // Global emergency
  { thresholdPercent: 95, channel: 'webhook', destination: 'https://api.internal/emergency' }
];

async function checkAndAlert(usage: UsageStats): Promise<void> {
  for (const alert of defaultAlerts) {
    const limit = getLimit(alert);
    const percent = (usage.current / limit) * 100;

    if (percent >= alert.thresholdPercent) {
      await sendAlert(alert, {
        currentUsage: usage.current,
        limit,
        percent: percent.toFixed(1)
      });
    }
  }
}

Emergency shutoff

class EmergencyShutoff {
  private active = false;

  async check(globalUsage: number): Promise<void> {
    if (globalUsage >= globalLimits.emergencyShutoffCost) {
      this.activate();
    }
  }

  activate(): void {
    this.active = true;

    // Notify all channels
    sendAlert('emergency', {
      message: 'Emergency shutoff activated - all AI requests paused',
      timestamp: new Date()
    });

    // Log for investigation
    console.error('EMERGENCY SHUTOFF ACTIVATED');
  }

  deactivate(): void {
    this.active = false;
    console.log('Emergency shutoff deactivated');
  }

  isActive(): boolean {
    return this.active;
  }
}

FAQs

Should I limit by tokens or by cost?

Both. Tokens for immediate throttling, cost for billing alignment. A token limit prevents large requests; a cost limit accounts for model pricing differences.

How do I handle legitimate high-usage users?

Offer higher tier plans with increased limits. Monitor usage patterns to identify power users for outreach. Consider custom enterprise plans.

What about provider rate limits?

Track your OpenAI/Anthropic rate limit headers and adjust accordingly. Implement adaptive throttling that backs off when approaching provider limits.

How granular should limits be?

Start with user and org level. Add per-agent or per-feature limits if you notice specific areas driving costs. Too granular becomes hard to manage.

What's a fair free tier limit?

Enough for meaningful trial use without enabling abuse. We use £0.50/day (roughly 50 GPT-4o requests or 500 GPT-4o-mini). Adjust based on your cost tolerance.

Summary and next steps

Rate limiting for AI agents is fundamentally about cost control. Multi-level limits (request, user, org, global) catch different failure modes. Pre-flight estimation prevents expensive mistakes. Graceful degradation maintains service quality.

Implementation checklist:

Implement token counting and cost estimation
Build sliding window usage tracking
Add per-user and per-org limits based on tiers
Create pre-flight checks before expensive calls
Implement graceful degradation strategies
Set up cost alerts and emergency shutoff

Quick wins:

Add basic request-level limits (max tokens, max tools)
Track usage per user even before enforcing limits
Set up daily cost alerts

Internal links:

External references: