Academy2 Dec 202514 min read

AI Agent Rate Limiting: Implementing Token Budgets and Usage Quotas

Build rate limiting systems that prevent runaway costs, ensure fair usage, and handle provider limits gracefully - covering token budgets, user quotas, and adaptive throttling.

MB
Max Beech
Head of Content
Futuristic cyberpunk setting with AI and technology themes

TL;DR

  • Rate limiting for AI is primarily about cost control. A single runaway agent can burn through your monthly budget in hours.
  • Implement limits at multiple levels: per-request, per-user, per-org, and global.
  • Use token sliding windows, not just request counts. 10 small requests ≠ 1 massive request.
  • Pre-flight estimation catches expensive requests before they execute.

Jump to Why rate limiting matters · Jump to Multi-level limiting · Jump to Implementation · Jump to Graceful degradation

AI Agent Rate Limiting: Implementing Token Budgets and Usage Quotas

It's 3am. An agent enters an infinite loop, calling GPT-4 with maximum context every 2 seconds. By morning, you've burned £8,000 in API costs - your entire monthly budget gone in 6 hours. No alerts fired because you only monitored request counts, not token usage.

Rate limiting for AI agents isn't just about fairness or DDoS protection - it's existential cost control. LLM API costs scale with usage, and usage can explode without warning when agents behave unexpectedly.

This guide covers building multi-layer rate limiting that prevents cost disasters while maintaining service quality for legitimate use.

Key takeaways

  • Token-based limits matter more than request-based limits. One 100K token request costs 100x more than one 1K token request.
  • Implement limits at request, user, org, and global levels. Each catches different failure modes.
  • Pre-flight estimation lets you reject expensive requests before incurring costs.
  • Graceful degradation (smaller models, shorter outputs) is better than hard rejections.

Why rate limiting matters

AI cost structure differs fundamentally from traditional SaaS. A database query costs fractions of a penny. An LLM call can cost pounds.

The cost explosion problem

ScenarioTraditional APILLM API
Single request£0.0001£0.01-£0.50
Runaway loop (1000 req/min)£6/hour£600-£30,000/hour
Single bad actorAnnoyingBankrupting

Real incident patterns

Pattern 1: Infinite retry loops

Agent hits an error, retries with the same massive context, fails again. Each retry costs £0.15. Loop runs for 4 hours before detection = £3,600 burned.

Pattern 2: Context accumulation

Conversation grows without trimming. By turn 50, each message includes 80K tokens of history. User asks 20 questions that afternoon = £60 for one session.

Pattern 3: Abuse by bad actors

Free tier user discovers they can trigger expensive operations. Scripts 1000 requests with complex prompts = £500 in usage you can't recover.

"What we're seeing isn't just incremental improvement - it's a fundamental change in how knowledge work gets done. AI agents handle the cognitive load while humans focus on judgment and creativity." - Marcus Chen, Chief AI Officer at McKinsey Digital

Multi-level rate limiting

Effective rate limiting operates at multiple levels, each catching different failure modes.

Level 1: Per-request limits

Prevent single requests from being unreasonably expensive.

interface RequestLimits {
  maxInputTokens: number;
  maxOutputTokens: number;
  maxToolCalls: number;
  maxExecutionTime: number;
}

const defaultRequestLimits: RequestLimits = {
  maxInputTokens: 32000,
  maxOutputTokens: 4000,
  maxToolCalls: 10,
  maxExecutionTime: 60000  // 60 seconds
};

Level 2: Per-user limits

Prevent individual users from excessive usage.

interface UserLimits {
  tokensPerMinute: number;
  tokensPerHour: number;
  tokensPerDay: number;
  requestsPerMinute: number;
  costPerDay: number;
}

const userLimitsByTier = {
  free: {
    tokensPerMinute: 10000,
    tokensPerHour: 100000,
    tokensPerDay: 500000,
    requestsPerMinute: 10,
    costPerDay: 0.50
  },
  pro: {
    tokensPerMinute: 50000,
    tokensPerHour: 500000,
    tokensPerDay: 5000000,
    requestsPerMinute: 60,
    costPerDay: 10
  },
  enterprise: {
    tokensPerMinute: 200000,
    tokensPerHour: 2000000,
    tokensPerDay: 20000000,
    requestsPerMinute: 200,
    costPerDay: 100
  }
};

Level 3: Per-organisation limits

Prevent one org from consuming disproportionate resources.

interface OrgLimits {
  tokensPerDay: number;
  costPerMonth: number;
  concurrentRequests: number;
}

// Based on subscription tier
const orgLimits = {
  starter: {
    tokensPerDay: 1000000,
    costPerMonth: 50,
    concurrentRequests: 5
  },
  growth: {
    tokensPerDay: 10000000,
    costPerMonth: 500,
    concurrentRequests: 20
  },
  enterprise: {
    tokensPerDay: 100000000,
    costPerMonth: 5000,
    concurrentRequests: 100
  }
};

Level 4: Global limits

Protect your overall budget regardless of individual limits.

interface GlobalLimits {
  totalCostPerHour: number;
  totalCostPerDay: number;
  totalTokensPerMinute: number;
  emergencyShutoffCost: number;
}

const globalLimits: GlobalLimits = {
  totalCostPerHour: 500,
  totalCostPerDay: 3000,
  totalTokensPerMinute: 5000000,
  emergencyShutoffCost: 10000  // Auto-pause all if exceeded
};

Implementation guide

Token budget manager

interface UsageRecord {
  userId: string;
  orgId: string;
  tokens: number;
  cost: number;
  timestamp: Date;
}

class TokenBudgetManager {
  private redis: Redis;

  // Sliding window token counting
  async checkUserLimit(
    userId: string,
    estimatedTokens: number,
    tier: string
  ): Promise<{ allowed: boolean; remaining: number; resetIn: number }> {
    const limits = userLimitsByTier[tier];
    const now = Date.now();

    // Check multiple windows
    const [perMinute, perHour, perDay] = await Promise.all([
      this.getWindowUsage(userId, 60000),
      this.getWindowUsage(userId, 3600000),
      this.getWindowUsage(userId, 86400000)
    ]);

    // Check against limits
    if (perMinute + estimatedTokens > limits.tokensPerMinute) {
      return {
        allowed: false,
        remaining: limits.tokensPerMinute - perMinute,
        resetIn: this.getResetTime(userId, 60000)
      };
    }

    if (perHour + estimatedTokens > limits.tokensPerHour) {
      return {
        allowed: false,
        remaining: limits.tokensPerHour - perHour,
        resetIn: this.getResetTime(userId, 3600000)
      };
    }

    if (perDay + estimatedTokens > limits.tokensPerDay) {
      return {
        allowed: false,
        remaining: limits.tokensPerDay - perDay,
        resetIn: this.getResetTime(userId, 86400000)
      };
    }

    return {
      allowed: true,
      remaining: Math.min(
        limits.tokensPerMinute - perMinute,
        limits.tokensPerHour - perHour,
        limits.tokensPerDay - perDay
      ),
      resetIn: 0
    };
  }

  async recordUsage(record: UsageRecord): Promise<void> {
    const key = `usage:${record.userId}`;
    const timestamp = record.timestamp.getTime();

    // Add to sorted set with timestamp as score
    await this.redis.zadd(key, timestamp, JSON.stringify({
      tokens: record.tokens,
      cost: record.cost,
      timestamp
    }));

    // Trim old entries (keep last 24 hours)
    const cutoff = Date.now() - 86400000;
    await this.redis.zremrangebyscore(key, 0, cutoff);
  }

  private async getWindowUsage(userId: string, windowMs: number): Promise<number> {
    const key = `usage:${userId}`;
    const cutoff = Date.now() - windowMs;

    const entries = await this.redis.zrangebyscore(key, cutoff, '+inf');

    return entries.reduce((sum, entry) => {
      const data = JSON.parse(entry);
      return sum + data.tokens;
    }, 0);
  }

  private getResetTime(userId: string, windowMs: number): number {
    // Returns seconds until oldest entry in window expires
    // Implementation depends on your windowing strategy
    return Math.ceil(windowMs / 1000);
  }
}

Pre-flight cost estimation

Estimate costs before making expensive calls:

interface CostEstimate {
  inputTokens: number;
  estimatedOutputTokens: number;
  totalTokens: number;
  estimatedCost: number;
  wouldExceedLimit: boolean;
  limitType?: string;
}

class CostEstimator {
  private tokenizer: Tokenizer;

  async estimate(
    messages: Message[],
    model: string,
    context: { userId: string; orgId: string; tier: string }
  ): Promise<CostEstimate> {
    // Count input tokens
    const inputTokens = this.countTokens(messages);

    // Estimate output (use historical average or model-specific estimate)
    const avgOutputRatio = 0.5;  // Output typically ~50% of input
    const estimatedOutputTokens = Math.min(
      Math.ceil(inputTokens * avgOutputRatio),
      4000  // Cap at max output
    );

    const totalTokens = inputTokens + estimatedOutputTokens;

    // Calculate cost
    const pricing = MODEL_PRICING[model];
    const estimatedCost =
      (inputTokens / 1000) * pricing.input +
      (estimatedOutputTokens / 1000) * pricing.output;

    // Check against limits
    const budgetManager = new TokenBudgetManager();

    const userCheck = await budgetManager.checkUserLimit(
      context.userId,
      totalTokens,
      context.tier
    );

    const orgCheck = await this.checkOrgLimit(context.orgId, estimatedCost);

    const wouldExceedLimit = !userCheck.allowed || !orgCheck.allowed;
    const limitType = !userCheck.allowed ? 'user' : !orgCheck.allowed ? 'org' : undefined;

    return {
      inputTokens,
      estimatedOutputTokens,
      totalTokens,
      estimatedCost,
      wouldExceedLimit,
      limitType
    };
  }

  private countTokens(messages: Message[]): number {
    return messages.reduce((sum, msg) => {
      return sum + this.tokenizer.encode(msg.content).length;
    }, 0);
  }
}

Rate limiter middleware

class RateLimitMiddleware {
  private budgetManager: TokenBudgetManager;
  private estimator: CostEstimator;

  async process(
    request: AgentRequest,
    context: ExecutionContext
  ): Promise<void> {
    // Step 1: Estimate cost
    const estimate = await this.estimator.estimate(
      request.messages,
      request.model,
      context
    );

    // Step 2: Check if would exceed limits
    if (estimate.wouldExceedLimit) {
      throw new RateLimitError({
        type: estimate.limitType,
        estimatedTokens: estimate.totalTokens,
        estimatedCost: estimate.estimatedCost,
        message: this.getLimitMessage(estimate.limitType, context.tier)
      });
    }

    // Step 3: Check request-level limits
    if (estimate.inputTokens > defaultRequestLimits.maxInputTokens) {
      throw new RequestTooLargeError({
        inputTokens: estimate.inputTokens,
        maxAllowed: defaultRequestLimits.maxInputTokens
      });
    }

    // Step 4: Reserve budget (to prevent race conditions)
    await this.budgetManager.reserve(context.userId, estimate.totalTokens);

    // Step 5: Execute (in calling code)
    // ...

    // Step 6: Record actual usage (after execution)
    // await this.recordUsage(actualTokens, actualCost);
  }

  private getLimitMessage(limitType: string, tier: string): string {
    const messages = {
      user: `You've reached your ${tier} tier usage limit. Please wait or upgrade your plan.`,
      org: `Your organisation has reached its usage limit. Please contact your admin.`,
      global: `Our service is experiencing high demand. Please try again shortly.`
    };
    return messages[limitType] || 'Usage limit reached.';
  }
}

Adaptive throttling

When approaching provider rate limits, slow down proactively:

class AdaptiveThrottler {
  private currentDelay = 0;
  private recentErrors: number[] = [];
  private maxDelay = 30000;

  async throttle(): Promise<void> {
    if (this.currentDelay > 0) {
      await sleep(this.currentDelay);
    }
  }

  recordSuccess(): void {
    // Decrease delay on success
    this.currentDelay = Math.max(0, this.currentDelay - 100);

    // Clean old errors
    const cutoff = Date.now() - 60000;
    this.recentErrors = this.recentErrors.filter(t => t > cutoff);
  }

  recordRateLimit(retryAfter?: number): void {
    this.recentErrors.push(Date.now());

    if (retryAfter) {
      this.currentDelay = retryAfter * 1000;
    } else {
      // Exponential backoff based on error frequency
      const errorCount = this.recentErrors.length;
      this.currentDelay = Math.min(
        this.maxDelay,
        Math.pow(2, errorCount) * 100
      );
    }
  }

  getStatus(): { delay: number; recentErrors: number } {
    return {
      delay: this.currentDelay,
      recentErrors: this.recentErrors.length
    };
  }
}

Graceful degradation

When limits are approached, degrade gracefully instead of hard failing.

Strategy 1: Model downgrade

async function executeWithDegradation(
  request: AgentRequest,
  context: ExecutionContext
): Promise<AgentResponse> {
  const modelChain = ['gpt-4o', 'gpt-4o-mini', 'gpt-3.5-turbo'];

  for (const model of modelChain) {
    const estimate = await estimator.estimate(request.messages, model, context);

    if (!estimate.wouldExceedLimit) {
      // Use this model
      return execute({ ...request, model });
    }
  }

  // All models would exceed - show degraded response
  return {
    content: "I'm currently limited in how I can help. Please try a simpler question or wait a few minutes.",
    degraded: true
  };
}

Strategy 2: Output limiting

async function executeWithOutputLimit(
  request: AgentRequest,
  remainingBudget: number
): Promise<AgentResponse> {
  // Calculate safe output tokens
  const inputTokens = countTokens(request.messages);
  const safeOutputTokens = Math.max(
    100,  // Minimum useful response
    remainingBudget - inputTokens
  );

  return execute({
    ...request,
    maxTokens: safeOutputTokens
  });
}

Strategy 3: Queue and batch

class RequestQueue {
  private queue: QueuedRequest[] = [];
  private processing = false;

  async enqueue(request: AgentRequest): Promise<AgentResponse> {
    return new Promise((resolve, reject) => {
      this.queue.push({ request, resolve, reject });
      this.processQueue();
    });
  }

  private async processQueue(): Promise<void> {
    if (this.processing) return;
    this.processing = true;

    while (this.queue.length > 0) {
      // Check if we have budget
      const budgetAvailable = await this.checkBudget();

      if (!budgetAvailable) {
        // Wait before trying again
        await sleep(5000);
        continue;
      }

      const item = this.queue.shift();
      try {
        const result = await execute(item.request);
        item.resolve(result);
      } catch (error) {
        item.reject(error);
      }
    }

    this.processing = false;
  }
}

Cost alerts and monitoring

Alert configuration

interface AlertConfig {
  userId?: string;
  orgId?: string;
  thresholdPercent: number;  // Alert at X% of limit
  channel: 'email' | 'slack' | 'webhook';
  destination: string;
}

const defaultAlerts: AlertConfig[] = [
  // User approaching limit
  { thresholdPercent: 80, channel: 'email', destination: '{{user.email}}' },

  // Org approaching limit
  { orgId: '*', thresholdPercent: 90, channel: 'slack', destination: '#billing-alerts' },

  // Global emergency
  { thresholdPercent: 95, channel: 'webhook', destination: 'https://api.internal/emergency' }
];

async function checkAndAlert(usage: UsageStats): Promise<void> {
  for (const alert of defaultAlerts) {
    const limit = getLimit(alert);
    const percent = (usage.current / limit) * 100;

    if (percent >= alert.thresholdPercent) {
      await sendAlert(alert, {
        currentUsage: usage.current,
        limit,
        percent: percent.toFixed(1)
      });
    }
  }
}

Emergency shutoff

class EmergencyShutoff {
  private active = false;

  async check(globalUsage: number): Promise<void> {
    if (globalUsage >= globalLimits.emergencyShutoffCost) {
      this.activate();
    }
  }

  activate(): void {
    this.active = true;

    // Notify all channels
    sendAlert('emergency', {
      message: 'Emergency shutoff activated - all AI requests paused',
      timestamp: new Date()
    });

    // Log for investigation
    console.error('EMERGENCY SHUTOFF ACTIVATED');
  }

  deactivate(): void {
    this.active = false;
    console.log('Emergency shutoff deactivated');
  }

  isActive(): boolean {
    return this.active;
  }
}

FAQs

Should I limit by tokens or by cost?

Both. Tokens for immediate throttling, cost for billing alignment. A token limit prevents large requests; a cost limit accounts for model pricing differences.

How do I handle legitimate high-usage users?

Offer higher tier plans with increased limits. Monitor usage patterns to identify power users for outreach. Consider custom enterprise plans.

What about provider rate limits?

Track your OpenAI/Anthropic rate limit headers and adjust accordingly. Implement adaptive throttling that backs off when approaching provider limits.

How granular should limits be?

Start with user and org level. Add per-agent or per-feature limits if you notice specific areas driving costs. Too granular becomes hard to manage.

What's a fair free tier limit?

Enough for meaningful trial use without enabling abuse. We use £0.50/day (roughly 50 GPT-4o requests or 500 GPT-4o-mini). Adjust based on your cost tolerance.

Summary and next steps

Rate limiting for AI agents is fundamentally about cost control. Multi-level limits (request, user, org, global) catch different failure modes. Pre-flight estimation prevents expensive mistakes. Graceful degradation maintains service quality.

Implementation checklist:

  1. Implement token counting and cost estimation
  2. Build sliding window usage tracking
  3. Add per-user and per-org limits based on tiers
  4. Create pre-flight checks before expensive calls
  5. Implement graceful degradation strategies
  6. Set up cost alerts and emergency shutoff

Quick wins:

  • Add basic request-level limits (max tokens, max tools)
  • Track usage per user even before enforcing limits
  • Set up daily cost alerts

Internal links:

External references: