AI Agent Rate Limiting: Implementing Token Budgets and Usage Quotas
Build rate limiting systems that prevent runaway costs, ensure fair usage, and handle provider limits gracefully - covering token budgets, user quotas, and adaptive throttling.

Build rate limiting systems that prevent runaway costs, ensure fair usage, and handle provider limits gracefully - covering token budgets, user quotas, and adaptive throttling.

TL;DR
Jump to Why rate limiting matters · Jump to Multi-level limiting · Jump to Implementation · Jump to Graceful degradation
It's 3am. An agent enters an infinite loop, calling GPT-4 with maximum context every 2 seconds. By morning, you've burned £8,000 in API costs - your entire monthly budget gone in 6 hours. No alerts fired because you only monitored request counts, not token usage.
Rate limiting for AI agents isn't just about fairness or DDoS protection - it's existential cost control. LLM API costs scale with usage, and usage can explode without warning when agents behave unexpectedly.
This guide covers building multi-layer rate limiting that prevents cost disasters while maintaining service quality for legitimate use.
Key takeaways
- Token-based limits matter more than request-based limits. One 100K token request costs 100x more than one 1K token request.
- Implement limits at request, user, org, and global levels. Each catches different failure modes.
- Pre-flight estimation lets you reject expensive requests before incurring costs.
- Graceful degradation (smaller models, shorter outputs) is better than hard rejections.
AI cost structure differs fundamentally from traditional SaaS. A database query costs fractions of a penny. An LLM call can cost pounds.
| Scenario | Traditional API | LLM API |
|---|---|---|
| Single request | £0.0001 | £0.01-£0.50 |
| Runaway loop (1000 req/min) | £6/hour | £600-£30,000/hour |
| Single bad actor | Annoying | Bankrupting |
Pattern 1: Infinite retry loops
Agent hits an error, retries with the same massive context, fails again. Each retry costs £0.15. Loop runs for 4 hours before detection = £3,600 burned.
Pattern 2: Context accumulation
Conversation grows without trimming. By turn 50, each message includes 80K tokens of history. User asks 20 questions that afternoon = £60 for one session.
Pattern 3: Abuse by bad actors
Free tier user discovers they can trigger expensive operations. Scripts 1000 requests with complex prompts = £500 in usage you can't recover.
"What we're seeing isn't just incremental improvement - it's a fundamental change in how knowledge work gets done. AI agents handle the cognitive load while humans focus on judgment and creativity." - Marcus Chen, Chief AI Officer at McKinsey Digital
Effective rate limiting operates at multiple levels, each catching different failure modes.
Prevent single requests from being unreasonably expensive.
interface RequestLimits {
maxInputTokens: number;
maxOutputTokens: number;
maxToolCalls: number;
maxExecutionTime: number;
}
const defaultRequestLimits: RequestLimits = {
maxInputTokens: 32000,
maxOutputTokens: 4000,
maxToolCalls: 10,
maxExecutionTime: 60000 // 60 seconds
};
Prevent individual users from excessive usage.
interface UserLimits {
tokensPerMinute: number;
tokensPerHour: number;
tokensPerDay: number;
requestsPerMinute: number;
costPerDay: number;
}
const userLimitsByTier = {
free: {
tokensPerMinute: 10000,
tokensPerHour: 100000,
tokensPerDay: 500000,
requestsPerMinute: 10,
costPerDay: 0.50
},
pro: {
tokensPerMinute: 50000,
tokensPerHour: 500000,
tokensPerDay: 5000000,
requestsPerMinute: 60,
costPerDay: 10
},
enterprise: {
tokensPerMinute: 200000,
tokensPerHour: 2000000,
tokensPerDay: 20000000,
requestsPerMinute: 200,
costPerDay: 100
}
};
Prevent one org from consuming disproportionate resources.
interface OrgLimits {
tokensPerDay: number;
costPerMonth: number;
concurrentRequests: number;
}
// Based on subscription tier
const orgLimits = {
starter: {
tokensPerDay: 1000000,
costPerMonth: 50,
concurrentRequests: 5
},
growth: {
tokensPerDay: 10000000,
costPerMonth: 500,
concurrentRequests: 20
},
enterprise: {
tokensPerDay: 100000000,
costPerMonth: 5000,
concurrentRequests: 100
}
};
Protect your overall budget regardless of individual limits.
interface GlobalLimits {
totalCostPerHour: number;
totalCostPerDay: number;
totalTokensPerMinute: number;
emergencyShutoffCost: number;
}
const globalLimits: GlobalLimits = {
totalCostPerHour: 500,
totalCostPerDay: 3000,
totalTokensPerMinute: 5000000,
emergencyShutoffCost: 10000 // Auto-pause all if exceeded
};
interface UsageRecord {
userId: string;
orgId: string;
tokens: number;
cost: number;
timestamp: Date;
}
class TokenBudgetManager {
private redis: Redis;
// Sliding window token counting
async checkUserLimit(
userId: string,
estimatedTokens: number,
tier: string
): Promise<{ allowed: boolean; remaining: number; resetIn: number }> {
const limits = userLimitsByTier[tier];
const now = Date.now();
// Check multiple windows
const [perMinute, perHour, perDay] = await Promise.all([
this.getWindowUsage(userId, 60000),
this.getWindowUsage(userId, 3600000),
this.getWindowUsage(userId, 86400000)
]);
// Check against limits
if (perMinute + estimatedTokens > limits.tokensPerMinute) {
return {
allowed: false,
remaining: limits.tokensPerMinute - perMinute,
resetIn: this.getResetTime(userId, 60000)
};
}
if (perHour + estimatedTokens > limits.tokensPerHour) {
return {
allowed: false,
remaining: limits.tokensPerHour - perHour,
resetIn: this.getResetTime(userId, 3600000)
};
}
if (perDay + estimatedTokens > limits.tokensPerDay) {
return {
allowed: false,
remaining: limits.tokensPerDay - perDay,
resetIn: this.getResetTime(userId, 86400000)
};
}
return {
allowed: true,
remaining: Math.min(
limits.tokensPerMinute - perMinute,
limits.tokensPerHour - perHour,
limits.tokensPerDay - perDay
),
resetIn: 0
};
}
async recordUsage(record: UsageRecord): Promise<void> {
const key = `usage:${record.userId}`;
const timestamp = record.timestamp.getTime();
// Add to sorted set with timestamp as score
await this.redis.zadd(key, timestamp, JSON.stringify({
tokens: record.tokens,
cost: record.cost,
timestamp
}));
// Trim old entries (keep last 24 hours)
const cutoff = Date.now() - 86400000;
await this.redis.zremrangebyscore(key, 0, cutoff);
}
private async getWindowUsage(userId: string, windowMs: number): Promise<number> {
const key = `usage:${userId}`;
const cutoff = Date.now() - windowMs;
const entries = await this.redis.zrangebyscore(key, cutoff, '+inf');
return entries.reduce((sum, entry) => {
const data = JSON.parse(entry);
return sum + data.tokens;
}, 0);
}
private getResetTime(userId: string, windowMs: number): number {
// Returns seconds until oldest entry in window expires
// Implementation depends on your windowing strategy
return Math.ceil(windowMs / 1000);
}
}
Estimate costs before making expensive calls:
interface CostEstimate {
inputTokens: number;
estimatedOutputTokens: number;
totalTokens: number;
estimatedCost: number;
wouldExceedLimit: boolean;
limitType?: string;
}
class CostEstimator {
private tokenizer: Tokenizer;
async estimate(
messages: Message[],
model: string,
context: { userId: string; orgId: string; tier: string }
): Promise<CostEstimate> {
// Count input tokens
const inputTokens = this.countTokens(messages);
// Estimate output (use historical average or model-specific estimate)
const avgOutputRatio = 0.5; // Output typically ~50% of input
const estimatedOutputTokens = Math.min(
Math.ceil(inputTokens * avgOutputRatio),
4000 // Cap at max output
);
const totalTokens = inputTokens + estimatedOutputTokens;
// Calculate cost
const pricing = MODEL_PRICING[model];
const estimatedCost =
(inputTokens / 1000) * pricing.input +
(estimatedOutputTokens / 1000) * pricing.output;
// Check against limits
const budgetManager = new TokenBudgetManager();
const userCheck = await budgetManager.checkUserLimit(
context.userId,
totalTokens,
context.tier
);
const orgCheck = await this.checkOrgLimit(context.orgId, estimatedCost);
const wouldExceedLimit = !userCheck.allowed || !orgCheck.allowed;
const limitType = !userCheck.allowed ? 'user' : !orgCheck.allowed ? 'org' : undefined;
return {
inputTokens,
estimatedOutputTokens,
totalTokens,
estimatedCost,
wouldExceedLimit,
limitType
};
}
private countTokens(messages: Message[]): number {
return messages.reduce((sum, msg) => {
return sum + this.tokenizer.encode(msg.content).length;
}, 0);
}
}
class RateLimitMiddleware {
private budgetManager: TokenBudgetManager;
private estimator: CostEstimator;
async process(
request: AgentRequest,
context: ExecutionContext
): Promise<void> {
// Step 1: Estimate cost
const estimate = await this.estimator.estimate(
request.messages,
request.model,
context
);
// Step 2: Check if would exceed limits
if (estimate.wouldExceedLimit) {
throw new RateLimitError({
type: estimate.limitType,
estimatedTokens: estimate.totalTokens,
estimatedCost: estimate.estimatedCost,
message: this.getLimitMessage(estimate.limitType, context.tier)
});
}
// Step 3: Check request-level limits
if (estimate.inputTokens > defaultRequestLimits.maxInputTokens) {
throw new RequestTooLargeError({
inputTokens: estimate.inputTokens,
maxAllowed: defaultRequestLimits.maxInputTokens
});
}
// Step 4: Reserve budget (to prevent race conditions)
await this.budgetManager.reserve(context.userId, estimate.totalTokens);
// Step 5: Execute (in calling code)
// ...
// Step 6: Record actual usage (after execution)
// await this.recordUsage(actualTokens, actualCost);
}
private getLimitMessage(limitType: string, tier: string): string {
const messages = {
user: `You've reached your ${tier} tier usage limit. Please wait or upgrade your plan.`,
org: `Your organisation has reached its usage limit. Please contact your admin.`,
global: `Our service is experiencing high demand. Please try again shortly.`
};
return messages[limitType] || 'Usage limit reached.';
}
}
When approaching provider rate limits, slow down proactively:
class AdaptiveThrottler {
private currentDelay = 0;
private recentErrors: number[] = [];
private maxDelay = 30000;
async throttle(): Promise<void> {
if (this.currentDelay > 0) {
await sleep(this.currentDelay);
}
}
recordSuccess(): void {
// Decrease delay on success
this.currentDelay = Math.max(0, this.currentDelay - 100);
// Clean old errors
const cutoff = Date.now() - 60000;
this.recentErrors = this.recentErrors.filter(t => t > cutoff);
}
recordRateLimit(retryAfter?: number): void {
this.recentErrors.push(Date.now());
if (retryAfter) {
this.currentDelay = retryAfter * 1000;
} else {
// Exponential backoff based on error frequency
const errorCount = this.recentErrors.length;
this.currentDelay = Math.min(
this.maxDelay,
Math.pow(2, errorCount) * 100
);
}
}
getStatus(): { delay: number; recentErrors: number } {
return {
delay: this.currentDelay,
recentErrors: this.recentErrors.length
};
}
}
When limits are approached, degrade gracefully instead of hard failing.
async function executeWithDegradation(
request: AgentRequest,
context: ExecutionContext
): Promise<AgentResponse> {
const modelChain = ['gpt-4o', 'gpt-4o-mini', 'gpt-3.5-turbo'];
for (const model of modelChain) {
const estimate = await estimator.estimate(request.messages, model, context);
if (!estimate.wouldExceedLimit) {
// Use this model
return execute({ ...request, model });
}
}
// All models would exceed - show degraded response
return {
content: "I'm currently limited in how I can help. Please try a simpler question or wait a few minutes.",
degraded: true
};
}
async function executeWithOutputLimit(
request: AgentRequest,
remainingBudget: number
): Promise<AgentResponse> {
// Calculate safe output tokens
const inputTokens = countTokens(request.messages);
const safeOutputTokens = Math.max(
100, // Minimum useful response
remainingBudget - inputTokens
);
return execute({
...request,
maxTokens: safeOutputTokens
});
}
class RequestQueue {
private queue: QueuedRequest[] = [];
private processing = false;
async enqueue(request: AgentRequest): Promise<AgentResponse> {
return new Promise((resolve, reject) => {
this.queue.push({ request, resolve, reject });
this.processQueue();
});
}
private async processQueue(): Promise<void> {
if (this.processing) return;
this.processing = true;
while (this.queue.length > 0) {
// Check if we have budget
const budgetAvailable = await this.checkBudget();
if (!budgetAvailable) {
// Wait before trying again
await sleep(5000);
continue;
}
const item = this.queue.shift();
try {
const result = await execute(item.request);
item.resolve(result);
} catch (error) {
item.reject(error);
}
}
this.processing = false;
}
}
interface AlertConfig {
userId?: string;
orgId?: string;
thresholdPercent: number; // Alert at X% of limit
channel: 'email' | 'slack' | 'webhook';
destination: string;
}
const defaultAlerts: AlertConfig[] = [
// User approaching limit
{ thresholdPercent: 80, channel: 'email', destination: '{{user.email}}' },
// Org approaching limit
{ orgId: '*', thresholdPercent: 90, channel: 'slack', destination: '#billing-alerts' },
// Global emergency
{ thresholdPercent: 95, channel: 'webhook', destination: 'https://api.internal/emergency' }
];
async function checkAndAlert(usage: UsageStats): Promise<void> {
for (const alert of defaultAlerts) {
const limit = getLimit(alert);
const percent = (usage.current / limit) * 100;
if (percent >= alert.thresholdPercent) {
await sendAlert(alert, {
currentUsage: usage.current,
limit,
percent: percent.toFixed(1)
});
}
}
}
class EmergencyShutoff {
private active = false;
async check(globalUsage: number): Promise<void> {
if (globalUsage >= globalLimits.emergencyShutoffCost) {
this.activate();
}
}
activate(): void {
this.active = true;
// Notify all channels
sendAlert('emergency', {
message: 'Emergency shutoff activated - all AI requests paused',
timestamp: new Date()
});
// Log for investigation
console.error('EMERGENCY SHUTOFF ACTIVATED');
}
deactivate(): void {
this.active = false;
console.log('Emergency shutoff deactivated');
}
isActive(): boolean {
return this.active;
}
}
Both. Tokens for immediate throttling, cost for billing alignment. A token limit prevents large requests; a cost limit accounts for model pricing differences.
Offer higher tier plans with increased limits. Monitor usage patterns to identify power users for outreach. Consider custom enterprise plans.
Track your OpenAI/Anthropic rate limit headers and adjust accordingly. Implement adaptive throttling that backs off when approaching provider limits.
Start with user and org level. Add per-agent or per-feature limits if you notice specific areas driving costs. Too granular becomes hard to manage.
Enough for meaningful trial use without enabling abuse. We use £0.50/day (roughly 50 GPT-4o requests or 500 GPT-4o-mini). Adjust based on your cost tolerance.
Rate limiting for AI agents is fundamentally about cost control. Multi-level limits (request, user, org, global) catch different failure modes. Pre-flight estimation prevents expensive mistakes. Graceful degradation maintains service quality.
Implementation checklist:
Quick wins:
Internal links:
External references: