How do you rate limit an AI agent?

AI agent rate limiting happens at three levels: (1) LLM API rate limits from your provider (tokens per minute, requests per minute); (2) application-level limits you set in your agent code (max requests per task, cool-down periods); (3) infrastructure-level budget limits (spending ceilings that prevent cost overruns). Layer all three for production agents.

What's the difference between LLM rate limits and agent rate limits?

LLM rate limits are enforced by your model provider (OpenAI, Anthropic) and apply to your entire account. Agent rate limits are controls you set on individual agents to prevent them from using too many resources — even within your account's overall allowance. They're complementary.

How do I prevent an agent from calling an API too many times?

Set a max_steps or max_iterations limit in your framework. Add exponential backoff to tool call implementations. Set a per-agent spending limit so excessive calls exhaust the agent's budget before they exhaust your account. Use tool call frequency tracking if you need fine-grained control.

What happens when an agent hits OpenAI's rate limits?

OpenAI returns a 429 (Too Many Requests) error. Most frameworks handle this with automatic retry with exponential backoff. If your agent is consistently hitting rate limits, you need to either optimize the agent (fewer calls per task), upgrade your API tier, or implement per-agent request pacing.

How do spending limits function as a form of rate limiting?

A spending limit is effectively a resource ceiling — when the agent's budget is exhausted, all subsequent calls are blocked regardless of how fast they're coming in. This is often more robust than time-based rate limits for agents, because it limits total resource consumption rather than just momentary request rate.

How to Set Rate Limits on an AI Agent

Rate limiting a web server is a solved problem. Rate limiting an AI agent is messier — because agents don’t make predictable, uniform requests. They make bursts of calls when reasoning, then silence, then another burst. Standard rate limiting patterns don’t map cleanly.

Here’s what actually works.

The Three Layers of Agent Rate Limiting

Rate limits for AI agents come from three places, and you need to think about all three:

Layer 1: Provider limits — enforced by OpenAI, Anthropic, or whatever LLM you’re using. These apply to your account as a whole. Hitting them means 429 errors for all your agents simultaneously.

Layer 2: Application limits — controls you build into your agent code. Max iterations, cool-down periods, circuit breakers. These apply per agent and per task.

Layer 3: Infrastructure budget limits — spending ceilings that stop the agent when its allocated budget runs out. These are coarser than time-based limits but often more robust for preventing runaway agents.

Layer 1: Working Within Provider Limits

Most providers give you two limits:

Requests per minute (RPM) — how many API calls you can make
Tokens per minute (TPM) — how many tokens you can consume

When you hit these, you get a 429. The right response: exponential backoff.

import time
import anthropic
from anthropic import RateLimitError

def call_with_backoff(client, messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + (0.1 * attempt)  # Exponential + jitter
            time.sleep(wait)

If you’re consistently hitting provider rate limits, you have three options: upgrade your tier, optimize your agents to use fewer tokens per task, or spread load across multiple API keys (which requires careful account management).

Layer 2: Application-Level Controls

These are the controls you implement in your agent logic.

Max Steps Limit

Every production agent needs a max steps limit. Without it, a looping agent will keep calling the LLM indefinitely.

# LangChain
executor = AgentExecutor(
    agent=agent,
    tools=tools,
    max_iterations=15,  # Hard stop after 15 steps
    early_stopping_method="generate"  # Generate a final answer if max is hit
)

# Pydantic AI
result = await agent.run(prompt, max_steps=10)

Tool Call Frequency Tracking

For agents with expensive external tool calls, track how many times each tool is called per task:

from collections import defaultdict

class RateLimitedToolExecutor:
    def __init__(self, max_calls_per_tool: dict):
        self.max_calls = max_calls_per_tool
        self.call_counts = defaultdict(int)

    def execute(self, tool_name: str, *args, **kwargs):
        self.call_counts[tool_name] += 1
        if self.call_counts[tool_name] > self.max_calls.get(tool_name, 100):
            raise RuntimeError(
                f"Tool '{tool_name}' call limit ({self.max_calls[tool_name]}) exceeded. "
                "Stopping to prevent runaway costs."
            )
        return tools[tool_name](*args, **kwargs)

Cool-Down Between Tasks

For agents running batch jobs, pacing between tasks prevents burst-then-wait patterns:

import asyncio

async def process_batch(items: list, agent, delay_seconds: float = 1.0):
    results = []
    for item in items:
        result = await agent.run(item)
        results.append(result)
        await asyncio.sleep(delay_seconds)  # Pace requests
    return results

Layer 3: Budget Limits as Rate Control

A per-agent spending limit is a form of rate limiting — but coarser. Instead of controlling requests per minute, it controls total resources consumed.

For many use cases, this is more appropriate than time-based limits:

An agent doing a single expensive task should be able to burst
An agent that loops should be stopped by hitting its budget ceiling

With ATXP, when an agent’s balance hits zero, all subsequent calls return a 402. This is the ultimate backstop — the agent literally cannot call anything further.

# Create an agent with a $2 budget — that's the rate limit
response = httpx.post(
    "https://api.atxp.ai/v1/agents",
    headers={"Authorization": f"Bearer {ATXP_API_KEY}"},
    json={
        "name": "batch-processor",
        "budget": 2.00,  # Hard ceiling — can't exceed this no matter what
        "currency": "usd"
    }
)

The Right Combination

For production agents:

Implement max_steps — prevents infinite loops regardless of cost
Add tool call tracking for expensive external tools
Set per-agent spending limits via ATXP — hard ceiling that can’t be reasoned around
Handle 429 and 402 errors gracefully — different causes, different responses

The spending limit is the one control that can’t be bypassed by the agent’s reasoning. Application code can potentially be circumvented by a clever prompt or unexpected behavior. Infrastructure-level limits cannot.

For the broader set of controls that make agents safe in production: how to ramp agent autonomy.

ATXP enforces budget limits at the infrastructure layer — the rate control that can’t be reasoned around.