How to Set Rate Limits on an AI Agent
Rate limiting a web server is a solved problem. Rate limiting an AI agent is messier — because agents don’t make predictable, uniform requests. They make bursts of calls when reasoning, then silence, then another burst. Standard rate limiting patterns don’t map cleanly.
Here’s what actually works.
The Three Layers of Agent Rate Limiting
Rate limits for AI agents come from three places, and you need to think about all three:
Layer 1: Provider limits — enforced by OpenAI, Anthropic, or whatever LLM you’re using. These apply to your account as a whole. Hitting them means 429 errors for all your agents simultaneously.
Layer 2: Application limits — controls you build into your agent code. Max iterations, cool-down periods, circuit breakers. These apply per agent and per task.
Layer 3: Infrastructure budget limits — spending ceilings that stop the agent when its allocated budget runs out. These are coarser than time-based limits but often more robust for preventing runaway agents.
Layer 1: Working Within Provider Limits
Most providers give you two limits:
- Requests per minute (RPM) — how many API calls you can make
- Tokens per minute (TPM) — how many tokens you can consume
When you hit these, you get a 429. The right response: exponential backoff.
import time
import anthropic
from anthropic import RateLimitError
def call_with_backoff(client, messages, max_retries=5):
for attempt in range(max_retries):
try:
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages
)
except RateLimitError:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + (0.1 * attempt) # Exponential + jitter
time.sleep(wait)
If you’re consistently hitting provider rate limits, you have three options: upgrade your tier, optimize your agents to use fewer tokens per task, or spread load across multiple API keys (which requires careful account management).
Layer 2: Application-Level Controls
These are the controls you implement in your agent logic.
Max Steps Limit
Every production agent needs a max steps limit. Without it, a looping agent will keep calling the LLM indefinitely.
# LangChain
executor = AgentExecutor(
agent=agent,
tools=tools,
max_iterations=15, # Hard stop after 15 steps
early_stopping_method="generate" # Generate a final answer if max is hit
)
# Pydantic AI
result = await agent.run(prompt, max_steps=10)
Tool Call Frequency Tracking
For agents with expensive external tool calls, track how many times each tool is called per task:
from collections import defaultdict
class RateLimitedToolExecutor:
def __init__(self, max_calls_per_tool: dict):
self.max_calls = max_calls_per_tool
self.call_counts = defaultdict(int)
def execute(self, tool_name: str, *args, **kwargs):
self.call_counts[tool_name] += 1
if self.call_counts[tool_name] > self.max_calls.get(tool_name, 100):
raise RuntimeError(
f"Tool '{tool_name}' call limit ({self.max_calls[tool_name]}) exceeded. "
"Stopping to prevent runaway costs."
)
return tools[tool_name](*args, **kwargs)
Cool-Down Between Tasks
For agents running batch jobs, pacing between tasks prevents burst-then-wait patterns:
import asyncio
async def process_batch(items: list, agent, delay_seconds: float = 1.0):
results = []
for item in items:
result = await agent.run(item)
results.append(result)
await asyncio.sleep(delay_seconds) # Pace requests
return results
Layer 3: Budget Limits as Rate Control
A per-agent spending limit is a form of rate limiting — but coarser. Instead of controlling requests per minute, it controls total resources consumed.
For many use cases, this is more appropriate than time-based limits:
- An agent doing a single expensive task should be able to burst
- An agent that loops should be stopped by hitting its budget ceiling
With ATXP, when an agent’s balance hits zero, all subsequent calls return a 402. This is the ultimate backstop — the agent literally cannot call anything further.
# Create an agent with a $2 budget — that's the rate limit
response = httpx.post(
"https://api.atxp.ai/v1/agents",
headers={"Authorization": f"Bearer {ATXP_API_KEY}"},
json={
"name": "batch-processor",
"budget": 2.00, # Hard ceiling — can't exceed this no matter what
"currency": "usd"
}
)
The Right Combination
For production agents:
- Implement max_steps — prevents infinite loops regardless of cost
- Add tool call tracking for expensive external tools
- Set per-agent spending limits via ATXP — hard ceiling that can’t be reasoned around
- Handle 429 and 402 errors gracefully — different causes, different responses
The spending limit is the one control that can’t be bypassed by the agent’s reasoning. Application code can potentially be circumvented by a clever prompt or unexpected behavior. Infrastructure-level limits cannot.
For the broader set of controls that make agents safe in production: how to ramp agent autonomy.
ATXP enforces budget limits at the infrastructure layer — the rate control that can’t be reasoned around.