What Happens to Your AI Agent When the API Goes Down?
LLM APIs go down. External services APIs go down. They don’t do it often, but often enough that production agents need to handle it.
Most tutorial-grade agent code handles the happy path. Here’s what you need for the unhappy path.
The Failure Modes
LLM provider outage — OpenAI, Anthropic, or your model provider is unavailable. Your agent can’t reason or generate text. This is the most total failure mode.
Tool API failure — your agent can still think, but one of its tools (web search, database, external API) is returning errors. The agent can reason about the failure but can’t complete tasks requiring that tool.
Intermittent errors — occasional 500 errors or timeouts from any API. Not a full outage, but enough to break naive retry logic.
Rate limiting — 429 errors from exceeding your tier’s limits. Temporary, resolvable with backoff.
Payment/auth failure — 402 (budget exhausted) or 401 (invalid credentials). Your agent’s infrastructure layer is blocking calls.
Each requires a different response.
Handling LLM Provider Outages
Retry With Exponential Backoff
For transient failures (network hiccups, brief 503s):
import time
import anthropic
from anthropic import APIStatusError, APIConnectionError
def call_with_retry(client, **kwargs):
max_retries = 4
base_delay = 1.0
for attempt in range(max_retries):
try:
return client.messages.create(**kwargs)
except (APIStatusError, APIConnectionError) as e:
# Don't retry client errors — those are your fault
if hasattr(e, 'status_code') and 400 <= e.status_code < 500:
if e.status_code not in (429, 408): # Do retry rate limits and timeouts
raise
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
time.sleep(delay)
Fallback to Secondary Provider
For extended outages, switch to a backup provider:
from openai import OpenAI, APIError
import anthropic
primary = OpenAI()
fallback = anthropic.Anthropic()
def call_with_fallback(prompt: str) -> str:
try:
response = primary.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
except APIError as e:
if e.status_code >= 500: # Server-side error — try fallback
response = fallback.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
raise
Note: fallback outputs will differ. If your downstream system depends on consistent output format, test the fallback model’s outputs against your schema.
Handling Tool API Failures
When a specific tool fails, the agent should degrade gracefully — skip the tool and proceed with what it has, rather than failing the entire task:
from langchain_core.tools import tool
@tool
def web_search(query: str) -> str:
"""Search the web for current information."""
try:
results = search_api.search(query)
return format_results(results)
except Exception as e:
# Don't propagate the error — return a signal the agent can reason about
return (
f"Web search unavailable (error: {str(e)[:100)}). "
"Proceed with information from your training data and note the limitation in your response."
)
The agent receives the error as text and can decide what to do: try a different tool, complete the task with caveats, or explicitly surface that it couldn’t verify current information.
Checkpointing for Long-Running Agents
For agents running multi-step workflows, an outage mid-task shouldn’t lose all completed work:
import json
from pathlib import Path
class CheckpointedAgent:
def __init__(self, task_id: str, agent):
self.task_id = task_id
self.agent = agent
self.checkpoint_path = Path(f"checkpoints/{task_id}.json")
def save_checkpoint(self, step: int, completed_results: list):
self.checkpoint_path.write_text(json.dumps({
"step": step,
"results": completed_results
}))
def load_checkpoint(self) -> dict | None:
if self.checkpoint_path.exists():
return json.loads(self.checkpoint_path.read_text())
return None
async def run(self, tasks: list):
checkpoint = self.load_checkpoint()
start_step = checkpoint["step"] + 1 if checkpoint else 0
results = checkpoint["results"] if checkpoint else []
for i, task in enumerate(tasks[start_step:], start=start_step):
result = await self.agent.run(task)
results.append(result)
self.save_checkpoint(i, results)
return results
When an outage hits step 7 of a 20-step workflow, resume from step 8 — not from scratch.
What Not to Do During Outages
Don’t take irreversible actions in degraded state. If your LLM is unavailable and your agent falls back to cached decisions, don’t let it execute purchases, send emails, or delete data on stale reasoning. Require fresh LLM confirmation for irreversible actions.
Don’t silently time out. An agent that just stops responding during an outage is worse than one that surfaces a clear error. Users and downstream systems need to know why work stopped.
Don’t retry forever. Exponential backoff with a maximum retry count. After the maximum, surface the failure and stop. Open-ended retry loops waste resources and make outage diagnosis harder.
The Status Page Pattern
For production deployments, check provider status pages before starting long-running jobs:
import httpx
def check_provider_status() -> dict:
providers = {
"openai": "https://status.openai.com/api/v2/status.json",
"anthropic": "https://status.anthropic.com/api/v2/status.json",
}
status = {}
for name, url in providers.items():
try:
r = httpx.get(url, timeout=5)
data = r.json()
status[name] = data.get("status", {}).get("indicator", "unknown")
except Exception:
status[name] = "unreachable"
return status
If a provider is already showing as degraded, don’t start a job that depends on it.
The Underlying Principle
Agents that handle failures gracefully are more useful than agents that perform better under ideal conditions. The outage you plan for is one you recover from in minutes. The outage you don’t plan for is the one that wakes you up at 3am.
ATXP’s infrastructure handles routing and billing reliability — and for the control patterns that prevent bad states from becoming worse: how to revoke agent access without breaking your pipeline.