What Happens to Your AI Agent When the API Goes Down?

LLM APIs go down. External services APIs go down. They don’t do it often, but often enough that production agents need to handle it.

Most tutorial-grade agent code handles the happy path. Here’s what you need for the unhappy path.

The Failure Modes

LLM provider outage — OpenAI, Anthropic, or your model provider is unavailable. Your agent can’t reason or generate text. This is the most total failure mode.

Tool API failure — your agent can still think, but one of its tools (web search, database, external API) is returning errors. The agent can reason about the failure but can’t complete tasks requiring that tool.

Intermittent errors — occasional 500 errors or timeouts from any API. Not a full outage, but enough to break naive retry logic.

Rate limiting — 429 errors from exceeding your tier’s limits. Temporary, resolvable with backoff.

Payment/auth failure — 402 (budget exhausted) or 401 (invalid credentials). Your agent’s infrastructure layer is blocking calls.

Each requires a different response.

Handling LLM Provider Outages

Retry With Exponential Backoff

For transient failures (network hiccups, brief 503s):

import time
import anthropic
from anthropic import APIStatusError, APIConnectionError

def call_with_retry(client, **kwargs):
    max_retries = 4
    base_delay = 1.0

    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)
        except (APIStatusError, APIConnectionError) as e:
            # Don't retry client errors — those are your fault
            if hasattr(e, 'status_code') and 400 <= e.status_code < 500:
                if e.status_code not in (429, 408):  # Do retry rate limits and timeouts
                    raise

            if attempt == max_retries - 1:
                raise

            delay = base_delay * (2 ** attempt)
            time.sleep(delay)

Fallback to Secondary Provider

For extended outages, switch to a backup provider:

from openai import OpenAI, APIError
import anthropic

primary = OpenAI()
fallback = anthropic.Anthropic()

def call_with_fallback(prompt: str) -> str:
    try:
        response = primary.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    except APIError as e:
        if e.status_code >= 500:  # Server-side error — try fallback
            response = fallback.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text
        raise

Note: fallback outputs will differ. If your downstream system depends on consistent output format, test the fallback model’s outputs against your schema.

Handling Tool API Failures

When a specific tool fails, the agent should degrade gracefully — skip the tool and proceed with what it has, rather than failing the entire task:

from langchain_core.tools import tool

@tool
def web_search(query: str) -> str:
    """Search the web for current information."""
    try:
        results = search_api.search(query)
        return format_results(results)
    except Exception as e:
        # Don't propagate the error — return a signal the agent can reason about
        return (
            f"Web search unavailable (error: {str(e)[:100)}). "
            "Proceed with information from your training data and note the limitation in your response."
        )

The agent receives the error as text and can decide what to do: try a different tool, complete the task with caveats, or explicitly surface that it couldn’t verify current information.

Checkpointing for Long-Running Agents

For agents running multi-step workflows, an outage mid-task shouldn’t lose all completed work:

import json
from pathlib import Path

class CheckpointedAgent:
    def __init__(self, task_id: str, agent):
        self.task_id = task_id
        self.agent = agent
        self.checkpoint_path = Path(f"checkpoints/{task_id}.json")

    def save_checkpoint(self, step: int, completed_results: list):
        self.checkpoint_path.write_text(json.dumps({
            "step": step,
            "results": completed_results
        }))

    def load_checkpoint(self) -> dict | None:
        if self.checkpoint_path.exists():
            return json.loads(self.checkpoint_path.read_text())
        return None

    async def run(self, tasks: list):
        checkpoint = self.load_checkpoint()
        start_step = checkpoint["step"] + 1 if checkpoint else 0
        results = checkpoint["results"] if checkpoint else []

        for i, task in enumerate(tasks[start_step:], start=start_step):
            result = await self.agent.run(task)
            results.append(result)
            self.save_checkpoint(i, results)

        return results

When an outage hits step 7 of a 20-step workflow, resume from step 8 — not from scratch.

What Not to Do During Outages

Don’t take irreversible actions in degraded state. If your LLM is unavailable and your agent falls back to cached decisions, don’t let it execute purchases, send emails, or delete data on stale reasoning. Require fresh LLM confirmation for irreversible actions.

Don’t silently time out. An agent that just stops responding during an outage is worse than one that surfaces a clear error. Users and downstream systems need to know why work stopped.

Don’t retry forever. Exponential backoff with a maximum retry count. After the maximum, surface the failure and stop. Open-ended retry loops waste resources and make outage diagnosis harder.

The Status Page Pattern

For production deployments, check provider status pages before starting long-running jobs:

import httpx

def check_provider_status() -> dict:
    providers = {
        "openai": "https://status.openai.com/api/v2/status.json",
        "anthropic": "https://status.anthropic.com/api/v2/status.json",
    }
    status = {}
    for name, url in providers.items():
        try:
            r = httpx.get(url, timeout=5)
            data = r.json()
            status[name] = data.get("status", {}).get("indicator", "unknown")
        except Exception:
            status[name] = "unreachable"
    return status

If a provider is already showing as degraded, don’t start a job that depends on it.

The Underlying Principle

Agents that handle failures gracefully are more useful than agents that perform better under ideal conditions. The outage you plan for is one you recover from in minutes. The outage you don’t plan for is the one that wakes you up at 3am.

ATXP’s infrastructure handles routing and billing reliability — and for the control patterns that prevent bad states from becoming worse: how to revoke agent access without breaking your pipeline.