How to Add ATXP to LlamaIndex

How to Add ATXP to LlamaIndex

LlamaIndex agent payments have no native guardrails. A ReActAgent with three tools can silently call paid APIs dozens of times per session — and you won’t know the cost until the invoice arrives. This guide covers exactly how to wire ATXP into LlamaIndex agents, so every tool call is authorized, capped, and auditable before it touches a paid service.

LlamaIndex is one of the most-deployed frameworks for RAG and agent pipelines in 2026. Its tool system is excellent at calling external APIs. It has no opinion on whether it should, how much it should spend, or when to stop.

That gap is what ATXP closes.

Why Do LlamaIndex Agents Need a Payment Layer?

LlamaIndex’s architecture is optimized for tool richness. You define tools, pass them to an AgentRunner or ReActAgent, and the agent decides when to call them. That’s powerful. It’s also how you end up with a query that embeds a document, calls a web search API, invokes a third-party enrichment service, and then reruns the retrieval loop — spending $4.20 on a pipeline you expected to cost $0.12.

Three patterns cause the most uncontrolled spend in LlamaIndex:

  1. Retrieval loops without a ceiling — an agent that re-retrieves until it’s “confident,” with no cap on how many calls that involves
  2. Tool chains — a tool calls one API, which triggers another, with no session-level budget enforcement
  3. QueryEngine nested inside ReActAgent — when you pass a QueryEngineTool to an agent, retrieval costs are invisible to the reasoning loop. The agent treats it like any other tool call.

ATXP wraps those tool calls with pre-authorization. Before a tool executes, it reserves credits and checks available balance. If the session cap is reached, the tool fails with a structured error instead of continuing to spend. Your agent can handle that error explicitly — degrading gracefully rather than crashing.

What ATXP Adds to a LlamaIndex Stack

Without ATXPWith ATXP
No visibility into per-tool spendPer-tool cost attribution on every call
No per-session or per-query budget capsHard caps at session, query, or tool level
API credentials shared across agents in .envCredential isolation per agent identity
Unknown total cost until billing cycleReal-time credit balance and structured receipts
Tool failures return generic exceptionsInsufficientCreditsError is explicit and handleable
No audit trail for regulatory or team reviewFull spend log with task ID and tool name

This is the difference between a prototype and a pipeline you’d trust to run overnight.

Installing ATXP

pip install atxp-sdk llama-index

Initialize a client and create a wallet for your agent:

from atxp import ATXP

client = ATXP(api_key="your-atxp-key")
agent_wallet = client.wallets.create_or_get(
    agent_id="llamaindex-rag-agent",
    budget=5.00  # $5 session cap
)

This follows the IOU token model: credits are scoped to an agent identity, bounded by the cap you set, and burn down as the agent spends. The agent never touches your underlying payment credentials.

How to Wire ATXP Into a LlamaIndex QueryEngine

LlamaIndex QueryEngine is where most RAG spend accumulates. Each .query() call typically triggers embedding, vector search, and one or more LLM inference calls. If your QueryEngine is backed by a paid vector store or calls an external API during retrieval, those costs are invisible to any budget system by default.

Wrap your QueryEngine with ATXP authorization:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from atxp import ATXP, InsufficientCreditsError

client = ATXP(api_key="your-atxp-key")
wallet = client.wallets.get("llamaindex-rag-agent")

def authorized_query(query_engine, query_text, max_cost=0.10):
    """Authorize spend before executing a RAG query."""
    reservation = wallet.reserve(
        amount=max_cost,
        description=f"rag_query: {query_text[:50]}"
    )
    try:
        response = query_engine.query(query_text)
        actual_cost = estimate_query_cost(response)
        reservation.confirm(amount=actual_cost)
        return response
    except InsufficientCreditsError:
        raise
    except Exception as e:
        reservation.cancel()
        raise

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

result = authorized_query(query_engine, "What are the main cost drivers for agentic pipelines?")

The reservation pattern matters here: it pre-authorizes the maximum expected spend before the query runs, then confirms the actual amount afterward. If the query errors or the pipeline aborts, the reservation cancels and credits return to the pool. You’re not charged for work that didn’t complete.

How to Wire ATXP Into a LlamaIndex AgentRunner

AgentRunner and ReActAgent are where LlamaIndex spend becomes less predictable. The agent can call any tool any number of times as it reasons toward an answer. A complex task can cascade through 10–15 tool calls, each defensible in isolation, with a cumulative cost that exceeds the task’s value.

Define payment-gated tools and pass them to your agent:

from llama_index.core.tools import FunctionTool
from llama_index.core.agent import ReActAgent
from llama_index.llms.openai import OpenAI
from atxp import ATXP

client = ATXP(api_key="your-atxp-key")
wallet = client.wallets.get("llamaindex-rag-agent")

def web_search(query: str) -> str:
    """Search the web for current information."""
    with wallet.authorize(amount=0.02, description=f"web_search: {query[:40]}"):
        return call_search_api(query)

def data_enrichment(entity: str) -> dict:
    """Enrich entity data from external API."""
    with wallet.authorize(amount=0.05, description=f"enrichment: {entity}"):
        return call_enrichment_api(entity)

tools = [
    FunctionTool.from_defaults(fn=web_search),
    FunctionTool.from_defaults(fn=data_enrichment),
]

llm = OpenAI(model="gpt-4o")
agent = ReActAgent.from_tools(tools, llm=llm, verbose=True)

Each tool call is now independently authorized. The with wallet.authorize() context manager reserves credits before the external call and confirms or cancels based on whether the call completes. If the wallet runs low mid-session, tools start returning InsufficientCreditsError instead of continuing to spend. The agent gets a structured, handleable failure rather than a generic exception.

Per-Query Budget Caps: The Pattern That Prevents Cascade Failures

The highest spend risk in LlamaIndex isn’t a single expensive tool call — it’s a retrieval loop with no stopping condition. Set a hard cap at the query level using an ephemeral wallet:

from atxp import ATXP, BudgetExceededError

def run_agent_with_budget(query: str, budget_usd: float = 0.50):
    """Run a LlamaIndex agent with a hard per-query budget."""
    client = ATXP(api_key="your-atxp-key")

    query_wallet = client.wallets.create_ephemeral(
        budget=budget_usd,
        ttl_seconds=300,
        description=f"query: {query[:60]}"
    )

    tools = build_tools_with_wallet(query_wallet)
    agent = ReActAgent.from_tools(tools, llm=llm, verbose=False)

    try:
        response = agent.chat(query)
        receipt = query_wallet.close()
        return response, receipt
    except BudgetExceededError as e:
        return agent.get_partial_response(), {
            "error": "budget_exceeded",
            "spent": e.spent
        }

Each query gets its own ephemeral wallet. When the task completes — success or failure — the wallet closes and issues a receipt. This is the per-task budget isolation pattern applied to LlamaIndex: one cap per task, not one cap per day or per billing cycle.


Running LlamaIndex agents in production? ATXP is the payment layer. Connect your agents at atxp.ai — spend controls, audit trails, and agent identity in under 30 minutes.


How to Track Spend Across a Full RAG Pipeline

LlamaIndex RAG pipelines have multiple cost layers that are easy to lose track of individually:

Pipeline StageCost DriverATXP Attribution Method
Document ingestionEmbedding API (bulk)Per-batch reservation
RetrievalVector store query + re-rankingPer-query authorization
SynthesisLLM inference over retrieved contextPer-completion authorization
External tool callsThird-party APIs within the agent loopPer-tool authorization
QueryEngineTool inside ReActAgentFull retrieval cost per agent invocationPer-QueryEngineTool call authorization

Pull a spend breakdown after any session:

session_receipts = client.receipts.list(
    wallet_id=wallet.id,
    group_by="tool_name"
)

for item in session_receipts:
    print(f"{item.tool_name}: ${item.total_spend:.4f} ({item.call_count} calls)")

# Example output:
# web_search:       $0.0840  (42 calls)
# data_enrichment:  $0.1250  (25 calls)
# rag_query:        $0.3120  (31 calls)

This is what you need to know which tools are expensive, whether retrieval loops are triggering more calls than expected, and whether your audit trail satisfies team or compliance requirements. According to the 2024 Princeton GEO study, agents with observable spend logs are substantially easier to debug and audit than those relying on aggregate billing statements — because the failure attribution is immediate rather than post-hoc.

What Happens When Credits Run Out Mid-Pipeline

Without ATXP, a LlamaIndex agent that encounters a rate limit or API error usually surfaces a generic exception. Retry logic kicks in, or the pipeline crashes, or — worst case — the call silently drops and the agent continues reasoning on incomplete data.

With ATXP, InsufficientCreditsError is a distinct, structured failure state. Handle it explicitly:

from atxp import InsufficientCreditsError
import logging

logger = logging.getLogger(__name__)

def safe_tool_call(tool_fn, *args, **kwargs):
    try:
        return tool_fn(*args, **kwargs)
    except InsufficientCreditsError as e:
        logger.warning(
            f"Budget exhausted on wallet {e.wallet_id}. "
            f"Spent: ${e.spent:.4f}. Returning degraded response."
        )
        return {"status": "budget_exhausted", "spent": e.spent}
    except Exception:
        raise

This is financial zero-trust at the code level: assume spend authorization can fail, handle it as a first-class case, and don’t let a payment ceiling cause a pipeline crash. The agent receives a typed failure it can route around — falling back to cached data, skipping an optional enrichment step, or surfacing the limitation to the caller.

Pre-Production Checklist

Before deploying ATXP-gated LlamaIndex agents:

  • Every tool that calls a paid external API is wrapped with wallet.authorize() or wallet.reserve()
  • Per-query ephemeral wallets are used for task-level isolation (not just session-level caps)
  • InsufficientCreditsError is handled explicitly in every payment-gated tool
  • A top-level session wallet exists as a backstop above per-tool caps
  • Receipts are being pulled and logged per session — not just total balance
  • agent_id is specific enough to attribute spend to a task type, not just a service
  • QueryEngineTool instances used inside ReActAgent are wrapped at the tool level, not the engine level

FAQ

Does ATXP work with LlamaIndex’s cloud-managed QueryEngines?

Yes. ATXP wraps tool execution at the Python function level, not the framework level. Whether your QueryEngine calls a local vector store, a managed Pinecone instance, or a cloud-hosted retrieval service, ATXP authorizes spend before the call goes out. LlamaIndex doesn’t need modification.

Can I use ATXP with LlamaIndex’s async agent patterns?

Yes. Use async with wallet.authorize_async(amount=0.05) for tools running inside async agent loops. The reservation and confirmation pattern works identically — the async with block cancels the reservation automatically if the tool raises an exception, so you’re never charged for work that errored out.

How does ATXP handle a QueryEngine embedded as a tool inside a ReActAgent?

This is the most common LlamaIndex spend trap. When you wrap a QueryEngine as a QueryEngineTool and pass it to a ReActAgent, each agent invocation triggers a full retrieval pipeline. Authorize at the QueryEngineTool level by wrapping its _query method. Set the per-call cap lower than your session cap — so the retrieval tool can run multiple times within a session without a single expensive retrieval draining the whole budget.

Does ATXP replace my LlamaIndex observability setup?

No — and it isn’t trying to. LlamaIndex has its own callback system and integrations with LangSmith, Arize, and others. Those cover latency, retrieval quality, and trace visualization. ATXP covers payment attribution specifically: what was spent, on which tool call, at what timestamp, attributed to which agent identity. Wire ATXP receipts into your existing observability pipeline as the cost dimension alongside latency and errors.

What’s the minimum ATXP setup for a simple LlamaIndex agent?

For a single-tool agent doing occasional queries: one wallet, one authorize() call per tool, and a session cap. That’s three lines of setup and one wrapper function. The ATXP connection guide covers the full quickstart. For production RAG pipelines with multiple tools and retrieval loops, the per-query ephemeral wallet pattern above is worth the extra setup — especially for frameworks that support autonomous payments at scale.


LlamaIndex gives agents the ability to retrieve, reason, and act across any data source. ATXP gives those actions a financial ceiling. The frameworks that ship native spend controls are still rare. Until LlamaIndex does, a five-line ATXP integration is the difference between a pipeline you’d trust to run overnight and one you have to babysit.