How LLM Token Pricing Works: A Developer's Guide (2026)

Every time your agent thinks, it spends tokens. Every time it writes a response, it spends more. Understanding how that billing actually works — what a token is, why input and output cost different amounts, and how costs compound across a multi-step agent run — is the difference between building something economically viable and being surprised by a bill.

This is the mechanics, cleanly explained.


Input tokens in teal and output tokens in orange flow through a processor node, with stream thickness showing cost differential

What a token is

A token is the basic unit of text that a large language model processes. Not a word. Not a character. A token.

An LLM token is a chunk of text — typically 3–4 characters or about 0.75 English words — that a language model treats as a single unit during both reading and generation. Every LLM API meters usage in tokens, billing separately for input tokens (the text sent to the model) and output tokens (the text the model generates back). Because output requires sequential, autoregressive generation while input can be processed in parallel, output tokens cost 3–5x more than input tokens across all major providers.

The tokenization isn’t arbitrary — it’s learned from the training data. Common words are usually single tokens. Rare words, technical terms, and non-English text often break into multiple tokens. The string npx atxp is 4 tokens. The string agent is 1 token. The string antidisestablishmentarianism is 6 tokens. Whitespace and punctuation count too.

Why does this matter practically? Because what you’re paying for isn’t the length of your prompt in words — it’s the length in tokens, and those aren’t the same thing. Code is token-dense. JSON is token-dense. Structured output is token-dense. Plain prose is relatively token-light. If you’re sending a lot of structured data to a model, you’re spending more tokens than you might expect.


Why input and output are priced separately — and differently

Definition — LLM Token
An LLM token is the basic unit of text that a large language model processes — roughly 3–4 characters or about 0.75 English words. Every LLM API charges based on token count, billing input tokens (what you send) and output tokens (what the model generates) at different rates. Output tokens cost 3–5x more than input tokens because output must be generated sequentially — one token at a time — while input is processed in a single parallel pass.
— ATXP

Every major LLM provider charges more per output token than per input token. The ratio is typically 3:1 to 5:1. This isn’t arbitrary pricing — it reflects the actual computational difference between the two operations.

Input processing (reading your prompt): The model processes the entire input in a single parallel forward pass. The cost scales with length, but the architecture allows it to happen efficiently. The compute required grows roughly linearly with context length.

Output generation (writing the response): The model generates one token at a time, and each token depends on every token that came before it. This autoregressive process cannot be parallelized — token 47 has to wait for token 46. Each generation step requires a full forward pass through the model. The compute required grows with both the length of the output and the length of everything that came before it.

The practical implication: tasks that generate long outputs are disproportionately expensive relative to tasks that read long inputs. A summarization task (long input, short output) is cheaper than a drafting task (short input, long output) of comparable total token count.


Input tokens processed in parallel vs output tokens generated sequentially one at a time, showing why output costs more

The model comparison you actually need

ModelInput (per 1M tokens)Output (per 1M tokens)Output/Input ratioBest for
Claude Opus 4.6$5.00$25.00Highest-capability reasoning, complex agents
GPT-4o$2.50$10.00Complex reasoning, code, nuanced judgment
Claude 3.5 Sonnet$3.00$15.00Complex reasoning, long-form generation, code
Gemini 1.5 Pro$1.25–$2.50$5.00–$10.00Long context tasks; Google ecosystem
GPT-4o mini$0.15$0.60Classification, routing, simple extraction
Claude 3 Haiku$0.25$1.25Fast, cheap inference; summarization; triage
Gemini 1.5 Flash$0.075$0.30High-volume, latency-sensitive tasks
Llama 3 70B (hosted)~$0.59–$0.90~$0.59–$0.90Open-weight workloads; no input/output split

To make this concrete: Claude Opus 4.6 at $5/$25 per million input/output tokens. A bot built to maximize output — short prompts, long generated responses — burns roughly $0.10 per request. Five dollars buys that bot about 50 requests. A normal user in a chat conversation spends closer to $0.03 per message; five dollars gets them around 170 messages. At a $0.50/hour spend cap, an aggressive bot would need roughly 10 hours to drain a $5 balance. A real person chatting at a normal pace — $0.30 to $0.60 per hour — would rarely hit the cap at all.

This math matters for two reasons. First, it shows why a small pre-funded balance can safely power a lot of real usage before needing a top-up. Second, it shows how bot-like behavior (max output, minimal input) is identifiable through the cost profile alone — a useful signal for distinguishing real agent workloads from adversarial ones.

"It's still early, but the most savvy users are spinning up several agents using different models, with some sitting one level higher up comparing the outputs. They're figuring out which model should be used in which use case and constantly updating their workflow."

Louis Amira Louis Amira — Co-founder, ATXP

A few more observations worth sitting with:

The cost difference between the cheapest and most expensive option in this table is roughly 200×. That ratio matters a lot when you’re running hundreds of agent calls per task.

Llama-class open-weight models hosted via inference APIs (Groq, Together, Fireworks) often don’t distinguish input from output in their pricing — they charge a flat rate per token total. This makes cost modeling simpler but can be more expensive than it looks for output-heavy workloads.

The frontier models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) are genuinely better at complex reasoning tasks. The cost premium is often worth it. The mistake is using them for tasks that don’t require that capability — routing decisions, simple classification, triage — where a smaller model is indistinguishable in quality and 10–100× cheaper.


The 200x LLM cost spectrum from cheapest small models to most expensive frontier models

How token costs compound for agents

A single LLM call is cheap. An agent is not a single LLM call.

An agent planning a multi-step task makes a new LLM call at each step: one to plan the next action, one to process the tool result, one to evaluate whether the goal is met, one to decide what to do next. Each call costs tokens. The full context — the original goal, all prior steps, all tool outputs — typically grows with each iteration, which means later calls in a run are more expensive than earlier ones.

Here’s what that looks like with a concrete example. An agent running a 10-step research task:

CallInput tokensOutput tokensCumulative cost (Claude 3.5 Sonnet)
Step 1 — plan500200~$0.005
Step 3 — after 2 tool results2,100300~$0.011
Step 5 — mid-run context4,800400~$0.020
Step 10 — full context9,200600~$0.037
Full 10-step run~35,000 total~3,000 total~$0.15

That’s $0.15 for a meaningful research task on a frontier model — reasonable. Now multiply by 1,000 agent runs per month: $150. By 10,000 runs: $1,500. The per-run cost is small; the volume math is where agents require attention.

The key levers that change this math significantly:

Model routing: Use Claude 3.5 Sonnet for complex reasoning steps; use Claude Haiku or GPT-4o mini for simple decision steps within the same run. A hybrid strategy can cut total cost by 60–80% with no perceptible quality loss on the full task.

Prompt caching: If your system prompt is long and consistent across runs, cache it. On Claude, prompt caching reduces cached input costs by ~90%. On OpenAI, by ~50%. For agents with long system prompts that run repeatedly, this is the largest available optimization.

Context management: Don’t carry every prior tool result forward forever. Summarize or truncate older context. A step-5 tool result that’s no longer relevant to the current step doesn’t need to be in the context window at step 10.

Output length control: Instruct the model to be concise. Output tokens cost 4–5× more than input tokens. “Think step by step in 3 sentences” is cheaper than “think step by step” with no length guidance.


ATXP’s unified LLM gateway handles the model routing piece automatically. Your agent gets access to Claude, GPT-4o, Gemini, and Llama — all through a single endpoint, billed from the same IOU account. Switch models with a parameter change, not an account change. When a cheaper model ships that matches your quality bar, you’re already connected to it.

npx atxp

Your agent gets the full model catalog on registration. See which models are available →


Context block growing larger across each agent step, showing how token costs compound across a multi-step run

Prompt caching in practice

Prompt caching deserves its own section because the savings are large enough to change the economics of a workflow.

The pattern: many agent runs use the same system prompt — a long document describing the agent’s persona, tools, and instructions. Without caching, you pay to process that system prompt on every single call. With caching, you pay once (at the “cache write” rate, which is typically 1.25× the normal input rate), and every subsequent call that hits the cached content pays roughly 10% of the normal rate.

For an agent with a 4,000-token system prompt making 10 calls per run, across 1,000 runs per month:

  • Without caching: 4,000 tokens × 10 calls × 1,000 runs = 40 million input tokens per month
  • With caching: pay cache-write rate once per run; cache-read rate on the 9 subsequent calls
  • Typical savings: 75–85% on the system prompt portion of input costs

The implementation is minimal — a single flag in your API call. The savings are significant.


Choosing the right model for the task

"People are shocked at how cheap it is to run these models for small tasks. You don't necessarily need to ask your smartest friend what the capital of a state is. If you're not using the latest models, you can get a lot more usage for a lot cheaper. But for those that want the latest models, we typically have them live within an hour of their release."

Louis Amira Louis Amira — Co-founder, ATXP

The single most impactful cost decision in any LLM application is model selection. Here’s the practical framework:

Use a frontier model when:

  • The task requires genuine multi-step reasoning (not just retrieval or classification)
  • The output will be read by a human and quality matters
  • You’re writing or editing complex code
  • The task involves nuanced judgment or edge case handling

Use a smaller model when:

  • The task is a binary decision or classification
  • You’re routing between tools or categories
  • You’re extracting structured data from a known format
  • You’re doing the same task repeatedly with predictable inputs
  • Latency matters more than output quality

Use an open-weight model when:

  • You need to run inference on-premises or in a specific cloud environment
  • The task doesn’t require frontier capability and you want the lowest possible per-token cost
  • You’re building on top of a model you want to inspect, fine-tune, or modify

Most production agent workflows contain all three types of tasks. The architecture that minimizes cost while maintaining quality uses a large model for the high-stakes reasoning steps and a small model for everything else. The problem with a single-model stack isn’t just cost — it’s that you’re either overpaying on easy tasks or underperforming on hard ones.

On r/LocalLLaMA and r/MachineLearning, the recurring production insight is that most developers start with a frontier model for everything and then work backward as they understand which tasks actually need it. Starting with a cost model in mind tends to produce better architecture decisions earlier.

"Agents that are price-conscious can determine which model to route a request through. If it's not particularly complicated, they can take it to the extremely fast and cheap smaller models. If it's much more complex or needs to be a deep thought partner, they take it to the state-of-the-art models. Some people know a specific model is better at being creative, another better at writing code — the power users have their agents set up to make those decisions and route accordingly."

Louis Amira Louis Amira — Co-founder, ATXP

"We're constantly testing all of the models against each other for different purposes — write code first, come up with crazy ideas first. Generally, each model is known for specific things, but that could change overnight with a new model release after a research lab spends billions beefing up its skill set in a specific direction."

Louis Amira Louis Amira — Co-founder, ATXP

Frequently asked questions

What is a token in an LLM?

A token is the basic unit of text that a large language model processes — roughly 3–4 characters or 0.75 English words. Every LLM API charges based on token count, separately for input (what you send) and output (what the model generates back).

Why does output cost more than input for LLMs?

Input tokens can be processed in parallel in a single forward pass. Output tokens must be generated one at a time, each token dependent on the last — a sequential process that requires a full forward pass per token. That sequential computation is why output costs 3–5× more than input across all major providers.

How many tokens is a typical LLM API call?

It varies enormously by task. A simple routing decision: 200–500 tokens total. A research summary with a long document: 5,000–15,000 tokens. A full multi-step agent run: 20,000–100,000+ tokens accumulated across all calls. The key variable is how much context accumulates across steps, and whether that context is managed or left to grow unchecked.

What is prompt caching and how much does it save?

Prompt caching lets you reuse a previously processed prompt segment — typically a long system prompt — without re-paying full input rates on every call. Anthropic’s prompt caching reduces cached input costs by approximately 90%. OpenAI’s reduces by 50%. For agents with consistent system prompts running at volume, caching is typically the single largest available cost optimization.

How do I choose the right LLM model for cost?

Match model capability to task complexity. Use frontier models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) for complex reasoning and judgment. Use smaller models (GPT-4o mini, Claude 3 Haiku, Gemini 1.5 Flash) for classification, routing, and simple extraction — tasks where quality is indistinguishable and cost is 10–100× lower.

How do token costs compound for AI agents?

Each step in an agent run requires an LLM call. Context grows with each step, making later calls more expensive than earlier ones. A 10-step run with growing context costs significantly more than 10 independent 1-step calls. The mitigations: model routing (cheaper model for simpler steps), prompt caching, and explicit context management (summarize rather than carry every prior result forward).


Token pricing is mechanical once you understand it. The practical moves — matching model to task, caching repeated prompts, managing context growth, controlling output length — are each individually simple. Together they can reduce agent inference costs by 70–90% compared to a naive single-model, no-caching approach, with no meaningful quality difference on the final output.

The model you should be on is the one that’s right for each specific call in your agent’s workflow. That requires having access to the full model catalog without the overhead of separate API keys and accounts for each provider. ATXP’s unified LLM gateway gives your agent that access in one registered account — Claude, GPT-4o, Gemini, and Llama, switching with a parameter, paying from one IOU balance.

npx atxp

Ten free tokens on registration. Pay per token only. Full model catalog at docs.atxp.ai →