BlogJune 23, 2026

How Image Tokens Work: Why One Image Can Cost More Than a Page of Text

Vision models bill images as tokens but count them by size and detail, not characters. Here's how OpenAI, Claude, and Gemini differ, and how to keep image costs down.

Send a model 1,000 words of text and you can predict the cost almost exactly. Send it one screenshot and the bill can be larger, and wildly different from one provider to the next. That's because vision models don't read images as characters. They convert them into tokens based on the image's dimensions and the requested detail, and every provider uses its own rules.

Images become tokens by area, not by content

A model "sees" an image as a grid of patches or tiles. The more pixels you send, the more tiles, and the more input tokens you pay for. A blank white square and a dense infographic of the same size cost the same, because the content inside doesn't matter, only the resolution does. This is the opposite of text, where cost tracks how much you actually write.

Three providers, three formulas

The major vision providers each count images differently:

OpenAI (GPT-5 family) scales the image to fit a 2048-pixel box, shrinks the shortest side to 768 pixels, then splits it into 512-pixel tiles. High detail costs an 85-token base plus 170 per tile; low detail is a flat 85 tokens no matter the size, a cheap escape hatch when you don't need fine resolution.
Claude (Anthropic) approximates tokens as roughly width × height ÷ 750, resizing very large images first. Cost grows smoothly with pixel area, with no flat-rate option.
Gemini (Google) counts about 258 tokens per 768-pixel tile, with small images (384px or smaller on both sides) charged a flat 258.

Run the same 1024×1024 image through each and the spread is real: roughly 765 tokens on GPT-5.4, 1,399 on Claude Sonnet 4.6, and 1,032 on Gemini 3.1 Pro. Same picture, nearly a 1.8× difference, before the model writes a single word back. You can reproduce these numbers in the Image Token Calculator.

Why this wrecks naive budgets

Teams that estimate vision cost from "number of requests" get surprised, because a single high-resolution upload (a 4K screenshot, a phone photo) can tile into thousands of tokens. Multiply that across a support queue or a document-processing pipeline and image input quietly becomes the dominant line item.

How to keep image tokens down

Downscale before you upload. If your task works at 768px, don't send 4K. Fewer pixels means fewer tiles.
Use low detail where it's offered. For "is there a receipt in this photo?" you rarely need full resolution, and OpenAI's flat 85-token low-detail mode is dramatically cheaper.
Pick the provider that fits your image sizes. Many small thumbnails favor flat-rate behavior; a few large images change the tiling math and the ranking.
Estimate before you ship. Put your real dimensions into the Image Token Calculator, then sanity-check the dollar side in the Token Cost Calculator.

Image tokenization rules are approximate and change over time, so treat any estimate as a planning baseline and confirm against your provider's usage dashboard. But the headline lesson holds: with images, resolution is cost.