memkeeper
Token economicsTools

What your AI coding subscription actually buys

A flat-rate AI coding subscription is a black box. You pay $100 a month and tokens go in and code comes out, and you have no idea whether you are getting a deal or leaving value on the table. The data to answer that is already sitting on your disk: every coding agent writes a transcript of each session locally. We built a small tool that reads those transcripts and turns them into a receipt.

It is called token-burn: one zero-dependency Python script, nothing leaves your machine, it just sums the token usage your agents already recorded. Here is what 30 days on a $100 plan looked like when we pointed it at our own.

The receipt

30 days, one $100 plan
What you pay$100
API-list-equivalent value~$4,077
Leverage~41×
Tokens5.03B
Billed turns30,216

The same usage, billed at public API list prices, would run about $4,077. That is roughly 41× the subscription price. One precise note on what that number is: it is list-price-equivalent value, not money saved. You were never going to buy 5 billion tokens at list, and the subscription is not a refund. It is a measure of how much compute the flat rate actually hands you.

Where the tokens actually go

The leverage is the headline. The breakdown is the surprise. Of those 5 billion tokens, about 96% were cache reads: the model re-reading context it had already been shown. Freshly generated output, the actual new code and prose, was about 0.6%.

Read that again. Almost the entire bill is the model re-reading. The work you think you are paying for, generation, is a rounding error. The cost lives in context, not output.

This is why the subscription economics work at all: cache reads are cheap per token, so a provider can let you re-read enormous amounts of context under a flat rate. It is also where any efficiency lever has to operate. If you want to spend fewer tokens, you reduce what gets re-read, not what gets generated.

What the agent actually does

Tokens are one cut of the transcript. The commands are another. The same local files record every shell command the agent runs, and the distribution is its own kind of receipt. Out of roughly 21,500 shell commands in the same 30 days, here is where they went.

Top shell programsShare of commands
echo19%
grep14%
git11%
python39%
find5%
cargo4%
ls4%
sed4%

The shape matches the token story. Lump the commands by what they do and almost a third, about 32%, is searching and reading the codebase: grep, find, ls, cat, sed. Another 16% is running code, python and cargo invoking tests and builds. 11% is git. There is nothing exotic in the long tail. The agent spends its time gathering and re-reading context, then committing, which is the same thing the tokens said from a different angle.

One honest note on the table. The single largest entry, echo, is mostly the agent printing its own section markers as it works. It is scaffolding, not work, and it is a useful reminder that a raw histogram always has a layer of the model narrating itself before you reach the real activity underneath.

We measured it wrong the first time

The first run reported 12.3 billion tokens and 106× leverage. That was wrong. Resumed and forked agent sessions copy earlier turns into new transcript files, so a naive sum counts the same billed turn more than once. Deduplicating on the message and request IDs cut the total by about 2.4×, down to the 5.03 billion and 41× above. Those are the corrected numbers, and they are the ones in the table. The tool ships with the dedup built in so you do not inherit the same mistake.

Run it on your own plan

The whole point is that you can check this against your own usage in about ten seconds. token-burn is standard-library Python, reads only the transcript files your tools already write, and auto-detects whichever harnesses it finds (Claude Code and Codex today).

git clone https://github.com/teflon07/token-burn
cd token-burn
python3 burn.py --plan 100

It prints your last 30 days broken down by harness and model, with the list-price-equivalent next to what you pay. Nothing is uploaded.

The fine print, and a moving target

Two caveats on the numbers. They come from our real day-to-day workflow, which already uses memkeeper for retrieval rather than re-reading everything by hand, so this is not a memkeeper-free baseline. And they are a point-in-time snapshot from one setup; as the tooling and our habits change, the figures will move with them.

What the 96% tells us holds regardless: context, not generation, is where an AI coding budget goes. How much of that context is genuinely needed versus re-read out of habit is the open question, and the one we keep poking at, including the obvious test of the same task with retrieval on versus off. Treat these as a running tally rather than a final word. We will keep sharing what we measure as it shifts, so check back.