Memory per Token Optimizations. Where are we, and how much more room do we have?
Hey everyone,
In the AI compute buildout phase, especially for inference, which is where the industry is shifting most of its capex and where the actual revenue gets generated, the binding constraint is increasingly not compute. It’s memory. Capacity and bandwidth.
I covered the supply side of this story in my memory cycle article ”Every Memory Cycle Ends the Same. Until It Doesn’t,” where I argued that HBM has turned memory from a gadget component into a raw input for intelligence. In that article, I also wrote that the real risk for the memory cycle is a technical breakthrough that would require orders of magnitude less memory. Today’s article is about the other side of that exact coin: what the labs and inference providers are doing in software and model architecture to need less memory per token, how much of that optimization potential is already captured, and what it means for the hardware stack, including some unconventional setups like using depreciated H100s and A100s as dedicated decode machines.
The reason this matters now and not in two years is simple: companies are starting to hit their token spend limits. Agentic workloads (coding agents, research agents, computer-use agents) consume tokens at a rate that makes the chatbot era look like a rounding error. A single coding agent session can chew through millions of tokens of context and companies are increasingly becoming frustrated with it.
In this article, I cover software optimization techniques for lowering HBM usage and give my estimates on how much of that optimization is already captured, give my view on the best solution, and how much reduction it could offer, and also cover hardware angles from SRAM accelerators like Groq and Cerebras, as well as the decode and prefill split up of hardware and what that opens up.
Let’s start.
Prefill and decode: the two jobs inside every AI request
To understand why memory is the bottleneck, you first need to understand that every LLM request is actually two completely different workloads stapled together: prefill and decode.
Prefill is what happens when you send the model your input: the prompt, the document, the codebase, the conversation history. The model reads all of it and processes every input token in parallel, in one big pass. Think of it as an analyst who gets handed a 300-page data room before a meeting. He reads the whole thing in one sitting, takes structured notes on every page, and files those notes away. This is brute-force work; the limiting factor is how fast his brain works, not how fast he can pull pages out of the binder. In GPU terms, prefill is compute-bound: the chip’s arithmetic units (the FLOPs) are the bottleneck, and the memory system can keep up.
Decode is what happens when the model generates its answer, one token at a time. To generate each single token, the model has to read essentially all of its weights from memory, plus all of the notes it took during prefill (the so-called KV cache, more on this later in the article). Then it does a comparatively tiny amount of math, produces one token, and does the entire memory read again for the next token. Our analyst is now in the meeting, and before he speaks each individual word, he has to re-skim his entire stack of notes. The bottleneck is no longer his brainpower. It’s how fast he can flip pages. Decode is memory-bandwidth-bound.
You can put numbers on this. An Nvidia H100 delivers roughly 989 TFLOPS of dense BF16 compute against 3.35 TB/s of HBM3 bandwidth. That ratio means the chip needs to perform roughly ~295 floating point operations for every byte it pulls from memory just to keep its compute units fed. Prefill, processing thousands of tokens in parallel, easily clears that bar. Decode doesn’t come close: roofline analyses of autoregressive generation put its arithmetic intensity at roughly 1 FLOP per byte, about two orders of magnitude below the compute-bound ridge point. In plain English: during decode, the most expensive compute engines on the planet sit idle 95%+ of the time, waiting for memory.
The simplest illustration: a 70B parameter model in FP16 is ~140GB of weights. On an H100 with 3.35 TB/s of bandwidth, the theoretical single-user decode ceiling is about 3,350/140 = 24 tokens per second. On an A100 with 2 TB/s, it’s about 14 tokens per second. Notice what’s not in that equation: FLOPs. You could double the H100’s compute and single-stream decode speed wouldn’t move at all. The only lever is bandwidth, or reading fewer bytes.
This is the “memory wall,” and it’s also why a lot of inference revolves around batching. If reading 140GB of weights produces one token for one user, that’s terrible economics. If the same 140GB read produces one token each for 200 users simultaneously, your cost per token just dropped ~200x. The weights are read once, amortized across the batch. So the game every inference provider plays is: cram as many concurrent users as possible onto each GPU. And what limits how many users you can cram on? Memory capacity. Because every user brings their own luggage.
The luggage: KV cache, and why agents made it explode
That luggage is the KV cache. During prefill, the model stores intermediate “key” and “value” representations of every token in the context, so that during decode it doesn’t have to re-process the whole prompt for every new token. Those are the analyst’s notes. The catch: the notes grow linearly with context length, and they have to sit in the same precious HBM as the model weights.
In a standard transformer, Llama 3.1 405B needs 516 KB of KV cache per token of context; Qwen-2.5 72B needs 327 KB per token. Run that forward: a single user with a 128K-token context on a 70B-class model is carrying roughly 40GB of KV cache. At one million tokens of context, a Llama-70B-scale model would need ~330GB in BF16 for the KV cache alone, which doesn’t fit in any single GPU. For reference, the H100 has 80GB total. The B300 has 288GB.
In the chatbot era, this was manageable because most conversations were a few thousand tokens. Agentic workloads change that. An agent doing a long coding task holds the repo, the tool outputs, the execution traces, the full plan, all of it, in context, for hours. Context lengths of 100K-1M tokens went from research demo to daily production workload. And every one of those tokens occupies HBM for the entire duration of the session. The KV cache, not the model weights, becomes the dominant consumer of memory. Which means batch sizes collapse, which means the weight-read amortization collapses, which means cost per token explodes.
So what is the industry doing about it? A lot, actually. Let’s go through the software optimization stack, and, importantly for investors trying to model how much efficiency is still on the table, my estimate of how much of each technique’s potential has already been captured.
The optimization stack: where we are on each curve
A quick framing note: the percentages below are my own estimates of “captured potential” at the frontier (the major labs and serious inference providers), based on what’s publicly documented. The long tail of enterprise deployments is far behind the frontier on all of these, which is itself an investment-relevant point: there is a lot of “free” efficiency still sitting unused in corporate AI deployments.
1. Continuous batching


