Why Token Optimization Is a Gift to the Hyperscalers
Hey everyone,
A few weeks ago I wrote a piece called Most of the Economy Won’t Run on the Best Model, where I argued that the AI market will bifurcate: the frontier model goes to the small slice of work where intelligence is unbounded in economic value (drug discovery, novel math, the hardest agentic reasoning), and the middle of the economy — classification, extraction, summarization, routine code, support — runs on the cheapest model that clears the quality bar.
This article is the sequel to that one. Who captures the value when the world switches from token maxing to token optimization?
The shift away from always-buy-the-best-model is, on its surface, bearish for the AI labs and looks like it should compress the whole stack. But it is quietly one of the most bullish structural setups for the three hyperscalers — Microsoft, Amazon, and Google.
Let me explain why.
Think about a highway toll road. There are two businesses operating on it. The first is the company that manufactures the cars — they make the actual machine that does the work of getting you somewhere, and there’s a real margin in a car. The second is the company that owns the tollbooth. The tollbooth owner doesn’t care what car you drive. Ferrari, Toyota, or a 12-year-old used Honda. Every one of them pays the same toll to cross the bridge. Now imagine a world where, suddenly, everyone realizes they were commuting to work in a Ferrari for no reason, and they all downgrade to the cheaper Honda to save money. The carmaker’s revenue per vehicle collapses, but people don’t drive less because they switched to a cheaper car. They drive more, because now it’s cheap enough to justify trips they’d never have taken before. And every one of those extra trips still crosses the bridge. The tollbooth’s revenue goes up.
In the AI stack, the AI labs are the carmakers. The hyperscalers are the tollbooths. And in the next months, we are entering a period where most of the economy is trading down from the Ferrari to the Honda, while simultaneously driving 10x more miles.
From token maxing to token optimization
For the last 18 months, the dominant behavior in enterprise AI was token maxing. You found the single best model on the leaderboard, you pointed every workload at it, and you didn’t think too hard about cost, because the whole thing was a pilot and the bill was small relative to the perceived upside of “does this even work?”
That era is ending fast. Companies are essentially blowing through their AI budgets in a quarter. Altman said recently that the “my company spent my entire 2026 budget in Q1, can you make this more efficient?” complaint went from something that “never came up” to “all of a sudden a huge issue.” And it’s not just him; you can see it across the industry, from companies like Salesforce and Meta to many other smaller ones blowing their planned yearly budget in a matter of days.
The natural behavior change from this is that instead of having one model for everything, companies start routing: a small, cheap, often open-weight model handles the 80% of requests that are less complex, and only the genuinely hard requests escalate to the frontier. This is token optimization, and it doesn’t reduce total token consumption — it accelerates it. The moment inference gets cheap enough, you stop rationing it. You run the agent in a loop. You let it read the whole codebase. You re-run it five times and vote on the answer. A single coding-agent session now chews through millions of tokens of context where a chatbot query used a few thousand.
You can see this in the hard numbers the hyperscalers themselves disclose. Microsoft said it processed over 100 trillion tokens in a single quarter in 2025, up 5x year-over-year, with a record 50 trillion in one month alone. By its fiscal Q3 2026 call, Microsoft said over 300 customers were on track to process more than a trillion tokens each on Foundry this year — and that this was accelerating 30% quarter-over-quarter. Google went from 480 trillion tokens per month at I/O in May 2025, to 980 trillion by July, to 1.3 quadrillion by October — and in its Q1 2026 filing disclosed that its first-party models alone were processing more than 16 billion tokens per minute via direct API, up 60% in a single quarter.
At the same time, the price per unit of capability is falling substantially— the Stanford HAI AI Index found inference cost for GPT-3.5-level performance fell more than 280-fold in two years, and a16z pegs the decline at roughly 10x per year for any fixed capability level. And yet, total tokens processed are growing several-fold per year.
Where the margin lives in a routed token economy
When a company calls the SOTA model directly through, say, an AI lab’s first-party API, the lab captures the full economic rent of the token. There are many industry reports surfacing around the margins that providers like Anthropic have right now, and many of them claim that the margins went substantially up this year to as high as 70% gross margins. That price embeds the lab’s R&D, its brand, its leaderboard position — call it the “model-provider margin.”
And to be fair, when token optimization happens, there might not be a direct hit to margins in the short-term at these frontier labs (as there will always be use cases for the best AI model), but it doesn’t mean that growth acceleration becomes smaller than it would be if everyone stayed for every use case on the frontier path.
The company routes that same workload to an open-weight model — GLM 5.2, DeepSeek V3.2, Qwen3 Coder, Kimi K2.5, Llama, MiniMax M2.5 — running on a hyperscaler’s managed inference. The “model-provider margin” essentially goes to zero, because the model is open-weight and nobody is charging a brand premium for it. But the token still has to run on somebody’s GPUs, inside somebody’s data center, behind somebody’s managed API with its security, compliance, logging, and SLAs. And that somebody are the hyperscalers. They still charge their full infrastructure margins on the token. AWS has historically run a roughly 35–38% operating margin. Google Cloud, which lost money for years, posted operating margins above 17% in Q1 2026 and is still climbing. That margin doesn’t care whether the token came from a $50/million frontier model or a $1/million open-weight model. To return to the analogy that from before, the tollbooth charges the same toll regardless of the car.
So here is the structural beauty of it for the hyperscalers, and the structural danger for the labs:
Per-token economics compress, but the compression lands almost entirely on the model layer, not the infrastructure layer. The AI lab’s margins on simple workloads get squeezed. The hyperscaler’s infrastructure margin is sticky.
Total token volume explodes, and almost every single token crosses the hyperscaler’s tollbooth. More usage, on a partly-depreciated, increasingly efficient installed base, means absolute infrastructure revenue and gross profit dollars go up even as the price of any individual token falls.
This is Jevons’ paradox pointed directly at the cloud P&L. The hyperscalers squeeze more revenue out of the infrastructure they already own, and they don’t need the model-provider margin to do it. Outside of Google, they were never in that business in the first place.
The orchestration layer brings real value
If the future is multi-model — and it clearly is — then someone has to own the orchestration: the layer that decides which model handles which request, holds the fine-tunes, runs the agent loop, manages memory and tool-calling, handles fallback when a model is down. This layer can bring immense value to those who own it, and all three hyperscalers are naturally positioned to become just that.
Amazon’s Bedrock is the clearest example. It’s no longer “a place to call Claude.” As of 2026, the catalog spans 18 providers and 110+ individually addressable model variants, and AWS bolted on Intelligent Prompt Routing, which automatically routes each request to the cheapest model in a family that can handle it — AWS claims up to 30% cost savings with no accuracy loss. On top of that sits Bedrock AgentCore, a production agent harness with built-in Runtime, Memory, Gateway, Browser, Identity, and Observability. Bedrock reached a multi-billion-dollar annualized run rate with customer spend growing 60% quarter-over-quarter across 100,000+ customers.
Microsoft’s Azure AI Foundry and Google Vertex are the same play, although Google’s preferred scenario is where the Gemini model family dominates the workloads.
The natural position for hyperscalers to be the orchestration layer is great because companies already have their data, security perimeter, billing, and compliance live in the cloud environments. They can pick from a menu of models.” The lab moves from the front end — the thing the customer chooses and bonds with — to the back end, an interchangeable component sitting behind the hyperscaler’s harness. Customers also start building the fine-tunes, the RAG pipeline, the agent scaffolding, the eval suite at the cloud orchestration layer, not at the model layer; switching models becomes something normal and not a migration task. One of my readers, who runs his own harness, put it perfectly in the comments of my last piece: the harness is worth as much as the model itself, and you’re paying for it either way.
The moat for the labs becomes shallow with most use cases but still remains important at the most complex ones (the ones where economic upside is not capped). But for hyperscalers, they benefit in all use cases that run the economy.
One might argue why companies couldn’t move even outside of the cloud, but the problem is here that besides the normal reasons that people migrated a big part of workloads from on-prem to cloud (easier to scale, use, collaborate, manage), now on top of those reasons, you also have the problem compute shortages where infrastructure outside of cloud environments is even harder to get and managing that infrastructure becomes a much more difficult job. But an important aspect, I think, that is forming is also the cybersecurity one. Given the recent developments here, it is clear that only a handful of companies will have first-row access to the most capable cyber models, to first shield their systems before the model is then available for everyone else. If you are an enterprise, you want safety, and it seems the only way for you to get it is if you have your infrastructure at one of those preferred partners that get first-row access to cyber-capable models before everyone else does. The hyperscalers are those partners.
The hyperscalers’ job, then, is almost simple to state: enable as many models as possible, make routing and fine-tuning and observability frictionless, and let customers optimize tokens to their heart’s content. Every model they add makes their tollbooth more valuable.
The government approval window just made this worse for the labs
Now layer on a development from June that I don’t think the market has connected to this thesis at all: the June 2, 2026 executive order: Promoting Advanced Artificial Intelligence Innovation and Security.
The headline version is that it sets up a voluntary framework where developers of “covered frontier models” — designation determined by the NSA via a classified benchmarking process focused on cyber capabilities — give the federal government access to the model for up to 30 days before release to “other trusted partners.” (The earlier draft had a 90-day window; it was cut to 30 to avoid blunting U.S. competitiveness.) The order pointedly does not create mandatory licensing — but it establishes a structured pre-release evaluation pathway, an AI cybersecurity clearinghouse run out of Treasury, and a government role in selecting which “trusted partners” get early access. The catalyst, per CFR’s reporting, was rising concern about models like Anthropic’s Claude Mythos being able to autonomously find and exploit software vulnerabilities. Commerce ordered Anthropic to cut off non-U.S. access to its Mythos 5 and Fable 5 models on export-control grounds, and those models were pulled from Bedrock days after launch. Now there is growing concern that, even when models are released to the public, U.S. citizens may get access before people from other countries. Hyperscalers have clients worldwide, so it becomes even more important for them to offer multiple models to companies and for companies to have an orchestration layer that isn’t just one AI lab, since they will have to juggle geopolitical compliance as well.
It strengthens the case


