The next tectonic shift in AI: Inference

Oct 11, 2024

Hi all,

In recent weeks, after the launch of OpenAI's o1 model, I have researched the new model and its repercussions for the whole industry. I want to share some findings because I find them very significant for the entire AI investment landscape. I also had the privilege to talk to one of the best industry experts on AI inference, Sunny Madra, General Manager at GroqCloud (Groq), the $2.8B AI Inference startup. For those who haven't seen my talk with Sunny, I encourage you to do so, as it is truly very insightful for any AI & semiconductor investor. You can listen to it HERE.

AI workloads

We already know that we can divide AI workloads into two categories. We have the training workloads used for training new LLMs and the inference workloads used for running and using the models. In my article from August GenAI breakthroughs, I already mentioned that I believe inference workloads will be a far bigger market than training and that Nvidia will not have a monopolistic position as it does in the training market. It is also fair to say that a lot of the current GPU CapEx cycle so far was driven by training demand as five key companies are racing to make the newest and best model: OpenAI, Google, Meta, Anthropic, and xAI.

However, OpenAI's o1 model is changing the landscape and forces in the industry to a considerable extent. Before we start, I also want to emphasize that the view and the way I am going to be articulating is not going to be strictly scientific as I want to focus on the business aspects of these new technology changes. It will be more for the tech investor community than the scientific community (so for all the AI scientist out there reading this; bare with me).

OpenAI's o1 model

I will not break down the o1 model here as you can read the technical paper of OpenAI to do that. Instead, I want to focus on one key piece of the model that I find really important: the Inference part of this model.

I find these two visualizations very effective as a starting point for better understanding. The first visual here is the “Standard” LLM model. By standard, I mean every model before the new o1 model.

You have three main parts to making an LLM: the pre-training part, the post-training part, and the inference.

In the pre-training part, the model is the initial phase where the model learns to understand and generate human language by being exposed to vast amounts of data.

Then there is the post-training phase. Post-training is used to further refine, specialize, and optimize the model for specific tasks or to align it more closely with human preferences and practical applications.

The post-training part is then broken down into different steps like Supervised Fine-Tuning on Instructional Tasks (SFT-IT) and Reinforcement Learnings. Reinforcement Learning can be from Human Feedback (RLHF) or AI Feedback (RLAIF). To avoid going too much into the details, the point of reinforcement learning is to fine-tune the model to make it more useful and accurate and to integrate ethical and safety responses often.

The last part is »Inference«. This is the »production« stage, where the model then, based on prompts (can think of them as queries) from users, generates predictions (answers).

Now that we know how the »standard« LLM is built, let's look at how OpenAI's o1 seems like it is built:

There are two key differences. One is the »Self Play/Tree/RL« part in the post-training phase, and the second is a much longer Inference stage made up of Reasoning Traces/Trees.

The »Self Play« part is a technique where an AI agent (or more of them) Self-play against itself to learn and improve over time. The model generates its own training data by interacting with itself through this method.

The Tree Search is a decision-making process that involves systematically exploring a "tree" of possible actions and outcomes. It means that the model tries to simulate and evaluate various potential future actions and use those results to inform the current decision. This is very useful for solving complex problems that require multiple steps.

Then there is the significant change that we already mentioned, which is a much longer Inference time and, with it, Reasoning Traces/Trees. This means that the model has a record of the steps that it took to arrive at the final answer (prediction). It also means that the model comes up with different answers (prediction), evaluates them, and decides on the best answer (prediction) for that question. This is similar to the Chain-of-Though prompting that could already be used in the pre-o1 models, where users encourage the model to explain its reasoning and steps rather than provide a direct answer.

To simplify it, the o1 model has a backtracking ability. The model predicts something, realizes it did something wrong, goes back, erases that, and comes back and predicts again from that point.

The most significant implication of this kind of model is that inference workloads should grow substantially more than we were expecting in the pre-o1 period.

The calculation for Inference is now not just the number of users using it multiplied by the number of times they use it. The model can now take 10x or even more time on inference compute to come up with an answer. So inference also becomes part of the accuracy process.

The second big implication for investors is that inference computing is now becoming a new scaling paradigm. So, you not only scale the model with what is now known as data and training compute, but you can also scale them with more inference.

Noam Brown, an OpenAI researcher, has said that a study on the board game Hex using AI found that if you have 15x the inference compute, it equals 10x the training compute.

The fact that you can now scale LLMs via inference means that:

A. You can have smaller models that you dedicate more inference compute that can be as good as bigger parameter models with less inference compute

B. Inference computing is much cheaper than training computing, but the market for inference will be vastly bigger than training computing. In my discussion with Sunny, I asked Sunny how big he thinks, as an industry insider, the Inference market will be; Sunny revealed that he had the chance to preview an interview with Jensen Huang, the CEO of Nvidia, where Jensen said that Inference will be 1 billion times larger than Training. Sunny added that it makes sense to think that a model is going to be used billions of times before it is updated (trained) again.

It is also important to note that the Inference chip market has much more competition than the training market, where Nvidia dominates. From an industry expert:

»Training also is notoriously hard because you need special architectures and special cards and interconnects between the clusterand RDUs and stuff like that. It's mostly dominated by NVIDIA because they've done the best work there. Inference is interesting because inference can be done anywhere. Inference is very, very easy to do on any hardware. Training is harder.«

source: AlphaSense

This means that other companies will be able to reap the benefits of inference chips besides Nvidia. It also means margins on inference chips are not going close to Nvidia’s margins on its training GPUs, where it basically has a monopoly.

It also opens a path for some companies to lower some of their costs, and instead of going heavy on training GPUs and scaling there, they can split some of that on inference chips and still scale the models. Inference for customers is vastly cheaper than training.

It seems LLM companies will also have fewer data problems. We know by now that most of them already use synthetic data to train the new LLMs, but there could also be a potential new shift here. As Lukasz Kaiser, another OpenAI researcher on the o1 launch, hinted to it:

The multiple Chain of Thoughts (CoTs) that these new models produce in the process (referred to as hidden CoTs) can also be a good source for training data for future models. In my recent discussion with Sunny, he mentioned that the newest models have already used most of the internet data available and now have at their disposal data from the internet that is created in real-time, data siloed in private databases like ERP/CRM systems and synthetic data that previous versions of LLMs make. In a world where the LLMs have already been taught all the internet data available, having them train on synthetic data is important, and having CoTs as potential high-quality synthetic data can be even more valuable.

There is a reason why OpenAI is hiding the chain of thought from the user with their o1 model. It makes it much harder for competitors to distill these capabilities to replicate and possibly use them as training data for their future models.

So, after breaking down some technical topics, what does this mean from a business perspective for companies in this industry?

Sponsor:

Get a AlphaSense Free Trial

The companies best positioned for the inference market.

Because of this change in how new models use Inference, we, as investors, are underestimating the size of the Inference market. As we also acknowledged before, AI Inference will be much bigger than training in terms of usage, but it will also be much cheaper with lower margins than training since Nvidia doesn't have a monopoly.

For easier visualization, I made this diagram of some of the leading companies in the AI inference market:

Looking at the »Design Layer« of the diagram and the »Foundry« layer, the thing that comes to my mind is that we might be facing another bottleneck when it comes to chip manufacturing (Foundries). As a chip designer, Samsung and especially Taiwan Semiconductor are the only real possibilities for you to manufacture the leading edge new chips. And even Samsung recently had some issues admitting it had fallen slightly behind. So the thing that we might have to keep in mind is also the fact that if the supply on the Foundry level tightens up, a company like Taiwan Semi might start to shift orders to bigger customers where they have a more long-standing partnership. If that were to happen because of supply constraints, margins in the short-term on Inference chips might still be higher than what they will eventually end up in the long run when there is enough supply. Here the companies that have the best relationship with Taiwan Semi would be in a better position. The way Taiwan Semi decides to allocate is well presented by a Former employee of GlobalFoundries (the 4th largest semi Foundry in the world):

» Again, there are multiple scenarios. When I do not have enough capacity, let's say, to fulfill all my customer needs, then it's going back to the earlier contract. If the price is fixed or depending on if everybody already having exactly the same terms and conditions in their contracts, then at this point of time, when a company want to choose which company to favor on, we are looking for the future contracts.

Now, if I'm TSMC, I could go back and tell Apple, NVIDIA, and AMD and say that "Hey, I do not have sufficient capacity and I'm going to drop, decommit some of the requests because not enough capacity. It depends on who is going to give me a better price for a future contract or higher volume for my future contract that I prioritize to that company, so we look forward.«

source: AlphaSense

With this in mind, a very interesting thing that Groq's Sunny mentioned in our conversation is that they currently at Groq use older technology with 14nm chips produced in the U.S. Although they are now in the process of making also more cutting edge newer chips with 4nm tech they will continue to use the 14nm chips as the supply chains are not limited there.

That is great news for the hyperscalers as they are probably the ones because of their cloud offerings that can't afford to run out of inference compute for their clients.

It is also good news for foundries like Taiwan Semi and others as their newest high-tech chips are already filled with orders from training workloads, so having the ability to also harvest older tech chips for Inference workloads to fill in that supply might be a great way to squeeze some more profits from older fabs.

What about edge computing?

While the implication that you can get better accuracy with more Inference suggests that smaller models can become more capable, there are still significant hurdles for on-the-edge (on-device) computing. Problems like battery life, heating, and how long a user is willing to wait for an answer are big questions. Sunny shares my view as he thinks having full-on device compute for these AI models is more dream than reality:

source: Webinar Rihard-Sunny

Some easy queries can be processed on the device via an SLM, and then, for more complex ones, go to a cloud service. This is a setup that Apple hinted at with their Apple intelligence and what some other industry experts believe will be the case.

However, this practice presents challenges as you have to have a system that will accurately detect which query is complex and which is simple. Developing this kind of system might be an opportunity for some companies as it seems that we might need it not just to decide what could go on edge and what cloud but also what should go to a »pre-o1« AI model and what should go to the o1 type of AI models which have multi-step process baked by default in their offering. Where the question asked is very simple, a multi-step process might not be needed as it is more costly, uses more inference compute, and takes the model longer to answer the question. This need is already evident in OpenAI's pricing, with OpenAI’s o1 model being 6-20x (basic or mini) more costly for developers to use compared to other OpenAI models.

Summary

Inference has become a new scaling paradigm. Models will not only scale and become more accurate via training compute but also inference compute

The thing that I also didn't mention, but because of the o1 model release and the fact that we are coming to the start of Big Tech reporting earnings, I believe there is a high chance that the hyperscalers and companies like Meta, who are building these LLMs will increase their CapEx expectations now even more in the short-term than what they did before and much higher than analysts expect. The reason is that they now have to account for spending on Inference compute to improve these models. Inference was before a cost that they could gradually introduce and control more with users getting limited access to AI features, etc. This has changed, and you can use inference to scale the model. This might not be what investors will like in the short term. Still, it is something that, in the long run, brings us even more capable models, possibilities of easier agentic AI use cases, and SLMs that have a good enough accuracy to be used more often compared to bigger LLMs. There are already estimates on how much the inference costs are more expensive with an o1 model than with »pre o1 models«. This industry expert quantifies how much more expensive it is:

» Analyst: Strawberry o1, I've been told it's 4X-5X more expensive than ChatGPT?

Industry Expert: Yeah. That's the right level. That adds up given that it will essentially use 4X-5X more tokens on average. In the worst case, it will be 10X possibly. The 4X-5X is an average number of how much more expensive it is.«

source: AlphaSense

I am going to use Sunny's comments as my closing thought:

»The wonderful thing about the pace of AI development right now; less than every 90 days we have like a Christmas of something new«

Until next time,

PS: if you are interested in the LLM and Semicounductor space make sure to check my discussion with General Manager of GroqCloud Sunny Madra:

Sunny/Rihard AI & semi discussion

Thanks for reading UncoverAlpha! This post is public so feel free to share it.

Disclaimer:

I own Meta (META), Taiwan Semicounductor (TSM), Microsoft (MSFT) and Amazon (AMZN) stock.

Nothing contained in this website and newsletter should be understood as investment or financial advice. All investment strategies and investments involve the risk of loss. Past performance does not guarantee future results. Everything written and expressed in this newsletter is only the writer's opinion and should not be considered investment advice. Before investing in anything, know your risk profile and if needed, consult a professional. Nothing on this site should ever be considered advice, research, or an invitation to buy or sell any securities.

Vladimir

Oct 14, 2024

Thanks.

Looks like ASML is also can be involved to the inference market and therefore may have a positive effect. Do you agree?

Expand full comment

1 reply by UncoverAlpha

Oct 13, 2024

Thanks for the article. Really enjoyed it.

I tried to watch your interview with Sunny but the "watch now" button does not seem to work. Is this content for subscriber only? Also if there a retail investor oriented subscription for alpha-sense?

2 more comments...

UncoverAlpha

The next tectonic shift in AI: Inference

Discussion about this post