Chances are when you've interacted with a chat bot, you've seen it pause and say something like, thinking. Well, what's going on there? Well, we've

talked a lot on this channel about how LLMs actually get trained. There's a transformer architecture and it's fed a massive corpus of data. So there's

absolute ton of data going into here. Through next token prediction. It learns language, it learns

reasoning, it learns facts, probably learns how to code. And all of this gets compressed into the weights of the model. And for years, the playbook

for making that model smarter has been scale up. That means more parameters. That means... More training data, more flops during pre-training and the

scaling laws have shown that this holds up. Now, this approach is called train time compute, and it is

a cost.

Fix is the fixed cost. You spend months of computing time and probably millions of dollars training a model and then those weights are frozen, from

that point on, whether somebody asks the model to summarize an email or to solve a gradual level of physics problem, it does the same thing. One

forwards pass through the network token by token. And every token is something of a commitment because the

model picks the statistically most likely next token, it emits it, and at that point it's kind of locked in because there's no going back to

reconsider. It's a forward pass which is well always moving forward. So if that first token sends the response down the wrong path, the model just

keeps going with it, and that's actually one of the reasons LLMs can hallucinate so convincingly, but what if we

give the model a compute budget. So this time we're going to give the model a budget and specifically This budget is not spent at training time, but

instead it's spent at inference time when the model runs. And it gets to decide how to spend that budget. Well, that is test time compute. That's what

that thinking message is all about. And the research coming out over the last couple of years shows

this might be just as important a scaling axis as model size. So what's actually going on during that thinking time? Well, there are a few distinct

mechanisms and actually they can be combined. And the most visible one of these is just chain of thought. Now anybody can invoke chain of though just

by prompting it, by telling the model to think step by step. But there's a newer class of models and

those models are called reasoning models. And these reasoning models have been trained through reinforcement learning to do this automatically. So the

RL process bakes chain of thought into the model itself. Now, during RL training, the model learns that producing intermediate reasoning tokens, which

is breaking down a problem and working through the logic step-by-step, that tends to get a higher

reward. So it does it more often. Essentially, we can say that it is generating what we'll call thinking tokens as it goes through this process.

Thinking tokens are generated before the actual response. Now what's special about thinking tokens? On the face of it, not much. They're still real

output tokens, they cost real compute, but they change what that forward pass is being used for. So

remember in a standard response the model is committing to final answer tokens from the very first word, but with thinking tokens, those kind of early

commitments are really just like a scratch pack of work. The model can explore an approach. It can realize, ah, this is not working and then try a

different angle. All before it commits to a single word of the actual answer that's returned to the

user. So instead of going straight from a query straight to the answer that we give back to the users, we now have an intermediate step. We go from

query. Reasoning that's where the thinking tokens are generated and then we go to the answer. So that's chain of thought. Now the second mechanism

number two is search. So in standard inference the model does greedy or near greedy decoding and that one

and done forward pass is what I'm talking about there. So it picks the most likely next token and then it just it moves on but with test time compute

You can do something more like a tree search. Now the model in a tree, search starts a reasoning chain and then it branches. So it tries different

branches kind of off the tree. And at some point it needs to pick one of these branches to go down. So

to do that, it uses a verifier to score which branch is most promising before continuing down it. Then the third mechanism is self-consistency. This

is to run the same problem n times at a really high temperature, so you get n different reasoning paths. Then you take a majority vote on the final

answer so if it says let's say 7 out of 10 independent chains have landed on the same answer. You've

got some pretty decent confidence. Now there's no verifier model needed in this case because it's using the statistical distribution of the model's

own outputs as the signal. Now to be clear all three of these mechanisms are trading something. They are trading compute for accuracy. There are more

flops per query but there's also a probability of getting a good response.

Does this actually work? Well, a 2024 paper out of Google DeepMind on scaling test time compute found that test time compute follows its own scaling

law. In fact, we can draw it. So if we think of the axes here being performance on reasoning benchmarks, well, the performance actually goes up slowly

and smoothly and Digitably. As you increase inference compute. In fact, researchers showed that if

you take a kind of a tiny parameter model, let's say a three billion parameter model and you use test time search strategies, it would outperform a

much bigger model, a 70 billion model on hard math problems. So that's a model that is over 20 times smaller that's beating the big one and it's doing

it just by thinking longer, but there are definitely some trade-offs here. More thinking time, well

the obvious thing that means is more latency. If every single chatbot query takes 45 seconds while the model works through a search tree, well users

might have a bit of a bad time. And those thinking tokens are billed as regular old output tokens. So a response that burns through 10,000 of them for

a single response is more expensive to run, but it's not just latency and expense. Another problem

is just basically overthinking. So forcing a reasoning model to deliberate on simple questions can actually degrade performance. The model kind of

second guesses itself into the wrong answer. And if you... Ever talked yourself out of the right answer on an exam. Yeah, it's the same thing. And

that is an analogy I have some firsthand experience of. Now from an economics perspective, training

compute that is considered capital expense or CAPEX is paid once, regardless of the query volume or inference time, but test time compute that is OPEX

is operational expense. It's paid per query and you can choose how much per query as well.

With these trade-offs in mind, the best approach is actually one that is considered adaptive, so it changes based upon the request that comes in. So

we can route easy queries to the fast single pass inference. And then we can rout harder queries through the full reasoning pipeline. And that's how

many chat bots work today. ChatGBT, for example, uses a picker. To route queries between reasoning and

non-reasoning models. So we've been scaling AI by making models bigger and bigger at training time and test time compute is the second axis. It's

letting the model spend more compute on the problems that need it. AI models, they're getting bigger, they getting faster, but they're also learning

when to slow down and think.