What Happens After A 1,000,000x AI Compute Leap?

There used to be a chat group internally called data centers on fire that would have like exciting uh exciting events happening. >> A distant

supernova goes off, a cosmic ray hits a memory cell and a zero flips to a one. Does that really happen? >> Oh yeah. >> So my question is do you enjoy

these Chuck Norris style jokes about you? >> It could be true. um one problem that you solved tried to solve

many times but have never been able to crack. I cannot believe that this is happening but I got to talk to a legendary engineer the chief scientist of

Google Jeff Dean. He led Google Brain, one of the most legendary AI labs in history. He co-created map produce which taught thousands of computers to

work together as one. He co-built TensorFlow, the engine behind a huge chunk of AI research. And

for all this, they call him the Chuck Norris of computer science. Yes, I will tell him a joke about that too. Now, when I see interviews with these

executives, everyone is asking about China and taxes and all that. Look, I know nothing about that. I am just a student who loves to talk about

research. So, my goal was to try to go a bit deeper and ask him questions that maybe only he knows the

answer to, which is incredible. I'll also ask him about problems that even he couldn't solve yet. And I will ask him about some of the secret sauce at

Google and see if we get something and more. And I am so happy to share it with you fellow scholars so we can learn together. I am not sure if I saw

Jeff smile and laugh this much before. So, I hope he enjoyed it too. And once again, this is an

incredible honor. I cannot believe that I was sitting there. There were some production issues with the video part. I apologize for those. Also, I was

super nervous. I could barely hold on to my papers. Now, fellow scholars, let's learn together with Jeff Dean. Thank you so much for doing this, Jeff.

We talked a bit last year and I learned so much from you. It was incredible. And then I got a

message that we get to do this and I was so happy. >> So thank you so much for this and we get to share your knowledge. >> A small part of your

knowledge with the fellow scholars. So that's that's absolutely >> it was great chatting with you last year. I'm looking forward to this. >> Thank

you. Thank you. So everyone says that we are running out of training data for LLMs, but you said that there

is still plenty of data out there. >> What did you mean? >> Yeah, I mean I think everyone has this view that uh we're running out of training data and

um it's true we've like used quite a lot of the public text data in the world. Um but I think there's lots of interesting video data that we're not

really training on yet. uh there's lots of interesting kind of um ways to generate synthetic data and

then use that for training >> and then I also think we can start doing things like uh making more passes over the data that we do have to make more

and more capable models and also come up with algorithmic techniques that enable us to get a lot more information from every piece of data that we do

have. So I'm not too worried about that as like an impediment to making progress. It seems like

there's lots and lots of things we can do. People also say that with so much simulation data as you mentioned sooner or later most of the data will be

AI generated which is then used to train a different AI and then suddenly everyone starts to you know learn on the same thing but you said wait it

still helps I think the argument was that uh if you have enough compute you can crunch through a lot

of data and if there is just a little needle in the haststack that's useful the system is able to learn from it. Is that true? because my previous

crappy little experiment uh it was not true at all. So you had to be very careful with the data. >> Yeah. I mean I think it is true in general. I mean

there's a lot of details to get right to make this a reality. Think about for example doing RL

training and rollouts to uh you know figure out how to solve some fairly highle phrased uh coding question right. So you might explore a hundred or a

thousand different ways of generating solutions to these problems and you might have some, you know, some filters that you apply to these things like

does the code even compile? Well, you can throw out 800 of them right off the bat. >> Uh does it

pass the unit tests? Does it like perform well? And so you can really start to hone in on like which of these you know potentially many solutions to

the problem is the one that actually sort of generates the highest you know characteristics that you're looking for the reward in some sense >> and

that I think is definitely true like more compute will generate you more interesting solutions and then

those can then be put into the training data they can be enriched with like data augmentation techniques you know I generated the solution in Python

now I could generate a solution in Oh, and have more go programming language training data. >> That's like an incredible kind of augmentation like

augmentation before with convolutional neural networks, you know, it was just shift the image by a

couple pixels and whatnot and here the augmentation can be like completely different programming language and whatnot. >> Yeah, I mean I think you

know a lot of times we think about coding based problems as you go from natural language which is >> often very underspecified. It's like you know

make me a cool space invader game or something. Um, but actually if you have a program that already works

that does what you want and you want to translate it, that's awesome because in effect your prompt is the fully specified behavior of the system you

want and you just want it in a different language for whatever reason. Maybe better performance or better safety characteristics or whatever. So that

we've seen internally with some tools that have been written in Python and people have been able to

sort of just say >> please use all the tests for this code and the actual Python codebase and make different versions of it and found you know much

faster solutions. >> So you can you can suddenly get so much more out of the same amount of data basically. >> Yeah. So that's that's why you're not

worried about the data. Okay. Nice. Now Bod Deli has said that something like 90% of what happens in

modern data centers is not training anymore which I I found really surprising. It's inference like there's more less training and more using like

relatively speaking. >> Um how does that shift the way you design hardware at Google? Yeah, I mean I first there's a lot of other things that are not

either inference or training that happen in data centers like all the applications we run and search and

Gmail and so on. But of the sort of machine learning workloads you know I it is the case that training uh is becoming you know less proportion of the

overall compute that we want to do because there's so much you know inference workload you want to do and the inference workload includes both like

offline inference u sort of RL rollouts during RL training uh and then also online inference for

handling user requests or agent-based behavior. Because of that shift and the different characteristics of those two kinds of computations, it makes a

ton more sense to now specialize much more for inference workloads in hardware for example. Um because the characteristics are quite different. You

need lower precision. You >> you know are handling a very large volume of requests on this particular

model. The model weights don't necessarily change uh at inference time. Um all these things lead to very different solutions for hardware and much

more energy efficiency can be gained by specializing and so I think you'll see a lot more in this area uh you know now and in the future. We've

already done this with our TPU uh 8i and 8T chips that we announced a couple um maybe a month ago. >> Um but

you'll see even more specialization I think. >> And that's pretty crazy that you said that even FP4 kind of works. And I when I first heard it I was

like it cannot possibly work. can do anything useful and it does. >> Yeah. If you told that to a computer scientist from 15 years ago, they'd be like,

that's that's not enough numbers. >> Yeah. Yeah. Exactly. >> And I look at every now and then at

these papers and you you have these different transforms that are the the distance preserving transforms, rotations between the points and all kinds

of compression. But still FP4, that's unbelievable. It's not many bits for expert or enters or sign >> and it and it and it's high quality, you know,

intelligence that comes out of it. So, it's just >> it's a good sign that it works. >> Yeah. Yeah.

But I I I don't know if we can get lower. Uh what do you think like even lower? >> Possible. I mean I think um you know people are seeing and

experimenting with things where you have some even lower precision and then it every so many weights of that you know lower precision you have a

scaling factor and that seems like you get a little bit of a higher precision thing that's kind of shared across

all the other lower bit precision u formats whatever they might be two bit integer one bit integer you know I haven't heard anyone say two bit float

because I'm not sure what that would mean >> but um yeah I that plus a scaling factor seems to be able to get you pretty far. >> And the question is

like how often do you need the scaling factor? Is it every 64 or 128 or 256 weights? >> Pre and post

training are typically separate steps today. Do you see that split holding or do you expect the two to merge as capabilities increase? >> Yeah, I mean

I feel like it's a little intellectually dissatisfying that they are these distinct phases and you do one and then you do the other. it like

conceptually the right uh thing to do is to have interle periods where you're sort of observing data and

then periods where you're trying to use that new knowledge you've gotten from the data you >> like with DQN this experience replay kind of thing >>

yeah and then you want to now take actions in some environment maybe it's a simulated environment maybe it's the world with a robot or whatever it is

and then you know learn from those actions because I think you get a lot more benefit from actually um

taking actions and observing the consequences or trying to write code and seeing does the code work >> than you do from just passively sitting there

and seeing tokens streamed by you which is really what most of pre-training is these days. It's really interesting that you say that in an interled

manner because when I when I hear merging the two what in my mind is continuous like continuous

learning >> but at the same time people have to test models you cannot just chuck it out there you know you finish training you finish the post and

then maybe the red teaming steps and and you know safety and everything and then you package it up and you say okay this is good to go but if there's

continuous learning then there's no challenges because how do you know that this >> intermediate state

is actually safe. Maybe some more research there too. >> Yeah, I mean I think uh first like a bunch of discrete steps where maybe you do this a 100

times or a thousand times starts to look more like an integral than a summation. >> Um and so um >> I do think interle in that way will make sense >>

but you're right >> like you have a bunch of things you need to do for a live model that is serving

user requests. You need to make sure that it's safe. Um so it may be that the continual learning happens and then there's some uh application of uh

you know safety protocols and red teaming as you say uh and then you release a new version of that but then that model still continues to learn kind

of behind the scenes and then before the newest version of it is provided to users you redo the sort of

final safety testing and teaming. Jensen likes to say that compute capabilities advanced 1 millionx over the last 10 years. So if in the next 10

years, assuming we get another 1 millionx, what would we be able to do that we cannot do now? >> Yeah. I mean it's like imagining the future is always

a hard thing because this field is moving quickly. >> I mean I think if you think back, you know, 10, it

was 10 years. >> 10 years. >> 10 years. If you think back 10 years, you know, we were kind of just starting to have language models that were the

sequence to sequence paper had appeared. You know, it was just before the transformer. >> LSTMs, maybe >> LSTMs were popular. >> Um, and now those

models sort of look uh >> not nearly as ancient and not nearly as capable as the models we have today. So,

I think if you project forward that level of advancement, you're going to see >> huge investments in both like new kinds of hardware um you know new

kinds of research techniques uh there's just a lot more attention being paid to the field. So I I see that progress rate not slowing down um over the

next 10 years. And so that's going to be incredible like the multi- aent workflows we're now able to

start to >> kind of get to work on very complicated tasks like you saw in the IO uh keynote >> being able to write an operating system >> autonomously

with a relatively simple prompt. >> Crazy. uh you know obviously there's a lot of operating systemy like things in the training data so it's not

completely out of distribution but you know the fact that it's able to build an OS that can run Doom uh

successfully is pretty amazing >> I couldn't couldn't believe it I mean last year I heard a talk from Steven Balaban the Lambda CEO >> and he had this

neural OS like hey you know it does more and more like forget the UI forget the maybe the drivers I don't know but just let's let's have a neural OS

and I was like, "Yeah, that sounds like an amazing science fiction idea. I would love to see it, but

I don't know. I mean, it sounds far off." A year later and we got you, you know, not exactly like that. I know but if you look at the derivatives over

time >> I mean I would say one thing I'm particularly excited about is you know can we with these tools accomplish so much more in you know science

Demis was mentioning in the keynote or in you know complicated engineering tasks that often would

take you know lots and lots of people multiple years to accomplish. Could you actually have a system that with the correct access to the right kinds

of simulation environments and a learning set of agents that are trying to accomplish the task and break it down into smaller tasks, >> could you

design an airplane in, you know, five days instead of, you know, many years? That would be amazing. >> 1

millionx and we can we can try again. >> Yeah. I mean, we're not there yet, but that would be a pretty amazing capability. Or designing new computer

chips or computer systems, new hardware. Um, you know, I'm pretty excited about that. >> Yeah, incredible times. Are open models standing on the

shoulders of giants? And by that I mean if Frontier models suddenly stopped being released, would open

models improve as quickly as they do now or is their progress mostly driven by distillation? >> Yeah, I mean I think certainly a bunch of the progress

is driven by distillation. For example, our own Gemma models are definitely distilled from higher quality larger scale models. Um and I think a lot of

other open source models are getting benefit from distillation data. Uh distillation has always

been a you know amazing way to get really capable models into a smaller footprint thing and you know uh that's how our flash models are quite capable

for their size relative to the pro models is we're able to use the pro model to >> to teach the flash models. So I mean I think really the question is

uh not so much one of closed versus open. It's you know if we want small incredibly capable models

we have to keep building larger scale models that are maybe less inference efficient but are more capable and then use distillation >> to uh you know

transfer the knowledge into the smaller models whether they are open or closed. Now I'm I'm wondering you might be the only one who can answer that.

So I I really want to ask this. Everyone has their flagship models and yes the distilled models like

pretty much every company does this tiered level thing. the quicker faster models are always were well below the frontier models and at some point I

think 3.1 where there was one version where the quick one was suddenly so close to the frontier one there was like a 3% difference >> in in tough

benchmarks and I just heard someone saying I don't even know who that was that yeah it's not like just

distillation there is some magic sauce in there that's been in the works for years. So, can I hear a bit about that? >> Sure. Well, not too much. I

mean, there is always some magic sauce that we don't reveal, but distillation is definitely one of the key things that makes those, you know, much

smaller models much cheaper, much faster, much more affordable um models be, you know, nearly as good as

those frontier models. And then we push ahead and build an even better frontier model. And then we have to then do the process again where we now

transfer the knowledge and the really capable frontier model it back into a a lighter weight one. And I think um you know this is this is really

important because the flash models are really the workhorse of what people generally want to use because

they're you know they're almost as capable. We saw it. Yeah. >> Yeah. And uh >> and they're they're quite good. >> Yeah. It's unbelievable how close

they can get like this. This didn't used to be like that at all. All right. What trends in machine learning are you most excited about right now? You

you have a separate talk about like exciting trends in machine learning or something like that. >>

Yeah. I mean >> what's what's the newer version of that? >> Yeah, the newer version I guess I mean there's a few different trends that I think are

really exciting. The one is um uh so first I think continual learning is still a little bit nent but I think looking at ways to make models that are

more interled in their way use of so sort of seeing data passively and taking action and learning from

that seems like a really important thing. Uh you know agents and multi- aent use of uh these systems is really important. Um, as one trend of that

though, I think as you see, uh, you know, we're going to need a lot more inference hardware and capability for that because those systems that are

working autonomously in the background actually consume lots of tokens in order to sort of >> do the kind

of important work they've been asked to do. Um, you know, I think, uh, being able to build really efficient inference hardware will enable a lot of

things. So looking at you know co-design of model architectures and hardware architectures to make sort of the best use of um things and have really

good properties in terms of very low latency you know much higher performance per watt performance per

dollar are things we really care about. um you know I think looking at how do you know the context window of these models is an important

characteristic but uh I think there's a lot we could do if we come up with mechanisms that are sort of cascaded series of things that kind of give you

the illusion that you have all information in the context window >> like you'd like to have the whole internet

at your model's fingertips >> or on a personal level if you've opted in you know all of your email and your photos and your the videos you've watched

and things like that. Um, but you can't really do it with the sort of quadratic attention mechanism. But I think you can build a series of kind of

retrieval and lighter weight mechanisms and then ways of cascading from you know here are the 30,000

documents out of 10 billion that seem most relevant and then you know have a lighter weight model that looks at those and decides these 117 things

seem really relevant to what you're trying to do and puts those in the sort of more expensive context window of a a bigger model perhaps. Uh that's

going to be kind of exciting. And how do you orchestrate and interle all that stuff so it gives you the

illusion uh without you having to even think about it? >> Interesting. So it's very advanced games to be played with the context window because

obviously very expensive. So the attention mechanism you get you get bigo squared. >> Uh are we still there or are do we have some I mean I've heard

some n login things. Can we go lower? There's like a whole series. >> Obviously we can go lower but the

question is what the trade-offs are right like what do you have to pay for that? Yep. um where are we in that? >> Yeah, I mean I think there's

actually quite a large body of work there probably, you know, hundred papers on more efficient context uh algorithms than the than N squared one. >> I

mean the N squared one works really well. uh so it has a pretty high bar but I do think there is traction

in finding things that are you know much lower cost whether it's you know reducing algorithmic factors or very large constant factors on the base n

squed algorithm I think all of these are pretty exciting you can actually combine many of these approaches >> um and get uh you know much cheaper

attention over many more tokens >> yeah I think that's one of the most important things because if it was

cheaper in some sense and and you could still find the needles in the in the haststack over very long contexts. Then you could you could have some

sort of lifetime AI thing. >> Yeah, totally. Like I'd like my whole life of all the digital things I've seen uh in there. Uh as a say internal Google

developer, I'd love for the entire Google codebase to be in there, which is you know probably 10

billion lines of codes, probably you know big you know 100 billion tokens. >> I just want my wine list. >> I just want 100 billion. All I want is a

100 billion tokens of attention. It's all I need. >> Amazing. I think we got to do this one. So, Google's data centers run an enormous number of

machines. And at that scale, anything that can go wrong will go wrong. Like I hear that wires wear down, >>

hard drives fall apart, motherboards overheat. Um, is that something that actually happens day by day? And do you have any good stories? >>

Absolutely. I mean, I don't have that many personal stories, but there used to be a chat group internally called Data Centers on Fire that would have

like exciting uh exciting events happening and sometimes exciting videos. Um yeah, I mean I think >> at scale

lots of things that are very unexpected happen and usually those are the combination of one thing fails and something else fails simultaneously or in

cascade of during the yeah you have a cascaded failure of some sort. You know, sometimes that means some software system stops working. Sometimes it

means like the bus bar overheats and you get too much power to the to the rack and like it catches on

fire. I mean that's a much rarer thing. But um you know you have to be prepared for this and I think one of the things even from the very earliest

days of Google is we have really focused on how do you build reliable systems out of unreliable parts. Yes. >> Right. Like in the earliest Google

days, we were buying consumer machines without uh ECC memory didn't not only not ECC not even parody >> we

were buying consumer motherboards that didn't have like redundant power supplies and you can do that if you can handle things at a higher level and

that's generally what we try to do in all cases is >> I actually wanted to ask you about that the ECC thing because here's one of my favorite failure

modes if that's true but you tell me the distant supernova goes off, a cosmic ray hits a memory cell

and a zero flips to a one. Does that really happen? >> Oh yeah. Yeah, absolutely. I mean, alpha particles definitely can flip uh you know DRAM state.

We've actually observed this because we have monitoring data of how many ECC uh errors and like single bit errors that are corrected and two-bit

errors that are not corrected are happening in all of our machines. And you can actually see this where

some clusters that are pointing in a particular direction in the earth have a much higher rate for a you know a brief period like 10-minute period or

something and then the other ones in the other side of the earth do not have that. So it's definitely something that happens. >> How worried should I

be? Because MacBook Pros don't have ECC memory as far as I know like for one machine is it so

vanishingly you know unlikely that you shouldn't care but for data center or >> I mean for one machine it's generally not too bad. I mean I I think

they have par so at least they detect it typically if it's a single bit error >> so detection but not fixing >> right but ECC usually gives you single

bit error correction and dual bit dual error detection. Yeah. >> So for with that you don't have to

worry about it too much >> um at a single machine level but even at you know tens of thousands of machines you'd have to start thinking about that. So

you know one of the things we did when we were using machines without even parody is we built an entire softwarebased check summing system for large

amounts of our data. So >> doing it by hand >> doing it by hand essentially and like we would you

know for crawling web pages and putting them in the index >> you know if you detect that this particular record is corrupted it's usually generally

okay to just you know ignore that record. >> Now I have something interesting for you. I call it lightning round. So, please try to answer in one

sentence. One word is okay. One one sentence. >> Can I make run-on sentences? >> We'll see. We'll see. So,

I I read that Jeff Dean's pin code is the last four digits of pi. I I give this one an eight out of 10. So, my question is, do you enjoy these Chuck

Norris style jokes about you? >> It could be true. Um uh I I do enjoy them. I mean, it's a April Fool's joke gone ary by my colleagues in 2009, but

it's very both flattering and kind of embarrassing. >> I think I think he felt the same way about them,

too. But he he enjoyed them, too. Legend. All right. One big thing that you were wrong about and came around. I think AI is going to influence health

care quite dramatically, but I think it is harder not necessarily for technical reasons, but for you know, how do you actually get things in regulated

industries that are super important and have all kinds of privacy constraints and safety concerns,

but I think ultimately that will happen. It's just taking longer than I I hoped. Yes. Because I think there's tremendous world benefit to do it.

Um, but we need to do it carefully and safely. >> Vim or Emacs or something else? Hint, there's only one good answer. >> Emacs. Was that it? Oh, no.

Look, I I'm a Vim person, but I'm I'm not >> Maybe I'm I'm an embarrassment of a Vim person because

I I I looked at Emacs, too, and I was like, that's pretty cool, too, but I I don't want to learn both. It's it's just so much time. So, >> yeah, it's

true. One can spend a lot of time customizing Emacs. the VRC I wrote up and then and then it never ends. Yeah. One problem that you solved tried to

solve many times but have never been able to crack. >> I mean I think in some sense we still don't

have an answer to how do you do continual learning appropriately? That's something I've thought about a little. I' I've dabbled a little bit with some

techniques along with colleagues. >> But I think uh you know if we're able to crack that it's going to be amazing. Um, but it's not there yet. >> Last

one. Favorite Two-Minute Papers episode. >> Oh, yeah. I mean, I assume the Transformer one was a

good one. >> All right. All right. Well, that's that's a good one. Okay, Jeeoff, I I learned a lot today. Thank you so much. This chatting with you

again. >> Thank you so much. >> Thank you. >> Here you see me running the full Deepseek AI model through Lambda GPU cloud. 671 billion parameters

running super fast and super reliably. This is insane. I love it and I use it on a regular basis. Lambda

provides you with powerful NVIDIA GPUs to run your own chatbots and experiments. Seriously, try it out now at lambda.ai/papers AI/papers or click the link in the description.

What Happens After A 1,000,000x AI Compute Leap? | Jeff Dean · 全文文字稿