Inference, Diffusion, World Models, and More | YC Paper Club 文字稿

All right. Hello everyone. How you guys doing? Welcome to the first ever YC paper club. This is like a very exciting thing. Absolutely thrilled with

the response. We had over a thousand folks that applied to come in. It was a very hard selection. If you guys have friends that didn't make the cut,

I'm very sorry. We're we kind of we need to keep it to about a hundred. Um and so we selected a very

cool group. Um the mission is to create this kind of community of great founders and great researchers and try to pull them together. I guess just for

you guys to get a sense for how cool the people in this room are. Um, raise your hand if you have at least five citations, 10 citations, a 100

citations, a thousand citations. Wow, this is insane. Okay, 10,000 citations. Oh my god. Okay. All right.

This is awesome. I I would go up to 300,000, but I think it's like Chris Manning and that's about it. Um, so, uh, raise your hand if you've raised at

least a million dollars. Raise your hand if you've re raised at least $5 million. At least $10 million, at least $50 million. We still got one. We

still got two over here. All right. Okay. Awesome. The hidden mission that I'll also kind of add on

this is we had uh Har and I had um this uh awesome uh breakfast in uh Woodside and this place is so unique and special and we kind of just don't use

it enough at YC. So the hidden mission is to make Pioneer great again. And so I went through winter 16 here. Um it was an unbelievable time. I think

140 companies went through that batch. 10 of 15 of them are unicorns. It's an insane number. um WPY,

uh Astronis, um Deep Graham, all these companies were in the batch and during that time uh Sam was still running the show and basically sitting right

there would be me, Undercarpathy, Vaj Deremba and Greg Brockman because they were starting this thing called OpenAI and it was like the very early

stages and there was like not that many AI companies. So they would ask me and Steve from Debb like

what are you guys what are you working on? What are the problems you're working on? and they're looking for problems because they didn't even know

what to research. And so it was such a such a special time. This place is so special uh to me in particular uh to Har as well. And we just it's it we

don't really use it enough. So I wanted um to kind of make this community down here. And I also think

that 100% of the AI talent or AI people in the Bay Area, probably about half of them are in the city maybe is a good number. There's anthropic, uh

there's open AI, there's cursor, there's all this stuff in the city. Then there's a lot that are down here that are not making the trek up to the city

to join YC. And so he's like, "Yes, emphatically, yes." Um, and so you have Google DeepMind right on

the corner. You have um Tesla, you have XAI, you have Thinking Machines, you have all these other people in Palo Alto, you have a lot of startups. And

so uh I wanted to kind of like solve six birds with one stone and kind of pull together this community down here as well. And Harj uh is super excited

about it as well. And so thank you very much Har for letting us do this. We got uh five great

papers here coming up. The first one is Tanishk Speculative Decoding. You want to come up? All right. Do you want me to pull it on? Yeah, I got you.

Cool. I know it uh looks like maybe I was sloppy and I added an extra word in the title, but uh it is intentional um and it'll make sense in uh good

time. Um my name is Tanishk. I'm a grad student at Stanford. Um, this is a project I worked on with

Triau and Aar May. I'm going to be evangelizing inference for people today. Hopefully, you'll be inference enjoyers by the end. So, I'm not sure how

much I have to motivate inference. I worked on training before inference. And I sort of the sort of mental model I had in mind for how inference works

was you know you do this beautiful craftsmanship during the training process and you get these like

you know very intricate weights and then you kind of just hand it off and use them to generate tokens. In my mind it's sort of like you have the

weights just multiply the matrices it's why do you need a team for it? Um I was very confused but there is in fact a lot of subtlety involved. Um it's

a lot of fun the algorithms and systems behind inference at scale. I'm not sure I need to spend too long

talking about why inference is important. Um there is one point I want to make that I don't hear people talk about enough. So things you may have

heard are that inference costs are high. They dominate training costs when you're serving a model for billions of users or you know 10 claud code

power users. That's trillions of tokens. Um, not only are inference costs dominating training costs, but

even within training, RL is starting to exceed the compute requirements of pre-training. And what is RL but a wrapper on inference, right? So, these

are two things you've probably heard before. The third is one I fear isn't really talked about, but it's the reason that I started working on

inference, and I use the phrase working on inference lightly. This was the only inference project I've ever

done. Um, but the reason I got interested in making inference fast was not because of cost or for convenience. It was entirely because of capability.

So the claim I'm going to make and maybe this is the one thing to take away from the message I'm trying to send in this talk is that inference today

is seen as a sort of like cost or convenience lever. But uh in one two or 3 years inference is going

to be seen as a capability. And what I mean by that is that if you have a method, an algorithm, a system where its performance scales with the amount

of thinking it does, then fundamentally the speed at which you can do inference, the tokens per second is exactly the peak intelligence that you can

deliver. So inference should be thought of as not so much as a a cost or convenience factor, but as a

capability. Um, and that's why I got interested in it. I I wanted to work towards the future where we have an entire data center of 20,000 B200s just

working on the reman hypothesis. Um okay, yes, that's the future that uh I had in mind. Perhaps this meme is a little outdated because it has an A100

on it, but uh yeah. Okay. So to motivate things, here is an example of fast inference. So I'm going

to give you a little demo of uh three algorithms side by side. We're going to sample, you know, a code prompt from VLM with just normal auto

reggressive decoding. We're going to use their speculative decoding. And then I'm going to put next to it the sort of janky handrolled inference

engine I wrote over a summer for this project. Um, whose main strength is just that it implements a new algorithm

and so you can see them side by side. SSDs on the right and you can see it is quite a bit faster than what you can get if you try to use an open

source engine. Um, and it's not the systems, it's it's the algorithm. Um so yeah that's what we want to work towards understanding both how

speculative decoding works as well as the algorithm on the right. Okay. Um I'll start by introducing what

speculative decoding is how it works and then we'll move into what speculative decoding is. I hope that if you have like a reasonably strong

understanding of how speculative decoding works the problem that SSD is trying to solve will feel very motivated and the algorithm should just become

clear in good time. Okay, so this is the schematic I'm going to use to explain how vanilla speculative

decoding works. Um, it has a small model, the tiny llama up top, as well as a big model, the big llama. And our goal is simply to sample fast from the

big llama. We want tokens generated from the big model. And we're going to use a small model as a sort of proxy or an instrument to be able to sample

quickly from the big model. Okay. So, what the draft is going to be responsible for is basically

generating a bunch of tokens one by one. One by one is important. It's auto reggressive. So you need to do three forward passes on the draft or you

know however many some constant number. Um and these are going to be guesses for what the draft believes that the big model is going to output next.

It wants to sort of predict ahead of time. The job that the big model has, I'm going to call it the

target model, is verifying these guesses. What does verification mean? Verification means doing one forward pass over these generated tokens to see

how likely it is that the big model would have generated them. The sort of key asymmetry here, the reason that speculation works is that it is easier

to verify than to generate. This is a feature of the transformer architecture where you can get the

probabilities for many tokens in a sequence in parallel in one forward pass. Um but you can't generate them in parallel. auto reggressive decoding as

uh one at a time. Um so we're leaving the auto reggressive decoding which is slow uh to a very quick and small model and then we're doing just one

forward pass on these tokens. And the way you verify tokens is basically by having the big model look

at the probabilities of each of the generated tokens and see how plausible it is that it would have generated those tokens. And sort of the intuition

here is that we will accept precisely those tokens that the big model could plausibly have generated. Its probabilities were reasonably high. There

subtleties in exactly what the algorithm is um that I'm going to gloss over, but that's the way to

think about it. Um and then we're going to find a point perhaps where we don't think it's plausible the big model would have generated those tokens

and we're going to reject those tokens. So in the little schematic on the right uh there the draft samples three and the big model verifies them and

concludes that only the first token was something it would plausibly have generated. It will reject the

second token onwards and importantly this is a sort of critical but subtle detail of vanilla specular decoding because you have the probabilities at

each of the sequence positions. You can sample an extra token at the point at which you rejected a token for free as in without doing any more forward

passes. And so that yellow token is what I'm going to call a bonus token that you sample for free.

This is going to be important in SSD. Um, so yeah, that's uh that's an important conceptual point. And this sort of sets the stage for how SSD works.

Okay, we have our schematic. And the way we've set up speculative decoding is that it's a way to exchange flops for latency. So speculation in general

is not actually something that uh only LLMs do. It's like a a deep idea in computer science. It's

used in CPUs as well where the general philosophy is that you premputee something ahead of time. Some of what you premputee may be useless because it

may be an incorrect prediction of the future, but if you're right, you get to fast forward in time um and you get lower latency as a result. So the

sort of like moral philosophy of speculative decoding is that it's currency exchange. The difficulty

with normal speculative decoding is that you can't push this arbitrarily far. You cannot keep sampling more and more tokens on the draft and keep

getting speed ups because at some point you're going to get to a point where you're spending a lot of time drafting and you're not accepting all that

many tokens. And in particular, like a big bottleneck in vanilla speculative decoding is the sequential

dependence between the small llama and the big llama. Um the drafting in round t has to take place before the verification of those tokens. um and the

drafting in round t+1 can't take place before you know the outcome of verification of the previous round because you need that as a prefix to draft on

top of. So there's a logical dependency here. The goal of SSD is very simple. There's a lot of

gnarly and subtle details but the highle idea is incredibly simple. It is simply to parallelize this sequential operation. We want drafting and

verification to be happening at the same time. Normally in speculation they happen on the same hardware and that's fine because there's only one of

them happening at a time. In our setup they're going to be happening at the same time. So we're not going to

be collocating them. And the main question basically becomes how do you parallelize this inherently sequential algorithm that has a logical

dependency. Um and the way we're going to do that is we are going to have the draft model send back its draft tokens in a certain round. So we've sent

back a bunch of blue tokens. That's now the job of the verifier to do a forward passover and verify. And this

is going to take a while because a verifier is a big model. What we on the draft are going to do is basically start anticipating the most likely

verification outcomes immediately. As soon as we send back like a certain round of speculation and once we have in mind some of the most likely

verification outcomes, we are going to start drafting the next round on top of those immediately while

verification is taking place. If we're right, the next time the verifier asks for a draft, we'll have it ready immediately. We're entirely hiding the

latency of drafting. If we're wrong, well, we'll have to figure out a backup strategy. And there's uh there's there's there's some subtleties on what

you do and how you do it there. Um so yeah, the way that speculative decoding looks like this. And

perhaps unsurprisingly, the analog for SSD is this diagram on the right. We're now drafting and verification happen in parallel. um the principal

difficulty or algorithmic design space in SSD is how do you predict verification outcomes ahead of time. I thought verification is where you are

leveraging the intelligence of the big model that should by construction be difficult to predict. Um and the

intuition for why it's plausible at all is that you can make many guesses on the draft for what a verification outcome is. And a verification outcome

here is just you know a plausible number of accepted tokens and then a bonus token on top of that. Now this is hard to predict because a bonus token

comes from a vocabulary which has size you know tens to hundreds of thousands. Um so it's a large

space to cover um but it turns out you can do it well um reasonably well. You can get it right about 80 to 90% of the time which is more than enough

to get big speed ups. And the way we do that, the short of it is basically we use information on the draft to predict what the verification outcome is

likely to be. When we generated the blue tokens on the draft, we had other tokens that we chose not

to sample. Those other tokens are plausible verification bonus token candidates. And so you basically use information from the token distributions of

the draft model to predict what likely outcomes on the target are. And then once you have all of these predictions, you can decode them in parallel as

just different sequences that you're decoding on top of a shared prefix. And voila, it uh it's it

gives you speedups because you get to hide the latency of drafting altogether. Um there's also a an additional bonus that since verification actually

kind of takes a while, you get more time to draft uh in the first place. So you can draft more tokens which increases the expected tokens per round

and sort of gives you further speed ups. There's a bunch of stuff that we work through in the paper

that's uh that's sort of reckoning with the implementation details of this. One of it is how you handle cache misses. One plausible thing you could do

perhaps naively is to just fall back to ordinary speculation just in time. Turns out that actually this is not always optimal. Um there's trade-offs.

You know, as batch size increases, you're going to fail to predict some of the sequences

verification outcomes. Um and so you need different ways to predict and handle cache misses. Should you be allocating your compute on the draft

equally amongst plausible prefix length? Uh the short answer is no. You can be clever about it. And all of this trickery just helps you increase your

cash hit rate, so to speak, the amount of time you're able to correctly predict verification outcomes. And

there's there's some trade-offs between cash hit rate and the actual quality of the drafting you're doing. Um and this is totally non-obvious. Um, and

and we go into why that exists and how you can navigate it in the paper. Um, I'm happy to talk about it in in Q&A as well. Um, okay. So, what do you

get for the price of this uh mind-numbing complexity and uh pain wrangling an inference engine?

Well, you get the privilege of watching a number go up, which I guess is the north star of all AI research. And so here we have uh a bunch of

inference algorithms and inference engines. The blue ones are sort of uh my inference engine and uh the light blue is just the baseline implementation

of speculative decoding. The red is SG lang which is you know of all the inference engines we tried the

fastest with speculative decoding and the dark blue is SSD. Um and normally speculative decoding um is a is a win for latency but it's sort of unclear

whether it's useful for throughput. um for us it turn in in this setting it's actually a win for both um and so you get numbers going up and you also

get the ability next time you are at a San Francisco house party um to see other people dancing and

knowing in the corner that uh you know what it takes to sample at 300 tokens per second uh for llama 370B on 4H100s. So this is uh sensitive information um but yeah that's that's about it. YOU.

All right, that was awesome. Okay, so for this next paper, this is um my first experience being scooped. The only issue is that he didn't talk to me

and he did it six months before me. Um but uh Isaac can vouch for me on this and maybe Robert as well. I basically fell in love with the diffusion

policy paper. I was like this is definitely like you know a full uh predicting like th horizon steps for

your robotic control. Um we have these amazing video models. Why don't we just use the video model to like run this like at test time to like play out

the movie and where do I end up? And then you have your classic push t. And then I started like looking around uh and then DM mind of course already

did it. So so I wasted like a month and it was not happy. But anyway, thank you very much. Please

welcome Stannis.

Hi everyone. I'm Stannis. I'm a star research scientist at Google DeepMind. Uh currently I'm co-leading a new project on word modeling for

robotics. uh where we try to build general purpose policies on top of video and word models. But uh this is an early work that I did about two years

ago. Uh so this is before I switched to working on hardcore robotics and uh going into hardware really

scaling up the data but uh you can probably see a lot of very similar ideas early version of ideas demonstrated on toy problems. Okay. So uh first to

give some background what is the model predictive control. So model predictive control also called the receding horizon control uses a dynamics model

or some people also call it a word model and uh action selector mechanism uh which is a planner to

construct agents that can solve a wide variety of tasks by means of maximizing a no objective. So the main advantages of model predictive control is

uh it can adapt to normal reward functions at test time. So uh the dynamics model are also easier to learn and generates better than just policies and

the action proposal dynamics model factorization also allows easy adaptation to normal dynamics. So

we're going to uh demonstrate some of these in later experiments but basically here we are showing the overall idea which is extremely simple. We have

a action proposal which proposes a sequence of actions. We have a dynamics model which can evolve these actions and give you the future states. And uh

finally we have some objective functions that we are trying to optimize. We basically use a

planner to optimize that and uh pick the actions and execute it in the environment. So what is diffusion model operative control? So the motivation

mainly is uh there are a couple of problems we need to address in order to make MPC effective in practice. One the dynamics model need to be accurate

to avoid the problem of compounding errors and uh two the planning algorithm also needs to be powerful

enough to select a good sequence of actions. So with DMPC what we did is to use diffusion models to learn both multi-step action proposals and

multi-step uh dynamics models. So the advantages are mainly to reduce compounding errors and we also found that uh it can simplify the planning

algorithm. Essentially we can just use a very simple uh sampling based planner and we can already outperform a

lot of the previous uh approaches. So uh before we dive into the details also want to give a hierarchical view of some related works we organized. So

there are a lot of related works in the literature and uh we organize it uh in this way where we basically look at how different approaches um so

basically all approaches essentially try to build a joint uh distribution of the states and the actions

but they do it in different ways and also use the different components in different ways. So for example, you can build it in a factorized way where

you have row a which is your policy predicting the actions and then collision on the action predict the state which is a dynamics model and uh for

this you have the dynam paradigm where you basically learn a model and use the model to also generate

data in the imagination and the learn policy. But uh you can also do MPC uh where you uh essentially use a planner to select the actions and uh we

also have uh some uh there are also approaches where you build a joint model of the state and actions and you're essentially also doing MPC and there

are also model free approaches where you directly learn a policy. uh I won't dive into the full details

but uh there are basically different trade-offs in terms of runtime plan uh whether we can do runtime planning and uh adapting to normal rewards and

adapting to normal dynamics leveraging non-expert data and also the uh general speed at runtime and there is also the distinction between whether

you're doing singlestep modeling or multi-step modeling. Okay. So coming to diffusion model, diffusion

model has enjoyed a lot of successes uh in uh generating AI especially for generating images and videos. But uh in recent years they also found a lot

of successes in robotics. So currently uh so here I'm also showing a slide where uh this is a kind of the exploration space for uh diffusion based uh

I would calling diffusion based agents. So we of course start with the diffusion policy where we

condition all the observation and generate future actions. But then we also have this work called the diffuser which uh is uh you can think of it as a

way to joint jointly model uh observations and states but in toy space. There are of course these ideas are explored in tons of different papers but

this is just a very simple and uh conceptual way to describe it. And uh then there's also decision

diffuser where we collision on the observations we directly generate future uh we condition on the history directly generate future observations and

then try a separate inverse dynamics model to derive the actions and uh finally we have the diffusion model predictive control where we first have an

action proposal to propose future actions and use a dynamics model to evolve it and uh then use

planner to select the actions. There are different uh trade-offs among these. So for example, diffusion policy is sort of on complex uh complex

control like day-to-day we still rely on it a lot. But this requires expert demonstrations. So essentially you can't move out of the behavior cloning

paradigm. Uh for diffuser it's a jointly modeling state and action. So it has implicit word modeling and

also model based planning. And this is actually something that we are trying to explore at scale similar ideas. But uh and then there's also uh

decision diffuser where you do observation only learning. The main benefit of this is it allows you to leverage uh video only data to learn from video

only data because for robotics uh the data is a many bottleneck. And then finally there's a division MPC

which allows us to do runtime adaptation to normal rewards and normal dynamics. So what does the algorithm look like? It actually is extremely simple.

We have uh often data set and uh we have uh some hyperparameters. Essentially we are learning a couple of u uh learning a couple of models all from

the offline data sets. We're learning a policy which u uh given the current observation predicts the

actions. We're learning a dynamics model which uh given the uh given the actions uh evolves the observations to predict the future states. And uh

basically after learning all this at uh um at uh inference time when we actually deploy it as a policy we uh sampled action proposal and score it uh

rank it and uh pick the best. But uh the main difference uh compared to previous approaches is uh we

adopted a multi-step action proposal which uh is uh essentially very similar to a diffusion policy but if you train on more diverse data it can give

you uh more coverage in terms of the action space and uh we are also using a multi-step um uh dynamics model which uh allows you to uh evolve for a

long time horizon without a lot of compounding error. And uh this allows us uh to and also uh there's a

fact that we leverage diffusion model which is a really powerful way to model data especially multimodel data and uh what we observed empirically is

the uh stronger modeling uh capabilities also allows us uh to uh simplify the planning algorithm so that we can just use such a simple uh planner to

do to solve the task. tasks. Yeah. Um also contrasting with a few of the representative uh path works

uh including uh model based offline control offline planning and this diffuser work which I mentioned it learns a joint model and uses a classifier

free guidance for planning. Okay. Uh so yeah next to dive into some uh results uh there are lots of numbers but the short answer is uh we obtain very

competitive results in fixed reward single task setups. This is just to demonstrate that uh the

approach uh when you deploy it in uh single reward uh fixed reward single task setup it can perform competitively to the current state-of-the-art uh

previous state-of-the-art approaches. But uh I think uh there are a couple of uh more interesting uh properties of DMPC. One is it can adapt to no

rewards at runtime. Here we are showing some uh examples where uh essentially we train the model to uh

these are very simple modulo tasks but we train the model to just uh local motion tasks run forward and jump etc. But uh at inference time we can just

by changing the reward function to uh make it uh exhibit uh novel behaviors like uh jumping etc. So uh here's another example where we show that uh

DMPC can adapt to novel dynamics while uh this kind of uh joint modeling approaches struggle. This is

really the benefit of the factorization of the action proposal and the dynamics model. So the here the idea is uh we can keep the action proposal the

same but uh we uh we have uh scenarios where the dynamics of the environment changed. So for example the walker has a broken left ankle and as a

result when it starts to execute actions the consequence of the actions change. So in such cases because

of the factorized representation in DMPC we can uh simply just adapt the dynamics model on some play data collected in the new environment and uh we

observe that we can recover a lot of the performance because of the changing dynamics. Finally, we dug into the various components of uh the DMPC

design and we demonstrated that uh the different components in DMPC basically contributed to improved

performance. Uh this uh these include uh the diffusion active proposals, action proposals, improve performance and simplify the planning. We do

multi-step diffusion action proposals and the fact that we do multi-step also uh contributes to improved performance and finally multi-step dynamics

modeling also uh contributes to improved performance. Uh that's it.

All right. And that was the last Google Deep Mind paper that they're going to publish. So, good luck out there. Um, this next one is one of my lab

mates that I work with a lot that is the most world model pled person that I know. And so, I can't imagine, you know, anyone else presenting this

paper other than Yan Lun himself. Um, Isaac Ward. There you go. Thanks a lot. >> All right, guys. Is that a

good distance? You all can hear me at the back. Cool. Cool. Yeah, I'm enjoying a uh a cool little period in life where I started working on world

models a couple years ago, kind of before they got really hot and now they're enjoying a moment in the sun and suddenly everyone wants to talk to me

which is nice. I'm presenting lay world model which is a call out of course out of Yan Lacun's group. Uh

QR code here if you want to follow along with the project page, but I'll explain through it and yeah, really excited to talk to you about this one. Uh

hidden in this presentation is really like a billion-dollar question and it's not hyperbole. uh Yan Lakun's raise of $1.03 billion dollars back in

March basically just to train world models is sort of what this presentation is about. I want to get

at some of the questions that they're going to be testing. First five slides here just going to do some basics on world models. I think we've all

heard the term but I want to just make sure we're all on the same page and then we'll jump into uh what this paper is really uh offering and what it

means for world models at large. But first of all, world models, what are they? Why do we care about

them? So really it's about learning the dynamics of the world, which is to say we're trying to come up with some model Typically, we're using like a

big neural network to predict how a system will change over time based on its inputs. So, you have your current state or scenario using S for notation

here. You're playing some action, maybe that's like a movement or a command for a robot, um, or a

language command for a robot, and then you're trying to predict like what its outcome is going to be, like what scenario will it end up in once it's

executed that action. So, you're really trying to model the system or the environment that the robot is in, modeling the world. It's a world model.

Uh, these kinds of models are really cool. They enable a few really interesting capabilities. One of

them is generating imagined outcomes. We've probably all seen like the sort of weird kind of um hallucinity uh imagination sequences coming out of

world models over the last couple years. We'll talk more about those and why they're useful. Uh this allows us to get to model based control. I'm glad

Stannis kind of explained that in the last talk for me, so I'll skip over it. Um and the last piece is

really cool. Surprise quantification. Uh I'll get to that later. Um but a really powerful capability of world models. I wanted to communicate to you

all that this is not a new idea at all. It's really just kind of new advertising or packaging on an old idea. So I started going back through Google

Scholar and this is a paper that I think is older than the average age of this room. Um from Europe's

1990 and of course Richard S. Sutton who we know from reinforcement learning basically describes exactly a modern world model a black box that takes

as input its situation and its action that it's going to execute and outputs a prediction of its immediate next situation. So really old idea and uh

that's the flyer from Europe's 1990. Great. Right. So, getting a little bit more explicit um and

changing the notation from state to observation just because in real world systems, we typically don't have access to the exact true state. We

typically have some observation from sensors. This is just an example that I pulled up from some world models that we're training on a quadrotor. So,

as an example, the observation that the quadrotor gets might be its current kinematic state, position,

velocity, this kind of thing. In addition to the images that it's taken from a forward- facing camera, the action might be a control input, in this

case a yaw, and move back to the left. And then we want to make a prediction that says well if you do that action you're going to end up slightly back

in the room and looking to the left. And we actually want to generate what the sensor um would result

uh in this case. So highly uh dimensional observations images uh and also LAR and things like that are completely on the table in world models. Uh

they're really challenging because action sequences can be quite long. Um and the really big thing is that the minimum in the optimization landscape

for these kinds of models may not correspond to the desired behavior. And more on that later. Um, but

hopefully you'll agree that if you have trained a system that's capable of doing this thing, it must have an internal model of the world. And imbuing

agents with an internal model of the world, um, is potentially a very useful capability. And that really is the big question. Are we going to have

model free or model based policies? Are our agents going to have an internal model of the world or are

they not? And this is sort of being fought out right now both in the research community and in like the startup community. So on the left, model free.

The idea is you're taking some observations, you're feeding this into some kind of big neural network potentially with a bunch of interesting learning

tricks there, but you're getting some optimal action out. So, it's just mapping between

observation and some optimal action. But at no point is there an explicit representation of what the future might look like if you execute that

action. These kinds of models are pretty good. There is growing evidence to show that internal to these neural networks are highly obuscated and

challenging to interpret world models uh sort of in the in the weights. uh I'll talk about a paper very briefly

that's um speaks to that and maybe someone can present on it in a future week. And then over on the um other side, model based approaches, right? So

now we're saying we're going to train this world model up explicitly and actually use that in our policy to be able to explicitly predict the outcome

of potential actions. So yeah, totally like two different species of policies. The model free stuff,

some of the weaknesses is they show a little bit of brittleleness to out of distribution. Um, model based ones are great because you can kind of

quantify modeling error and this is really important when you're deploying things in the real world. Uh, we'll talk a little bit about this. I have a

little asterisk here, some biological precedent which we'll speak to more. Um, and you have to have this

additional mechanism of course which is a downside where you actually need to propose action candidates to evaluate with the world model um, which

Stannis spoke to in the previous talk. This is a great paper. But I just wanted to chuck this in there uh which talks about how even model free base

policies do have world models in them and a really cool paper that hopefully can be presented in a

future week. Uh just to make it concrete before we jump into the paper I wanted to just bring a little toy here just to show you what this looks like.

So of course went to push t like all good researchers do and in push t we basically just have an image of a little blue ball agent and you're trying

to push the blue tea into the green slot. uh the state is comprised the observation is comprised of

that image plus the 2D position of the endeector and the 2D action of where you're going to move the endector. So you can make a little architecture

that looks like this. I just whipped this up. Couple hundred thousand parameters and um oh let's play this. So if that's the actual roll out, this is

what the model thinks the action sequence is going to do. So you can see it's a little bit wobbly

because it's a tiny model, but we can certainly train up models of these kinds of toy environments and indeed more complex ones. So what are the

challenges associated with training this kind of model? Well, one is you're trying to learn the representation of the world. So how you're going to

compactly represent those highly dimensional images or LAR inputs or highly dimensional sensor inputs at

the same time as you're trying to learn how actions change that representation. So you're co-learning representation and dynamics. And there are many

solutions in the optimization landscape that will essentially just cause you to do nothing. So for example a a local min minima in the optimization

landscape is to say well every state is just the same it's a trivial collapse basically um and there

are many techniques in the literature to say how can you avoid these so there are solutions of a variety different kinds that basically say there a

way to avoid the collapse associated with training world models and that's really where the world model comes in. It says, well, instead of having to

use some manner of trick or like special method or a bunch of like hyperparameter tuning schedule,

we're instead going to really drastically simplify this and go for a more elegant method. So, if you know a little bit about world models, there's

some popular ones in the top right here. This is a figure straight out of the paper. So, PLDM is planning in with latent dynamic models, dino, dino,

um, distillation with no labels, world model, dreamer out of deep mind, and then temporal difference MPC

as the final one. So, in some way, shape or form, I'll explain this. they use some kind of trick or um like challenging to configure design to get

away with uh this collapse to avoid this collapse and the world models coming in and saying basically we can do this with sort of one hyperparameter

and one loss term which I'll talk about there's really no time to go through all the different tricks

that different world model approaches use because it really is the wild west out there right now so many different methods but they basically fall

into one of these three categories so one is you could do some explicit heristic that stops collapse by like enforcing some special um healthiness in

like the latent space of your embeddings. Um the language trick is maybe a bit unfair here, but it's

what's used in the paper. Uh you could use some foundational methods. So you could take some like existing autoenccoder or diffusion model or video

model and use that as a basis for your world model and add an action conditioning element in there. Um or you could use some privilege data that may

not be usually available to the model outside of train time uh to be able to avoid collapse. and lay

well model even though it says that it's doing something very different I really think uh it's just offering a new kind of trick uh which I'll talk

about here so jer is joint embedding predictive architecture it's sort of yan lakun's main work and lay world model is a kind of jepper model uh

basically the way it works is you're going to take an autoenccoder um or I should say an image encoder uh

encode this observation in this case it's of a robot doing a push cube task that's going to turn that image into a latent vector in the latent space

of this encoder uh you're going to train an action condition forecasting module this predictor to be able to predict what is the next latent embedding

going to look like when I execute this action. So not what the next image is going to look like but

what's the next latent going to look like and you can use the decoder attached to that encoder to decode that back out into a useful image. But for

the most part all the interesting work is going to be done in the latent space. And basically what they say is over a batch all of those latent

embeddings uh should be in a healthy distribution which they describe as a gausian distributed uh

distribution in the latent space and thus enters the sigg regularizer which is the sort of new term they add. So sigg for sketching as in uh doing

one-dimensional passes over a high dimensional data. Um I for isotropic so this should look the same when you slice it in any direction and g for gaus

and distributed cigar. So basically you're taking all of these embeddings of your different

predictions doing a one-dimensional slice over each direction like in that highdimensional space and then you want each of the curves across those

slices to be gausian distributed and if that's true then your um distribution in the latent space must be very healthy. Uh so the idea is you can

quite cheaply evaluate how gausian distributed your embeddings are and thus how healthy your world model is

and how non-olapsing it is. So essentially I just say instead of training up on the normal predict the next uh latent you add on this additional sigg

term. So I'd argue that basically this paper is just um providing a very elegant kind of regularization. And to finish off I'll just talk about three

capabilities that you get from this. So one is the openloop prediction quality. This is what world

models do. So you feed in like the context this push t at the top and you can see the top row is the real example. The bottom is the imagined and they

look about the same. This is good. It means your world model is really good at predicting what your next action is going to do. They do that on push t

and then on a slightly um like a 3D analog task like a push cube. This is all great. I love seeing

these um these plots. Um but really what matters is how does this actually affect the policy like for the actual task completion. How is this useful?

Um and that sort of brings us into how you can use these models for model predictive control. Basically you take your initial observation and a goal

observation. I put an asterisk there because how often do you have a goal observation in a robotics

task? Like you don't always know exactly the situation that you want to end up in. But in this case, that's how they frame it. So they say, you know,

the world looks like this right now. I want the world to look like this. You encode both of those. And then you're basically doing a search over the

actions that will get you in the latent space from this starting point to this ending point. And

there are well- definfined optimization methods to um to achieve that. It works pretty well. I'll make it um make it simple. The world model is better

than the competition on these like small 2D tasks. As soon as you go to 3D, Dino World model wins. It does have a big foundational backbone trained on

that kind of image data. So you'd expect it to um to win. Um they run on a really simple

environment called two room and kind of say you know we don't do so well on this but that's because we're promoting like really high dimensional

healthy embeddings and it's a very low dimensional problem. I'm not sure if I'd truly go for that. Um but a good takeway is that it's about 50 times

faster than any of the competition across the board because it's doing all this work in the latent space

and it doesn't have to have any like additional tricks relating to more forward passes or like having two copies of the model in memory. And uh you

can actually boot this thing up on like a single card, less than 24 gigabytes of VRAM and it's only 15 million parameters. So that is pretty nice.

Final piece, this is what I think is a really cool capability of world models. Um you can quantify the

model error. So basically they just come up with some trajectories that kind of screw with the world model. So the top one is going from left to

right. That's time. Uh so that's just like a nominal example. Everything's normal. Then they take the same example, but they change the color of the

tea. And then they take the same example, but they just teleport the tea into a different location. And

this is really cool because you can actually see the moment they apply those perturbations, you get a spike in the model error and this is detectable

which is to say world model enabled agents can quantify how poor their predictions are. They have good estimates of their uncertainty. This is really

powerful. Model freebased approaches don't natively give you this stuff. This is my last slide. Um a

few discussion points and broader themes maybe we can chat about here. Obviously, you know, are we going to go with model based? Are we going to go

with model free? Um what's going to be the best way to enable intelligent agents to do interesting things in the world? regularization and

representation learning. Um, in this paper they are co-learning the representation of the world that the agent

has and the dynamics of the world. Should this be separated? Can we take some bio inspiration? Should we use pre-existing um like foundation models

and stuff like that? And then finally, how can we fight uh representational collapse elegantly? I think this work does a really great job of that, but

the question is still out on what the best way to do it is. So um that's my talk. Thanks very much

for your attention.

All right. Okay. So, for the next two, um, we're kind of focusing on, um, less world model stuff and more heady, high level stuff that I think is

pretty interesting. Um, this is a a paper that's going to be presented by Ashe, one of the YC uh, startups here named QABs. and your co-founder

president. You're president of QABs. Is that right? >> Okay. Welcome Ashe. >> Hey everybody. Today I'm going

to be talking through Andrew Gordon Wilson's paper uh deep learning is not so mysterious or different. Uh we actually work with Andrew on the

generalization problem at Q Labs. So I'm really excited for more people to know about his work. The current state of machine learning is that we know

that scaling that scaling models leads to better generalization. But we don't have a mechanistic

understanding of why that is the case. Um yeah, if we can understand general generalization, then we might be able to optimize for it as well. So the

payoff to understanding it is actually really large. Um when you talk to people in the field, they often explain that generalization is a mystery and

they point to examples like overparameterization, benign overfitting and double descent as reasons

why we might not be able to understand generalization at all. So Andrew's work here basically dispels those mysteries by using classical theories of

generalization uh which have to date not really been used to explain things like overparameterization thus far. So the first classical theory that

we'll go through is uh pack bay. So pack bay basically bounds the test loss which is the generalization.

This is the quantity that we care about with a training loss and a compression term. Um the thing is in the past when people overparameterize models

this compression term tends to dominate and so in practice these bounds become loose and vacuous meaning that we can't use them for anything at all.

This was basically due to a mislication of the bound. You can compute the compression term in an

alternative way as we'll get into sort of later in the talk here. So let's go through the first mystery that uh Andrew goes through in his paper. Um

the mystery that he talks about is overparameterization. And this is basically the idea that as you scale up the model parameter size from the bias

various variance trade-off, you would expect that you might overfit. But in practice, we see the

opposite. The scaling laws tell us that we actually get better generalization. Um the the scaling and the better generalization from

overparameterization is is due to like the the massive gains in model capability over the last couple of years. But we still don't really understand

why it impro why it improves generalization. So the packbased framework gives us a pretty useful way to think about

the success of over par parameterization. The first is with empirical risk. Empirical risk is basically training loss. When you increase the number of

parameters you can fit your data better. Um so the empirical risk the left uh the first term goes down. And Andrew's work also finds that when we

increase the model, when we increase the number of parameters, um we also find more compressible

solutions. So this is work by Lotfi at all at all and they develop methods to basically compress the uh yeah they compress the training set you and

and the model and they basically find a negative correlation between the bits required to encode the training set and the number of parameters. Um and

so we find that as we increase the model size we can find more efficient encodings of the training

set. So the second term in this bound also gets lower. Another perspective on this model compressibility point is a perspective of flatness. As you

increase the number of parameters, it turns out that the number of the volume of flat minima in parameter space exponentially increases. This is the

green region and uh and comparatively the volume of sharp minima increases much less and uh this is

interesting and this is useful the compressibility view because flat minima are known to be more compressible than sharp minima and so

overparameterization fits within existing theories and through Andrew's work we actually see useful bounds on generalization even for models at like a

billion parameter scale and so we go to the next so-called mystery of deep learning which is called uh benign

overfitting which Andrew also dispels in or at least partially explains in his paper. So the idea of benign overfitting is that deep neural networks

are able to fit totally random noise but at the same time they are able to to generalize well when you have structured data. The mystery is how can

you have an inductive bias that allows you to generalize well if you can also fit totally random data.

I think a regularized polomial model um in Andrew's paper gives us pretty good intuition for how this might be the case. Here you can see that on

random data, so section C of the figure that we have enough parameters to fit the data and so we can we can fit the totally random data. But on

structured data, the regularization pushes us to use the lower order terms. And so we are able to both get the

flexibility but also have inductive bias that allows us to generalize. And generally this is this is the view to take um for neural networks like

there are expressive models with a soft inductive bias. Um we can go through this concept um just using this figure right here. So uh on the left hand

side we have an example of what's like a flexible hypothesis space. And a flexible hypothesis space

would allow you to fit the data that you have. But the problem is that you would almost certainly overfit if you if you um if you do not have a bias

towards one solution over the other. But on the other hand, if you have an inductive bias, you would solve this overfitting problem, but instead you

wouldn't you wouldn't be able to model all of the details of reality. Um and so the middle ground is

to have a very expressive hypothesis space, but also have a bias towards solutions that might generalize. For example, in the pack bay framework, we

might want to bias towards more compressible models if we can. And so we see that uh deep learning so-called mysteries are actually consistent and

partially explained by existing theories such as soft inductive biases and pack bays. And sort of the

thing I want to leave you with is that um if we can find the right inductive biases building on these theories, we might be able to optimize for them

as well. And by the no free lunch theorem, the only way that we get improvements in learning efficiency is through inductive biases. So I I think that

this is that working on this problem is a really good bet to make. Given the massive sample

efficiency gap between AI and humans, we might actually see massive gains in capability. If we work on this problem um and so yeah, that's where I

want to leave you with short presentation. Okay. Um so for this last paper then after this we have some boba for everyone. So sit tight 15 minutes. Um

this is an idea that you know I've been obsessed with. Back to the sample efficiency thing. I think

that like the two major problems we have left really to solve in AI is intelligence per watt um and intelligence per sample. And if you compare that

to where we're at today compared to humans, um I would say that we're still or an order or two magnitude off on intelligence per watt. Uh and we're me

like orders of magnitude off on intelligence per sample. I don't know what percent of the internet

that you guys have read, but I have not read the entire internet. In Chris Ray's lab in particular, we've been obsessed with this idea that um if I

have uh under the a fixed size amount of data and I have infinite compute, just go nuts, how much generalization can I actually achieve? And so this

is exactly uh the paper that starts to answer that question. And I'm really excited to uh introduce uh

Con Woo.

Uh hi, I'm Ku. Um this is a paper that I co-led with my amazing collaborator Suhas as well as Percy and Potsu. So part of the motivation for this

paper is just the fact that over the past uh six or seven years pre-training has continued to improve model capabilities in pretty surprising ways. So

in 2020 with GPT3 we had sort of the emergence of incontext learning. In 2022 with Anthropics RHF,

we had sort of the advent of alignment. And maybe most notably in 2024 with both 01 from OpenAI and then later Deepseek R1, we had the emergence of

reasoning. And in fact, even still today, we see that with these newer and bigger pre-training runs like Mythos and 5.5, the models just continue to

keep better. And so because pre-training is very expensive, a lot of the focus on the research side of

things has been on how do we improve compute efficiency. And in general, people have found that to improve compute efficiency, you need to scale both

the number of parameters in your model and the number of data points that you train your model on. And so these were quantified with the so-called

chinchilla scaling laws. The problem with compute efficiency is that we're soon going to be constrained

by data. And so if you look at these sort of public projections of the rate of growth of internet data, they suggest that the amount of sort of human

generated text on the internet grows by roughly 3% per year. And the amount of compute that we're spending on pre-training is growing by roughly four

or 5x per year. And so what this suggests is that as time passes on, the amount of compute that

we're willing to spend per data point is going to continue to increase by roughly 4x year-over-year. And so this sort of motivates the core question

in this paper which is how should you approach pre-training when you're constrained by data but totally unconstrained by compute. And it's worth maybe

spending a few seconds to think for yourself if you haven't already seen this paper like what would

you do in this situation. This is a very different algorithmic regime from sort of the computer efficient pre-training world that we've sort of lived

in for sort of most of uh modern time. And it's also worth noting that this question is not that different from how machine learning worked before the

modern alm. So for things like classical statistics where maybe you really care about your rates

with respect to the number of points of data you have and you don't care about compute or even older benchmarks like emnest and pen treebank where

you're sort of implicitly data constrained because the benchmarks don't have that many data points. And so sort of the core contribution that I'll

explain in this paper is that we bring the modern toolkit of scaling laws to sort of answer this problem.

And so what we'll show is that we'll propose a few different scaling recipes and we'll sort of chase scaling recipes that monotonically decrease your

iid validation laws. So sort of in distribution generalization and we'll show that these scaling laws have a really clean functional form and they

follow a super clean power law. And when you're able to fit these power laws, what you can do is you

can estimate the best possible loss of your recipe by looking at the asmtote of the power law. And this is in some sense a quantification of your best

possible performance under infinite compute. And our goal in this paper is sort of to think more carefully about what types of algorithms allow you to

lower your compute asmtote. Uh and we're sort of going to chase these types of infinite compute

wins. And so to start, I'm going to introduce this canonical setting that we referenced in this paper, which is that we're going to simulate a data

constrained world by just constraining the number of pre-training tokens we have to be a very small amount. So we're going to assume access to only

200 million tokens from DCLM, which is general web data. And what we're going to do is we're going to

pre-train large and larger models, which is the x-axis, using different kinds of pre-training recipes. And the y-axis here is going to be again our ID

validation loss on DS DCLM. And our goal is going to be to find recipes that allow us to spend more compute and train larger models while

monotonically decreasing our loss. So to start, we can consider sort of the obvious approach that you might

take when you're in this setting, which is first to epoch your data. So to train on the same data points over and over again until you start

overfitting as well as scaling up your model. So making your model larger and larger. And what we can do is we can do both of these at the same time.

And we can do sort of an exhausted grid search over these parameters until we start over until we start

overfitting and then we do early stopping. And this is sort of the red line which is what we call the standard recipe. And what you'll see with the

standard recipe is that even if you are willing to spend more compute, as you train more and more overparameterized models, you start to overfit more

quickly and your loss starts to increase after a certain point. And so if you see this line, sort of

the natural instinct you should have is how do we fix this? And one possible approach is to do really aggressive regularization. And so sort of the

first baseline in this paper is going to be doing really aggressive regularization by cranking up your weight decay. And so what we do is we show that

if you optimally tune your weight decay for each total parameter count. So we're going to optimally

tune learning rate, weight decay, and epoch count for each one of these purple points. You can show that your loss follows a really clean power law as

you increase the number of parameters in your model. And this is really aggressive regularization. So for context, we use weight decays that are

something like 30 times larger than the weight decays that people do for compute optimal pre-training.

And so on the legend here, you can see the sort of the form of this power law. And it has a few nice properties. One is that the exponent on the model

parameters n is one. And this is actually predicted by sort of the data constraint theory. The second nice property that it has is that the scaling

law has an asmtote which is 3.43 in this case. And this characterizes the performance of the best

possible regularized model in this setting if you had like infinite compute. So you'll notice that the baseline approaches because they overfit more

quickly. They don't even have a measurable asmtote. And so once we start going down the rabbit hole of regularization and these other types of

classical machine learning techniques, there's a whole basket of techniques to get into. And so perhaps

maybe the most famous one is to do ensembling. And so what we show in this paper is that you can bring back ensembling in the modern world of

pre-training language models and they turn out to be incredibly data efficient. So what these light blue points correspond to is they correspond to

300 million parameter models that were ensembling with more and more members. So the fifth point will

correspond to 1.5 total billion total parameters which is five ensemble of 300 million parameter models. We show that you can also fit really clean

scaling laws to ensembles. So you also get a power law that has exponent one and the number of ensemble members and it also has an asmtote. But most

importantly the asmtote of ensembling is much lower than the asmtote of the regularized recipe. So it's

giving you a true data efficiency win if you had an infinite amount of compute. There's also this interesting property which is that ensemblings if

you do a compute matched comparison so the same number of parameters are actually better than the regularized recipe. So if your goal is just to train

the best 1.5 billion parameter model it's better to train an ensemble of a bunch of small models when

you're data constrained than to train one really large model. The last thing we show in this plot is that you can actually compose the benefits of

regularization and ensembling. So one way to think about this is that regularization gives you this ability to continue to make the models larger and

larger while ensembling introduces this new axis for scaling compute which is by training more and more

models. And so what this gold line which we call the joint scaling recipe is we quantify this hypothetical performance if we were able to train an

ensemble an infinitely large ensemble of infinitely large models. And so the way in which we actually quantify this performance is we fit two scaling

laws. So we'll take a double limit. What we'll first do is we'll train ensembles of 150 million

parameter models, 300 million parameter models and so on and so forth. And then we'll look at the asmmptotes of the ensembles. And then we'll take a

second we'll fit a second scaling law to the asmmptotes of these ensembles. And this is essentially taking the first limit is taking the limit over K.

And the second limit is taking the limit over n. And what we find is that if you're willing to sort

of go through the effort of training infinitely large models and infinitely many ensembles, uh you get a huge loss improvement. And so all of these

experiments are sort of in this toy data constrained setup of 200 million tokens. And obviously this is very different from sort of the standard

regime of pre-training. So what we also do in this paper is we spend some effort on trying to confirm that

our recipes scale. So the first way in which we do this is that we build data scaling laws. So what data scaling laws are is that we repeat the exact

same set of experiments from the previous slide at four different pre-training token counts up to 1.7 billion uh tokens. And so for each slice on the

x-axis at each seat token count, we're going to quantify the best possible performance of each

recipe if we had an infinite amount of compute. So for the red points, they overfit more quickly. So these will be actual models. While for the purple

and the gold points, these will correspond to sort of a single limit or a double limit. What these data scaling laws let us do is they let us quantify

the data efficiency numbers of our approaches. So one way in which we do this is if we have some

new recipe that we believe should improve upon the standard recipe that we're using right now, you can take the loss of your new recipe and you can

project it onto the data scaling law. So the red line of a standard recipe and this projection lets you measure essentially the effective number of

extra tokens that your algorith algorithmic improvement is buying you. So in this case what we see is

that this joint scaling recipe gives you roughly a 5x data efficiency win over uh the standard recipe. It's also worth noting that uh these data

efficiency wins are something that we can realize with sort of finite models not just double limits. So for example if you're willing to train a five

ensemble of 1 billion parameter models this will give you roughly a 3.7x data efficiency win. The other

interesting aspect about these data scaling laws is if you look at the functional form in the legend, you'll see that they all have really similar

exponents and they all have very similar asmtotes. And so the reason why this matters is this suggests that even if you repeated these experiments at

a much larger token scale, if you believe that these data scaling law laws extrapolate, this data

efficiency win is going to be constant over the actual number of token counts that you have. So they suggest that this double joint scaling well

recipe has a 5x data efficiency win even if you are willing to send the seed token count to like 10 trillion tokens or whatever people are doing

pre-training at these days. So now I'll go over some methods to sort of make this data efficiency win perhaps

slightly more practical. And so even though these recipes require a lot of training compute we also show that you can reduce the amount of inference

compute you need by using distillation. So the plot on the right here, the purple line corresponds to the same regularized recipe. The light blue

points correspond to the same ensemble skilling. So we first show that what you can do is you can take an

eight ensemble which is roughly 2.4 billion total parameters and you can distill it into a single dense 300 million parameter model which is the pink

star in the bottom. And you can do this while retaining roughly 83% of the loss improvement. So this shows you that data efficiency is not something

that you need a large amount of inference compute for. If you're willing to amort amortize the test

time compute during training time, you can get an extremely data efficient model that's still very small. The other surprising result we show in this

section is that you can do self-distillation to even improve your loss. So with self-distillation, what we're doing is we're starting with the 300

million parameter model at the start of the light blue curve and then we're distilling this model into

a fresh 300 million parameter model which is the green star. And what we find is very surprisingly even doing self distillation gives you huge loss

improvement. It even beats the asmtote of the regularized recipe. This is actually pretty counterintuitive and we have a longer sort of uh description

of this result in the paper but it turns out to have pretty surprising connections to uh ensembling

and there's actually a view uh from prior work on viewing self-distillation as implicitly training a two ensemble. We also show that even though we're

only chasing IID VAT loss in all of our experiments, pretty much all of the trends in this paper directly work on downstream benchmarks. And this is

like a fully held out sort of test set where we only looked at the benchmarks at the very end of the

paper because the advisers told us to. Um, and you can see that everything tracks the standard recipe overfits. Still model scaling gives you

improvements. Ensembling is even better. and you can still retain a lot of the benefits through distillation. And finally, we also show that you can

do this for other settings beyond pre-training. So things like continued pre-training. So we consider a setup

where you're trying to CPT a 3B model and we assume access to sort of this restricted set of 4 billion math related tokens where the whole corpus of

data is actually 73 billion tokens. And what we show is that if you're willing to do these data efficiency tricks like aggressive epoing and things

like ensembling, you can match the performance of training on the full 73 billion tokens even using

only 4 billion tokens which is roughly a 17x data efficiency win. So to sort of wrap up this talk, maybe the main point I want to make is that when

you're constrained by data and you're unconstrained by compute and this sort of new algorithmic regime, the types of algorithmic choices you make

matter a lot and we should be willing to sort of rethink every aspect of a stack. In this paper, we mostly

do this by revisiting a lot of these classical ideas from uh machine learning and deep learning. Things like regularization, ensembling, distillation

have existed for many years. And we also introduced this evaluative tool of asmmptotes. And maybe the hope is that if you're willing to chase

algorithms that have lower compute asmmptotes, uh these will give you like better ideas for data efficiency.

But like ultimately what we really want to do is we want these asmtotes to help us develop new and better ideas under infinite compute that don't

already exist. And so if you're interested in the details, that's a QR code for the paper. And we've also done some follow-up work on looking at how

synthetic data interacts with data efficiency. So feel free to check that out as well if you're

interested. Thanks.

All right. Thank you guys so much for coming. This is like a dream come true. I'm in one of my favorite places that um was most important places of

my life and now I get to talk about AI here. So super fun. I think there's a lot of potential for this club. I think I don't have nearly, you know, 1%

of all the ideas that we probably have to make this club really great um in all of your heads. And

so we want to make sure all of you guys get in on the Slack. So I'll make sure that you know, please send me a note if you're not already on there.

And then we can kind of make this thing whatever we want. So it's kind of fun and I intend to. So like please come with ideas. We want to make this

super fun. Um obviously, you know, there's some round rules, be respectful, all that kind of stuff. Um,

and definitely be involved. And that's kind of the the biggest thing that we really only really ask. That's all I got. That's a wrap. Go get some boba tea. Thank you.

Inference, Diffusion, World Models, and More | YC Paper Club · 全文文字稿