Diffusion Gemma First Look & Demo – A BIG Step for Local AI Models! 文字稿

In some ways, this one's a little better because the car is actually not sucked into the ground. Google has released a very exciting open-source model

called Diffusion Gemma. Now, aside from being Apache 2.0 licensed, this is something that really may represent one of the future directions of AI that

can run quickly and efficiently on basically your atome gaming PC. So, for today's video, we're

going to be taking a look at this with some technical things thrown in because I personally have a difficult time conceptualizing sometimes like the

diffusion models and how they work. I find it can be difficult to grasp with some of the examples that do exist. So, we're going to get started by

just taking a quick peek at some of the interesting things of note about this model. Please do feel free

to subscribe as I do want that 100K plaque. And let's take a peek at Diffusion Gemma. So, Diffusion Gemma, as they say right here, is a 26 billion

parameter mixture of experts model with 4 billion active. So, comparable to Gemma 4 26B- A4B, at least in terms of some of the benchmarks we'll be

seeing here. And we will also run some side to side comparisons, not only to get a feel for the speed

differential with this model, but also the quality of the outputed results, as that is a consideration, especially with an experimental model like

this. Now, they do mention right here, and this specific benchmark was performed on an H100 GPU, which is still not something a consumer is going to

have in their home. But we can see right here that the speed for the diffusion Gemma model versus the

regular Gemma 426B with MTP, which just means that it's going to be a little faster in generating tokens. This is significantly faster. And

considering that increase in speed, the drop off in intelligence is not really comparable to the massive boost in speed that one gets. So this is very

exciting for local AI. And basically the next section is going to give us specific insight into why. The

thing that is important here for folks who are going to be interested in local AI as is very popular right now. They basically say most language

models act like a typewriter generating one token at a time from left to right. The gist of this paragraph is essentially that works very well for big

data centers where models are being served to a large number of users at one time. So the requests can

be batched together and that efficiently utilizes those GPUs as they are set up in a data center to serve many different users. A home user like you

or I who is just playing with this on our system has a different basically bounds in what their GPU is able to do. So a model like that is working

traditionally in auto reggressive LLM is not going to take full advantage efficiently of our GPU there's

going to be a lot of waiting happening as it generates that one token at a time. This diffusion model essentially inverts that because it keeps the

GPU busy while it's generating those batches of 256 tokens at a time. So it does more work in that same amount of time. More properly utilizing the

GPU's computational ability, I guess could be said. Diffusion Gemma utilizes your hardware to its full

potential. The processor is using a large or doing a larger chunk of work at once and it upgrades your model inference from a single sequential

typewriter to a massive printing press that stamps an entire block of text simultaneously. TLDDR. A home user with say a 5090 will have their GPU more

efficiently utilized for this model. Therefore, it runs faster. It's a better fitting setup for a single

user setup. So, there is a section here on how text diffusion works and they pretty simply list it in three steps. One, the canvas. The model starts

with a canvas of random placeholder tokens. Two, the model makes multiple passes. This is iterative refinement, locking in correct tokens and using

them as context clues to refine the rest. And then finally, the text converges into highquality output.

Now, I personally find this graph a little difficult to conceptualize. So, I've have a different example here that I'd like to kind of ascribe these

three steps to. This was made with the help of Claude Fable. Um, you know, rest in peace. But basically, this is a 3D model of a keyboard. And the

whole thing that we're going to see here is these blank keys are essentially what we see in step one

where the model is going to start with a canvas of random placeholder tokens. Assume any blank key here is attributable or the same to that step one

blank tokens. Now the task here is basically step by step as this doises it's going to converge towards one of four keyboard layouts. This is actually

using a real diffusion model that was trained on this laptop and it has awareness of four real

keyboard layouts. Cordi, Vorac, Kmarmac, and Azerti. So, as this starts to refine its predictions and actually den noiseise, it's going to lock in

keys which will lead us to seeing step two right here where the model makes multiple passes locking in correct tokens and using them as context clues

to refine the rest. So when we start to see keys here that are locked in, those will then be used in

the next dinoising step to refine the rest or in this specific example pointed towards one of these specific keyboard layouts because it knows oh if Q

is right here then this is very likely going to be querty or something of the sort. Now I understand it can still be a little difficult for this to

make sense. So let's just start by manually initiating the first step where this is actually going to

be what we see right here where we have our noisy canvas and it is going to perform a dnoising step. Now keep in mind them being blank right here is

not one to one an example of how this is showing it right here but more or less this is just designed to be a visual thing. So okay that's where we

start with the noisy canvas and now it is going to predict a few and then it is going to remask these.

So what we just saw there is it made a first pass and prediction and these are predictions of keys that it was confident enough at to actually save.

So these are not getting remasked which is just basically when these go back to blank from where they were. Now in step two they say that the model is

locking in correct tokens and using them as context clues to refine the rest. So B and H being right

here in our example we see H has eliminated the D'vorak keyboard option. Now, this is a simplified example because there's only four potential correct

options. In reality, this is a gigantic neural network with 262,000 potential like vocabulary pieces or something. The point I'm trying to say is this

is a much simpler explanation, but the process is the same. So, we're going to run another

dnoising step right here. And okay, it has predicted another key. And if we hover over these keys that have been remasked, we're actually seeing

potential probability distributions of what this key may be, depending on the canvas as it sits right now. So that really ties back into step two

right here where it iteratively refineses, if we go back right here, multiple passes, locking in correct

tokens and using them as context clues to refine the next. So these are all being used as context clues which can be seen by the fact that the actual

probability distributions for what any of these specific keys could be will change as we run additional steps. So I think this one let's find one

that's like not 100% certain and that would be this position 37. So let's run one other step and see how

our probability distribution changes with some more tokens locked in. Okay, I do believe those two got inverted in terms of their percentage, but I

don't actually 100% remember. So, okay. So, things are changing a bit. And as we run steps, we'll see more keys are actually getting filled in. And

these would just essentially be locked in as the dnoising step processes. Okay, that still seems to be

going. And it's still fairly unsure about this one. I've played with this so I know this key is one that generally does not get properly confidently

generated until much later in the process. But we can see that slightly the percentages are changing as to what this is expecting it to be. And we'll

just keep running steps right here. And the elimination of potential keyboard layouts that we see

here is a function of this specific demonstration that simplifies kind of what happens in the actual diffusion process. But just for a simple visual

demo, I find it can be helpful to see like, oh, okay, so that token being that can actually allow this to eliminate something. So now it's swapped

back. So now it's more likely that M will appear there. Whereas these had kind of flip-flop a bit and

then finally it got to a point where it knew enough tokens to know definitively that okay, the actual layout is quarterty. Again, that's a

simplification of what happens. And also caveat of this demo is basically this just starts completely randomly like a roll of the dice. In reality

when speaking with something like diffusion Gemma here the prompt that we actually send it will heavily influence

like it won't just randomly start picking letters here. It will have some context as to where to I guess begin for lack of a better term. So really

this is something I just wanted to show because I found that it actually helped me personally to better conceptually understand what's going on in the

diffusion process. Keep in mind I'm not a Google deep mind scientist and I whipped this together with

a now deceased model Claude Fable. So I mean I will put this on GitHub in the link in the description but be sure to like run it through an LLM if

you're confused about anything. Though if we do auto write here this is actually a train diffusion model that's running locally on the system. A

potato computer will be able to train the model as the example will be on GitHub. So no worries about that.

And then if you click auto you get to see the process happen just in real time. I was manually clicking step by step just so we could get a better

look at some of these things. And finally, before we get to playing with this, this is just the actual model on hugging face here. It is Apache 2.0,

which is awesome. And there are a bunch of different quantizations as well for using Mac systems. There

are MLX, and I do believe this is now supported in Unsloth Studio. When I first started playing with this, it wasn't. So, there was kind of a patch to

Llama CPP that allowed me to use this. Therefore, the speeds that we're going to see here are not going to be anywhere close to what's listed in that

specific announcement post where it's doing like 1100 tokens per second. And that's because I'm not

necessarily running this in the most GPU optimized way. But the whole point of it is just to see that there is a speed up even without any like real

optimization for how this should be served. Additionally to that, we're going to do sidebyside tests just of how the results compare to its nondeusion

sibling. So to begin, we're just going to run some sidebyside speed comparisons with the same

context length from a 50/90 mobile laptop. So this laptop has 24 gigs of video RAM. I am just running this at a low context length of 48 tokens. I

have thinking disabled and we're basically going to send it something simple just to get a feel for the speed. And what we're going to see right here,

this is just a cool visual trick. This is not actually visualization of the diffusion process. Unsloth

does have a command line um thing that you can use if you go and click on their documentation right here in the run diffusion Gemma guide. There is

actually a command line way where you can see the diffusion process happen in real time. This just is simplified just to make it look pretty. Okay, so

we see right here that the total speed was 93.6 6 tokens per second, which really is not great, but

this is a 50/90 laptop system, and this is a 4-bit quantization of a 26 billion parameter mixture of experts model. Now, we're going to go into LM

Studio instead. I now have the non-deusion version of this model in LM Studio here. This is the Gemma 26BA, and this is the QAT version, which just

means that it hypothetically performs better even at this 4-bit quant than the nonQAT version. It's

called quantization aware training and it makes it perform better when it's quantized versus just the standard model being quantized down to 4bit. I

know that may seem confusing. It's not super important right now, but just know that this will have good quality comparatively. So, I'm giving this

the same exact prompt. We have the same exact low context length set and thinking is also disabled

here. So, we're going to see immediately the token speed that we get and we'll be able to compare them side by side. And right there we see that we

got 61.3 tokens per second in very similar testing environments with the same prompt sent to it and the same total context length set. So the

differential right there just between the two was 94 tokens per second versus 61 tokens per second. And this

is not in an optimized way. I'm not serving this through VLLM or using like a special version that would accelerate it more on this specific Nvidia

card because I'm just more focused on the display of the diffusion model I suppose here. However, let's just do something else now. So, I'm just

instructing this to write a few paragraphs about the iMac G3. Okay. And this did actually think. So, that

may have messed up our first initial test. I don't know why the thinking toggle button isn't working here. Um, Fable must have messed that up. And we

see, okay, we got a few paragraphs as well as thinking at a speed of 114.2 tokens per second. Now, it is very possible that there was an unfair result

run here because we didn't actually run the LM Studio version with thinking enabled. So, let's fix

that. All right. So, now we're back in LM Studio and I have this time ensured that thinking is toggled on here as well as I missed it the first time.

I do apologize for my incompetence there. And we've just asked it to write a few paragraphs about the iMac G3. And we see that was 57 tokens per

second versus what we got online for the same exact prompt, the same context length, which was 114 tokens

per second. So that is quite a speed difference. I understand this is not a scientifically accurate test setup, but more or less it gives us some

understanding of okay, on the same exact system, both with 4-bit quantized models, this one is running significantly faster. And that is what is very

cool about the entirety like of the diffusion text model specifically for local AI is because it's going

to properly utilize the GPU in the way that a single user is going to get more benefit out of it. And really what was said in the announcement post

there where they said that an auto reggressive LLM like what we see and what we're using in LM Studio is better suited for a data center because it's

not as fast right here. But if there's a thousand of me speaking to this right now, the architecture

works more efficiently for a data center to be serving this to a bunch of different people with batched requests and things like that. But a single

person at home is going to get more out of the diffusion model in terms of speed. Now, let's talk a little about some of the downsides of this. And

this is going to bring me into a next test where I'm going to need a system that's going to be able to

handle a significantly longer bit of context than this laptop right here. So, I'm going to use my RTX 6000 Pro Blackwell card right now. And that is

on the beige box behind me. And it's running here. It's the same exact model, the same exact quantization, the same GGUF, everything. The only thing

is I'll be able to really extend the context length here. So, we can do some true sidebyside testing

in terms of the result quality. So, I'm going to be giving this a simpler version of the triedand-true browser OS test. I have begun this just using

the web interface that I have hooked up to the system behind me with a longer context length as well as thinking enabled and I'm also going to just

begin it from LM Studio which is actually running locally on this system. So from this point on any

speed differential we notice is not at all comparable because one is a laptop one is a desktop with a big beefy card. though we're only focused on

comparing quality of the results right now as that's important as well though this is experimental and they do specifically mention that. All right.

So I now have two browser OS results. One created by the regular Gemma 426B4B 4-bit quant QAT version.

So a good 4-bit quant of the normal version. So that is right here and we'll open that and take a look at it. Okay, there is no right click. I'm not

going to spend a lot of time going through these, but just to give ourselves a sense of quality comparability between some random stuff. We have a

clock that is the correct time. We have a start menu. Okay. And it just tells us like there is no start

menu, but all right. A decent notepad that opens when you click once. A decent calculator. 55 * 6 330. Good. And then a somewhat functional terminal

actually. Okay. So, this is overall not a bad result for a model of this size, especially that heavily quantized. Now let's take a peek at what we

received from the diffusion version of this model. Okay, we have a gradient background. There still is

no right click, but there is a clock the correct time in our local. There is also a start menu which just doesn't do anything. So, okay, we have a

notepad. Good. Very similar in terms of the notepad. Our calculator much simpler and like the aesthetic. 54 * 30. Where's the equal sign? Okay, so

that's perhaps like a good demonstration of some of the differences. Okay, so 54* 3 would be 162, but

that's okay. So just a good thing there. And then a browser, which is very interesting. Okay, I didn't expect this to whip out a functional browser

from the 4-bit diffusion version, but nonetheless, it did. And it actually led us to Wikipedia, which would work in this because it's not going to

block embedding. So impressive, but just a good simple demonstration of some of the quality differential

in a non like controlled environment. So let's do another one. So the next one is going to be in a single HTML file. Make me a 3D driving game. And it

is also running online here with the 6000 Pro. All right, let's take a look at our 3D driving games now. First, we'll start with the we'll start with

the normal model now. So this is the one that was run through LM Studio, the non-deusion model.

Okay, you know what? Pretty solid. We even have some obstacles. Can we lose? Interesting. So, if we hit those, sorry. Sometimes I get like sucked into

actually playing these games. I have to remember this isn't like a normal model test, but really this is not bad for the quantization and ane model.

Very, very solid output. Now, let's take a peek at what we got from the diffusion model. And it's

you see the relation between the two being how similar the actual yellow obstacles they put are. I would almost go out on a limb and say in some ways

this one's a little better because the car is actually not sucked into the ground. But the entire point of this is just to showcase that for the speed

increase we get with the diffusion model, the loss in intelligence is not huge comparatively to the

gain in speed. And depending on the type of task you're going to do, that may be a very acceptable trade-off. This is dare I say good. And let's just

do one more. maybe like a beautiful static front end for something. We'll go back to the throwback test that I liked to run very often. So, we'll do

the classic Steve's PC repair website generation and we'll also go and run that from within LM

Studio. Again, I would like to reiterate that everything we're doing now is purely a quality based comparison because the LM Studio version right here

is running locally on this laptop and then the version that's running on the web right here is running on the big desktop behind me. So, there's no

comparison of speed here, only quality. And finally, let's take a look at the comparison between our

Steve's PC repair websites. I think this time we'll just stick to the normal. We'll start with the non-deusion version. Okay, this looks quite good.

Everything here is arranged nicely. We do have some slight hover effects on these cards. The icons it's chosen to use are nice and well done. Even the

header up here goes translucent when you scroll down. And there's a competent contact card as well

as a footer. So, overall, nicely done. Now, let's take a peek at the diffusion model result. Okay, again, very similar. I do see perhaps a bit less

eloquence in terms of the hero section right here. It's less high-tech, modern, a bit simpler. Nice hover effect, though. And if we scroll down, yeah,

we can definitely see this is slightly worse just in terms of overall quality. Although, now that I

see it, again, it's a sort of a tossup. This one's more together and coherent. But that is a good-looking contact card when judged independently. And

we also have a footer and the header does go translucent as well. So really the whole point of this is just to give a few sidebyside examples from the

diffusion model versus the non-defusion model. Same quantization. The non-defusion model was the

QAT version. So we'll produce some strong results just as we saw here. Overall, I wanted to just do a video on this model because while it may not be

the most interesting thing to test by itself, it's very exciting from a research and directional standpoint, especially for local AI, which is very

hot topic right now because of what happened with Fable. I'm not going to waste anyone's time making a

video where I just give my opinions on the situation. I don't like doing stuff like that. I personally am just don't I don't know. I don't enjoy that.

I'd rather test models. I will say I see a lot of folks fear-mongering saying like buy a GPU, go into debt if you have to. That is probably one of the

dumbest reactions to that specific scenario that I could imagine. Yes, it shows that intelligence

can be taken away at any point in time if it is a model that does not exist locally with you. I don't think somebody should go spend $5,000 on a DGX

Spark to protect themselves against that. I think a proper trade-off and probably what a non-reactionary person would say is go buy a hard drive with

a couple terabytes and download some of the models that are available on HuggingFace. So if the day

comes when things are actually really being taken away from us, you have them and then at that point you can go into debt and buy the system then that

will inevitably be more powerful than the gold box you would buy now. um kind of accomplishing the same thing without the fear-mongering of giving

people advice to spend money they don't have when till that point a $20 a month chat GPT or cloud

subscription will fill the void better than a DGX spark that you take a loan out on. So I am a bit you know I have opinions but I try not to let them

come in the channel too much. So back to the diffusion Gemma that is going to conclude our first look and test of this model. It is very exciting from

a development standpoint. Again, I will put the link for the keyboard demo on hugging uh not

hugging face on GitHub. I'll put that in the description. I'm not a Deep Mind research scientist, so take it with a grain of salt, but more or less

it's just a visual aid, I think, to help conceptualize some of what's going on here. And overall, it's really cool to see this and play with it. So, I

wanted to cover it. Some folks had suggested I will or would, and I have. So, thanks for the

suggestions. And if you have any questions, leave them in the comments. Thanks for watching.

Diffusion Gemma First Look & Demo – A BIG Step for Local AI Models! · 全文文字稿