In some ways, this one's a little better because the car is actually not sucked into the ground. Google has released a very exciting open-source model
called Diffusion Gemma. Now, aside from being Apache 2.0 licensed, this is something that really may represent one of the future directions of AI that
can run quickly and efficiently on basically your atome gaming PC. So, for today's video, we're
going to be taking a look at this with some technical things thrown in because I personally have a difficult time conceptualizing sometimes like the
diffusion models and how they work. I find it can be difficult to grasp with some of the examples that do exist. So, we're going to get started by
just taking a quick peek at some of the interesting things of note about this model. Please do feel free
to subscribe as I do want that 100K plaque. And let's take a peek at Diffusion Gemma. So, Diffusion Gemma, as they say right here, is a 26 billion
parameter mixture of experts model with 4 billion active. So, comparable to Gemma 4 26B- A4B, at least in terms of some of the benchmarks we'll be
seeing here. And we will also run some side to side comparisons, not only to get a feel for the speed
differential with this model, but also the quality of the outputed results, as that is a consideration, especially with an experimental model like
this. Now, they do mention right here, and this specific benchmark was performed on an H100 GPU, which is still not something a consumer is going to
have in their home. But we can see right here that the speed for the diffusion Gemma model versus the
regular Gemma 426B with MTP, which just means that it's going to be a little faster in generating tokens. This is significantly faster. And
considering that increase in speed, the drop off in intelligence is not really comparable to the massive boost in speed that one gets. So this is very
exciting for local AI. And basically the next section is going to give us specific insight into why. The
thing that is important here for folks who are going to be interested in local AI as is very popular right now. They basically say most language
models act like a typewriter generating one token at a time from left to right. The gist of this paragraph is essentially that works very well for big
data centers where models are being served to a large number of users at one time. So the requests can
be batched together and that efficiently utilizes those GPUs as they are set up in a data center to serve many different users. A home user like you
or I who is just playing with this on our system has a different basically bounds in what their GPU is able to do. So a model like that is working
traditionally in auto reggressive LLM is not going to take full advantage efficiently of our GPU there's
going to be a lot of waiting happening as it generates that one token at a time. This diffusion model essentially inverts that because it keeps the
GPU busy while it's generating those batches of 256 tokens at a time. So it does more work in that same amount of time. More properly utilizing the
GPU's computational ability, I guess could be said. Diffusion Gemma utilizes your hardware to its full
potential. The processor is using a large or doing a larger chunk of work at once and it upgrades your model inference from a single sequential
typewriter to a massive printing press that stamps an entire block of text simultaneously. TLDDR. A home user with say a 5090 will have their GPU more
efficiently utilized for this model. Therefore, it runs faster. It's a better fitting setup for a single
user setup. So, there is a section here on how text diffusion works and they pretty simply list it in three steps. One, the canvas. The model starts
with a canvas of random placeholder tokens. Two, the model makes multiple passes. This is iterative refinement, locking in correct tokens and using
them as context clues to refine the rest. And then finally, the text converges into highquality output.
Now, I personally find this graph a little difficult to conceptualize. So, I've have a different example here that I'd like to kind of ascribe these
three steps to. This was made with the help of Claude Fable. Um, you know, rest in peace. But basically, this is a 3D model of a keyboard. And the
whole thing that we're going to see here is these blank keys are essentially what we see in step one
where the model is going to start with a canvas of random placeholder tokens. Assume any blank key here is attributable or the same to that step one
blank tokens. Now the task here is basically step by step as this doises it's going to converge towards one of four keyboard layouts. This is actually
using a real diffusion model that was trained on this laptop and it has awareness of four real
keyboard layouts. Cordi, Vorac, Kmarmac, and Azerti. So, as this starts to refine its predictions and actually den noiseise, it's going to lock in
keys which will lead us to seeing step two right here where the model makes multiple passes locking in correct tokens and using them as context clues
to refine the rest. So when we start to see keys here that are locked in, those will then be used in
the next dinoising step to refine the rest or in this specific example pointed towards one of these specific keyboard layouts because it knows oh if Q
is right here then this is very likely going to be querty or something of the sort. Now I understand it can still be a little difficult for this to
make sense. So let's just start by manually initiating the first step where this is actually going to
be what we see right here where we have our noisy canvas and it is going to perform a dnoising step. Now keep in mind them being blank right here is
not one to one an example of how this is showing it right here but more or less this is just designed to be a visual thing. So okay that's where we
start with the noisy canvas and now it is going to predict a few and then it is going to remask these.
So what we just saw there is it made a first pass and prediction and these are predictions of keys that it was confident enough at to actually save.
So these are not getting remasked which is just basically when these go back to blank from where they were. Now in step two they say that the model is
locking in correct tokens and using them as context clues to refine the rest. So B and H being right
here in our example we see H has eliminated the D'vorak keyboard option. Now, this is a simplified example because there's only four potential correct
options. In reality, this is a gigantic neural network with 262,000 potential like vocabulary pieces or something. The point I'm trying to say is this
is a much simpler explanation, but the process is the same. So, we're going to run another
dnoising step right here. And okay, it has predicted another key. And if we hover over these keys that have been remasked, we're actually seeing
potential probability distributions of what this key may be, depending on the canvas as it sits right now. So that really ties back into step two
right here where it iteratively refineses, if we go back right here, multiple passes, locking in correct
tokens and using them as context clues to refine the next. So these are all being used as context clues which can be seen by the fact that the actual
probability distributions for what any of these specific keys could be will change as we run additional steps. So I think this one let's find one
that's like not 100% certain and that would be this position 37. So let's run one other step and see how
our probability distribution changes with some more tokens locked in. Okay, I do believe those two got inverted in terms of their percentage, but I
don't actually 100% remember. So, okay. So, things are changing a bit. And as we run steps, we'll see more keys are actually getting filled in. And
these would just essentially be locked in as the dnoising step processes. Okay, that still seems to be
going. And it's still fairly unsure about this one. I've played with this so I know this key is one that generally does not get properly confidently
generated until much later in the process. But we can see that slightly the percentages are changing as to what this is expecting it to be. And we'll
just keep running steps right here. And the elimination of potential keyboard layouts that we see
here is a function of this specific demonstration that simplifies kind of what happens in the actual diffusion process. But just for a simple visual
demo, I find it can be helpful to see like, oh, okay, so that token being that can actually allow this to eliminate something. So now it's swapped
back. So now it's more likely that M will appear there. Whereas these had kind of flip-flop a bit and
then finally it got to a point where it knew enough tokens to know definitively that okay, the actual layout is quarterty. Again, that's a
simplification of what happens. And also caveat of this demo is basically this just starts completely randomly like a roll of the dice. In reality
when speaking with something like diffusion Gemma here the prompt that we actually send it will heavily influence
like it won't just randomly start picking letters here. It will have some context as to where to I guess begin for lack of a better term. So really
this is something I just wanted to show because I found that it actually helped me personally to better conceptually understand what's going on in the
diffusion process. Keep in mind I'm not a Google deep mind scientist and I whipped this together with
a now deceased model Claude Fable. So I mean I will put this on GitHub in the link in the description but be sure to like run it through an LLM if
you're confused about anything. Though if we do auto write here this is actually a train diffusion model that's running locally on the system. A
potato computer will be able to train the model as the example will be on GitHub. So no worries about that.
And then if you click auto you get to see the process happen just in real time. I was manually clicking step by step just so we could get a better
look at some of these things. And finally, before we get to playing with this, this is just the actual model on hugging face here. It is Apache 2.0,
which is awesome. And there are a bunch of different quantizations as well for using Mac systems. There
are MLX, and I do believe this is now supported in Unsloth Studio. When I first started playing with this, it wasn't. So, there was kind of a patch to
Llama CPP that allowed me to use this. Therefore, the speeds that we're going to see here are not going to be anywhere close to what's listed in that
specific announcement post where it's doing like 1100 tokens per second. And that's because I'm not
necessarily running this in the most GPU optimized way. But the whole point of it is just to see that there is a speed up even without any like real
optimization for how this should be served. Additionally to that, we're going to do sidebyside tests just of how the results compare to its nondeusion
sibling. So to begin, we're just going to run some sidebyside speed comparisons with the same
context length from a 50/90 mobile laptop. So this laptop has 24 gigs of video RAM. I am just running this at a low context length of 48 tokens. I
have thinking disabled and we're basically going to send it something simple just to get a feel for the speed. And what we're going to see right here,
this is just a cool visual trick. This is not actually visualization of the diffusion process. Unsloth
does have a command line um thing that you can use if you go and click on their documentation right here in the run diffusion Gemma guide. There is
actually a command line way where you can see the diffusion process happen in real time. This just is simplified just to make it look pretty. Okay, so
we see right here that the total speed was 93.6 6 tokens per second, which really is not great, but
this is a 50/90 laptop system, and this is a 4-bit quantization of a 26 billion parameter mixture of experts model. Now, we're going to go into LM
Studio instead. I now have the non-deusion version of this model in LM Studio here. This is the Gemma 26BA, and this is the QAT version, which just
means that it hypothetically performs better even at this 4-bit quant than the nonQAT version. It's
called quantization aware training and it makes it perform better when it's quantized versus just the standard model being quantized down to 4bit. I
know that may seem confusing. It's not super important right now, but just know that this will have good quality comparatively. So, I'm giving this
the same exact prompt. We have the same exact low context length set and thinking is also disabled
here. So, we're going to see immediately the token speed that we get and we'll be able to compare them side by side. And right there we see that we
got 61.3 tokens per second in very similar testing environments with the same prompt sent to it and the same total context length set. So the
differential right there just between the two was 94 tokens per second versus 61 tokens per second. And this
is not in an optimized way. I'm not serving this through VLLM or using like a special version that would accelerate it more on this specific Nvidia
card because I'm just more focused on the display of the diffusion model I suppose here. However, let's just do something else now. So, I'm just
instructing this to write a few paragraphs about the iMac G3. Okay. And this did actually think. So, that
may have messed up our first initial test. I don't know why the thinking toggle button isn't working here. Um, Fable must have messed that up. And we
see, okay, we got a few paragraphs as well as thinking at a speed of 114.2 tokens per second. Now, it is very possible that there was an unfair result
run here because we didn't actually run the LM Studio version with thinking enabled. So, let's fix
that. All right. So, now we're back in LM Studio and I have this time ensured that thinking is toggled on here as well as I missed it the first time.
I do apologize for my incompetence there. And we've just asked it to write a few paragraphs about the iMac G3. And we see that was 57 tokens per
second versus what we got online for the same exact prompt, the same context length, which was 114 tokens
per second. So that is quite a speed difference. I understand this is not a scientifically accurate test setup, but more or less it gives us some
understanding of okay, on the same exact system, both with 4-bit quantized models, this one is running significantly faster. And that is what is very
cool about the entirety like of the diffusion text model specifically for local AI is because it's going
to properly utilize the GPU in the way that a single user is going to get more benefit out of it. And really what was said in the announcement post
there where they said that an auto reggressive LLM like what we see and what we're using in LM Studio is better suited for a data center because it's
not as fast right here. But if there's a thousand of me speaking to this right now, the architecture
works more efficiently for a data center to be serving this to a bunch of different people with batched requests and things like that. But a single
person at home is going to get more out of the diffusion model in terms of speed. Now, let's talk a little about some of the downsides of this. And
this is going to bring me into a next test where I'm going to need a system that's going to be able to
handle a significantly longer bit of context than this laptop right here. So, I'm going to use my RTX 6000 Pro Blackwell card right now. And that is
on the beige box behind me. And it's running here. It's the same exact model, the same exact quantization, the same GGUF, everything. The only thing
is I'll be able to really extend the context length here. So, we can do some true sidebyside testing
in terms of the result quality. So, I'm going to be giving this a simpler version of the triedand-true browser OS test. I have begun this just using
the web interface that I have hooked up to the system behind me with a longer context length as well as thinking enabled and I'm also going to just
begin it from LM Studio which is actually running locally on this system. So from this point on any
speed differential we notice is not at all comparable because one is a laptop one is a desktop with a big beefy card. though we're only focused on
comparing quality of the results right now as that's important as well though this is experimental and they do specifically mention that. All right.
So I now have two browser OS results. One created by the regular Gemma 426B4B 4-bit quant QAT version.
So a good 4-bit quant of the normal version. So that is right here and we'll open that and take a look at it. Okay, there is no right click. I'm not
going to spend a lot of time going through these, but just to give ourselves a sense of quality comparability between some random stuff. We have a
clock that is the correct time. We have a start menu. Okay. And it just tells us like there is no start
menu, but all right. A decent notepad that opens when you click once. A decent calculator. 55 * 6 330. Good. And then a somewhat functional terminal
actually. Okay. So, this is overall not a bad result for a model of this size, especially that heavily quantized. Now let's take a peek at what we
received from the diffusion version of this model. Okay, we have a gradient background. There still is
no right click, but there is a clock the correct time in our local. There is also a start menu which just doesn't do anything. So, okay, we have a
notepad. Good. Very similar in terms of the notepad. Our calculator much simpler and like the aesthetic. 54 * 30. Where's the equal sign? Okay, so
that's perhaps like a good demonstration of some of the differences. Okay, so 54* 3 would be 162, but
that's okay. So just a good thing there. And then a browser, which is very interesting. Okay, I didn't expect this to whip out a functional browser
from the 4-bit diffusion version, but nonetheless, it did. And it actually led us to Wikipedia, which would work in this because it's not going to
block embedding. So impressive, but just a good simple demonstration of some of the quality differential
in a non like controlled environment. So let's do another one. So the next one is going to be in a single HTML file. Make me a 3D driving game. And it
is also running online here with the 6000 Pro. All right, let's take a look at our 3D driving games now. First, we'll start with the we'll start with
the normal model now. So this is the one that was run through LM Studio, the non-deusion model.
Okay, you know what? Pretty solid. We even have some obstacles. Can we lose? Interesting. So, if we hit those, sorry. Sometimes I get like sucked into
actually playing these games. I have to remember this isn't like a normal model test, but really this is not bad for the quantization and ane model.
Very, very solid output. Now, let's take a peek at what we got from the diffusion model. And it's
you see the relation between the two being how similar the actual yellow obstacles they put are. I would almost go out on a limb and say in some ways
this one's a little better because the car is actually not sucked into the ground. But the entire point of this is just to showcase that for the speed
increase we get with the diffusion model, the loss in intelligence is not huge comparatively to the
gain in speed. And depending on the type of task you're going to do, that may be a very acceptable trade-off. This is dare I say good. And let's just
do one more. maybe like a beautiful static front end for something. We'll go back to the throwback test that I liked to run very often. So, we'll do
the classic Steve's PC repair website generation and we'll also go and run that from within LM
Studio. Again, I would like to reiterate that everything we're doing now is purely a quality based comparison because the LM Studio version right here
is running locally on this laptop and then the version that's running on the web right here is running on the big desktop behind me. So, there's no
comparison of speed here, only quality. And finally, let's take a look at the comparison between our
Steve's PC repair websites. I think this time we'll just stick to the normal. We'll start with the non-deusion version. Okay, this looks quite good.
Everything here is arranged nicely. We do have some slight hover effects on these cards. The icons it's chosen to use are nice and well done. Even the
header up here goes translucent when you scroll down. And there's a competent contact card as well
as a footer. So, overall, nicely done. Now, let's take a peek at the diffusion model result. Okay, again, very similar. I do see perhaps a bit less
eloquence in terms of the hero section right here. It's less high-tech, modern, a bit simpler. Nice hover effect, though. And if we scroll down, yeah,
we can definitely see this is slightly worse just in terms of overall quality. Although, now that I
see it, again, it's a sort of a tossup. This one's more together and coherent. But that is a good-looking contact card when judged independently. And
we also have a footer and the header does go translucent as well. So really the whole point of this is just to give a few sidebyside examples from the
diffusion model versus the non-defusion model. Same quantization. The non-defusion model was the
QAT version. So we'll produce some strong results just as we saw here. Overall, I wanted to just do a video on this model because while it may not be
the most interesting thing to test by itself, it's very exciting from a research and directional standpoint, especially for local AI, which is very
hot topic right now because of what happened with Fable. I'm not going to waste anyone's time making a
video where I just give my opinions on the situation. I don't like doing stuff like that. I personally am just don't I don't know. I don't enjoy that.
I'd rather test models. I will say I see a lot of folks fear-mongering saying like buy a GPU, go into debt if you have to. That is probably one of the
dumbest reactions to that specific scenario that I could imagine. Yes, it shows that intelligence
can be taken away at any point in time if it is a model that does not exist locally with you. I don't think somebody should go spend $5,000 on a DGX
Spark to protect themselves against that. I think a proper trade-off and probably what a non-reactionary person would say is go buy a hard drive with
a couple terabytes and download some of the models that are available on HuggingFace. So if the day
comes when things are actually really being taken away from us, you have them and then at that point you can go into debt and buy the system then that
will inevitably be more powerful than the gold box you would buy now. um kind of accomplishing the same thing without the fear-mongering of giving
people advice to spend money they don't have when till that point a $20 a month chat GPT or cloud
subscription will fill the void better than a DGX spark that you take a loan out on. So I am a bit you know I have opinions but I try not to let them
come in the channel too much. So back to the diffusion Gemma that is going to conclude our first look and test of this model. It is very exciting from
a development standpoint. Again, I will put the link for the keyboard demo on hugging uh not
hugging face on GitHub. I'll put that in the description. I'm not a Deep Mind research scientist, so take it with a grain of salt, but more or less
it's just a visual aid, I think, to help conceptualize some of what's going on here. And overall, it's really cool to see this and play with it. So, I
wanted to cover it. Some folks had suggested I will or would, and I have. So, thanks for the
suggestions. And if you have any questions, leave them in the comments. Thanks for watching.