Qwen 3.6 27b Breakthrough Running Local AI on nVidia DGX Spark? 文字稿

Let's say my name is Zero and on October 25th, 2025, I had a really important decision to make. I had two distinct choices, one of which was to

purchase an ounce of gold at the current market price since it seemed like it might be a good idea. The other was to pre-order an Nvidia DGX Spark,

which at the time was heralded to be this massive advancement in local AI, even though token prices were

lower than they'd been in quite some time, and I just didn't know what to do. So, let's say I made one of those choices and then sped up to today.

What would that actually look like in terms of local AI? Should I have sold the gold? Should I have just bought 3090s and put those in my personal

safe instead? What would have been the better option for running Qwen 3.6 27B? Welcome to AI Flux. Let's

get into it. I would like to kindly ask that you like, subscribe, and share this video. But, let's get into the real question here of whether or not

you should have purchased gold or an Nvidia DGX Spark. So, obviously the price of gold went up a lot. It went up a lot, a lot. Approximately to the

point where it really would have cost about as much as buying an Nvidia DGX Spark at MSRP. And what's

interesting is the DGX Spark had kind of a rough launch, predominantly because most people who are running local AI, and before agentic AI was really

a big reason why people were buying GPUs, is that these users were not researchers. And researchers and people who do local AI use GPUs very

differently. And this has to do with what they're doing, right? With people who are using local AI, generally

speaking this is quantization, but mostly it's just running inference because we're using these models. We are power users of this technology to get

things out of them. And people who are researchers and who are producing models, they needed just a smaller version of a GB200 or GB300 with a very

similar shared memory architecture so that they could have a better idea of how to actually make these

models. And initially, this was a very confusing thing to local AI enthusiasts. Mostly that models that you would think would perform massively well

performed okay, and in certain cases just not very well on these systems, even if you had multiples of them, which you should watch this video if you

want to learn more about that. And what's interesting is we've finally reached a point where we're

understanding how to use these models better with the DGX Spark and these unified memory models. Now, what do I mean by unified memory? Basically, I

mean that the CPU and the GPU share the same HBM memories. So yes, there is in theory a huge speed up because everything's in the same lookup table.

If the CPU wants to do something, it's looking in the same place that a GPU is. However, the memory

model that is built for inference and running a lot of these models on GPUs is very, very different, and it's why both of these configurations are

just a very different animal that come from entirely different continents. And for the longest time, people wrongly thought that comparing the DGX

Spark to Apple Silicon and, you know, like the Mac Studio or a Mac mini, God forbid if you bought one for

Open Claw, was actually very similar. And what's important to note here is that Apple Silicon and what Nvidia does on the DGX Spark could not be more

different. Sure, they're both using HBM shared across a number of different processor types, but the architecture is uniquely its own. And what I'd

like to talk about today are some advancements that have started to build a much better case to use

the DGX Spark over a pile of GPUs. Because what's really interesting at this price point, which ironically coincides with how much an ounce of gold

was at the absolute top, so around $4,000, is that for the same price or probably a little bit less, you could just go and buy 4090s with or without

NVLink bridges, and at that point, you have what is otherwise called the value king of AI rigs with the

same amount of VRAM as you would get with a DGX Spark uh from Nvidia. 128 gigs, of course at a different speed, of course the 3090s don't support

things like NVFP4, but the question is, do you really need NVFP4 to have kind of the best performance with these models? And we're going to get into

that right now. So, there have been a lot of interesting advancements that have happened really just in

the past week. So, on April 24th, we started to see the first real deployments of DFlash in usable Qwen 3.6 27B quants. And this is one of the first

leaps that was something you just can't do on a 3090. In theory, you could potentially try this with a 5090, but whether or not you would see a

performance gain is an open question. And it's also cool to see that ZLab is finally getting some

well-deserved attention. So, if you want to learn more about DFlash, there's a video in the description you should definitely watch. But what this

provided was a three to five x performance boost on Qwen 3.6 27B on the DGX Spark. It's also really cool that this came out of ZLab, and this is

currently pinned to vLLM. You could probably do something similar with SGLang, but vLLM is currently the way

to do this. And things got even better. And we're starting to see some of the fastest entries on the LLM leaderboard be models you actually can't run

on a 3090 and that also require enough unified RAM to run anything. So, the idea that you could even spread this across multiple GPUs is no longer as

cozy a reality as it used to be. And you might think, well, this is kind of bad. I really like my

pile of GPUs, but what's interesting is in terms of producing local GPUs at scale and having these things be something you can have in your closet

that won't run up, you know, a $400 power bill. The move towards unified RAM Nvidia hardware, I don't think is the worst thing, and I think the

potential for this to become a much more juicy market for Nvidia might kind of start to save us from Nvidia

doing crazy things like re-releasing the 3060 Ti because they need to sell a GPU to someone that's not a data center. What else can we learn here?

This development didn't happen in a vacuum. This started to happen when Qwen 3.6 was released, and it turned out to be really similar to Qwen 3.5.

This is something that Meta could take a hint from, which is people who quantize and really love using

your models and doing crazy things with them enjoy it when you don't completely rip out and change the architecture every single time you release a

model. There were people that initially started posting the ability to get well over about 100 tokens per second on the generative side on DGX Spark

GPUs. And the impressive part about this is how this performance is coming from a system that actually

has relatively low memory bandwidth. And this goes back to my point about why unified memory systems are so different from GPUs. And it's the fact

that the bottleneck is not necessarily how fast you can pull stuff out of the GPU back into a PCI bus and then back up because it's it's actually the

same going both ways. So, you sort of get a speed up just on the fact that you're not having a

conversion loss every time you're going from the GPU to the CPU and system memory and back because it's all the same memory. This is another example

of someone getting really great performance using the flash. And what's also really cool is this doesn't have to kind of cheat with Nvidia's

proprietary Invidia precision is actually using FP8 precision, which a lot of people would generally refer to

as a better model to work from. INT4 models initially were probably the first really quirky performance gain that we saw with the DJX Spark. And

people initially were like, "Well, why is this happening?" And it pretty much again is using hardware to skip a conversion loss of having to push

weights and activations over um system memory, basically loading them into the GPU to do anything and then

waiting for that to happen. And NVFP4 pretty much just also compressed the activations as well as the weights, and that's why we saw a nice little

speed up. But what's nice is the FP8 model is in theory just a more capable model. So, it's cool to see that this approach isn't leaning on only NVFP4

and FP8. And unfortunately, you can still really only do this on minimum a DJX Spark. I'll make some

future videos on whether or not you should buy the B200 or the B300, but for most people, myself included, I think that's a little bit outside of

their budget. Now, another really cool tool that has come onto the scene is a site called Local Maxing. So, I'll put a link below. And what's really

cool is the number one spot for Quen 3.6 is a model that was hosted on the DJX Spark. And ironically

enough, yes, it is an NVFP4 model, but it still proves my point that one of the better models to do this that the hardware that the fastest version of

this is being run on is in fact a DJX Spark and not a 5090. So, I'm curious what you guys think. We're starting to see a lot of really interesting

developments around this. I personally still like my pile of GPUs because I happen to do a lot of

other processing on these that are ML vision pipelines. So, to round out the video, I do want to look at a bit of pricing. So if you go into eBay,

there are actually a lot of very curious deals on the DGX Spark coming out of Japan right now. So if I could put DGX Spark, and let's say we want to

buy the Nvidia version. So you'll see that the Nvidia versions are whoa, wow, here's one in Germany for

only $1,200. You'll see here that brand new from some electronics retailer, $5,000 for the PNY DGX Spark GB10. Uh and generally speaking, you know, if

you want to really cheap out and get the ASUS version, which frankly, it's nice that they have the power button on the front, uh you're still going to

be paying, you know, around $4,000. And what's also interesting is there are quite a few deals to

be had in Europe. So if you're in Europe and you happen to want to buy one of these, you might be in luck. Wow, in Spain as well. So yeah, buying

these new is potentially kind of a hit or miss. If you're willing to buy them used, you might be able to get a really good deal on one of these. Now

if you go to sold items, so I want to see here. So this is where things get interesting. So if you're

willing to pay a little bit in tariffs and just shipping, you can actually still get these for a pretty good deal. So they're right around 4,000 for

the most part. Curiously outside the US, these are still pretty affordable. Of course, there's still some really interesting scams going on. Like this

is a reservation only, so I wouldn't fall for that. Yeah, unfortunately, unless you're in Europe

where these are surprisingly inexpensive, here's someone who sold two of them for $1,600. That's very Yeah, so if you're in Europe, uh please start

hoarding these and sending them to me in boxes that are not labeled Nvidia. So I'm curious, are you using GPUs? Are you using an Apple silicon-based

system? Let me know in the comments below. As always, I hope you learned something, and I'll see you in

the next one.

Mhm.

Qwen 3.6 27b Breakthrough Running Local AI on nVidia DGX Spark? · 全文文字稿