Running Qwen 3.6 27b Local AI on nVidia 1080ti? 文字稿

Currently, you're probably more likely to find gold buried in your backyard than affordable VRAM of any kind. And these problems aren't just affecting

local AI enthusiasts. They're actually affecting just ordinary gamers as well. And the irony is it's at this point pretty hard to find any GPUs even

all the way down through the 2000 series as people are just trying to find GPUs to run anything on

their gaming rig. And one enthusiast thought, "What if I could try to actually run the latest version of Qwen 3.6 27B on two GTX 1080 TIs?" So, not

even RTX. They hadn't even invented that yet. Welcome to AI Flux. Let's get into it. If you're into local AI anything, you should follow Sandro and

Daniel Moll on Twitter. So, Sandro posted this today. 2016 hardware running Qwen 3.6 27B with two Pascal

era GTX 1080 Ti Nvidia GPUs, which is kind of crazy. What's really impressive is he managed to eke 14 tokens per second with a 131K context window,

predominantly utilizing Turbo Quant in conjunction with a Q8K and Turbo 4V. About doubles the context at zero speed cost. And this is something that

you would probably expect to see with someone explaining how to run this on an RTX 3090 with a lot more

VRAM and NVLink, which is normally how you see these kind of VRAM speed-ups, which is just really in software finding a much more efficient way to

shuffle information between GPUs directly via VRAM and not having to go through the system. Let's hop into what made this possible. I want to be

clear, the reason I'm talking about this is this is another option for a bare minimum entry point to be

doing local AI and in this case specifically agentic tasks, not really anything that is too uh heavy lifting. Cuz once we start getting into models

that are quantized this much, there are things that start to drop off. And even with Turbo Quant, we'll start to see that here. And what I should say

is this version of Qwen 3 with Turbo Quant and what that enables is a state-of-the-art advancement.

This is something that previously, you know, two 1080 TIs, you would only be able to run maybe some of the more capable one-bit models. And one-bit

models should be viewed as models that pretty much can only do one skill at a time. It should basically be viewed as if you're using part of a mixture

of experts model with like one of the experts running entirely. So, what was Daniel actually running

to make this possible and what did the performance look like? So, again, two GTX 1080 TIs. So, these are over 10 years old. He was running this with

an HP Z840 workstation. So, pretty much just a bare-bones PC that you can find on eBay that has enough slots and decent throughput. One reason the

1080 Ti is such an interesting pick here is it's one of the only GPUs back in the day that had more than

10 gigs of VRAM. That was a really big deal and it's something that Nvidia didn't really do for a long time afterwards. For instance, the RTX 3080

only had 10 gigs of VRAM. And the system does have 128 gigs of DDR4 ECC. I would guess there's going to be some offloading here and it'll be curious

to see what back-end they were using. So, the stack. What they're using here is llama.cpp, which is

surprising but also not because llama.cpp, although it is loathed for being a bit slower than SGLang and vLLM, it does tend to have better support for

older GPUs that are just slower at moving information around. And specifically, he was using this llama.cpp Turbo Quant fork, which I'll link below.

The model he was actually using was Qwen 3627B, specifically the UDQ4KXL Qwen, which is only about

17 GB large. So, maybe they actually did manage to fit this on VRAM between those GPUs. The other big thing here I'm not sure if this has to do with

this being Oh, yeah. So, it's a ancient dual Xeon uh setup. NUMA is basically a means of organizing and orchestrating system RAM between both CPUs and

not having it mess up across PCIe devices. So, it's an older problem, but this is truly an ancient

system with pretty old GPUs plugged into it. So, it's surprising that tooling this new actually even works. So, obviously the key to making a lot of

this work is Turbo Quant KV cache. There are a number of things that Turbo Quant does, but in terms of making this possible, um the KV cache component

is it. What's cool is with these two GPUs, you can push it to around 131,000 tokens. And what's

really interesting is as a function of Turbo Quant, this happens with no speed cost. So, there's no information that's having to shuffle between the

GPUs to make this happen. We're still looking at 14 tokens per second. So, this is kind of on the rusting, degrading, oxidizing edge of what I would

say is even usable performance for these models. This is a great option if you want to have it run

kind of as a slower agentic stack. Uh something that you're not going to be directly interacting with in terms of watching feedback come back from the

model. But you're just going to have it churn through some emails or use the recently released Cloudflare system for answering emails or sending

emails. It's a great option for that. And the other thing is that Pascal is also hitting hard

limits with driver support and just out of the question, which I understand how that could be a lot of money, buying two of these is a pretty good

option. And again, like 11 gigs of RAM at this price point basically does not exist. And if you're willing to buy a GPU that maybe is a little bit

older or from an OEM that is a little bit less popular, you have a lot great options here. So, we're even

seeing pricing down to $100 if you're willing to find a cooler. And for the main kind of host workstation here, these are also incredibly cheap. So,

for pretty much $450 including the workstation itself without RAM, you can pretty much have a agentic local AI kind of test bed that you can run. And

I would say this is probably the cheapest I would go. There are some people who would say you can go

buy some of those embedded AMD boards, but then again, that's an entirely different architecture and AMD tooling ironically is still quite bad. So, I

would still probably suggest buying an ancient HP Z840 and two GTX 1080 TIs before I recommended buying AMD hardware. And if you have some questions

about that or you disagree with me, please let me know in the comments below. So, I'm curious what

you guys think. Are you guys going to go out and buy some 1080 TIs to try this out? You have a 1080 Ti for gaming and now you realized you can run the

latest Qwen model. Does this change kind of what you were thinking? Are you going to go buy a second one? Let me know in the comments below. So, as

always, I hope you learned something from this video. Please like, subscribe, and share. It helps me

out a lot with making future videos. And I'll see you in the next one.

Running Qwen 3.6 27b Local AI on nVidia 1080ti? · 全文文字稿