Currently, you're probably more likely to find gold buried in your backyard than affordable VRAM of any kind. And these problems aren't just affecting
local AI enthusiasts. They're actually affecting just ordinary gamers as well. And the irony is it's at this point pretty hard to find any GPUs even
all the way down through the 2000 series as people are just trying to find GPUs to run anything on
their gaming rig. And one enthusiast thought, "What if I could try to actually run the latest version of Qwen 3.6 27B on two GTX 1080 TIs?" So, not
even RTX. They hadn't even invented that yet. Welcome to AI Flux. Let's get into it. If you're into local AI anything, you should follow Sandro and
Daniel Moll on Twitter. So, Sandro posted this today. 2016 hardware running Qwen 3.6 27B with two Pascal
era GTX 1080 Ti Nvidia GPUs, which is kind of crazy. What's really impressive is he managed to eke 14 tokens per second with a 131K context window,
predominantly utilizing Turbo Quant in conjunction with a Q8K and Turbo 4V. About doubles the context at zero speed cost. And this is something that
you would probably expect to see with someone explaining how to run this on an RTX 3090 with a lot more
VRAM and NVLink, which is normally how you see these kind of VRAM speed-ups, which is just really in software finding a much more efficient way to
shuffle information between GPUs directly via VRAM and not having to go through the system. Let's hop into what made this possible. I want to be
clear, the reason I'm talking about this is this is another option for a bare minimum entry point to be
doing local AI and in this case specifically agentic tasks, not really anything that is too uh heavy lifting. Cuz once we start getting into models
that are quantized this much, there are things that start to drop off. And even with Turbo Quant, we'll start to see that here. And what I should say
is this version of Qwen 3 with Turbo Quant and what that enables is a state-of-the-art advancement.
This is something that previously, you know, two 1080 TIs, you would only be able to run maybe some of the more capable one-bit models. And one-bit
models should be viewed as models that pretty much can only do one skill at a time. It should basically be viewed as if you're using part of a mixture
of experts model with like one of the experts running entirely. So, what was Daniel actually running
to make this possible and what did the performance look like? So, again, two GTX 1080 TIs. So, these are over 10 years old. He was running this with
an HP Z840 workstation. So, pretty much just a bare-bones PC that you can find on eBay that has enough slots and decent throughput. One reason the
1080 Ti is such an interesting pick here is it's one of the only GPUs back in the day that had more than
10 gigs of VRAM. That was a really big deal and it's something that Nvidia didn't really do for a long time afterwards. For instance, the RTX 3080
only had 10 gigs of VRAM. And the system does have 128 gigs of DDR4 ECC. I would guess there's going to be some offloading here and it'll be curious
to see what back-end they were using. So, the stack. What they're using here is llama.cpp, which is
surprising but also not because llama.cpp, although it is loathed for being a bit slower than SGLang and vLLM, it does tend to have better support for
older GPUs that are just slower at moving information around. And specifically, he was using this llama.cpp Turbo Quant fork, which I'll link below.
The model he was actually using was Qwen 3627B, specifically the UDQ4KXL Qwen, which is only about
17 GB large. So, maybe they actually did manage to fit this on VRAM between those GPUs. The other big thing here I'm not sure if this has to do with
this being Oh, yeah. So, it's a ancient dual Xeon uh setup. NUMA is basically a means of organizing and orchestrating system RAM between both CPUs and
not having it mess up across PCIe devices. So, it's an older problem, but this is truly an ancient
system with pretty old GPUs plugged into it. So, it's surprising that tooling this new actually even works. So, obviously the key to making a lot of
this work is Turbo Quant KV cache. There are a number of things that Turbo Quant does, but in terms of making this possible, um the KV cache component
is it. What's cool is with these two GPUs, you can push it to around 131,000 tokens. And what's
really interesting is as a function of Turbo Quant, this happens with no speed cost. So, there's no information that's having to shuffle between the
GPUs to make this happen. We're still looking at 14 tokens per second. So, this is kind of on the rusting, degrading, oxidizing edge of what I would
say is even usable performance for these models. This is a great option if you want to have it run
kind of as a slower agentic stack. Uh something that you're not going to be directly interacting with in terms of watching feedback come back from the
model. But you're just going to have it churn through some emails or use the recently released Cloudflare system for answering emails or sending
emails. It's a great option for that. And the other thing is that Pascal is also hitting hard
limits with driver support and just out of the question, which I understand how that could be a lot of money, buying two of these is a pretty good
option. And again, like 11 gigs of RAM at this price point basically does not exist. And if you're willing to buy a GPU that maybe is a little bit
older or from an OEM that is a little bit less popular, you have a lot great options here. So, we're even
seeing pricing down to $100 if you're willing to find a cooler. And for the main kind of host workstation here, these are also incredibly cheap. So,
for pretty much $450 including the workstation itself without RAM, you can pretty much have a agentic local AI kind of test bed that you can run. And
I would say this is probably the cheapest I would go. There are some people who would say you can go
buy some of those embedded AMD boards, but then again, that's an entirely different architecture and AMD tooling ironically is still quite bad. So, I
would still probably suggest buying an ancient HP Z840 and two GTX 1080 TIs before I recommended buying AMD hardware. And if you have some questions
about that or you disagree with me, please let me know in the comments below. So, I'm curious what
you guys think. Are you guys going to go out and buy some 1080 TIs to try this out? You have a 1080 Ti for gaming and now you realized you can run the
latest Qwen model. Does this change kind of what you were thinking? Are you going to go buy a second one? Let me know in the comments below. So, as
always, I hope you learned something from this video. Please like, subscribe, and share. It helps me
out a lot with making future videos. And I'll see you in the next one.