How to run agentic 35B models with only 8gb of vram (nvidia 4060ti) 文字稿

In 2026, it's okay to be cute, vulnerable, and only have 8 gigs of VRAM to run AI agents locally. Sure, I've told you multiple times that in most

cases, if you don't want to be GPU poor, you should really try to have more than 8 gigs of VRAM, or at least two GPUs when you have 8 gigs of VRAM to

have an okay experience running AI locally. And what's interesting and also pretty cool is this is no

longer the case with the recent advancements. Welcome to AI Flux. Let's get into it. Real quick, please like, subscribe, and share. It helps the

channel out a ton. So, today we had some interesting advancements in the continuing saga of putting Qwen 3.6 onto smaller and smaller GPUs that have

even less capability. We've seen Qwen 3.6 on a 1080 Ti, we've seen it on the PS5 GPU, and today we're

seeing it on a 4060 Ti with only 8 gigs of VRAM. Which, previously, yes, I did say in the video of GPUs to not buy for local AI, I did say you

shouldn't buy the 4060 Ti. But what's interesting is GPU markets are softening a bit, and this is now a really affordable GPU that is plentiful all

over eBay. So, I have changed my opinion. Some people on Twitter have also shared some strong opinions about

this, so I'm going to talk about it. So, this comes from Above Spec, who you should definitely follow on X. And he says here, "The idea that you need

a 24 GB GPU for serious local LLMs in 2026 is a lie that we don't have to repeat to ourselves anymore." He just ran a 35 billion parameter version of

Qwen 3.6 on an RTX 4060 Ti 8 gig at 41 tokens per second with 16K context, and 24 tokens per second

at 200K context. Which is actually really doable. Like that's something you could use with Hermes agent quite well. And I'm curious to see if

Llama.cpp or vLLM. And you can see here, it's all proven up in his data. So, it shows uh the context depth, the prompt processing uh transaction per

second, and then how fast the token generation was. And of course, there are some caveats to this. There are

some caveats that still affect the 3090, that still apply to the 4060 Ti. So, how did Above Spec make this work? It turns out it's a little thing

called MoE offload. Of course, uh you can't fit all of this on the GPU. We're just putting what's important on the GPU, and then being very careful

and selective with what we shuffle between the system and the GPU. It also kind of means you need to have

a pretty decent box to plug this into, so you can't just go on to eBay and buy the cheapest HP barebones PC you can find, and assume this will all

work. You do have to do a few things right. So, he says here, "MoE offload is that Qwen 3.6 35B activates only 3 billion parameters per token. Keep

attention and shared weights on the GPU, and cold expert FFNNs to system RAM." And curiously, this is

using Llama.cpp. So again, I know some of you in the comments really hate that I don't like Llama.cpp as a first-party inference engine, but what's

really cool with Llama.cpp is it's really flexible and tunable, way more so than vLLM. So, if you're only using two GPUs, or they're old, or you want

to do some really wild things with what you're choosing to have on the GPU or your system, it's a

great option. And he has even more receipts here. And specifically, another big part here is he's using the Q8 KV cache at around 10 kilobytes per

token, which is how he's getting this pretty decent context window, which is required for using agents. Basically, you always want to have at least a

90K token context window when you're running AI agents. Uh don't ask me why. I've tried a bunch of

different ones. What's also cool is uh you can actually fit that entire context window within 2 gigs of VRAM with flash attention on. That vary your

performance a bit by task, but it's crazy this is possible. And we have even more receipts here. So, this is again Qwen 3.6 35B A3B with Q4 quants,

with Q8 KV cache, in a single batch flash attention on. And this again shows that even with flash

attention on, we're still getting pretty decent performance. Whether or not it's tokens per second ingest from prompt, or token generation. And it's

painful to say this, but 24 tokens per second is still very useful if you're using Hermes, or you have your own harness that is kind of going through

simple tasks. Dense models have shown to be pretty good at this. And another curious point he makes

is that the RTX 3070 8 gig also is a pretty good option. It's obviously not as fast as the 4060. And what's funny is the 3070 actually has higher

memory bandwidth than the 4060 Ti at 448 GB per second compared to 288. So, the irony is a video was just less cheap back in the day with um

bandwidth, and ironically, you can actually get better performance from worse VRAM and for less money at the same

time. And the key here is you really need a half-decent CPU and a pretty decent set of 64 gig RAM on your system. But what's cool is Micro Center

actually has some great deals on AMD Ryzen CPUs that include RAM as a combo, so I'll include some links below for that. So, this is a really

interesting thing to do. You know, the 3060 is also potentially a great option for this, especially if you can

snag one of the 12 gig versions. There are other questions as to what version of KV cache you should use, but this is really, really cool. And this

isn't just something that's been happening with 3.6. There have also been kind of breakthroughs doing this with Qwen 3.5, which is a tad smaller. It's

still usable. Ironically, you can kind of pull this off with vLLM as well. Granted, Llama.cpp is

going to be the most flexible for doing something like this. And if it means anything, there are even people on Hacker News doing this. So, people who

are highly opinionated, some would say too opinionated, have also tried this and are saying, "You know, why bother spending all the money for 24 gigs

of VRAM, let alone 48 for a modded GPU?" You know, if we look back at the 4060, the specs here are

actually not bad. As I said, you know, in theory you can get better memory performance from a 3070. However, that's getting dangerously close to a

level of usability that if anything changes, like if you don't just stick using Qwen 3.6 for the rest of your time with the machine, there might be

some questions as to whether or not that was a good call. And I would still say, if you can find it and

if you can afford it, please buy the 16 gig version. Like yes, you can do this on 8 gigs, and it's really cool. Maybe as like an edge uh situation, or

if you really just don't have the budget for it. But still, getting as much RAM as you can afford is going to give you the best experience. And if you

look at pricing, 4060 Ti 16 gig cards, which this person claims is super rare, they're still

pretty expensive. I mean, they're really affordable relative to a lot of other GPUs currently. And the 4060 Ti 8 gig cards are really inexpensive.

We're talking under $300 at this point, and they're pretty plentiful. This is a generation of cards from Nvidia that was pretty high quality, too. So,

I wouldn't have any problems buying these, and no one really mined crypto on these, either. And just

for the sake of argument, I know this video is about cheap GPUs, but let's just look at the 3070 8 gig. And so, this is the funny thing. So,

ironically, yes, these are cheaper than the 4060 Ti. Yes, they in theory have more memory bandwidth. But the thing is this is a much slower GPU. So,

unless you find like a really good deal, I mean, under $200 kind of a deal for this, I would still recommend

getting the 4060 8 gig, or really, come on, guys, get the 4060 Ti 16 gig. So, I'm curious what you all think. I think this is a really interesting

advancement in the tooling that we have at our disposal to make these models fit and do agentic things locally. Again, I think it's much better for

agents rather than other things, like uh just flat-out generating prompts, or kind of doing a back and

forth in real time. So, I'm curious, are you going to buy this? Are you going to buy the 3070? Are you going to buy the 4060? Are you just going to

buy a GPU like the 3090 because you think it's a much better use of your money? Let me know in the comments below. Please like, subscribe, and share,

and I'll see you in the next one.

How to run agentic 35B models with only 8gb of vram (nvidia 4060ti) · 全文文字稿