The world of local AI and AI agents wouldn't be what it is today without Mistral AI. And today we got a huge update with their latest model Mistral
Medium 3.5 and a ton of information that gives us a really interesting view into how Mistral sees the future of local agentic AI playing out. So, to
see if the Mistral AI office has completely run out of croissants, let's get into it. But really quick,
I'd like to kindly ask that you like, subscribe, and share. It helps out the channel a ton. This new model release from Mistral is kind of
interesting. They're calling it Mistral Medium 3.5. And this is kind of a middle-of-the-road model for them. But what they're claiming it can do is
very different from the past with Mistral. Some Mistral updates are huge core blocks of their model and kind of
their approach for making models in their architecture. And others are bits and pieces that give a hint of what they're developing. So, their VLM
format for vision was incredible. And when they were just first releasing models that were only intended to really be run locally with the famous
Mistral 7B model, they were really pioneering different combinations of experts in MoE and more broadly just
making MoE small enough to put on GPUs. Now, one thing that I think is worth noting is Mistral was one of the first to come up with the idea of really
small models that you could run concurrently in a day when that application seemed completely useless. But now it couldn't be more different with AI
agents. And with this latest model, that is their entire approach. So, let's see what they're
telling us so far on their website. So, coding agents are a really big part of this release. They say here that coding agents have mostly lived on
your laptop. Today we're moving them to the cloud where they run on their own in parallel and notify you when they're done. So, if you have a laptop
with a big enough GPU, I guess that's the case. But this is kind of an interesting push from Mistral.
You can start them on the Mistral vibe CLI or directly in latechat. Offloading a coding task without leaving the conversation. Powering Mistral 3.5 in
public preview, our new default model in Mistral vibe and latechat. Built for long stretches on coding and productivity work, the new work mode in
latechat extends this with a powerful agent for complex multi-step tasks like research, analysis, and
cross-tool actions. What's interesting is this model wasn't created just for local AI. This is really meant to be kind of a product push for Mistral
to compete more with Cursor and a lot of these other kind of enterprise agentic task runners. And what's interesting is Nvidia has really pushed their
unified memory architecture in hardware because it's really good at these kind of ongoing,
long-running tasks with multiple instances. So, this may be Mistral just trying to push some compute to try something new. And there are some
performance gotchas in this and the reasons why they picked the benchmarks they did. But let's look at the benchmarks. So, in terms of this agentic
benchmarking with competing models that have clearly been kind of carefully selected, they do say this is a
dense 128 billion parameter model with a 256K token context window that can do instruction following, reasoning, and coding with a single set of
weights. And it can also be self-hosted with as few as four GPUs. Now, those four GPUs are probably really expensive, but we'll get to that in a bit.
So, reasoning effort is now configurable per request, so you can kind of tell it how much you want it to
think before doing something. This has been an issue in the past with other reasoning benchmarks because a lot of times they'll underperform because
they'll spend most of the token budget reasoning and they won't actually finish putting out the result. But let's look at this blurry image of a
benchmark. This is crazy. Why are Why are these not interactive? So, basically Mistral Medium 3.5 128B is
not necessarily ahead, you know, they're comparing to Kimmy K2.5, GLM 5.1, which is a pretty old model, and Qwen 3.5. So, notably they do not include
Qwen 3.6 27B dense, which is interesting. And we'll get to that in just a bit. So, they show math instruction, which is basically about the same. They
show agentic benchmarks versus previous Mistral models. So, what we can say is this is without a
doubt the best Mistral agentic model. We'll see if it's efficient to actually run on GPUs. Here's another agentic benchmark versus Mistral coding
models. So, for now this is the best Mistral coding model based on these latest benchmarks. One thing that is really big about this is this model has
been tuned and created for Mistral's new agentic runtime. So, they're not using Pydantic, they're not
using any of these existing frameworks. Maybe you could use this with Hermes, but their approach is we're just going to do our own because we think we
can make our harness better. So, they say here that this is mostly meant for coding. And again, this is all meant to work through the vibe agentic
runtime where they're kind of intelligently placing a human in the loop behind all of these tools. And
then supposedly they think that's a much better thing to do. And it appears to be kind of a an agentic play on a new rendition or a new imagination of
Claude Code. For me, I like using Claude Code for certain things. Not the biggest fan of it. I'm still kind of a Cursor power user to an extent, so
I'm not sure if that makes me unk status or not. But if there are any zoomers in chat, please let me
know. One thing that they appear to be really bragging about is this cross-tool workflow. So, basically if a single task with a single bit of context
has to be completed across multiple tools. So, like you write an email, you get some feedback, and then you implement a change in code somewhere,
which is kind of interesting. Research and synthesis is kind of an emerging area in the agentic space.
You know, Auto Research was an interesting paper that lent itself relatively well to being useful. And now with these things like Pi, they're these
newer kind of lighter-weight harnesses. The Auto Research stuff is turning out to be a bit more useful now that web use has been kind of federated
better than it was a year ago. And yeah, so this is maybe meant to be like a part of a software team, but
it's still something you're supposed to use in latechat. Let me see if I can use this pretty easily. I'll accept the terms of service. And let's see
here. So, tools-wise we have a code interpreter, we can do images, uh we can do web search. So, let me ask if I'm using the raw version of Mistral
Medium 3.5, how many 3090s will I need to use? So, let's see if it understands I want to host this
locally. All right, so it understands I'm not using a quantized version. It understands how much RAM those have. It gives us a kind of a source. All
right, well there is no 2-bit quantization currently, so that's a little odd. All right, so we need 6 to 8. It's about right. And what's weird is it
didn't find the Unsloth article about this, which is my next segway. Unsloth, of course, like minutes
after this was released already has a quant that is quite good. Now, the funny thing is most of the comments here are that this model is just not as
good as Qwen. And we're going to get to that in just a bit. But Unsloth has a fantastic guide about this. This is a large model even in its dense form
even when it's quantized down to three and four-bit, which is kind of surprising if I'm being
honest. Of course, this will only get better. The one thing we have to remember is that Qwen has been really forward with pushing their architecture
and it's really kind of become popular. So, Qwen right now will just have the best quants and will be the best model to run on GPUs if you're doing
agentic or coding things. But anyway, also what's interesting is the Hugging Face page for the Mistral
Medium 3.5 GGUF quant is currently dead, so I surely that's going to be updated. Let's look at the comments here. First off, this is a huge model. So,
it's put well by Yusuf here that this model from Mistral, Mistral Medium 3.5, is even in its 3-bit and 4-bit form is four basically five times bigger
than Qwen 3.6 27B dense for the same score. So, there's clearly a reason that they didn't pick to
benchmark against that. And it's really disappointing. I mean, this is this is kind of a AMD level fail where you're claiming to compete on the world
stage at kind of the current state of the art and then keep referencing benchmarks from a year ago or in this case about 6 months ago, which is, you
know, not great. And of course, it's cool that you can run the 3-bit version of this locally. But a
lot of what matters when you're doing agentic stuff is accuracy and capability. So, actually when you're falling off on performance, even if you can
fit it in RAM with agentic stuff, of course you're not looking at raw performance as much. You're not looking at Can it solve the hardest problems?
You know, can it figure out the hardest architectural things? You know, we're still using APIs for that
for the most part. But for agentic tasks, you want it to be able to, you know, manipulate multiple tools, understand kind of a chain of reasoning, and
write emails that are coherent. So, when you're cold emailing hundreds of people to try to find a job, you actually get a job. That said, there's
another pretty solid write-up here that has to do with again kind of calling Mistral out and saying
why, you know, why are we claiming this is such a powerful model? And this is someone from an org called Zen Magnets that is directly comparing
Mistral Medium 128B dense and Qwen 27B dense, specifically Qwen 3.6. So again, it's citing that they have the same Swee verified score. It's actually
much worse at browser and agentic tasks compared to Quen 3.6 27B, which to an extent wasn't even
deliberately made to be agentic. So, it's another L in the Mistral corner of France. And this is even before, as he says here, which is a great point,
before Quen 3.6 122B comes out. So, that's going to be a smaller model that is technically using kind of an older architecture, so mixture of experts.
And if that outperforms this brand new Mistral model, there's a question of if Mistral is kind of
slipping in terms of their previous kind of reigning performance in this whole realm of local AI. Of course, there are a lot of people who will say,
"Oh, but they just don't benchmark max. It's hard. They're just being honest about benchmarking numbers." And the issue is that just so many people
have used this model so far, the Quen 3.6 model, that it's hard to only argue with benchmarks. Right
now, this is really disappointing. I hope this will be something that we can improve on quite a bit. But we're just going to have to see. And in
conclusion, I do want to go over kind of what hardware you would need to run this. So, unfortunately, currently you're probably still going to need
about three RTX 3090s to run this within reason. If you have larger GPUs, there's a chance for that. But
for right now, if you're deciding to make purchasing decisions around this, I think just using late chat is probably your best bet. And for now, if
you guys want to see more of this, I might do a stream later. So, I'm curious, are you guys going to go out and start running this model locally? Are
you already running Quen 3.6, in which case I assume you're not going to be running this. Let me know
in the comments below. So, as always, if you learned something, please like, subscribe, and share, and we'll see you in the next one.