Hmm, 30 billion parameters in a new open free AI model where images, video, and audio all work. Hmm, [clears throat] why? There are a bunch of other
free systems around in this area like the amazing Gemma 4. So, what does this do better than those? Two words, throughput and cost efficiency. Okay,
what does that mean in practice? Now, hold on to your papers, fellow scholars, because it processes
almost 10 hours of video per hour. Whoo, that is nearly 10 times real time. That is insanely quick. Wow, almost three times faster than Gwen 3 Omni.
And when processing documents, it gets up to seven times faster. To run it locally, you'll want something like this or a beefy desktop GPU. We're
talking about 25 gigs of video memory, not something you run on your phone. And to run it in the cloud, I
use Lambda. Okay, so how did they do that? Where's the magic sauce? Well, it does five things really well and one thing not so well. Dear fellow
scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Well, one, member layers scale linearly with context length instead of
quadratically. What does that mean? Well, it means you throw everything you got at it. The more documents you have,
the longer video or audio you have, the bigger the advantage this one has. So, if you're running something online that processes those on a mass
scale, this is going to be incredible. Two, when audio comes in, this side converts raw audio waves into tokens, but differently than elsewhere.
Normally, you have a speech recognition model here. Those are often huge and expensive and strip away all
emotion and tone from the input. But this one keeps all these data and still does the job well. So much cheaper than running a whole separate model
like Whisper on top. Three, when you give it an image or video, many previous generation techniques smash it into a different aspect ratio. This one
keeps it. Then, oh, look at this. Convolutions in 3D. Now we're talking. Many other techniques look at
the video frame by frame. It takes tons and tons of computation to finish these videos. Here, the 3D convolution looks at blocks of frames. It looks
at a package of frames at the same time, and thus it can compress it a great deal. Faster, cheaper. Four, now that's really interesting, somewhat
unexpected. You would expect a huge standalone CLIP model here. These essentially predict what text would
match the image well. You need that here, too. But, here's the trick. Not one standalone CLIP model. Nope, this one distills down three models. One
for matching images to text, one for fine details, and one for object segmentation. Now, all three of these are smashed down into one small encoder
neural network. Once again, super efficient. Five, efficient video sampling. This is a good one. At this
point, we have thrown, let's say, a video with 300 images into the neural network. That's still a lot of data, but it turns out not all frames are
completely unique. Many of them share the same background, for instance. And this one finally throws away this duplicate information. And it makes it,
you guessed it right, even cheaper and more efficient. Okay, scholarly question. So, what is the
license attached to it? What I would love to see Apache 2.0, which is highly permissive, and I don't see it here. It has its own license. That's
usually not great news, but in this case, it's better than I thought. Derivative works and commercial use is fine. On the other hand, it needs a bit
of attribution and is a little stricter on patent grants. If Apache 2.0 were a 10 out of 10, this is a
seven out of 10, in my opinion. And we don't shy away from talking about limitations here. So, anything else? Oh, yes. If you're doing pure text
reasoning or pure coding, I would probably look elsewhere. It is not the number one smartest open model. No. But, if you need multimodal input, like
audio or video, processed super fast and super cheap, this is the one. So, we now have free and open AI
models that we can own and run them ourselves, which is only going to get more and more important in the future. And since we have so many models,
they are starting to specialize. They are becoming good in different directions. So, better models and more value for us fellow scholars, for free.
Sign me up for that. Hugely appreciated. What a time to be alive. Here you see me running the full
DeepSeek AI model through Lambda GPU Cloud. 671 billion parameters, running super fast and super reliably. This is insane. I love it and I use it on a
regular basis. Lambda provides you with powerful Nvidia GPUs to run your own chatbots and experiments. Seriously, try it out now at lambda.ai/papers
or click the link in the description.