So, we have the um the word down here. Today, we're going to be looking at a very interesting new model called Sense Nova U1. Now, aside from being
Apache 2.0 open source, which is always good to see. This is a unified multimodal model, which as we'll get into later means that essentially it does
not translate between text and then two images as a native multimodal model may do for image
generation where it is text to image. This is a different architecture and that is partially what makes it interesting. Another thing that has me very
excited to play with this is the fact that it seems to excel at infographics. Now, often times when we do the image model tests on this channel, they
do kind of delve into chaos later on in the video, but I do find that there's probably a good
balance between funny and also educational infographics. So, I'm pretty excited to get playing with that with this because there could be some fun to
be had. I suppose could be said. So before we get into it, please do feel free to subscribe. I am trying to hit that 100K plaque which we're closer to
than to zero. So that is always nice. Additionally to that, this model is available online and like
you can just play with it online as we see right here. But for today's video, because we do love our local AI, we are going to be testing this on both
the DGX Spark through Comfy UI where I do currently have it set up and waiting for us to play with it. Additionally to that, we're also going to be
testing it in a cloud GPU. I do have an A100 on standby rented just to see some of the speed on a
proper system and not something like the Spark which is a bit memory constrained. So with that, let's take a peek at this model and some of the
interesting architectural considerations and then we'll jump into some fun and light testing. Now to begin, I would actually like to start on the
GitHub repository for this model because it shows us a bit more concise the information pertaining to the
different sizes and things like that for this model. So we can see right here of course as we'll look at a little more in the technical section they
do also talk about the core architecture of it which is this Neo Unifi thing which is just a blog on hugging face which has some pertinent information
about that. However we can see if we scroll down here that there are a bunch of different variations
of this model. So we have this Laura adapter which speeds things up massively and produces relatively decent generations for eight steps which is
pretty low in the scheme of like image generation and things like that. However, this is an interesting architecture because as we'll find right here
in this parameter breakdown, this is not just an 8 billion parameter model. It is actually 8 billion
parameters per like half. So, as we see right here, it makes a little more sense if we say it contains roughly 8 billion understanding parameters and
8 billion generation parameters. So, basically what that means is the VRAMm utilization that we are going to be seeing for this model as shown right
here in Bflat 16 is around 35 gigs. So the total parameter size as we see right here is like 17 1/.5
billion parameters. So it's just interesting architecturally and definitely something that I want to bring up. We will also hopefully soon have the 3B
variation which I would imagine will be kind of a 6B model in the same way the 8B is 8 in and then eight out or non-scientifically but cuz that will
open up a lot of ability to run this on more resource constrained devices as opposed to needing that
35 gigs or around that sort of size. So, in terms of the unique architecture for this model, I'm not normally one who just wants to like regurgitate
the page back to the person who's watching the video. I try to not do that, but in this case, I do believe that may be one of the more ideal ways to
talk about this architecture. But one of the things to note immediately right here is they say rather
than relying on adapters to translate between modalities, sense Nova U1 models think and act across vision and language natively. So that would be
kind of the native unified multimodal architecture that they talk about here. Additionally to that, there is also a specifically linked blog post for
more architectural information about this model just here on hugging face. However, for the purpose of
this video and to keep things kind of concise and more focused on actually testing the model and having some fun, we can see right here the key
pillars as they have chosen to outline these with a crane emoji. So that's you know that I suppose that works. At the core of U1 is Neo Unifi, a novel
architecture designed from the first principles for multimodal AI. It eliminates both visual encoder and
variational autoenccoder where pixel world information are inherently and deeply correlated. Several important features are as follows. Model language
and visual information end to end as a unified compound. So basically that's just talking about the unique architecture of this where it's different
and unified so that it does not kind of translate between modalities like from text to image and
things like that. It tries to as they say up here translate instead of translating between modalities. It thinks and acts across language and vision
natively. So finally before we just get into testing it at the bottom of this GitHub repository we can see there's a bunch of specific ways that we
can use this. So there's visual understanding, which is actually kind of interesting because the
example prompt right here is essentially sending it a menu and asking it what specifically it would recommend. So it can see the image and then
understand it and then respond, which is neat. Also, we have our normal text image where we can just describe an image and then it will create it. The
default resolution is 2048x 2048, which is fairly large. So it'll be interesting to see how the speed for
something that size is, especially on the Spark image editing. Now, this will be interesting. I don't really know what to expect here, but it's
interesting to have nonetheless. And then finally, this interle generation, which is probably where my interest lies the most because this is where we
can essentially start to have fun with infographics and things like that. So, it'll be interesting to
play with that. And really our next step now, at least for the beginning of this, we are going to of course play with this on the DJX Spark as well as
an A100 that I do have running. However, for now, I do just want to run a few things through the studio app right here where we have U1 fast. And as
it says right here, this is an accelerated version dedicated to infographic generation. So, I think
there's probably some like crossover here between speed and then quality. I do want to try some infographs. And now I have to think of some prompts,
which shouldn't be too difficult. This is just going to be an initial test. Keep in mind that there is like the thing garbage in garbage out. So this
would be probably considered a fairly lackluster prompt, but I think it might be kind of funny
because I've basically told it to generate an infographic that would show us having a computer that is powered by somebody doing bicep curls, which is
who knows with, you know, ridiculous for now, but maybe in like 5 10 years there may be such a contraption as to do this. H okay. All right, let's
take a peek at this. So, and again this is I mean this does also have the capability to just do
general image generation and things like that which they do have some examples of right here but I find and my personal interest in this is definitely
more in the infographics because uh they can be funny and it's probably better suited to this. So, okay, we have mechanical to electrical conversion.
I will say the bicep here does look quite nice. 74% energy capture efficiency. Ah, interesting. it
almost went ahead and actually like tried to design the device as well beyond just like showcasing the like concept. So what I see right here there's
there's like a little electric motor. Okay. Or the sustainable energy matrix as it says. I would imagine like if you curl that and then you would spin
like the you would spin the middle handle and then it would power the GPU. Sustainable desktop
power. That is a laptop so I will knock it there in terms of its like attention to detail. Future integration, wearable micro grids, advanced personal
energy harvesting. That is a terrifying sentence. But environmental impact, reduced e-waste, autonomous computing, projected energy contribution for
2025 to 2030, renewable kinetic source. And that's our environmental impact overall. I do have to
say this. Oh, okay. So, I was looking and right as I was about to say, I don't see any spelling weirdness or anything like that. We unfortunately did
lose with sustain cron emission. Oh, yeah. Okay. But everything else there aside from that was actually not bad. And I like how it actually went ahead
and designed the device as well beyond just doing like a funny infograph. I have to balance like
can I post this on YouTube and this is funny. Let me try something like serious. All right. So, I've gone into the sample image gallery because I want
to see how long the prompts were that were being used to generate these, which were significantly longer than our bicep power device. But I've now
gone and just translated one of those and used it like in the same format, but of course into English,
and it is still going to be the comic book style. Now, this specific one is for two people arguing over whether they should use a unified memory Mac
for AI or a beefy Nvidia dedicated GPU. So, we'll see how this goes. And the whole prompt is quite lengthy here. So, I was actually surprised to see
it did all right with the like one sentence prompt that we previously saw. Especially when I noticed
in the gallery, they were all like multiple paragraphs like this. Oh, all right. All right. Desktop war. A Okay. Comic guide to choosing AI hardware.
The countdown to local AI begins with the clatter of keys. Okay. This is our Mac Mini fan. Unified memory is clean, quiet, and efficient. No way. A
dedicated GPU is the real compute beast. Feud. Okay. Very good. Boom. And we do have like the dot
comic book style there. Unified memory flexibility and dedicated GPU for collide on the same desktop. And then this one's more eco-conscious just
throughout like the progression of this specific side of the pane. And then we also have this individual CUDA Nvidia. Okay, I do like seeing these
specific graphics coming out of the desktop there. Then we have unified memory and dedicated GPU. I want
quiet efficiency for large local models. I want full CUDA power until the VRAM glows. inside. There is no single winner, only the right battlefield
for your AI dreams. Okay, so next up, we're going to be testing this locally on the Nvidia DGX Spark. I do have this all set up right here through
Comfy UI. There was recently, like a couple of days ago, I think the Comfy UI workflow was released
officially for this, which will make it simpler for a lot of folks to be able to play with and test given that this is a pretty well-known software.
So with that, we're at least going to just begin with the text to image with the eightstep Laura as this will run pretty quickly on the DGX Spark
being that it is a little memory constrained in terms of bandwidth. And I do just have a really simple
sample image prompt right here. Nothing really funny at least yet. And we're going to run this right now. Now, I do believe this may take a little bit
longer because Okay, now I didn't know if the model was still loaded in memory right now or not. It seems to have already been loaded in memory as it
would not be going this quickly if it still needed to load the model. So, we're going to see this.
And keep in mind, this won't be like a full measure of maximum potential quality right here. This will just be kind of showcasing that even on the DJX
Spark, it can make a little infograph or not an infograph, but kind of like a visual much quicker. So, okay. So, we have the DJX Sparu1 lock spark.
Okay. So, we're going to notice that like some of the text and stuff is not necessarily um like words
that one would commonly come across. However, something I will note that I am actually quite pleased with is the actual drawing of the separate
letters is quite clean. So although this is like a not a word that I've normally come across ever, it is like actually the individual letters are
drawn nicely and we do have some interesting like 1 2 3 and then two is going to two, three is going to four
and then we have five and then eight is up there. So again, this was more just a measure of speed on the DJX Spark right here. And we can see this is
a pretty big resolution for this photo right here. I mean, if we were to really like zoom into this, you can see this is actually like the to generate
an image at this resolution on this system at a reasonable speed like that is always nice to see.
And that is kind of what the eightstep Laura really kind of creates. But now with that, I think we can perhaps uh have a little fun with this. All
right, so I'm just going to change the subject here and we'll see what it does with this. And keep in mind it should run pretty quickly here. But I am
somewhat interested to see how it goes about actually coming up with the graphical depiction of this
idea, which I would imagine would be like an angry human with like rage lines emanating out of their head as they look at a phone and maybe a face
like and then perhaps a graphics card being powered. So yep. Okay. So that was actually I mean Okay. So we're going to notice the text is not
necessarily speed ditch. So we have the um the word down here. But if we focus more on the actual visuals right
here, this is very good adherence to the specific thing that I had outlined kind of this was aside from perhaps that this was more or less what I had
envisioned in my head where it's an angry person. Although they do, to be honest with you, this looks like the skin you could get on a graphics card
like in like 2007, 2008. I feel like a lot of cards had like these sorts of graphics just overlaid on
them. As a fact, I even have some that like look like this. But okay. Well, now this I'm actually going to save because something I will want to do is
run this in a stronger one. So, okay. So, right here we can see that this is not running on the local Spark. This is using an A1 180 gig. Thanks to
the homies at Thundercompute who have allowed me to have a large a lotment of free compute. So, thank
you to Thunder Compute for allowing me to do things like this on your A100's, H100s, etc. But now what we're going to do is this is the 50 step one
running on the A100 right here. Now, this is likely going to take a little longer at least to begin because this will need to load the model in fresh,
but it will produce hypothetically a much higher quality generated result being that as opposed to
eight steps, this will do 50. All right, the model has been loaded in now. So, it is just going through the 50 steps. As we can see, the progress bar
here is moving significantly slower because this is doing um if my calculations are correct, around 6 something times as many steps as it did with the
eightstep one that we ran on the Spark. But I will be very interested to see how it differs from
this. All right. And this is again just the text to image. So we're not doing any specific infographic things right now or the interled one, which is
actually really quite neat. And we'll focus specifically on that at least a little later because that's a rather unique thing to actually witness. And
again, this is on an A100 from Thunder Compute. Oh wow. Okay. So, it's interesting. I will say it
is possible that more steps didn't really help. And I think this actually speaks to more of the impressiveness at least of seeing how this runs on
eight steps as opposed to running on 50 where we don't necessarily see anything that's significantly better here. There may be some more like visual
fidelity to the actual graphics here. Yeah, okay, I could see that. But in terms of the actual text, it
changed nothing really at all. And this is the text to image path. So this is not the specific like um infographic one or things like that, which we
will test additionally um probably right now. And I will at least start out the eightstep ones from the Spark because it is always fun to run things
locally if we can. So next up, I'm going to be using the infographic eightstep workflow. Okay. Uh,
okay. Maybe I'll just keep that one open for now. And we see I have like a sample one in here just from when I was setting this up where it basically
just talks about like some things pertaining to the model. I guess I'll run this just to see what it looks like in terms of the ideal generation that
we should be seeing right here. And we can see this is going pretty quickly because it can use the
same model that's loaded in right now. So, it didn't need to like dump it and then reload it. Okay. And then we have an infographic here. Just on
first glance, everything here actually looks quite good. I don't see anything here that is actually off. I will look at it more fine grained, but in
terms that was a 2720x 1536 16x9 photo. So again, the resolution here is fairly high. Sens Nova U1 Neo
Unifi one native model for understanding and gen traditional multimodal stack. Okay. Yes. Yes. Yeah. All the text here was spot on. So this perhaps
would have been better for the rage powered GPU. All right. So I've modified this to basically create the same concept that we had before where it's
creating the VC pitch titled Rage Compute Human Emotion Power GPU. So I'll be interested to see the
text as well as the actual graphics that it chooses to use here. I would imagine this will be more kind of like PowerPointish and less perhaps visual.
Very good. Rage to flops conversion engine. Genius. Raw human rage input. Unstructured emotion. Angry Reddit threads. Comments. Sentiment extractor.
Wasted outrage. Caps lock intensity meter. Downvote volatility. Toxity heat capture. Wasted outrage.
Seed round ready. Synthetic benchmarks 10 times matter. Angry Reddit comments for the pilot data set. And then we have our rage to flops conversion.
Okay. Some oddities here, but the there is a lot of text that was denoted with this prompt. And then it kind of gets a little oddity over here. Meme
derived boost clock. Okay, that is understandable. Now, next up, I want to showcase something that is
quite unique to this model where this is the interled storyboard demonstration. And essentially what we're going to be seeing right here is this is
going to almost reason through the actual process of generating these images. And we're going to see in this specific pane here when we do actually
get the output. Let me just close the workflow tab. We're actually going to see text generated as well
as images as well as additional text. So we're almost going to see the model natively produce text and images kind of interled I guess could be said.
And it's interesting to see this process because it's using both modalities to actually come up with the final answer. Now inevitably of course the
specific prompt right here at least for the first time is just a simple technical demonstration of
what this is actually doing here. So this is just saying local DGX Spark. This is running on the Thundercompute A100 right now, but I do want to just
showcase it at least with something a little serious first. And we're going to notice this is taking this will take a bit longer because it still
needs to actually like load in the model and then it will take a bit of time additionally to that
because it also has to generate the text and the images as we'll see right here in this specific pane. So, we can see right there that yes, we got our
output there, but that's not necessarily what I'm focused on right here. What I find interesting, and we're going to notice it also did produce two
images as was denoted in this prompt, but it's actually generating this text here as well. And we can
see that it's actually thinking through this. So, it's able to basically reason about what the image is going to contain, and it's simultaneously
generating the images and this text. So, it's almost generating the text kind of how you would just see like an auto reggressive LLM or something like
that do this. But it's just I find this architecturally to be very interesting because this is vastly
different than like our normal text to image stuff that we may see. Okay, this layout emphasizes the technical workflow and local deployment
capabilities and it's just building that based off of the specific prompt right here. So this text is actually coming generated from this model and
how it's interpreting and understanding this prompt right here. Then we get one image result right here. Now I
see the verdict is perhaps a bit um difficult for one to parse but everything else prior to that is actually cool. But the whole point of this is not
necessarily even to see this. It's it's unique architecturally and I find that really at least kind of interesting. I suppose if you're still watching
the video at this point you probably will as well. And this is definitely cool to see. So we can
see more text right here. And then we have our second image produced down there. And then finally we have some text there. An interesting conclusion
fail fix is needed. So look at this. It actually was able to assess the image right here. Successfully captures the layout and the tech review
aesthetic. However, there are several visual redundancies and alignment issues that need correction.
Specifically, the bridge sentence optimized for local hardware and iterative generation is repeated twice creating unnecessary clutter. In the
comparison panel, these are duplicated with their respective boxes. So it's noticing this as well, which looks like a rendering error. Additionally,
the main title is missing the model or locally DGX Spark suffix mentioned in the content requirements. The
vertical spacing is also slightly cramped due to the redundant text. Conclusion fail fix is needed. So now if we go and we'll take a peek at what we
see right here. It didn't go ahead and fix it following that because we told it only to generate two images here, which it did. However, it's really
interesting to see that it's actually able to assess its own output here and then reason off of that,
which I just find like it's architecturally kind of cool. All right, next up, I want to try this, but with something that we're all likely familiar
with, well, I would assume most people would be. It is relating to Doc Brown's time machine from the movie Back to the Future. So, it has things
specific to the film like 1955, the Delorean, the 88 mph threshold, the flux capacitor, and things like
that. So, we should get some form of infographic as well as some interleaf text generation here as well with a couple images pertaining to the time
machine and the movie BTTF. All right, we should be about to get our results. Very cool. So, we see that this has generated a large amount of text.
Okay, I do see almost like an inverted Delorean there. The design plan aims to transform the iconic
Delorean time machine into a high-end tech. Now, I'm not going to read this through verbosely. I just find it cool that it's generating these like
interled or whatever you'd like to call it. All right, let me just take a look at some of these. So, obviously the first thing right here is a bit
unique, one could say. We have our normal mall parking lot and then of course we have our 1955 hill valley
arrival. It reads and it's good at actually reading the specific text right here. Verdict a 10 to 10 for temporal reliability and 88 miles per hour
performance. Okay. And then we have the Dellran's time machine tech review. So the text here is a bit not 100% proper but it does all it also does
understand that. So it's good at actually assessing these and then saying like oh okay this needs some
additional fixes and things of the sort here. But the whole cool part of this and why I want to demonstrate this is the architecture. It's very
interesting in the way that it's going about doing this where it's doing like text and then an image and then text and then an image and then it's
giving us our conclusion there which is fail fix is needed and it's good self-awareness on the part of the
model which I am happy about and then we can see those are just the same images that we received right here in this pane. Now I want to just do
something like odd. So again this is where usually all the image testing videos at some point kind of descend into chaos. This would very likely be
that point in the video. But I will say I think I kept it together longer than normal. So that's good. This
that's just not right. Okay. Um what other fruits do we have? Oh no. Let's do maybe not a human turning into a helicopter. And again I don't know like
you know do you think the helicopter will have banana-like aesthetics to it? I think it will. I think it'll be yellow. Oh, nope. It's not. That is.
So, the skulls are pretty consistent across generations here. This is quite something. And again,
this is a fairly large resolution right here. Let's try something that's perhaps not as disturbing. A beautiful and it was interesting the way that's
almost like transparent to the actual background. Not as entertaining though. Let's try Steve the PC repair man. Steve the PC repair man. Very good.
Okay. He's using a heat gun on some form of flat motherboard. I like the apron. All right, Steve.
Okay. I mean, it was cartoon style. I wasn't nec I didn't really know if it would I kind of assumed it would be try some of these old school tags that
used to help image generations like back in the days of early image models that could run locally. Interesting. That was almost like an identical like
juxtiposition. It just changed it from cartoon to non-carto. Very interesting. And that is
inevitably a Mac. So for those of you that are familiar with the lore of Steve's PC repair, this is actually quite a fitting image. Very interesting.
And overall, that is probably going to wrap this video up. I don't have a traditional, so to speak, results overview because we've just been
generating these through Comfy UI as we go. Obviously, this model is just interesting from an architectural
perspective with specific emphasis on this interle generation capability where it's generating the text and images in parody almost and then it gives
us the end result here as well. But seeing the ability of it to actually judge the endgenerated photo right here is very interesting. And its focus
also on the ability to do like infographics and things as we had seen in some of the other nodes that
we tested on the DGX Spark is pretty cool and it's just architecturally unique as well as Apache 2.0. So I figured I'd take a peek at this because I
had seen some other coverage on it and it was kind of interesting. So I wanted to also throw my head into the ring so to speak. So, that is going to
conclude our first look and test of Sense Nova U1. I am going to be interested to see if the 3Bs come
out at some point because those will definitely lower the barrier to entry to run this in a local system. Additionally to that, I noticed the 8step
like the quality difference between the 50step and the eightstep was not like insanely large, which was very interesting to see and it did run very
quickly even on the DJX Spark as we saw right here with that eightstep result. So that was also nice to
see. So that is overall going to conclude our first look and test of Sense Nova U1. If you have any questions, please feel free to leave them in the comments. And thanks for watching.