SenseNova U1 First Test – A New Kind of Open Source Image Model! 文字稿

So, we have the um the word down here. Today, we're going to be looking at a very interesting new model called Sense Nova U1. Now, aside from being

Apache 2.0 open source, which is always good to see. This is a unified multimodal model, which as we'll get into later means that essentially it does

not translate between text and then two images as a native multimodal model may do for image

generation where it is text to image. This is a different architecture and that is partially what makes it interesting. Another thing that has me very

excited to play with this is the fact that it seems to excel at infographics. Now, often times when we do the image model tests on this channel, they

do kind of delve into chaos later on in the video, but I do find that there's probably a good

balance between funny and also educational infographics. So, I'm pretty excited to get playing with that with this because there could be some fun to

be had. I suppose could be said. So before we get into it, please do feel free to subscribe. I am trying to hit that 100K plaque which we're closer to

than to zero. So that is always nice. Additionally to that, this model is available online and like

you can just play with it online as we see right here. But for today's video, because we do love our local AI, we are going to be testing this on both

the DGX Spark through Comfy UI where I do currently have it set up and waiting for us to play with it. Additionally to that, we're also going to be

testing it in a cloud GPU. I do have an A100 on standby rented just to see some of the speed on a

proper system and not something like the Spark which is a bit memory constrained. So with that, let's take a peek at this model and some of the

interesting architectural considerations and then we'll jump into some fun and light testing. Now to begin, I would actually like to start on the

GitHub repository for this model because it shows us a bit more concise the information pertaining to the

different sizes and things like that for this model. So we can see right here of course as we'll look at a little more in the technical section they

do also talk about the core architecture of it which is this Neo Unifi thing which is just a blog on hugging face which has some pertinent information

about that. However we can see if we scroll down here that there are a bunch of different variations

of this model. So we have this Laura adapter which speeds things up massively and produces relatively decent generations for eight steps which is

pretty low in the scheme of like image generation and things like that. However, this is an interesting architecture because as we'll find right here

in this parameter breakdown, this is not just an 8 billion parameter model. It is actually 8 billion

parameters per like half. So, as we see right here, it makes a little more sense if we say it contains roughly 8 billion understanding parameters and

8 billion generation parameters. So, basically what that means is the VRAMm utilization that we are going to be seeing for this model as shown right

here in Bflat 16 is around 35 gigs. So the total parameter size as we see right here is like 17 1/.5

billion parameters. So it's just interesting architecturally and definitely something that I want to bring up. We will also hopefully soon have the 3B

variation which I would imagine will be kind of a 6B model in the same way the 8B is 8 in and then eight out or non-scientifically but cuz that will

open up a lot of ability to run this on more resource constrained devices as opposed to needing that

35 gigs or around that sort of size. So, in terms of the unique architecture for this model, I'm not normally one who just wants to like regurgitate

the page back to the person who's watching the video. I try to not do that, but in this case, I do believe that may be one of the more ideal ways to

talk about this architecture. But one of the things to note immediately right here is they say rather

than relying on adapters to translate between modalities, sense Nova U1 models think and act across vision and language natively. So that would be

kind of the native unified multimodal architecture that they talk about here. Additionally to that, there is also a specifically linked blog post for

more architectural information about this model just here on hugging face. However, for the purpose of

this video and to keep things kind of concise and more focused on actually testing the model and having some fun, we can see right here the key

pillars as they have chosen to outline these with a crane emoji. So that's you know that I suppose that works. At the core of U1 is Neo Unifi, a novel

architecture designed from the first principles for multimodal AI. It eliminates both visual encoder and

variational autoenccoder where pixel world information are inherently and deeply correlated. Several important features are as follows. Model language

and visual information end to end as a unified compound. So basically that's just talking about the unique architecture of this where it's different

and unified so that it does not kind of translate between modalities like from text to image and

things like that. It tries to as they say up here translate instead of translating between modalities. It thinks and acts across language and vision

natively. So finally before we just get into testing it at the bottom of this GitHub repository we can see there's a bunch of specific ways that we

can use this. So there's visual understanding, which is actually kind of interesting because the

example prompt right here is essentially sending it a menu and asking it what specifically it would recommend. So it can see the image and then

understand it and then respond, which is neat. Also, we have our normal text image where we can just describe an image and then it will create it. The

default resolution is 2048x 2048, which is fairly large. So it'll be interesting to see how the speed for

something that size is, especially on the Spark image editing. Now, this will be interesting. I don't really know what to expect here, but it's

interesting to have nonetheless. And then finally, this interle generation, which is probably where my interest lies the most because this is where we

can essentially start to have fun with infographics and things like that. So, it'll be interesting to

play with that. And really our next step now, at least for the beginning of this, we are going to of course play with this on the DJX Spark as well as

an A100 that I do have running. However, for now, I do just want to run a few things through the studio app right here where we have U1 fast. And as

it says right here, this is an accelerated version dedicated to infographic generation. So, I think

there's probably some like crossover here between speed and then quality. I do want to try some infographs. And now I have to think of some prompts,

which shouldn't be too difficult. This is just going to be an initial test. Keep in mind that there is like the thing garbage in garbage out. So this

would be probably considered a fairly lackluster prompt, but I think it might be kind of funny

because I've basically told it to generate an infographic that would show us having a computer that is powered by somebody doing bicep curls, which is

who knows with, you know, ridiculous for now, but maybe in like 5 10 years there may be such a contraption as to do this. H okay. All right, let's

take a peek at this. So, and again this is I mean this does also have the capability to just do

general image generation and things like that which they do have some examples of right here but I find and my personal interest in this is definitely

more in the infographics because uh they can be funny and it's probably better suited to this. So, okay, we have mechanical to electrical conversion.

I will say the bicep here does look quite nice. 74% energy capture efficiency. Ah, interesting. it

almost went ahead and actually like tried to design the device as well beyond just like showcasing the like concept. So what I see right here there's

there's like a little electric motor. Okay. Or the sustainable energy matrix as it says. I would imagine like if you curl that and then you would spin

like the you would spin the middle handle and then it would power the GPU. Sustainable desktop

power. That is a laptop so I will knock it there in terms of its like attention to detail. Future integration, wearable micro grids, advanced personal

energy harvesting. That is a terrifying sentence. But environmental impact, reduced e-waste, autonomous computing, projected energy contribution for

2025 to 2030, renewable kinetic source. And that's our environmental impact overall. I do have to

say this. Oh, okay. So, I was looking and right as I was about to say, I don't see any spelling weirdness or anything like that. We unfortunately did

lose with sustain cron emission. Oh, yeah. Okay. But everything else there aside from that was actually not bad. And I like how it actually went ahead

and designed the device as well beyond just doing like a funny infograph. I have to balance like

can I post this on YouTube and this is funny. Let me try something like serious. All right. So, I've gone into the sample image gallery because I want

to see how long the prompts were that were being used to generate these, which were significantly longer than our bicep power device. But I've now

gone and just translated one of those and used it like in the same format, but of course into English,

and it is still going to be the comic book style. Now, this specific one is for two people arguing over whether they should use a unified memory Mac

for AI or a beefy Nvidia dedicated GPU. So, we'll see how this goes. And the whole prompt is quite lengthy here. So, I was actually surprised to see

it did all right with the like one sentence prompt that we previously saw. Especially when I noticed

in the gallery, they were all like multiple paragraphs like this. Oh, all right. All right. Desktop war. A Okay. Comic guide to choosing AI hardware.

The countdown to local AI begins with the clatter of keys. Okay. This is our Mac Mini fan. Unified memory is clean, quiet, and efficient. No way. A

dedicated GPU is the real compute beast. Feud. Okay. Very good. Boom. And we do have like the dot

comic book style there. Unified memory flexibility and dedicated GPU for collide on the same desktop. And then this one's more eco-conscious just

throughout like the progression of this specific side of the pane. And then we also have this individual CUDA Nvidia. Okay, I do like seeing these

specific graphics coming out of the desktop there. Then we have unified memory and dedicated GPU. I want

quiet efficiency for large local models. I want full CUDA power until the VRAM glows. inside. There is no single winner, only the right battlefield

for your AI dreams. Okay, so next up, we're going to be testing this locally on the Nvidia DGX Spark. I do have this all set up right here through

Comfy UI. There was recently, like a couple of days ago, I think the Comfy UI workflow was released

officially for this, which will make it simpler for a lot of folks to be able to play with and test given that this is a pretty well-known software.

So with that, we're at least going to just begin with the text to image with the eightstep Laura as this will run pretty quickly on the DGX Spark

being that it is a little memory constrained in terms of bandwidth. And I do just have a really simple

sample image prompt right here. Nothing really funny at least yet. And we're going to run this right now. Now, I do believe this may take a little bit

longer because Okay, now I didn't know if the model was still loaded in memory right now or not. It seems to have already been loaded in memory as it

would not be going this quickly if it still needed to load the model. So, we're going to see this.

And keep in mind, this won't be like a full measure of maximum potential quality right here. This will just be kind of showcasing that even on the DJX

Spark, it can make a little infograph or not an infograph, but kind of like a visual much quicker. So, okay. So, we have the DJX Sparu1 lock spark.

Okay. So, we're going to notice that like some of the text and stuff is not necessarily um like words

that one would commonly come across. However, something I will note that I am actually quite pleased with is the actual drawing of the separate

letters is quite clean. So although this is like a not a word that I've normally come across ever, it is like actually the individual letters are

drawn nicely and we do have some interesting like 1 2 3 and then two is going to two, three is going to four

and then we have five and then eight is up there. So again, this was more just a measure of speed on the DJX Spark right here. And we can see this is

a pretty big resolution for this photo right here. I mean, if we were to really like zoom into this, you can see this is actually like the to generate

an image at this resolution on this system at a reasonable speed like that is always nice to see.

And that is kind of what the eightstep Laura really kind of creates. But now with that, I think we can perhaps uh have a little fun with this. All

right, so I'm just going to change the subject here and we'll see what it does with this. And keep in mind it should run pretty quickly here. But I am

somewhat interested to see how it goes about actually coming up with the graphical depiction of this

idea, which I would imagine would be like an angry human with like rage lines emanating out of their head as they look at a phone and maybe a face

like and then perhaps a graphics card being powered. So yep. Okay. So that was actually I mean Okay. So we're going to notice the text is not

necessarily speed ditch. So we have the um the word down here. But if we focus more on the actual visuals right

here, this is very good adherence to the specific thing that I had outlined kind of this was aside from perhaps that this was more or less what I had

envisioned in my head where it's an angry person. Although they do, to be honest with you, this looks like the skin you could get on a graphics card

like in like 2007, 2008. I feel like a lot of cards had like these sorts of graphics just overlaid on

them. As a fact, I even have some that like look like this. But okay. Well, now this I'm actually going to save because something I will want to do is

run this in a stronger one. So, okay. So, right here we can see that this is not running on the local Spark. This is using an A1 180 gig. Thanks to

the homies at Thundercompute who have allowed me to have a large a lotment of free compute. So, thank

you to Thunder Compute for allowing me to do things like this on your A100's, H100s, etc. But now what we're going to do is this is the 50 step one

running on the A100 right here. Now, this is likely going to take a little longer at least to begin because this will need to load the model in fresh,

but it will produce hypothetically a much higher quality generated result being that as opposed to

eight steps, this will do 50. All right, the model has been loaded in now. So, it is just going through the 50 steps. As we can see, the progress bar

here is moving significantly slower because this is doing um if my calculations are correct, around 6 something times as many steps as it did with the

eightstep one that we ran on the Spark. But I will be very interested to see how it differs from

this. All right. And this is again just the text to image. So we're not doing any specific infographic things right now or the interled one, which is

actually really quite neat. And we'll focus specifically on that at least a little later because that's a rather unique thing to actually witness. And

again, this is on an A100 from Thunder Compute. Oh wow. Okay. So, it's interesting. I will say it

is possible that more steps didn't really help. And I think this actually speaks to more of the impressiveness at least of seeing how this runs on

eight steps as opposed to running on 50 where we don't necessarily see anything that's significantly better here. There may be some more like visual

fidelity to the actual graphics here. Yeah, okay, I could see that. But in terms of the actual text, it

changed nothing really at all. And this is the text to image path. So this is not the specific like um infographic one or things like that, which we

will test additionally um probably right now. And I will at least start out the eightstep ones from the Spark because it is always fun to run things

locally if we can. So next up, I'm going to be using the infographic eightstep workflow. Okay. Uh,

okay. Maybe I'll just keep that one open for now. And we see I have like a sample one in here just from when I was setting this up where it basically

just talks about like some things pertaining to the model. I guess I'll run this just to see what it looks like in terms of the ideal generation that

we should be seeing right here. And we can see this is going pretty quickly because it can use the

same model that's loaded in right now. So, it didn't need to like dump it and then reload it. Okay. And then we have an infographic here. Just on

first glance, everything here actually looks quite good. I don't see anything here that is actually off. I will look at it more fine grained, but in

terms that was a 2720x 1536 16x9 photo. So again, the resolution here is fairly high. Sens Nova U1 Neo

Unifi one native model for understanding and gen traditional multimodal stack. Okay. Yes. Yes. Yeah. All the text here was spot on. So this perhaps

would have been better for the rage powered GPU. All right. So I've modified this to basically create the same concept that we had before where it's

creating the VC pitch titled Rage Compute Human Emotion Power GPU. So I'll be interested to see the

text as well as the actual graphics that it chooses to use here. I would imagine this will be more kind of like PowerPointish and less perhaps visual.

Very good. Rage to flops conversion engine. Genius. Raw human rage input. Unstructured emotion. Angry Reddit threads. Comments. Sentiment extractor.

Wasted outrage. Caps lock intensity meter. Downvote volatility. Toxity heat capture. Wasted outrage.

Seed round ready. Synthetic benchmarks 10 times matter. Angry Reddit comments for the pilot data set. And then we have our rage to flops conversion.

Okay. Some oddities here, but the there is a lot of text that was denoted with this prompt. And then it kind of gets a little oddity over here. Meme

derived boost clock. Okay, that is understandable. Now, next up, I want to showcase something that is

quite unique to this model where this is the interled storyboard demonstration. And essentially what we're going to be seeing right here is this is

going to almost reason through the actual process of generating these images. And we're going to see in this specific pane here when we do actually

get the output. Let me just close the workflow tab. We're actually going to see text generated as well

as images as well as additional text. So we're almost going to see the model natively produce text and images kind of interled I guess could be said.

And it's interesting to see this process because it's using both modalities to actually come up with the final answer. Now inevitably of course the

specific prompt right here at least for the first time is just a simple technical demonstration of

what this is actually doing here. So this is just saying local DGX Spark. This is running on the Thundercompute A100 right now, but I do want to just

showcase it at least with something a little serious first. And we're going to notice this is taking this will take a bit longer because it still

needs to actually like load in the model and then it will take a bit of time additionally to that

because it also has to generate the text and the images as we'll see right here in this specific pane. So, we can see right there that yes, we got our

output there, but that's not necessarily what I'm focused on right here. What I find interesting, and we're going to notice it also did produce two

images as was denoted in this prompt, but it's actually generating this text here as well. And we can

see that it's actually thinking through this. So, it's able to basically reason about what the image is going to contain, and it's simultaneously

generating the images and this text. So, it's almost generating the text kind of how you would just see like an auto reggressive LLM or something like

that do this. But it's just I find this architecturally to be very interesting because this is vastly

different than like our normal text to image stuff that we may see. Okay, this layout emphasizes the technical workflow and local deployment

capabilities and it's just building that based off of the specific prompt right here. So this text is actually coming generated from this model and

how it's interpreting and understanding this prompt right here. Then we get one image result right here. Now I

see the verdict is perhaps a bit um difficult for one to parse but everything else prior to that is actually cool. But the whole point of this is not

necessarily even to see this. It's it's unique architecturally and I find that really at least kind of interesting. I suppose if you're still watching

the video at this point you probably will as well. And this is definitely cool to see. So we can

see more text right here. And then we have our second image produced down there. And then finally we have some text there. An interesting conclusion

fail fix is needed. So look at this. It actually was able to assess the image right here. Successfully captures the layout and the tech review

aesthetic. However, there are several visual redundancies and alignment issues that need correction.

Specifically, the bridge sentence optimized for local hardware and iterative generation is repeated twice creating unnecessary clutter. In the

comparison panel, these are duplicated with their respective boxes. So it's noticing this as well, which looks like a rendering error. Additionally,

the main title is missing the model or locally DGX Spark suffix mentioned in the content requirements. The

vertical spacing is also slightly cramped due to the redundant text. Conclusion fail fix is needed. So now if we go and we'll take a peek at what we

see right here. It didn't go ahead and fix it following that because we told it only to generate two images here, which it did. However, it's really

interesting to see that it's actually able to assess its own output here and then reason off of that,

which I just find like it's architecturally kind of cool. All right, next up, I want to try this, but with something that we're all likely familiar

with, well, I would assume most people would be. It is relating to Doc Brown's time machine from the movie Back to the Future. So, it has things

specific to the film like 1955, the Delorean, the 88 mph threshold, the flux capacitor, and things like

that. So, we should get some form of infographic as well as some interleaf text generation here as well with a couple images pertaining to the time

machine and the movie BTTF. All right, we should be about to get our results. Very cool. So, we see that this has generated a large amount of text.

Okay, I do see almost like an inverted Delorean there. The design plan aims to transform the iconic

Delorean time machine into a high-end tech. Now, I'm not going to read this through verbosely. I just find it cool that it's generating these like

interled or whatever you'd like to call it. All right, let me just take a look at some of these. So, obviously the first thing right here is a bit

unique, one could say. We have our normal mall parking lot and then of course we have our 1955 hill valley

arrival. It reads and it's good at actually reading the specific text right here. Verdict a 10 to 10 for temporal reliability and 88 miles per hour

performance. Okay. And then we have the Dellran's time machine tech review. So the text here is a bit not 100% proper but it does all it also does

understand that. So it's good at actually assessing these and then saying like oh okay this needs some

additional fixes and things of the sort here. But the whole cool part of this and why I want to demonstrate this is the architecture. It's very

interesting in the way that it's going about doing this where it's doing like text and then an image and then text and then an image and then it's

giving us our conclusion there which is fail fix is needed and it's good self-awareness on the part of the

model which I am happy about and then we can see those are just the same images that we received right here in this pane. Now I want to just do

something like odd. So again this is where usually all the image testing videos at some point kind of descend into chaos. This would very likely be

that point in the video. But I will say I think I kept it together longer than normal. So that's good. This

that's just not right. Okay. Um what other fruits do we have? Oh no. Let's do maybe not a human turning into a helicopter. And again I don't know like

you know do you think the helicopter will have banana-like aesthetics to it? I think it will. I think it'll be yellow. Oh, nope. It's not. That is.

So, the skulls are pretty consistent across generations here. This is quite something. And again,

this is a fairly large resolution right here. Let's try something that's perhaps not as disturbing. A beautiful and it was interesting the way that's

almost like transparent to the actual background. Not as entertaining though. Let's try Steve the PC repair man. Steve the PC repair man. Very good.

Okay. He's using a heat gun on some form of flat motherboard. I like the apron. All right, Steve.

Okay. I mean, it was cartoon style. I wasn't nec I didn't really know if it would I kind of assumed it would be try some of these old school tags that

used to help image generations like back in the days of early image models that could run locally. Interesting. That was almost like an identical like

juxtiposition. It just changed it from cartoon to non-carto. Very interesting. And that is

inevitably a Mac. So for those of you that are familiar with the lore of Steve's PC repair, this is actually quite a fitting image. Very interesting.

And overall, that is probably going to wrap this video up. I don't have a traditional, so to speak, results overview because we've just been

generating these through Comfy UI as we go. Obviously, this model is just interesting from an architectural

perspective with specific emphasis on this interle generation capability where it's generating the text and images in parody almost and then it gives

us the end result here as well. But seeing the ability of it to actually judge the endgenerated photo right here is very interesting. And its focus

also on the ability to do like infographics and things as we had seen in some of the other nodes that

we tested on the DGX Spark is pretty cool and it's just architecturally unique as well as Apache 2.0. So I figured I'd take a peek at this because I

had seen some other coverage on it and it was kind of interesting. So I wanted to also throw my head into the ring so to speak. So, that is going to

conclude our first look and test of Sense Nova U1. I am going to be interested to see if the 3Bs come

out at some point because those will definitely lower the barrier to entry to run this in a local system. Additionally to that, I noticed the 8step

like the quality difference between the 50step and the eightstep was not like insanely large, which was very interesting to see and it did run very

quickly even on the DJX Spark as we saw right here with that eightstep result. So that was also nice to

see. So that is overall going to conclude our first look and test of Sense Nova U1. If you have any questions, please feel free to leave them in the comments. And thanks for watching.

SenseNova U1 First Test – A New Kind of Open Source Image Model! · 全文文字稿