Local LLM fine-tuning on the NVIDIA DGX Spark - Part 1 文字稿

testing. Let's see if this works.

I just need to make sure that this live stream works. >> Testing. Let's this works. Okay, beautiful. Let me know if you can hear me in the chat. It's

been a while since I've done a live stream, so going to be a bit rusty. >> This live stream works. >> Okay, I think I can hear the audio on my phone.

That should be okay. So, ladies and gentlemen, welcome to some local LLM fine tuning. We want to we

want to get an LLM. We want to fine tune an LM for a specific task. All right. So, of course, you can use all the big dogs, APIs, and that sort of

stuff, but maybe you want a little one. So, let's do it in here. We're going to do supervised fine-tuning. And let me make sure. There we go. Okay.

Beautiful. So, we could use Google Collab, which I actually already have done in the past, but we're

going to use uh I can't show you, but I have one of these. Nvidia DGX Spark. Nvidia sent me one of these. So, we're going to play around with it, see

if it's any good. They actually didn't tell me what to say, so I'm just going to tell you if it's good or not. Um, it's pretty expensive. It's 3,999

USD. Um, but they kind of were so nice to send me one for free, so we're going to see if it's any

good. So, Google Collab, I've tested that this workflow works in Google Collab. Now, we're going to test it if it works local. So, I've set up the

stream. Okay, I'm going to get better at these live streams, too. So, be sure to always put stuff in the chat of anything that you want to see. I want

to do more live streaming in 2026, but don't tell anyone your goals. Just do them. So, get some LLM

fine-tuning code going. So, we're going to do SFT equals supervised finetuning, but we'll get to what that is later on. All right. Um, today's goals, fine-tune a SLM, small language model. Okay.

Pop out chat. Okay. Yo, testing the chat. Where's everyone tuning in from? I'm from Brisbane, Australia. Cool. I can get the chat over here. I'm a

real streamer now. Okay. So, this is See down here. I don't know if you can see that, but it's got SSH there. It's running on the Nvidia DJX Spark.

So, if we do this, this is a segmentation model, EOM MT, which is a just a transformer encoder that is a

segmentation model. So, it's actually quite fast at segmentation, which is really cool. So, then here we go. We do it on another image, but this is

not what we're doing. I'm just showing you that this code is running on there's a photo of me and my wife. Wonderful. Okay. So, segmentation model

running on here. This is on the Nvidia DGX Spark. Let's have a look. Iran. We got Iran. Lovely. We got

New York. Is it a big latency? What do you mean by that, Andra? Is it Is the stream lagging? Is it? Let me know if the stream's lagging. I haven't

done one of these in a while. So, we're just going to This is a working out the kinks stream until I get my setup fully done. But, let's start a new

notebook and we'll do it in why not just encode, right? Um, let's go fine-tune llm.ip.

Okay. We could of course do it in a script, but I'm going to do it in So, we need a couple of libraries. Let's see if we've got transformers. Maybe I

zoom in a little bit more. We don't want AI for now. Import TRL. Do we have TRL? We don't have TRL. So, TRL equals transformers reinforcement

learning. Okay, so this is the library we're going to be using for our finetuning. So transformers is going to give us the model. TRL is going to help

us do the finetuning. So let's go um TRL GitHub.

There we go. I'm not signed into GitHub. Maybe I need to um One second. Just going to sign into GitHub. I need to get used to doing credentials and

stuff like that without my leaking my passwords to the public. That would be not fun. Okay, signed into GitHub. We can go back. Okay, there we go. So,

TRL is transformers reinforcement learning. Okay, and so why would you want to fine-tune your own

LLM? because well for one maybe you don't want an LLM maybe you want a small LLM which is what we're going to work on we're going to use Gemma 3 270

mil so let's go into here and put that there so markdown I'm going to go fine-tune a small language model why fine-tune because we want small model

for a specific task eg don't want to pay API credits or leak our data online. Right? If we have our own

model, we can run it anywhere and everywhere we like. Okay. So let's go here and we're going to go Gemma 3 270 mil instruction tuned. So this is the

model we're going to do finetune which is Google's Gemma 3 but it's a 270 mil model. So if we go here, why won't why would we want this? More zoom

would be better. What about this? Is this a good level of zoom? This this amount? let me know. Uh, I am

still training jiu-jitsu. I trained jiu-jitsu last night. Oh, we got greetings from Russia, Boston, Canada. What's going on? Nice to see everyone

here. Uh, I trained jiu-jitsu last night. So, um, how to fine-tune an LM model. We are going to do supervised fine-tuning which is also known as SFT

which is basically SFT equals give samples of inputs and outputs. So for example, input equals hello, my

name is Daniel. If our goal was to extract names, then the output is Daniel. Does that make sense? input output. Okay, beautiful. But we need TRL. So, let's install that.

Honda activate AI. And then we're going to go UV pip install TRL. I've been using UV. Anyone used UV before? It's fun. It's fast. Okay. Do we have

Let's restart this notebook. Boom. We have TRL. That's wonderful. Okay. Now, a few things we need cuz this is a machine learning cooking show. Maybe

that's what I could stream. Machine learning cooking show. Ingredients. We're going to go one, we need a model, and we're going to use Gemma 3 270

mil. Uh two, we need a data set and we're going to use um a pre-baked data set to extract food from text.

What we're cooking, we're going to make

using PIP. What's up, brother? My brother just got here. We're in an office in Brisbane and we work together, but I'm going to rudely be streaming

while he's going to be trying to work. So what we're cooking is we're going to build uh small LM small language model to extract food and drink items

from text. Why? If we needed to extract uh needed to go over a large data set of um image captions and filter them for food items. We could then use

these filtered captions for a food app.

In which case should we fine-tune an LLM instead of RAG? So let's write this here. This is a great question from WLWA. Fine-tuning LLM verse rag. So

fine-tuning uh you would want that you would want fine-tuning when you have a very specific task like a well- definfined task of you could treat an

LLM as a language processing model. So you might fine-tune something to do a very specific task eg

structured data extraction. Okay. So, say for example, you had uh a set of emails or something like that and you wanted to or receipts or something

like that, you wanted to extract it to JSON, you would fine-tune an LLM for that. But for rag, uh you want to inject custom knowledge into an LLM. So,

let's write an example for this. So fine-tuning equals um an example would be you're an insurance

company who gets 10,000 emails a day and you want to extract structured data directly from these emails to JSON. And an example for RAG would be

you're an insurance company wanting to send automatic responses to people, but you want the responses to include information from your own docs.

Does that make sense? So fine-tune, you have a very specific task that's defined and you just want to repeat that task 10,000 times a day. Or with

rag, if you are an insurance company and they send someone send you an email like, "Hey, my um my fence broke, right?" You might send them an email

back of um information with links to your documentation, right? That's a very simple use case but that's

whereas with the fine-tuning model you might extract that email and go okay this person their problem is the fence is broken uh their address is this

cuz they put in the form etc. So these are simple use cases but that's a kind of an idea of which to use which one and when. Now we've got our

ingredients model data set three we need some training code uh four eval code and then five demo. So our

method is going to be one download model two um download data set three inspect data set and then we're going to go four train model on data set five

eval model six launch create an interactive demo. Okay. Seven, the bonus can be make the demo public so people other people can use it. Okay, welcome

to cooking with machine learning. So like all good cooking shows, I've got something that I've

prepared earlier here. Now this is I'm making a course on this. So just stay tuned for that. Oh, we need data sets as well. So download model, we use

transformers for that data set. We use data sets hugging face. Uh all these videos all these videos will be available after the stream. I'm going to I

really want to improve my YouTube stuff this year. So, um I'll put it on just uh I think you can

just watch live stream replays on YouTube. So, it'll be there available. Uh but then I want to take the best parts of the stream and turn that into an

actual video. Not sure how to do that yet, but we're going to figure it out, aren't we? So, train model and data sets. We're going to use hugging face

TRL eval model. Um, basically just look at a bunch of samples and create an interactive demo.

We're going to use hugging faces gradio. And then bonus, we can do this on hugging face spaces. Okay. So, and I want all these tutorials to be end to

end. Like, there's got to be something working at the end of it. You can only hear through one headphone. I'm so sorry about that.

H I think that may be because I've had that in the past. I need to fix that. I'm going to fix that for the next stream. I'm sorry. But that's a good note. I'm going to write that down.

I didn't test with headphones. Next stream. Fix headphones. Um mic so the speakers. Fix headphone. Fix mic so the sound comes through both headphones.

Sorry, it doesn't work on this one. Next time I'll do my test on headphones as well as a speaker. Okay, so we need data sets as well.

Beautiful. Okay, so we've got we're starting to get our ingredients. Let's get our model over here. Oh, we could get our data first. No, I'll get the model. That's going to be more fun.

So, our model name. And by the way, this Google Collab over here, this is I've already worked on this, so it's kind of cheating. We're not exactly

doing it live, but oh well. Google, we want Gemma 3 270 mil it. And I need to turn this into a code cell. Okay. Now, where does this model come from?

Of course, it comes from our friends at Hugging Face. So, and Google, thank you so much. Model link

here. Now, it doesn't really matter what model you use. You could I'm just using Gemma 270 mil because this is quite a small model, 536 megabytes.

There are other models out there like um baguetteron model hugging face which apparently works better. I just haven't tried it. It's a 321 million

parameter model. Um if we go have a look at this image there we go. So this is ML MMLU which is like a LLM

benchmark gem 3 270 mil's down here but we don't exactly need a extremely knowledgeable LLM to do what we're doing. We just need one that's good at

text processing basically. Um and then Quen 3 600 mil parameters over here. And then they've released this one bagatron uh which is 321 mil

parameters. We could even try Monad at a future point too. do it 6x smaller. So that may be an experiment for a

future stream, but for now we're going to stick with Gemma 3 270 mil because I know that works. Okay. So let's get torch and we'll go um device cuda

if torch cuda is available else cpu and then go print using device CPU what we don't have cuda available. That's blasphemy. Python import torch cuda

dot is false. What has happened here? We might have to reinstall PyTorch here.

The last time I used this spark cuda was available. So

that's fun. So let's check out torch version. So, there's a problem already, but that's okay. Plus CPU. Why do I have Let's go to PyTorch's website. PieTorch.

Linux.

Now we'll install PyTorch. Did it work? Maybe I already had it.

No. Yo yo everyone joining the stream. Pip uninstall torch vision.

So, I really thought that I had PyTorch installed with CUDA

cuz we want it to be on the GPU, don't we? Okay, we've got

for some reason PyTorch wasn't available originally. Well, CUDA, sorry. Thank god that was a little bit easier than what I thought it was going to be.

I thought we were going to be troubleshooting CUDA installations. CUDA. Okay, we've got CUDA. Wonderful. So, for those who don't know, CUDA is what

you need to basically run your models on an NVIDIA GPU. If we don't do that, our models are going to go really slow.

No, I'm not going to use Claude code. We're going to we're going to be raw dogging this. We're going to be coding all of this by hand, right? Because

we want to learn. Once we've learned how to do it, then we can do Claude code. Before that, we need to start learning things ourselves because if we

know how to do it, then we can get AIS to do it for us and then we can inspect it if something goes

wrong. I always Does anyone get causal and casual mixed up? I always get those mixed up. Okay, so we're going to set up our model and model is going

to be equal to automodel from pre-trained and then we're going to do the model name. How do I connect uh Nvidia GPU to Mac? So, right now I'm doing

SSH. So, you can't see it from my screen, but I've got uh Mac Mini and an Nvidia Spark. Maybe I take a

photo of this and I can show you. What are we building? We're fine-tuning an LLM. This is my setup. I'm just taking a photo of this.

So, this is what we got. I'm coding right here. So this is my Mac Mini which is beautiful but this is the Nvidia DJX Spark. So right now I'm writing

code on here and but it's running on here. So see there's that Spark uh FF62.local. Why do we do that? Because this has an inbuilt CUDA GPU whereas

this is just running Mac OS. We don't we can use it we could use MPS but uh I prefer to use CUDA for

these type of workflows. So we're using it on here. We're fine-tuning a small LLM to do structured data extraction. So that's the what we're going to

do. I've got a little key node here. Fine-tune a small LM to do structured data extraction. So we come back to code. There we go.

model name DT type. We're going to set that to auto. And we'll set device map equals auto as well cuz device map put the model on the GPU. Uh

attention implementation. Now I'm going to use eager because um could use flash attention too which is a faster version but ran into issues. So stick

with eager for now. Okay let's see if our model loads. So there's the config. This is all going to download automatically from hugging face. So there

we go. config model.safe tenses which is downloading there.

Okay, beautiful. So we've got our model. Let's see what it looks like. Model. Okay, so we got a Gemma 3 model. We have a bunch of layers here.

Attention MLP and then we have an output. So can we just feed it in model? Hello, my name is Daniel. No. So we need to do some scaffolding around the

inputs, right? We get out some errors. Yeah. So problem number one is who knows what our problem is. Our

model requires numbers aka tokens as input. We can turn strings into tokens via a tokenizer. Woohoo. So, let's get our tokenizer next. And we can do

that with the auto tokenizer method. So we'll just go tokenizer equals auto tokenizer dot from pre-trained. And then we're going to put in the model

name. And then we're going to go print out some info. So model on device

model.vice and then print info. And we go model using dtype model deep type. So there's our tokenizer. Tokenizer is basically a big dictionary of

there we go mapping pieces of words to numerical numbers or numericals. Uh and that way our model can take it. So if we go tokenizer, hello, my name

is Daniel. What happens? There we go. We get input ids. And now can we just do model? We put this in there. We might have to format this model with or

format this with something else as well. Input ids.

must be tensor H. So I think we can go equals true.

So we want it to return it into tenses. PyTorch. No. We can look up the docs shortly, but why look up the docs when you can just hack around for a bit. So, let's just turn this into torch.tensor.

Oh, we might be on here. We need to just put it to the GPU now to CUDA. And we get another error. I think we might be because we need to use um a chat

template. Yeah, there we go. Size of tens of 4 non-S singleton dimension. Maybe does it need a batch size? One more thing. We're going to try one more

thing and then if this doesn't work, we'll go to the actual docs.

There we go. Oh, that is cool. We get an output. Have no idea what this output means, but that's okay. What do we get from that? Outputs. Outputs keys

logets. We get some logets. Okay, so we need a way to post-process these outputs. Okay, we're making progress here. We've already got we've got a

small LLM running on our own system. So we could just do a pipeline. Yeah, let's do a pipeline. Try the

model with a pipeline. So pipeline is I'm going to move this up here. Actually, a pipeline in hugging face is the simplest way to get an LLM running

or any model running for that really. But later on, we might want some higher levels of customization. So, we won't use a pipeline. So, we go load

model and use it as a pipeline. So pipe equals pipeline text generation and then the model is of course

our model which is GMA 3 270 mil and the tokenizer equals our tokenizer.

Okay. And now let's check out what our pipe is. Beautiful. So let's put in some information to the pipeline. So we need an input prompt equals and we

want pipe tokenosa apply chat template. Now this is an interesting one. And we want our input text.

So we saw what came out of that before. If we put hi, my name is Daniel, uh, we get some logets. But now we want, of course, we can't display logets

to our customers. We need actual text. So let's put in this. We'll go input text tokenize equals false because we're going to have a look at what that

looks like in a second. And then we can go add generation prompt. True. Now, I'm not sure why.

What's the VS Code extension to get Python to just auto go to the right line? Let's check out what our input prompt looks like. Oh no, it didn't work.

We get a template error. Hm. Ah, this is because our data set is not correct, is it?

Let's get our data set and have a look what that looks like. Get data set. So this is uh a data set that I have prepared earlier.

You can go to my hugging face and see it there. Data set equals load data set. That's basically the most important point in any machine learning

pipeline is the models are good enough now that if you have your own data set, they will work. So a lot of my time is spent on crafting the right data

sets. I'm like programming the data for the model to learn. And this is a relatively small data set, but

it's should be what we need. Number of samples in the data set. And we're going to go len data set train. But we obviously need single commas there,

don't we? So this is downloading from hugging face. We got food extract.

Uh I haven't haven't been hands-on with the Mac Studio Ultra. No, we're running this from a Mac Mini to a uh Nvidia DGX Spark. So there we go. Number

of samples in the data set 1400. So we're going to go def get random idx data set and then we're going to go returns a random integer index based on

the number of samples in the data set. So this is one of my favorite things to do when exploring a new data set is to just view samples at random so

you get an idea of what's going on in the data set.

We need random.

So, let's get

Okay, now

let's view a random sample. Okay.

So, lot going on here, but that's okay. This is when you're working with an LM or any kind of machine learning problem. This is often what your data

sets will look like. They're just JSON or something like that, structured data. And we're going to do supervised fine-tuning. So, we're going to have

a series of inputs and some labels which are our truth labels. And we want the LLM to uh produce the truth label based on the sequence. So, let me

show you. So, the sequence here and our goal is we're building food extract. So, I'll show you this data set on hugging face. Hugging face.

Right. So food extract data set designed for fine-tuning a small LM eg Gemma 3 to extract structured data from text in a way that replicates a much

larger LLM. So what I've done is I've got uh GPTO OSS120B to synthetically label uh about a thousand or so sequences of text whether they be about

food or not and to extract uh different items food items from that. Now where would we use this? Well, in

my case building an app such as Neutrify, which is my brother and I's startup, we want to filter uh large databases for food images or food text.

Whereas if you were a company, you might have a specific uh set of text and you want to extract structured data from that so that you can view it in a

database. That's exactly what we're doing here. Now, the benefit of using an LLM for something like

this is it can take basically any textual kind of input and then produce structured outputs from that. That's the versatility of an LLM. So if you

want to read more about that, you can go there. But this is publicly available data. So our ground truth or our input is going to be our sequence. So

example input equals random sample sequence and then our example output equals random sample. Now what

I've done is the GPT OSS120B label is in JSON. So I want my model to take text input and I'll just show you the uh raw output and then go print info

input equals example input and then print info example output.

Okay, so this is what we want our model to do. Say this is our input. This is a raw string from here to here. Now we want the output to be structured

data. So this is JSON. So we can go from there go import. We want our model to output JSON. Import JSON. And then I go JSON load s. And we can turn it

into structured data. Oh, what's happened here? Um, one second. Maybe we could do eval. There we

go. [snorts] So, does that make sense? We want to take this input of text and output structured data like this. So we have is food or drink. Yes, it's

about food or drink. So this is a hand is holding a glass jar with an orange lid containing a mustard seedbased constant uh condiment. The jars label

displays the following nutrition information. Here's the ingredients list. And so the structured

data we've extracted from this is the food items. Mustard seedbased condiment. Mustard seedbased condiment. Water. Mustard seed. Salt. Sugar.

Turmeric. thyme, garlic extract, spice extract, peanut, wheat, gluten, etc., etc. Also mentions potential allergens and there's no drink items. So, if

we have a look at another example,

we have on a rustic wooden share plate, thin ribbons of crisp cucumber and bright green apple are artfully arranged beside a stack of golden glazed

donuts. Dusted with powdered sugar, a small porcelain sugar holds a splash of amber mirin, etc., etc. And now we have this. It is about food or drink,

food items. We have cucumber, apple, donuts. And so this is this is a almost a toy problem, but it is

something that we would use uh at my company to extract large amounts of structured data from a large data database or something like that. Now, you

could do this for anything. You could do it for insurance claims. You could do it for um if you're whatever type of finance business you want to

extract company names from something, you can use a small LLM for this. So, let's fine-tune it to do so.

And we'll have a look at another random sample just in full. Uh I have a condensed version of this. Now, why are we doing a condensed version? So,

let's go. example output condensed equals random sample and then we're going to go get the condensed version because the condensed version is less

output tokens. So when we're fine-tuning an LLM, uh the less output tokens basically the better because

that is going to be our constraint is how fast. If we want it to be fast, if we can get it to output less tokens, then each token takes up generation

time. If we can lower that, then we can speed up our whole process. So um

all right. So maybe I put some new lines between these so we can see them.

There we go. Okay. So, online self-sufficient vegetable gardening course in a virtual classroom for one. Uh, is this about food or drink? True. Food

items, vegetable drink items. No. Now, the condensed item. Notice that the condensed item doesn't have any of the JSON formatting. So, we're going to

fine-tune our LLM to produce this, which is basically YAML, but then this is easily passable into

JSON, if that makes sense. How does the DGX Spark compare to the customuilt PC? So, um, that's what I'm going to be exploring in future videos. For

now, uh, my intuition is that the customuilt PC is going to be faster because it's got an RTX 4090, whereas the DGX Spark is going to be better for

basically a one plug-in AI computer. And if you didn't want to build your own custom machine, and it

also has much more VRAM, has 128 GB of VRAM versus the RTX 4090, uh, which is 24 GB of VRAM. Hi from Turkey. Hello from Turkey. Hello from Brisbane.

So that's our goal is to take an input into an LLM and then get this as an output. Now I wonder if we can find Yeah, there we go. Now if we were

filtering a large data set of image captions, we might get some image captions that are just gibberish

basically 11111.tiff or 11111.jpeg or whatever. And in this case, we want it to be food or drink, no tags, foods, drinks. Does that make sense? So,

this is not about not about food or drink. Not about food or drink. Here we go. A front shot on of glossy and then we get food or drink foods. White

yam, chashu pork, seaweed, red emperor fish. So, that is a perfect example of what we want. Now, let's

see if our model can do this. What are we getting wrong here? Oh yes, we need to format our data set. That's next. Format the data set into LLM style

inputs outputs. So right now we have examples of um string based inputs and structured outputs. Uh however LLMs generally want things in the format of

it's something like this. I'm going to get the exact format but I'm trying to use my brain more. So

we want um user is going to be hello my name is Daniel and then system will be hi Daniel I'm an LM right so we just need it in that type of format in

other words. They want structure around the inputs and outputs rather than just raw information. Okay. Where the system in this case is the model. And

where can we find this? Well, that's in the docs. Hugging face docs. I think I've got a link here

somewhere in my pre-made notebook. Yeah, here we go. Find different style data sets. So, if we go to TRL docs, let's have a look at this. I want to

show you the different types of data set formats. This is very important. Data set formats and types. I want to put that in here. Oh, what did I do

there?

data set formats and types.

Now, if you want to train any of your own LLMs, you should get familiar with this data set. uh sorry if you want to fine-tune them you can train them

from scratch um a little bit of a different process but if you want to fine-tune them to do specific tasks get aware of the data set formats and

that's often a lot of bit a lot of the onboarding with any kind of machine learning model is getting your data in the right format so if you want a

custom model you need to figure out what format you want that uh data in so actually I'll keep this full Green

green code. Yo, what's up?

Yeah, I've heard that the memory bandwidth on the Spark is a disaster. That's this chat comment here from uh Madrid Proto. I'm not sorry that name is

too long for me to try and pronounce and take me longer. Um, so let's find out that together. Hey,

yeah, that's a good comment. Custom PC if you have a modern GPU will be much faster at inference. DJX Spark is better at local experiments like the

one Daniel is showing. Yeah, so I think that this is a great comment. Can I make this bigger? Yeah, I agree with that. How do I I can't react to a

message on YouTube. Anyway, [snorts]

so that's what we're doing. We're doing a local experiment. That is probably the nutshell. I think the Spark would be great for getting started with

things, but I just looked up this bad boy, RTX Pro 6000, and this looks freaking awesome. I mean, it's $14,000 Australian dollars. But that's the

that's the next upgrade I would do. So, that's what the goal of this year's stream is, is to get one of

these for free somehow, whether Nvidia send me one or we just get enough revenue from the stream and we buy one of these GPUs. Um that's that's the

goal, right? And then we can publish much more open source models, but we got enough to do to do for now. Anyway, back to data set formats. So the

format of a data set refers to how the structure the data is structured typically or categorized as

either standard or conversational. The type is associated with the specific task the data set is designed for such as prompt only or preference. Each

type is characterized by its columns which vary according to the task shown in the table. Now a lot of jargon there. I understand it via just inputs

and outputs. What inputs do you have? What outputs do you want? That is my every machine learning

problem you get just what inputs do you have? What outputs do you want? So we want I believe ours is going to be is it language modeling or standard?

I think yeah we want this. We want ours to just reply with language modeling. So we're going to have some text and then we want our assistant to

output some content. I think this is the style we want. So in our case, our content from the user is

going to be just a a text string and then the content from the model is going to be the formatted uh structured data that we want to extract. So that

is what we want. We want conversational. I believe we're going to see this in practice, so I may be wrong. Or even just prompt only. No, we don't want

that. Prompt completion

preference. No, I think we want language modeling. Let's let's go back and we'll just try. So, yeah, sample to compensation. Now, I wrote this

notebook probably two months ago, but then I got married and went on holidays. So, I'm just getting myself reamiliar with it. And then we want

messages. And then we're going here roll user, right? So, this is what we're going to input. Content sample is

going to be our sequence. So remember LLMs are just tokens in tokens out. That's all you got to that's all your framework should be for modeling

stuff. Tokens in tokens out. So load the sequence from the data set

and then up next we're going to go roll is going to be this is the system and then the content is going to be what our labels. So this is GPT. Now,

I've synthetically labeled all of this with GPTO OSS120B because it's basically one of the best uh open source models that are available. So, all of

this is open source, by the way. Everything I'm working on here completely open source. The model, the

data set, um the samples synthetically generated. So this is the beauty of working with AI in 2026 is that not only can you generate uh code, you can

generate data samples which obvious which in the past had been the biggest bottleneck to creating your own models. So now with hugging face data sets,

let's just see how this works. For example, um we have our random sample which is this. Let me get

the YouTube chat back up. There we go.

I agree with this statement. Nvidia, can you please send Daniel a Pro 6000 so we can watch him experiment with it? [laughter] What the Nvidia

developer in here? Yo. Okay, I'll if I do Nvidia developer, here's the rule. If I do over 100 hours of streaming with uh the Pro 6000, then I can keep

it. So that's that's my that's my deal. I'll do 100 hours of streaming of development all on the Pro 6000.

That's my sale. Okay. So now let's get our random sample back and we go here a front on shot of a glossy. So this is a lot, right? And what I've done

is I've got image captions and whatnot. Uuids, all of this is open source. But our main goal here is structured data extraction text to JSON. So now

but we need it in conversation style. So sample to conversation

random sample is roll user.

Oh, we have a deal.

There's uh we've seen that. Everyone's seen that in the chat. Let it be known. Nvidia developers said that's a deal to Pro6000 hours of live streaming

um of local AI development. Okay. And we'll get the Neotron models running and we'll fine-tune some of our own and we'll we'll publish them to Hugging

Face and then people can benefit from all the work that we do together. So, that's pretty cool.

But we need to we keep getting distracted. We need to get this data sample ready. So now all we've got here now is uh an input and an output. Messages

roll user. So if you went to chat GPT and typed in this,

there's a lot there, right? This is a big image caption. So let's just go to Chad GPT and we go type in this. Right? This is our input. This is all

chap GBT is doing behind the scenes except it's a much bigger model. Right? And I'm paying for this. So let's go. Please extract the following. Please

extract the food items and the drink items from this text to JSON

text.

Right? So, this is Chat GPT is doing this, but we want to build a small LLM that can just do it ourselves. So, Chad GPT is good at that. We know it's

going to be good at that. But right now, our little model probably isn't that good. And then what do we get? This is our label. Food or drink tags. We

got some tags as well. White yam, chashu pork, red emperor fish. What do we get? We get the

basically the same output as Chad GPT. Okay. Uh, white wine vinegar. Oh, did we get white wine vinegar? We said that was a food item. So that would be

maybe a sample where we could improve our uh sorry improve our labels to use white wine vinegar as a drink. But that would be another discussion. Is

it a drink or is it a food item? So from one angle, are you going to drink white wine vinegar?

Probably not. Are you going to use it in a food ingredient recipe? You probably will. So that's up to you. We're not going to argue about that. We're

going to figure out how to train a model to do this. So now we've got a little function to map our samples to conversation style which again is our

data format over here. Now we want to map it to our whole data set. So let's go map our um sample to

conversation function to data set

and we're going to go data set data set map and we're going to do batched equals false. We don't actually need to do this fast because it's going to

data set. And now let's have a look at train cuz it's it's our mapping and I want maybe sample zero. Okay. So now we have our messages and our data

set sample. Beautiful. So next let's create a train test split. And we're going to go data set equals

data set train test split. And then we're going to go test size equals 0.2 shuffle equals false. And then we're going to go seed equals 42. Right?

Because 42 is the magic number of the world. And then if we have a look at our data set now, we have a train and test set. So number one rule, what's

the number one rule in machine learning?

Always train on the train set and test on the test set. So whenever you see benchmarks of LLMs come out uh often these days why don't we go to Gemini

3 release blog. So obviously the Gemini 3 models are really good. But if we go to the benchmarks here this is why we do train and test set. So all of

these you would hope are on the results are on the test version of these benchmarks. All right.

There's a lot of benchmarks here. But what you can do to hack the benchmarks is create data like the benchmarks train on that and then when you avail

it uh it's kind of inflating all these numbers. So it's really important to not only test models on benchmarks these are a good overview but uh create

your own test sets on your own data sets so that you know when your model performs well on your own

data set that if you ship that into production it's going to work pretty well. So um like for this one for example, AI im 2025 that's mathematic

problems. So what can happen with these models is they can get really good at solving math problems. Yeah, sure that's a I've never actually seen a

use case for LLM solving a math problem. I'm not a mathematician. Um but if you have lots of uh example

math problems out there such as the benchmark itself because these models are trained on the entire internet um potentially examples from that test

are in the training data. So number one rule of machine learning when you're evaluating your model always evaluate on samples the model hasn't seen

because that is a not a true indicator but an indicator of how your model might perform in the real

world. So, we've got a trainer test set. Um, this gives us an indication of how

uh will um perform in the real world. Okay, what's next? So, we've got a train data set. Can we pass one of these samples into our model yet? I'm

getting sick of this of not passing a sample into our model. So, let's just go I want to put this conversation format and just see what our model says

to hi my name is Daniel. Let's go easy sample is this messages and I want to go content. Hi, my name is Daniel. Oh, actually, maybe that's why it's

not working. Input text. Add chat template. No, let's get this working.

There we go. Okay. So, now we've got a way to pass it to our model. Now, let's see what our model says to Hi, my name is Daniel. Is it going to say hi back? Is it a nice model?

So we formatted our data set default options pipe input prompt. Okay.

This is the input prompt.

Excuse me.

So this is our input prompt. What do we got? Hello. Oh, fine tuning. We're just doing it for fun.

MLX. We could do an MLX stream in the future, too.

That's going to be the year. This is the year of local AI. So, we're going to we're going to spend this year creating our own models and publish them

to Hugging Face for all different kinds of things. Apple devs do fine-tuning stuff. Yeah, they do for sure. We're at a point now where the software

and the hardware like the Nvidia DJX Spark and the um Mac M4s and whatnot and all the tooling from

Hugging Face and whatnot is getting to a point where there's never been a better time to get into creating your own AIS and building them for small

things. Um benefits of that is you're not sending your data to some big company. Uh it's running all on your own hardware. You don't need an internet

connection for a lot of different things. And imagine all the use cases where AI like needs to run but

with no internet connection. For example, a Tesla self-driving car that can't send data to the cloud because you need it to operate fast. So this year

we're going to be learning about how to run all on local AI. That's that's what I like. Obviously, we could do some streams with APIs and stuff like

that, but I like running small models and custom small models. So let's go next. This is what we're

going to put our input prompt to our model. Now you'll notice there's some weird things here like BOS start of turn user uh end of turn start of turn

model. So this is our conversation type. Okay. So when we put into our model this BOS is beginning of sentence. So that's it's going to say hey the

prompt's starting. So this is the type of data these LLM have been trained on. Um start of turn user.

So that's me the user is putting in this text and then this is the end of my turn. So these are these are special tokens, right? So the model has been

trained to understand these kind of special terms uh tokens and then we have another token there which is start of turn for the model and so the goal

of the model now is to go hm okay given this input what should I produce as my output that's

exactly what's happening when you use any of these LLMs chatbt it's doing something like this behind the scene so let's do the default options default

outputs equals pipe input prompt. And we're going the long way here, right? I could just get some AI generator to make all of this, but we're going

the long way here so that we understand each part of the process. And then later on, we can get some

AI to just write the whole thing for us. But then if something goes wrong, we can go back and go, hm, you shouldn't have put that token there. I'm

going to fix it. So the default options, let's see the default output. Sorry,

default outputs. Hi, my name is Daniel. We got some generated text. And then, hi Daniel. I'm here to help. How can I insist you today? Um, hi, my name

is Daniel. Please reply to me with a machine learning poem. Right. This is running locally on the Spark. This is a small 270 mil parameter model. Look

how quick that is. That's incredible. Oh, sorry. I've got the easy sample here. Um deaf create easy sample input is going to be return

template equals

And then let's create an easy sample. Easy sample. So,

look at that. Under a second. Please reply to me with a machine learning palm.

Okay, so this model probably can't write palms because it's 270 mil parameters, but let's get the uh generated outputs from this. So we need default

outputs and then maybe we go generated text and then maybe we go uh the input prompt and beyond.

Excuse me.

Generated text. What do we see from this?

Ah, that's right. We need

There we go. That is cool. So we just got our first generation from a local. Look at that. Okay, let's read this. My input, by the way, was here.

Input text. This is a 270 mil 500 megaby LLM file. Input. Now, the poem, I don't care if it's any good. I just care if it can do it. A few years ago,

this was impossible to do. Print.

I'm just going to do from model name.

Maybe we get a new line. It's very important to make things pretty. Let's read the poem. Hi, my name is Daniel. Please reply to me with a machine

learning poem. Okay, Daniel, I'm ready. Let's create a machine learning palm. The data streams, a tangled web of patterns, a subtle silent shrug. Each

data point, a question we ask. To learn and to grow, to find the right way. From simple data, a seed

takes its hold. A learning engine ba brave and bold with algorithms. A clever design. To sift the noise, to find the divine, the world of data, a vast

and unknown, where insights bloom, a future we've seown, and in each pattern, a new potential bright to make the future a guiding light. At least that

rhymes. So let the code begin with gentle grace. A symphony of data, time and space, a machine

learning, a helpful quest to understand, to connect, to put to rest. Not bad. like not great but not terrible. I don't mind it. Right. So that is

cool. Now we want to fine-tune it to do our certain task. And can we do that? Of course we can. So let's now see what it will do given an input of

what we give it uh sorry of one of our samples. So let's take our example sample our data set and we'll go

get random index. Where's my function? So try the model on one of our sequences.

What do we have here?

We've got the model writing a poem. Now,

now we need our random sample. And has that been mapped? No, it hasn't yet. And so we'll go get

and then we'll go random train sample equals

easy low carb cinnamon roll mousse. Okay. And so what we're going to do is we're going to apply the chat template to messages.

So we'll go input prompt and we'll go pipe tokenizer dot apply chat template and the conversation is going to be um random train sample And then we

want messages and we want one I believe just the first one. We don't want the or up to one, right? Because if we have a look at what this is. Yeah,

there we go. That's what we want as our input. So, um, we just want the text because we don't want the

model to have the label that we want it to output. So given this message, we want our model to produce this content. But let's see what our model does

by default. Remember this is the base model. It hasn't been fine- tuned yet. So if we have a look at our input prompt, right, there we go. Um, we can

go tokenize equals false. There we go. So because we're getting a random input every time, this

will be random. So, I've specifically made this data set to be uh full of all different kinds of text. Beautiful. Okay. So, now let's run uh the

default model on our input. So, input prompt. Uh, no, we want default outputs equals uh pipe and then we want to go text inputs equals input prompt

and then we go max new tokens equals 256 default outputs.

There we go. So, what is this? We get the generated text. Let's get zero. And then we want generated text. And then we want um

this is going to be a random one every time by the way. Onwards. Is that the right indexing? Let's see.

Excuse me. We might have to go add generation prompt equals true.

This is a beautiful appetizing picture. Okay, we're we are we are rolling here. We are getting get a random sample and now view and compare the

outputs. So we want to go print input which is going to be the input prompt.

Want a new line. We want things to look pretty, right? And then we're going to get the output. Output is going to be new line. And then we want this

random. There is actually a My Little Pony in Japan. Yes, there is a My Little Pony movie themed cafe in Japan. Right. So right now our the model is

just basically replying to our inputs and with the cruel work. Okay, I'm ready. Let's dive into the world of cruel work. I'm excited to learn more

about this fascinating field. What are you hoping to learn or explore? Right. So our models, what do we

want it to do? Structured data extraction, but of course it's just doing what it's what it's meant to do. This is a great description of the rustic

wooden board. A top- down view of a rusk wooden board. Now, how might we turn this into a food extraction? So, could we do it via prompting? Let's

try. So, this is the default outputs. We take in some text and the model replies. But what are we trying

to do? We're trying to do fine-tuning to structured output. Let's try to prompt it because that should be your order of operation default and then try

to prompt the model. Remember this is only a Gemma 2 Gemma 3 270 mil. I'm using a small model on purpose.

M5 Ultras. That'll be that'll be cool to see. That's what we got going on in the chat. No doubt kids in the 80s and 90s had Porsche pictures in the

room, but today my dream machine is a Mac Studio M5 Ultra.

Apple have been really stepping up their game with their Macs, haven't they? So, let's try to prompt the model. So we want a model to extract food and

drink items from text. By default, the model will just reply to any text input. However, we can try and get our um ideal outputs via prompting. So to

do that we are going to take our random sample and we'll inject a little maybe here in the messages

we'll inject a little instruction of what to do. So here random sample but now uh let's go prompt instruction equals maybe we do it like this. Um how

could we do it? Given the following text from an image caption, please extract the food and drink items to a list to a Python list. If there are no

food or drink items, return an empty list. Does that make sense? Now we'll give it an example. For

example, input text. Um, hello, my name is Daniel. Output food items, drink items. Example two input text is um let's just write what I had for lunch

or breakfast. Uh plate of rice cakes, salmon, cottage cheese, and small cherry tomatoes with a cup of tea. So the output here is going to be food

items, drink items. Okay, so there we go. And now target input text. And then we can go input text.

given the following target input text. So you see what we're doing here. We're just giving it a little instruction and we're trying to get our

structured data format out from a prompt. Before it was just replying with a default output, but now we're actually giving it uh some structure around

it. So how do we turn how do we get our maybe we don't need a random sample here. Let's inject our input text or our instruction sorry. So the

conversation is going to be this. So we want to change this content to that. So

content let's go def update input message content. And we'll go template string equals we go prompt instruction. We don't need to do that. And we'll go input.

So we'll go original content. So the input will be something like this. Original content will be Input and then new content

will be prompt instruction. Prompt instruction.replace

And we're going to replace the target input text

with the original content. And we'll see if this works. Obviously, and go return new input equals or actually we need original input equals

input.copy. This is going to be we could do this better, but I'm just making it so that it's it's really visible. original input and then we're going

to go new input equals oh we don't actually need we could just recreate the input's actually not that hard to

create is it I'll show you what I'm doing in a second I'm just coding it out first if in doubt code it out so new input is going to be list content

new content and role is going to be user. Okay, return new input. Now, let's bring our prompt instruction back up here

and we're going to go that's going to be the original. So, print original content. We're going to need a single one here.

Okay. And now let's try our helper function to inject it with some prompt update.

Yes. Now we've got our instruction there. How good's that?

Info. New content with instructions in prompt.

There we go.

So, you see what we've done there? The original content is uh image caption, but now the new content is an example prompt. So, we're basically

leveling up the amount of information we're passing to our LLM. So, now let's try this with the new content there.

What's the chat saying?

for me. Would you want a fast inference or fine-tuning? For me, if I want a general model, I'll probably just generally go to the chat, the best the

best uh API. But if I want a fine-tuned model, I'm going to be wanting to basically just run that locally, do it myself. And for that, I'll want an

Nvidia GPU. So, do I want to be running Deep Seek R1 locally? Not necessarily. Do I want to be just

interacting with an API for a chat model? Sure, that's the easiest way. But if I want a model to do a very specific use case, I'll probably fine-tune

that. And I'll start in levels of abstraction. First, I'll try an API. Second, I'll go, okay, I want to replicate this workflow locally, and then I

want to fine-tune a model to do that for me. And then again, it all depends on, I guess, the companies

you're working with. Is the data private? Uh, can it even use an internet connection? That's a big one, right? We're all used to having an internet

connection, but there's a lot of devices out there that need to operate without an internet connection. So I actually was just working on a computer

vision computer vision problem with a company who needed to do computer vision in the field uh without

an internet connection. So the de the models had to run on a mobile phone. So that's uh a very limited use case but certainly achievable. It just

takes a a fair bit of work, right? You need to create the data set, you need to test the model, you need to quantize it, you need to understand that

it works on device. Um that's where I have the most fun. Okay, let's see how our new prompt goes with our

model. Can our model by default do our task? We want it to extract food and drink items from a list. So, oh wait, my template is wrong. So, the food

items would be here. Rice cakes, salmon, cottage cheese, cherry tomatoes, cup of tea.

Wonder if we should punctuate this. Make it a real Python list.

There we go. Okay. So, we want to updated input prompt equals we'll take our

updated input prompt. And then we're going to go input equals random train sample. Put that there and then go boom. What did we get wrong?

Oh no, we can just directly go like that.

See, so the output

H the output's outputting. Yeah. So see, okay, this is a great example. This is in machine learning, errors are actually important because then you

know what to do or what to do next. You can fix the errors. So you basically need to just run a whole bunch of experiments. So what we've done here

with the small model is I've set extract the food and drink items to a Python list. But what the model

because it's only a small model 270 mil it has written a Python function to extract the food and drink items. But obviously this function is not going

to work because it's not working semantically and you don't know all the food and drink items uh ahead of time. So what can we do from this? We can

update it to go maybe just to a list instead of Python. Python I think is triggering the LLM to a

list

[snorts] return in the following format. Food items. Item one. Item two. Item three. Drink items. Item one. Item four. Item five. Let's see if this

works. So, we've changed the instructions there. We're doing prompt engineering on a 270 mil model. Okay. So, it's still doing Python,

right? So, this is the exact use case we'd like to fine-tune our model. So, it is a small Python model, uh, sorry, small LLM. So, it doesn't really

understand the nuance of what we're trying to do. So, let's put this into JT GPT. and see if it does what we want it to do. I would assume it would

see does it exactly right. But this is probably a trillion token trillion parameter model. So let's fine-tune. We'll go down here. Write it there and

go. Okay. Looks like our small LLM doesn't do what we want it to do. No matter. We can fine-tune it so it does our specific task. Okay. Who thinks we

can fine-tune it to get it working?

Let's go here.

Steps. So the steps to finetune our model one uh we need to set up SFT config which is supervised fine-tuning config. Actually before we do that let's

see what the ideal output here would have been. We've already got the ground truth. So uh our current error is that our model is basically generating

Python code when we ask it to extract a list which is I get where it's going with that but it's not

what we want. So we have our random train sample we want in an ideal world our model to output this here. So if we go messages um and then if we go

zero and then if we go one and then if we go uh no we want one. So content.

So this is our ideal output. Peeled white yams, deep red go chang and golden brown and pandas. Now have we got butter?

So GPT 5.2 has done better there on the ground truth, but that's okay. It's picked up the black pepper. This one hasn't picked up the black pepper,

but that's okay. Setup sftt config. Use TRL trainer. uh sorry SFT trainer to train our model on our supervised samples. Okay. So then once we

fine-tuned a model to go hey given our text input we want this to be this is our input.

People keep saying the Cray one. What's the Cray one? Oh, it's 1970s.

7 to10 million. Yeah. Whereas now we probably got more computing power on our desk. Let's just get how much compute power did the Cray one have?

Compare this to the power in a Nvidia RTX 4090. and then compare the pricing of the two.

Holy smokes. To put this into perspective, if the Cray 1 were a person, so this is the 1976 supercomput were as fast as a person walking 3 mph, an

Nvidia RTX 4090 would be traveling at 1.5 million mph. Fast enough to reach the moon in about 9 minutes. That's nearly the speed of light.

So half a million times the compute power at roughly 0.005 of the cost adjusted for inflation. That is insane. And that's only an RTX 4090. That's not

like a 5090 or something like that. So, if the Cray 1 is a walking person, which is a supercomput in 1976, the RTX 4090, which I have sitting over

there behind me, is the speed of light. That's incredible.

Okay. SFT config. So we're going to use this is TRL. So transformers reinforcement library. Uh SFT config is where is it? There we go. Is basically

the settings for our finetuning supervised uh and then SFT trainer.

Okay, so let's set up setting up our SFT config. So we'll go back to our notebook like we prepared earlier like all good cooking shows. Do I have

trackio here? I don't have trackio. That's okay. We don't need to. We can watch just the outputs here. So from TRL import sftt config. And then we're

going to go torch dtype equals model.dtype.

Oh, and thanks Madrid for tuning in. You're going to bed. It's nearly 2 a.m. All the best, my friend. Have a good sleep and I'll see you in the next stream. Hey, thanks for tuning in.

info.

Yeah. What learning rate should we use? Of course, we could experiment with this. What did I set this to be? Base learning rate

5E5. So quite small because our model has already trained on a lot of text and tokens. Now we want it to perform a very specific task. So, we don't

necessarily need it to be uh incredibly uh we don't need it to update its parameters too much. It's going to have a really good base knowledge of

language and different stuff. So, uh we want a checkpoint directory name and that's going to just output it over here. Fine-tune LLM. We can just call

it yeah something basic checkpoint models and then we can create a demo later on but let's go to our SFT config.

So config equals sftt config and then we're going to go output dur equals checkpoint dur name max length equals 512 the maximum number of length of

tokens that our model we wanted to output packing equals false. So packing I believe is going to uh put multiple samples into one. Um so then because

our models are just next token predictors right? So it almost doesn't really matter what order the

tokens come in as long as it knows that it needs to predict the certain next tokens given a certain input number of training epochs. We're going to

see how far we can get in three epochs. So we don't we shouldn't need too much um per device train batch size maybe because I have a lot of VRAM on

the uh Nvidia DJX Spark. I'm going to use maybe 16 batch size. Let's see how that goes. Gradient

checkpointing false. Optim equals um AdamW torch fused. And then yeah, I tried to use I've got some notes here for myself. This is my own um notebook

that I've been hacking around on. If you try AdamW, you'll get an error. Well, at least in my experience. So, if you try Adamw, you'll get an error.

Um, so we use the fused version. And then logging steps. How often do we want to log? Uh, save

strategy equals epoch. And then eval strategy equals epoch learning rate is going to be our base learning rate. And then FP16 equals true. If

torch_dype equals torch float 16, else false. So this is telling our model we want to use FP16. Um, the DJX Spark can actually use uh FP4, which is

very low, but I'm not sure. So, here's the problem. The the hardware has the capability of using um FP4, which

is floating point position 4. Basically, uh default is 32, 16 is half that, 8 is half 16, four is half of eight. So, it's just basically a way to save

um it's called quantization. Basically, a way to save on memory. So if for example you're running uh a model on a phone, you could train it in float

16 and that'll output a model that's 100 megabytes, but then you could quantize it um to FP8 and

then it'll output a 50 megabyte model. So half. Now there are some trade-offs there in that generally if you quantize a model unless it's been trained

in that quantization you will get some performance degragation but uh it isn't always that dramatic. So you could train a model in it's very common to

train a model in FP32 or FP16 um and then deploy. Usually these days FP16 is basically the default

because we've kind of worked out okay FP16 you just get the benefits of both worlds. But now in the last couple of years FP8 and FP4 has been

significantly improved. But we're kind of at that stage where we're waiting for software and hardware to synchronize to make them all I guess talk to

each other and make it much more reliable. So even though the hardware on the um Nvidia DJX Spark can do the

uh FP4, the software hasn't really caught up 100% in my experience. Now that may change um and I may be wrong, but if I am, please let me know. Torch D type.

So LR so learning rateuler. So when you're training sometimes you want the learning rate to go down. As you get closer to convergence you want the

learning rate to go down. Push to hub. So we can upload our model directly to the hugging face hub. We want to do that later, maybe either today or in

tomorrow's stream. Um, and that means that other people can use it. So, report to uh I'm going to set this to none. So, let's go to the docs for this.

Always look at the docs. It's opening in Safari. That's okay. Report two. So, Trackio is an open- source version of Weights and Biases.

Yeah, none. I wonder if I could just set that to none or is that going to default to weights and biases?

So, I'll show you Trackio.

There we go. So [snorts] very like a lot of iterations here. Tracheo import tracko as weights and biases and we can just yeah start to track different

things about our model. So we can just track it as a dictionary. That's going to be something we look at later as well is tracking our experiments.

But for now we just focus on getting straightforward fine-tuning. So I think that should be enough

for a config. Now again, we could tweak a lot of these. We could tweak the max length, tweak the epochs, tweak the batch size. Um, does there eval

batch size? We'll eval on 16 as well. So note, you can change this depending on the amount of VRAM your GPU has. Okay. So, let's check in with the

chat before we go to the next section. [snorts] Hi, do I use data bricks? I don't use data bricks. It's

always 4 a.m. my clock, so I have to sleep. Have a good stream. We'll end it later. Thank you very much for tuning in. WLWLA. Where would the NLP

world be without Hugging Face? This company rocks and makes it so easy to use the largest language models. I totally agree. Hugging face between

Hugging Face, Nvidia, and Apple of course for making Epic Max and stuff like that. Hugging face and Nvidia

are probably my two most used companies at the moment in terms of what they offer. And then of course Google with Gemini and YouTube. So very grateful

to be a uh machine learning engineer in 2026. So we have our config. Now what do we need next? We need our step two. We need a SFT trainer. Now, if

you're in the chat as well and it's really early where you are, thank you for tuning in. Or if it's

really late, I'm in Australia, so I think our time zone is basically all over the place compared to the rest of the world. For future streams, I'll

probably try and get a better um I guess time schedule for everyone. So, yeah, if people just put in where they're watching from or what time it is,

that way I'll get a better idea of where people are tuning in. I guess YouTube will provide this, but I

actually like talking to people. So, uh, maybe it's best for me to start early in the morning in Australia, like a 6:00 a.m. for Australia, or I do it

later at night, like I start at 8:00 p.m. or something like that. I personally prefer the morning, so maybe that's where it's at. So, let's go next.

We want our trainer. So, from TRL, import SFTT trainer. So create trainer objects. So SFTT trainer

is going to use transformers trainer under the hood. But SFT uh SFT stands for supervised fine-tuning

equal provide um input and desired output samples create the trainer object. So we can do trainer equals sftt trainer model equals model args equals f

sftt config and then train data set equals data set train eval data set test processing class equals tokenizer. Right? So if we have done this all

correctly, have I missed a comma somewhere? Yeah, there. If we have done this all correctly, we should

be able to train a model after this. Now, it may take Okay, so we get people over the world. So 4:00 a.m. 1:55 in Germany, nearly lunch. Um 7:55 p.m.

in New York. Cool. Hey, Alex. All the way from New York. I've been to New York. I love New York. So, trainer train. Who's ready to train our first

small This is pretty cool. We're two hours into the stream and we're going to train our first um small

language model, Gemma 3 270 mil, to do a specific task. So, we want it to extract food. We've seen how it's performed before. Now, I haven't trained

I've done inference on the Nvidia uh DJX Spark. I haven't trained a model. So, this is going to be my first model training on the Nvidia DJX Spark 2.

So, hopefully it works. You ready? 3 2 1 Let's train a model.

Okay, there we go. We might 3 minutes. Is it going to run in 3 minutes? I can hear the fan going off. I don't know if you'll be able to hear this.

So, that very quiet faint noise is the Nvidia DJX Spark training. That is cool. Okay, so 3 minutes apparently it's going to take. Now we can go watch

this happen. So watch N1 Nvidia SMI. There we go. So GPU util 93%. That's a good amount. So we're using most of the GPU. And we'll watch this train.

Okay. So, one of the things, one of the rules of this machine learning cooking show is that every

time we train a model, we have to do 10 push-ups. Okay, that could be my series is uh machine learning and muscle. Oh, Buunes Aries, hello. Hello from

Argentina. Hello from Turkey. Oh, look at that. Accuracy is already to 60%. 10 push-ups. [snorts]

Okay. After this model is finished, we're going to try it out and then I'm going to take a 5minute break, go to the bathroom, and maybe get a coffee

downstairs. My office is above a cafe. So downstairs is a cafe, so might have to get a coffee for the next part of the stream. But this is cool. We're

watching a a small language model fine tune in live real time. Accuracy has gone up. That's cool.

Loss has gone down. This is how many number of tokens our model has seen. 246,000. So, this is faster than I thought on the DGX Spark. I thought it

was going to be a little bit slower than this, but 3 minutes is not bad.

Yes. Alex says, "Bringing back ML fit with the push-ups." Yeah, that's the new series. New series idea. ML and muscle.

[snorts] Adita hello from India.

What data set are you using? Got late to the stream. Okay, let me show you the data set. Training is almost done by the way. So, I'm just using this

data set here, food extract. Um, which is essentially Can I just view one of these samples? So, these are image captions with uh a JSON of food

information extracted from it. I'll show you what it looks like in actual format. Yeah, here we go. Example

sample. So, sequence. So, that's an image caption and then I've used GPT OSS120B to label it with a prompt and then I'm trying to get this um Gemma 3

270 mil to reproduce this output here. So food or drink um sorry the condensed version. So uh food or drink one. So one or one or zero for is it food

or drink or not because there's some not food captions in there. Tags a uh food items, drink items,

um food advertisement, food packaging, all that sort of stuff. Drinks, foods. Yeah. There we go. 61% accuracy. Finished. We're done. We've just

fine-tuned a small model on an NVIDIA DJX Spark locally in what, two hours, writing every line of code from scratch. That is pretty cool. No

generation generators were used here. So, let's bring in our tags dict. And I'll just link this here so you can see

it in the chat.

Gemma, we're doing uh text only. So um reminder input equals text of image caption output equals um structured data. So we have we should have a

fine-tuned model right now. Um, I'll just show we'll remind ourselves of what the inputs and outputs are. So, this is where our model before we saw

what our model was doing before. It was creating Python code, which is not what we want. It's it's okay,

but not what we want. So this is our example input, an image caption, a top- down view of a rustic wooden board featuring three distinct items, amount

of uh peeled white yams sliced into thick ivory colored rounds is smooth, etc., etc. And then we want to output this. So food or drink one, it gets

one cuz the caption is about food. Tags fi because it's got food items. And then foods is the peeled

white yams, the deep red goju chang, golden brown, and pandas, right? Right? And then there's no drinks. So why would we use this? We could take this

model over a large data set such as data comp 1B. Maybe in the future we work out how to do that, [snorts] right? This is a 1 billion data set of

images and captions. And we might go we want all the food images uh for food image URLs from this data

set. There's 1 billion images. We don't want a billion images. We want 100 million or even less. We want 5 million, but we want only food because

we're building an app called Neutrify, which is the app me and my brother build and we want it to be able to take photos of food and understand what's

in there, right? We want to build this app and then but we only want food images. So, we could run it

across all of these text items here and then extract all the food items from that. So, that's where you would use some sort of model like this. So,

let's go back to where we're at. We've got our trained model. We could have tracked our um information, but now we're going to get Let's look at the

loss curves. Hey, or should we just try it out? I might just do a little cheeky thing here. Copy the

loss curves. There we go. How cool is that? So, training and validation loss prepo. to the training loss. Did we did we Yeah, we used the train and

test data set there. Okay, that's good. So that's basically what we want. The training loss going down. The validation loss is already pretty low as

it is. So our model might be overfitting here. But that's a weird thing in the case of LLMs. kind of

sometimes want them to overfit because the my intuition of this is because the token space is so large. Overfitting in an LLM is not too bad if you

have a specific use case for it. For example, if we just want to extract structured data, overfitting is actually probably okay because it just means

that the model is outputting to me. Again, test it on your own things. But my intuition is going,

well, that's okay because we want it to output the same structure every time. So, let's figure out how we Oh, we can save the model with trainer. We

should have got some checkpoints over here. Um, checkpoint models. There we go. Here's our checkpoints. Model.safe tenses. Checkpoint one, two, three.

We could have just saved the best one, but it's going to be loaded by default. That's okay. So, we

can go back to VS Code. So,

save the model. Checkpoints. And there's our saved model here. So, I believe next we're going to load the model back in and see how it goes. So, let's

let's run it on one sample. Uh, and then I'm going to take a quick break and then I'm going to come back and then we'll see if we can create a little

demo with it. So remind ourselves of where the checkpoint dur is there. And then we can go. Let's go here.

I'm just going to load the model back in. If you got any questions, put them in the chat. Otherwise, and I'll get to that in a second. Otherwise, I'll

just keep steamrolling ahead. Auto model. Do we need to retype all this? Absolutely not. But are we going to do it anyway? Yes, we are. Because that

way we start to get all of this stuff ingrained in our brains. Sets and reps, like just going to the gym every time we train a model, 10 push-ups. Or

even if we don't train a model for a while, every hour we do 10 push-ups. D type equals auto

and then attention.

Okay, let's check our loaded model. This is going to be a loaded version of Gemma 3. Woohoo. Look at that. So, it's just exactly before as the

original model. We've got our loaded model, but now of course the weights will be different. So, this is our original Gemma 3. Same architecture. All

we've done is update the weights behind the scenes to do our token task instead of um its default token

task. Because what is an LLM in general? Reminder, LLMs equal tokens in, tokens out. If you if you get the right tokens in, you'll get the right

tokens out. And of course, you define what the right is for you. So, how can we do this? We do some inference on some samples. Yeah, that's a good

idea. This is this is a notebook I created a couple of months ago as you can see um 10th of December. So,

we're going to load our trained model in and then we're going to do predictions on the test samples. Remember, we've trained on the training sample,

we want to do it on the test now evaluations. So data set train we trained on these and now we want to eval on these data set test. Is everyone up to

date with where we are

and pandas? Argentinas are so delicious. Oh yes. It's getting me hungry. Did it run faster than your 5090? I have a 4090. Um, but I haven't run this

exact workflow on there. We can compare that later on once I've got this notebook working. We can just run it end to end on the 4090. But, uh, I'm

going to see how it runs on the Nvidia DJX Spark first because I mean it's this is such a cool little

thing to have set up on the desk. I had a photo of it before. For those wondering, this is what we're running on. So there's my Mac Mini. Um, I'm

coding on that and then all the training has been done on the Spark as you can see here via SSH. So the Mac is running this display, but all of our

code is running directly on the Nvidia DGX Spark. So such a cool little setup.

Okay, let's test our loaded model. We'll just create a pipeline to begin with.

Loaded model pipeline text generation and then model equals loaded model and then tokenizer equals tokenizer. We can use the same tokenizer cuz we

haven't actually changed that. So I'm going to bring this out here. Comment that. And then let's see loaded model pipeline. How do we run this on the

exact sample we did before? So if we recall, here's our um prompt that we tried to get our model, but

it didn't really work. And here's our random sample that we tried before. And that is a a ground truth output. So if our model can replicate something

like that, we'll be really happy. So, let's grab this. Oh, actually, we can't run it on the samples for the train because we need one from test cuz

our model has seen all of the train samples. So, we want to go random test

and I'll put the test sample in here. Random test sample input prompt pipe. I want the loaded model pipeline, not just pipe. And this should work.

Let's see how it goes. Invalid key. Oh, excuse me. Boom. There we go. There's our first sample running on the Nvidia DJ Spark with our own fine-tuned

model. That's exactly what we want. Now, let's try it on another one. These are This is exactly right. Input. There we go. Okay, that's a great one.

So, let's look at the ground truth and compare it.

This is the ground truth. So, uh, this is test sample by the way. Our model hasn't seen any of these. Okay, so it missed some drink items. So, that's

that's an error of our model. But did it get most of the food items? Soybean oil, garlic, shellot, kafir limes, fish sauce, parsley, pepper, fish. It

didn't double up. It missed a few. Okay. Water and lime juice. Yep, lime juice. It missed that.

So, that's a good example of where we could improve our model in the future with more samples with drink items. I think that's a a good example. Exclamation point. Yep. That's not food or drink.

Oh, this is a big one.

water. So, it missed the water. Peter bread, wheat flour, vegetable shortening. Look at this. Our small tiny little model is performing not perfect,

but it's on par with GPT OSS 12B after just 3 minutes of training. Like, this is only 3 minutes of training and our model is already starting to

output exactly what we'd like.

Okay, it got that one wrong. That should be a not food item. Flavored cream food item. Yes. Nice. Drink it. So, I got the wrong tags there. Okay.

Okay, so we're seeing some errors in our model, but it's not too far off a very powerful model. So, what do we want to do now? Well, we've seen our

model. Let's let's make it a little bit um I guess this is only a thousand samples, too, by the way. So,

that's not even really that big of a data set. Let's make it a bit nicer of an output. as in I think we want in the next stream probably we'll create

a uh a demo of running our fine tune LLM and then we'll also get it to um hugging face. So I think that'll be a good place to do next. So what have we

done here? Supervised fine tuning. Done. Done. demo app ready to use. Okay, so that's actually

let's do that tomorrow. Tomorrow's goals. So this is December or January, sorry, January 8th,

Jan 7th. demo app ready to use bonus or and then we want to go evals. How does our model compare to GPT OSS 12B? Okay. So I think we're right here.

Next create small demo of using the model. Um, and then we also want to save the model to hugging face three. Create a sharable demo to gradio eg text

in model extracted outputs out. And then next is eval. How does our model perform compared to GPT OSS 12B?

Oh, the size of the train model. Let's get let's get uh how do we chat here?

create me a function to count the number of parameters in a PyTorch model both um trainable and not trainable as well as total parameters. So, we'll

get Gemini 3 flash to make a small little function here.

There we go. There's our number of parameters in our model is 270 mil. So compared to GPTO OSS which is 120 billion I'm going to write here up to

here. So our model is 270 mil parameters. GPT OSS 12B is 120B parameters. So, let's figure out the ratio. I think it's about 400x smaller. So, 120 1 2

3 1 2 3 1 2 3 Or actually 1 2 3 1 2 3 1 2 3 That's 120 billion 9 zeros divided by 270. 1 2 3 1 2 3. Yeah. So our model is 444 times smaller.

How cool is that? 440 times smaller. So yo Daniel, how are you? Hey Sjed, nice to see you. Thanks for tuning in. Uh Su Hail says, "Hi. So we take the

base model then fine-tune train it with our own data set. So we get specific model. We can ask anything to that model around the data set context.

Yes, that's exactly right. So no matter our model is fine-tuned for a very specific task. So as we'll

see in the demo on the next video um which will be tomorrow, tomorrow's live stream. No matter what we input to our model, we've fine-tuned it to only

output a specific format. So that's the power of uh fine-tuning is now in this case, there we go. Have a look at this. We want our model to extract

food items and drinks from a food from a image caption. So we've got lint extra dark chocolate

packaging link uh double shot latte. So there we go. Double shot latte. Dark chocolate is the food lint difference. That's incorrect, but that's I

guess we could work with that. Russian caravan, which is a different type of um a tea. a tea, drinkable tea. So that's very cool. And this is only off

a thousand samples, right? We could always upgrade this with more samples. Chocolate bars, crunchy,

dairy, milk, caramel, picnic, moa, boost, turkish delight, cherry ripe. There we go. Um whereas this is gibberish. It's not about food or drink. And

so therefore, we have no tags, foods, or drinks. So thank you everyone for tuning into the stream. We have officially loaded a small language model on

the Nvidia DJX Spark. Let's get our little image up here. From scratch, we have gone in two hours or

thereabouts. We have gone from with a lot of talking in between gone from coding on a Mac Mini to coding on the Nvidia DJX Spark. We've loaded Gemma 3

270 mil and a data set with a,000 samples and we've fine-tuned that Gemma 3 model. We saw that it didn't work too well. um the default weights even

when we tried to prompt it. But now it is working quite well on our target uh problem, right? We

wanted to extract food and drink items from any given input sequence of text. And so that's LMS in a nutshell. Tokens in tokens out. In the next

stream, we're going to create our demo of our model. We're going to save it to HuggingFace. Um and then we're going to share the demo to Graddio so

that you could try it as well via a link. Right. And then we might even maybe that's a third stream is to

do proper evals. So um see how our model compares uh to a model like GPT OSS um 12B which is where our labels came from. So um stay tuned. If you want

to see anything otherwise in particular in a stream you can just email me here um Daniel at my website. So you'll be able to find my website. Um, but

that's my email if you want to see anything in particular. Otherwise, leave a comment. I'll leave

all these streams on YouTube. I'm trying to get back into streaming a bit more. So, I'm keen to hear all your ideas. But, thank you everyone for

tuning in all over the world. I appreciate it. And I'll see you in the next stream. Hey,

Local LLM fine-tuning on the NVIDIA DGX Spark - Part 1 · 全文文字稿