AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway 文字稿

I'm Era. I'm going to be talking about eval uh specifically u AI evals like coding agent evals and stuff. And I'm going to talk about how they're

broken and how you could still use them. Anyway, uh before I start, I just I want to say one thing. It's just like it boggles my mind when you're like

when you're working on something and you're like cooped up in a room for so long and you just like you

think that like in your head you think that there's like no one cares and then when you talk about it and there's like so many people here. It makes

me very happy. I'm very thankful that you guys showed up. Um it makes you feel like you like it means something and I'm very thankful that you guys

came over. So thank you so much. Um all right so speaking of uh evals like my claim my fundamental

claim of this entire conversation is that eval are people are wrong about evals most people know a lot of things about eval claims they'll say things

but they're wrong about eval to be right about eval do you do that like how do how do we how do we become from wrong about eval to right about eval to

do able to do that basically what I want you to do is like I want you to be able to build them. I

want you to be able to interpret them and I want you to be able to use them in your agent flows. Your agent flows could be anything. It could be a

coding agent. It could be like a shopping agent. It could be an agent for anything. It could be something very trivial or it could be something super

complex that's like a production workflow that's used by millions of people. And in all those cases,

you can learn from EVLs. And regardless of like whichever direction you tend to go to in your agent building experience, I think Ewells are like one

of the most critical aspects um of my years of spending uh working with AI agents. So let's kind of reverse it, right? Let's reverse it. Like how do

we know like how do we know if people are wrong about them? Like why do I claim that people are wrong

about them? And there's two ways. The first there's two camps of wrong. And the first camp of wrong is the objective metrics camp. What does that

mean? Objective metrics camp is like basically like it's like basically people who believe that like everything you just take it as face value. So if

you look at an EOL, you look at artificial intelligence, you look at epoch AI, you look at all these

companies and they're all doing great work and they'll come up with these objective numbers of like whenever a model comes out, you just post these

benchmark scores and all your Twitter feed is just filled with like this score on evalu

uh array of information coming at you and it's it's supposed to be real numbers so you're supposed to believe it and it that's what a number came out

and it's like I don't think that's the answer. I don't think that there's like these exact numbers or how precisely one model is better than the

other. To be very precise, if you notice like there's like you would notice sonnet 4.6 at 52 and then

you'll notice like a few other models quite close to it. And it's very difficult to make the claim that like the models which are close to each other

in the score here are actually equally as good because they're not. And if you spent like half an hour using any of these models, you'll know very

quickly that these scores don't necessarily mean much. Um, so this was like a tweet from um Francis and

he made a claim that Meta came out with a new model. It was a it was a huge disappointment because it was benchmark max. Tons and tons of models these

days, tons and tons of lab these days are just doing this like game where just like get the highest score on Eval. Doesn't matter how good the model

is, it will get it will get the tweets in, it will get a clout in and then you pull people in and

then who knows maybe the model's good, maybe not. So that's one end of the spectrum. But how are the others end of the spectrum? The other end of the

spectrum is taste. So taste is king people are basically like who don't believe in the numbers at all. Who think that these numbers are completely

pointless. They don't believe in anything. It's just it's just uh made up. So this is basically like

the taste and king people. But basically like um the argument of taste is king people is basically like it it's it's all about wibes man. It's all

about wibes. Like it does it don't matter what the numbers say. So they'll like if you talk to them they'll say things like oh why do I like why do

you like cloud models? And they'll say oh I like talking to her. She sounds nice. They'll talk they'll

talk about an AI model like it's like an actual person. And it's it's just at this point it's like it's like I don't even know where to start. And I

think both of those like both of those uh groups are wrong. And I think the truth is somewhere in the middle that like eelss are not the end all and

be all. They're not completely useless. There are right ways to use them and there are wrong ways to

use them. So the purpose of this conversation is that I want you to like I want you to take through take you through a few levels and with as I walk

you through these levels like you'll have a much better understanding of how to work with emails. So the first one, this is a very rudimentary one, is

like I want you to be able to be like how can you use other people's evals? How can you use eval

from like if it comes out from the model labs, it comes out from cursor, it comes out from cloud code, whatever, how do you interpret them? Level two

is like how do you use eval to improve your own agents? And level three, if you have a lot of money and a lot of time, you can even build your own

eval. Um, but yeah, so these are these this is basically the point of this uh conversation here today.

Um so instead of like instead of just like giving you like you know how to interpret eval like some hard rules I'm just going to give you some

heruristics and if you follow these heruristics I think you have a much better understanding of somebody else's eval. So when you get these numbers

you'll you'll be like much more confident of like here's what it is and here's what it means for me. So first

thing the rule number one don't ever believe model lab eval just don't just like the whenever the numbers come out whenever labs come out with like

whatever eval numbers come out for mythos preview or gb 5.5 or whatever they're great and they're probably accurate and those models are I'm sure

they're very decent. I'm just saying don't take those numbers as a word of god. You you have to use your

own discernment. They're close approximations but they're not perfect. Um so this is like one of the tweets which is like very profound where this guy

he makes the claim that has any engineer actually made a decision based on a benchmark result and basically the claim is that like a lot of people

they will like routinely dismiss like eval results they will routinely say thing they like they'll run

evals they'll get the numbers they'll get those things but like they'll actually dismiss it and a lot of times like real AI researchers would like

kind of take them with a grain of salt and I think that's like the right way to think of uh of eval. Um the heristics too of how to interpret eval is

that you got to stay current but you don't have to be the earliest adopter and a lot of you who work

for like very big companies and you guys this matters more for you guys than for the rest of us. So what do I mean by this? So this is a chart of EPO

AI which shows like how good the models have been scoring in the last couple of years. Uh well I guess like from 2024 to now it's like 2 years but in

AI that's like 27 years. It's it's moves so fast. What you'll notice is that if you look at the soda

score every couple of months the sort of model changes and it changes very quickly. Like if I time travel to a couple months ago it was like sonnet

4.6 or oppus was the best model. Not so much anymore right? And if you if you keep playing this game of like, hey, I want the best thing all the time.

Like you'll just like the mental bandwidth that you'll spend trying to always be on top is just not

worth it. I think what you want to do is you want new models to come out. You want new things to come out. You got to wait out for a couple weeks and

then you got to be like, "Okay, let the dust settle." And that's when you try your own thing. There are people like me who will spend all their time

trying to find out what the new thing is, what's the best frontier thing at any point of time. and

that is what I do for a living. So sure, I'll do that. But I don't think you should do that. I think you should stay current, but you don't

necessarily have to pick the most urgent thing. And the third heristic which is a very important one is that when you're working on a problem, so I'm

personally because I work at client, I work on the problem of coding agents. Coding agents have a very

specific kind of eval. So these are called terminal bench. Um some evolve version like Frontier SWE um some other kind of like coding benchmarks.

Those are very specific and pertaining to me. I think maybe you work on a different problem. Maybe you work for some kind of shopping company. Maybe

you work for infrastructure company and maybe the eval applicable for you are very different. When a lot

of these models apps come up with a score, they're just like generic general purpose eval. They may not necessarily apply to you. I think as a problem

solver, you should always look for eval to your problem or as close as you can get. I think that's a much better measure. Um so to give a very precise

example like S swb bench was a very standard eval marker for coding agents for so long and then

openai came along and they said yeah this benchmark is like so saturated we can't use it anymore. If you've been in this space you would have known

that this this eval was like saturated so hard that like right now model apps come out they don't even mention the score because that's how saturated

some of those eval are and they're not applicable to your problems. Okay, so that was the first part.

The first part was figuring out what are the huristics that you can use to like improve like understand and interpret other people's emails but like

how do you use eval to improve your agent upon them and this is where I come in with like my own like experience of uh working at client and working

on this like very hard problem which is a problem of both engineering and philosophy and the way you

want to think of this is that like because like because AI has like such a high variance of response. It could like it could give you an answer. Uh

it's not very deterministic. The answer space is basically infinite, right? And if you let an agent run, if you let an agent run for you know 10

minutes at every step of the way, it could take a different turn. And then if you let the tree go this way,

like it's like it's an infinite space of what are the things an agent could do. So like when you want to solve this kind of problem like it's it's

very hard for to measure like is an agent actually doing the thing you wanted it to do and that's why the way I think of eval is like I think eval are

like kind of an engineering but they're also a philosophy problem and when we were working with coding

agents for like we've been doing this for a couple years um we found last year that like there were all these eval but like they were just like so

different from day-to-day problems that we just didn't bother using them. I talked to like open I talked to Enthropic last year and they were

basically like yeah eval are great but bro it's just about the vibes it's just about the vibes and um at the

time it was part of the reason was that like the evals were just like measuring something completely wrong. So to give an example to give a very

precise example a lot of eval would have things like Fibonacci sequence like implement the Fibonacci sequence implement the unit test right um they

would have like this algorithms problem that you solved in your sophomore year of university and it's like

doesn't apply at all to your real world coding experience. So um with time what happened was uh client wanted to build our own eval which were like

more applicable more accurate more pertaining to real world software problems and as we were working on them uh we found this incredible group from

Stanford u institute and they were came out simultaneously with this benchmark called terminal bench and

the best part of terminal bench was that it had like this small set of problems well 89 problems which very applicable to very real world software

engineering task. And these could include database issues, uh race conditions, um front-end bugs, um just real actual problems that like real software

engineers such as yourself face dayto-day. And we realized halfway through working on building on our

eval like hey they've built this like great ecosystem of like good set of problems. It's easy to run them. It's easy to replicate them. It's easy to

make these eval work with any of the coding agents whether it's codex cloud code client whatever and work with them. So we adopted u we adopted um

their evals basically. Now the hardest part the hardest part about them was like when you measure when

you're measuring like an AI system if you measure something very trivial. So if you measure something like how many Rs there are in a strawberry or if

you're measuring like how many toes does a cat have those things have like somewhat of a deterministic answer or you know what's the weather somewhere

those things are like single turn I think where agents go off is that like if you ask an agent

like hey like write an MCP server to connect to my app using this O what the agent will do is like the agent will do a ton of different things. It

will like use a web search tool. It will maybe install a Python library. Maybe it will access like some kind of sandbox. Maybe it will read a few

files. Maybe it will edit a few files. And the whole process could take like 5 to 10 minutes. So what you

want to be able to do in this kind of eval is like you do all those steps. You really let the agent run for 5 10 20 30 40 minutes. Let it do the whole

thing. And then once it's done there are like these deterministic unit tests which check like did I make the file? Does it run? Does it pass the test?

And that's what that's what terminal bench does. It's like agentic eval which take a while.

They'll take like some of those uh problems easily take like 30 45 minutes of like continuous agent just running turning on different attempts to

solve the problem and then once it's all done then it grades um the problem. Um so this is bench and this I'm very thankful for the team. So shout out

to them. Um, so yeah. So I guess like I guess when you have like an evaluation suit, right, you want to

be able to like you want to be able to like how do you define like a problem? Like how do you like what do you learn from this? Like I'm just talking

about my thing, but how can you interpret from this? So there's a couple of things you want to track when you're working on agentic evals, right? The

first thing you want to track is like just like how many turns is it taking? How many tool calls is

it taking? How many tokens is it using? How long does the whole run take? The run could take like sometimes there are models which are like very good

at performance but they'll take like 45 minutes because the inference is so slow, right? as you tweak these parameters of like what exactly you're

looking for and you run it on different models, I think you get much closer to like, hey, this is what

I really want and this is what I'm okay with and this is how much of money I'm willing to spend on this much quality. And once you track all these

things, I think that is like what you really need because I think that as much as I would love for everyone to use the most expensive frontier model

for every problem, I don't think that's how the world works. Like we don't have infinite amount of

money. Sometimes it just makes more sense to use like Deep Seek V4 for Flash, which is like 150th the cost of another model. And I think that this is

like if you track these things in EVAL, um they'll tell you like how to how to how to figure out and what to choose from. Um so specifically for

terminal bench, the way the eval work is that I told you that there were like 89 tasks, right? So these

89 tasks could be task of like caching bugs, latency issues, uh reg x bugs, front end bugs, race conditions, whatever, uh implementation aspects of

things. What terminal bench does with Harbor is that if you have problems you want to be able to solve, what you do is you make isolated containerized

environments where you set up the whole thing. You set up the machine, you set up the environment,

you install the dependencies, you install everything that you need in that specific machine in an isolated container and then you run the agent on it,

right? So if you run any agent with like the same starting point of like it already has, you know, whatever version of Python and JavaScript that you

needed, it's got this all thing working and then from that point on the agent starts. So the

benefit of using harbor which was also it's tied to the terminal bench team. The benefit of using that is that like usually back in the day the way

eval would work is that they would work sequentially. So they'll run like one after the other and it will take like six seven hours for the eval to

finish because problems would run sequentially and they will like interfere with each other's code. They

will interfere with each other's like environments and system. What Harbor does is that like it just like lets you split out all of them in different

environments and then you could run um Ewells on them. So I think that when you run your own emails, when you build your own EVs, I would strongly

encourage you that like really containerize them, really isolate them from each other. That's why they

won't interfere with each other's problems. And for us, uh we use model. model is like the infrastructure layer that helps us build like these

parallelized containerized uh environments so that like whenever our eval task would run they would run in like different uh different containers um

the in the way that I've shown here so shout out to modal um all right so how's the process like what do you

do like what do you what do you do here so the process is very simple you run the UL with your agent coding agent um any other kind of agent you get

an original score you figure out like what went wrong. So to give I I'll give you a very precise example. Have you ever used like say sometimes you

use like say clock code or sometimes you use like say um codeex like what would happen is that like it

will try to read a file or it will try to install something and it will just go in circles that it can't install this it can't read this or it's just

it goes into the same error and it just keeps going in circles of like I can't run this command I can't do this and I'm sure you've experienced this

before what happens is when you run evals on like a larger scale those problems become very obvious.

So what would happen is that like if there are 89 tasks on 20 of the task the model just went in like complete circles and did nothing just like was

trying to read a file couldn't read a file was trying to edit a file couldn't edit a file or has was having installation issues. When you run the eval

you get this like portfolio allocation of your failures. So if you're failing on like being able to

read files, if you're failing on inference, if you're failing on something, you're able to figure out like okay, what are the broad buckets in which

my successes and failures could be bucketed? And once you figure that out once you figure out like your tiny like these large buckets of like your

successes and failures, you can like iteratively improve on figuring out and point to the exact specific

problem. So one of the examples that we found in our testing was that like sometimes we would have a model that just doesn't work well with editing

certain files. So we would change the edit file tool. Sometimes it couldn't use the web browser pretty well. So we'll we'll change that tool. And I

think that like having the manifestation of your problems reflected in like an aggregate way is like a

much easier way to simulate what the user experience would be because how else are you going to like figure out like what went wrong? Um so there's

actually three things you're testing. What are the three things you're testing? You're testing the model. Obviously you're testing the model whether

or not it's good, but you're testing the harness as well. The harness is your agent scaffolding. So

like when you write the agent like it's possible that there's a model that's like really good but you just wrote it the wrong way. The best

explanation is that if a new model from anthropic comes out I guarantee you would have noticed that it works better in cloud code compared to say

Droid or cursor sometimes and it's like it's if it's the same model why is it that it's much better in cloud code

than some other agent? Why is that? And that's basically what you're testing here. that like sometimes it's a great model and your harness hasn't just

done the justice that the model needs to be able to make sense of it. And then the last thing you're testing is the problem same because you could be

solving like a stupid problem that just doesn't apply to that just doesn't apply to your eval. So

you need all three to be in alignment and you need to be very honest with yourself like hey this is what I'm trying to do and this is what works for

me. So in our case it was something like this like uh we ran the eval the first time and yeah so we ran the eval the first time we got an original

score um then we made some changes to like CPU memory layer uh we raised some timeouts we improved the

thinking behavior and as we made those changes iteratively our scores just improved and then eventually we were able to beat clock code in uh for

oppus 4.5 eels and what we found over time is that like we're able to beat clock code in other emails as well. Um because we just figured out some

tiny knobs that they couldn't figure out or they didn't optimize for. Um so I think that if you're you're

working on an interesting problem, you can just be like, "Hey, let me figure out what I'm doing. Let me figure out what my competitor is doing, let me

build some great evaluators doing, I'm just going to beat them. I'm just going to do it so much better than them." Um so there are um three zones of

improvements. The first one is like the most obvious flaws, right? Like the obvious flaws of like

okay like what is like obviously wrong with like your agent. So it could be like your read file tool is wrong. It could be that your you know agent

turns are broken. Maybe your checkpoints are broken. Maybe something obvious is broken. Right? Those things are just like it basically tells you that

your agent is like broken on like a fundamental level. So once you fix like once you fix like those

basic things I think your agent starts to work it makes it look like okay it's actually working it's actually functioning and that's a good zone one

of when you're working with evals because you want to fix the obvious flaws. The second zone and I think this is where you really do the real hill

climbing is that at that point you're just like um how do we like how do we actually figure out the

philosophical aspects of like how to make my agent better. A lot of times you'll find that like it's like you have all kinds of like stuff in the

prompt in the tool call in the tool call definitions um in the logic of like retries or whatever that like your agent is just not doing well. And I

think to some extent it's your fault of like prompt engineering. Maybe it's a fault of like using too many

tools. Maybe you're using too few tools. Maybe you're using the wrong tools. And I think that is where the like real gift of eals is that like you

instead of like sitting around and pontificating philosophically whether or not your agent is good, you can have like very nuanced judgments of

whether or not your agent is actually good by giving it real problems to solve. And then the zone 3 is like

the danger zone. So the reason I call zone 3 as the danger zone is that like sometimes people have this thing that once so as soon as you give them a

metric as soon as you give them a number to optimize for all they do is just optimize for the number. So they don't they don't really care what the

problem at hand is right give if you give someone a number all they'll do is like optimize for that

number. So they'll like they'll just like uh overfitit the model. They'll like they'll like change the prompt such a way that they only pass this like

specific task. They'll add like weird skills and stuff. So that's not nice. So you want to be cautious that like you're improving but like not

overfitting or doing something wrong. So basically if I could give you a final word it is that like find a

benchmark that works for you. Build some eval if you can. You should hill climb. Honestly hill climb means just like improving your score on the eval.

And then even if you get a good score you always need to make sure you're passing the vibe check. Like you need to know on some emotional level that

like yes my agent makes sense. Like it's it's you know it's not just about benchmarks. like is this

a sensible agent? Is it making sense? Is it actually solving our problems? And you got to start somewhere. You got to start somewhere. I think this is

a great discipline. Uh we spent a couple months working on it at client. We're still working on it. Every time a new model comes out, we try eval. We

improve the new model experiences with it. Uh we're using like a lot of open source models now. So

we're we're trying to support and improve eval. Um, and I think that we never would have figured out all these beautiful nuances of these like um

these open source models which are incredible much cheaper had we not run eval because we just we would have completely ignored them and just worked

on wibbes.

AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway · 全文文字稿