How to Train Your Agent: Building Reliable Agents with RL — Kyle Corbitt, OpenPipe

Watch on YouTube

Category: AI Development

Tags: AgentAIEmailLearningReinforcement

Entities: AI Engineer World's FairART EEnronGemini 2.5 ProOpenAI

Building WordCloud ...

Summary

Transcript

00:00

[Music] Um, hey everyone. Glad you're all here.

00:17

This is the reasoning and reinforcement learning track uh on the afternoon of the last day of the AI engineer world's fair. Glad you're all here.

Glad you're sharing it with us. Today, what I'm going to talk about is uh a very specific case study um that we did.

Uh this case study, I'm going to talk about

00:32

lessons learned very concretely. Um what did and didn't work, how we able to build an agent that worked well with reinforcement learning.

Uh all of this uh everything that I'm talking about in this presentation, this is an open- source codebase that we built. Um we wanted to share these learnings and I'll I'll I'll share that link with you at the end as well.

Um for those of you who

00:48

want to replicate what we did um so what is the project we're going to be talking about? It's a project called ART E.

It is a natural language uh assistant that helps you answer questions from your email inbox. So I'll give you an example of what we're talking about here.

Um

01:05

let's say you want to ask, you know, in this case our example question is when is Sherry's move to Portland targeted for? So you would ask this question to the assistant.

It then goes and it searches your inbox. It's got several tools.

So it has like a search tool, it has a read email tool, and then it can actually answer the final question. You can kind of see if you if you look here

01:21

what's going on behind the scenes. This is important so you get a sense of kind of how this agent works and as we're talking through how we built it, how we made it work.

Um hopefully that that helps uh make the conversation very grounded in a specific task. So anyway, you see the agent, it's it's you know searching for certain keywords.

It get

01:36

those messages back. It's in reading one of them and and answering the question.

That's that's what it does. Okay.

So um question, you know, once once we've decided this is kind of the the task we're trying to solve, why would you re use reinforcement learning for this specifically? Um and uh and the answer

01:53

is like to start with you shouldn't. In fact to start off with we did not.

Um so the first version of this agent once we decided we wanted to build this we did we didn't use any reinforcement learning at all. We purely built this on prompted models and this is the first lesson from this talk uh that I want to share is I would generally always recommend

02:09

starting with getting the best performance you can with a prompted model before going to any training including reinforcement learning. There's a few different reasons to do that uh 3 in specifically.

Um the first one is just like working out the bugs in your environment, right? Um you know, maybe your tools aren't implemented

02:24

properly, maybe they don't have access to the data you think they do. Um we find this happens a lot and it's a lot less frustrating to debug that uh you know, separately from debugging your your training loop.

So you want to make sure that like you can get at least some kind of performance um before you start training. Um and then second of all, you

02:40

may find as you're trying to improve the performance on uh on using these prompted models that uh you can get it working really well and that's great. So that means you don't need to train anything.

Um and that saves you a lot of time. Um there's a third reason as well uh that I'll share which is uh basically

02:55

once you've gone to that effort, you've done your best to get the best quality prompted baselines you possibly can. Um then uh if you find that those baselines are not able to get you where you need to go and you're able to surpass them with re reinforcement learning, it feels great.

You get to glow and be like, "Yes, I was able to beat the the frontier models on my task." Um this

03:12

this I I highly recommend it. Feels good.

You can you can like post on X about it. there's there's nice, you know, graphs and stuff.

So, this is this is what it looks like when everything goes right. Um, so this is an example of a training run for this art e model that I'm going to be talking about.

Uh, you

03:27

can see that there's these these lines for each of the prompted model baselines that we've got. Um, so we've got 03, 04 mini, and then Gemini and and 4.1.

And you can see uh those ones, you know, they have certain level performance. And then you can see this this uh sort of moving line um that's going on.

This is

03:43

the model that we trained. And you can see it actually starts out significantly worse than these other models uh from from the start.

That's because we started from a Quen 2.5 the 14 billion parameter one. It's a relatively small model, relatively weak model.

Um and so it was doing much worse than these initially, but you can see as training

03:59

progresses um you know initially at the beginning it it's sort of maybe maybe there's it's learning the right way to do tool calls. There's a very sharp bump as it figures out the basic stuff and then a more gradual climb until eventually it's able to significantly outperform uh any of the prompted models on this task.

And this is sort of what

04:15

you're you know in the ideal case when everything works this this is what you're looking for. This is what what you're hoping to achieve.

Um this is another view actually of that same data we were just looking at. Um I I like I wanted to highlight it in this way because it's important to realize.

So on the last graph it looked like the the

04:31

lines sort of asmtote out pretty close together. That's because they're getting near 100%.

But the last um you can see for example with our best prompted model here 03 uh it's 90% accuracy and with our RL model we're able to get up to 96%. And so one way to think about that

04:47

is like 60% of the errors that 03 was making um are are actually solved with our model. Um which is which is quite a large uh you know we find that that's actually can be very very important for the user experience of someone using one of these um if you're getting you know just half as many errors uh that that

05:04

can make the product much stronger. Um so this is this is where we got to on accuracy.

There's a couple other metrics that we find are often very very important. Um and you know the trade-off between these does does is very task dependent but but they matter in many cases.

Um cost obviously is a big one.

05:22

So for for this email agentic harness that we had we benchmarked the cost on 034 mini and our model. So if you wanted to do like a thousand searches using 03 that's going to cost $55.

Um which is a lot. I think for most use cases that

05:37

probably would be cost prohibitive just from a unit economics point of view. Um on 04 mini we're down to $8 but that's still quite expensive.

And then we drop another order of magnitude by moving to this smaller Quen 2.514B. Again, this is just driven it by being it being a much smaller model.

So it's it's much cheaper to run. Um but we're still able to get

05:53

very good performance because we've specialized it on our task. Um beyond cost and the accuracy, um the third metric that often comes up is latency.

Uh particularly if you're doing I mean certainly anything with voice, but if there's any real-time human interaction with the task, latency is going to

06:09

matter a lot. Um, and we were able to find on on this task we were able to get significantly better latency.

There's a number of different ways which I'll go into in more detail later that we were able to achieve this. Um, you know, one was just again moving to a smaller model helps.

There's just less less loading from memory, less matrix multiplies.

06:24

It's just you're able to get tokens out faster. Um, we were also able to train this model to have fewer turns going back and forth with the database with the actual email um the list of emails.

Uh, we we were able to train it to be more efficient with its queries. Um, and I'll go to that in a moment.

And so that

06:39

that leads to lower latency. Um there's actually a third thing which we didn't apply here but can help a lot with these smaller things which is called speculative decoding.

That's something you can do on large or small models. It generally works better on smaller task specific models because you get higher um acceptance rates on on your speculator.

But basically um there's

06:55

there's lots of reasons why smaller models work better. Um okay.

So then the next question uh for those of you who haven't done this yet is like okay what is the effort required to do this to actually achieve these results? Um, if you'd asked me this question a year ago, I would say, "Hey, you should really

07:10

only be doing this if you know you're this big company and willing to put, you know, months of of work into a project." I think that's changing. I honestly do.

Um, in this case, uh, so this this training run, it cost us about $80 in GPU time. It did take about a week of engineering time to build this.

And and

07:25

caveat that was with an engineer who is familiar with this domain and and had quite a lot of experience uh, you know, with machine learning and RL. Um but I actually expect as as we figure out the right patterns here collectively as an industry this will keep dropping.

Um and I expect that uh you know the sort of payback period to get a return on

07:42

investment from these specialized models is actually going to continue falling as well. Um and uh you know part of part of the reason I wanted to give this talk is to sort of distribute that know the knowledge we learned and hopefully move faster towards that that world where this is just sort of like a thing everyone knows how to do and it's very

07:57

easy and very fast. Um, so that's that's what we'll be talking about for the rest of time is is some more of the lessons we learned.

Um, okay. So, uh, when you are using RL to train an agent or really using RL for anything else, um, I find that consistently with different

08:14

problems we look at, there are there are sort of two hard problems that come up every single time. All right?

Um, and the two hard problems are first of all figuring out a realistic environment, right? So if you're training an agent, you need to be training it with realistic data, with realistic inputs and outputs, tools available, everything

08:30

like that to how it's going to be used in production. Um because if you don't, then it's going to be optimizing for the wrong thing and and you won't get the results you want when you deploy it.

And then the second thing which sometimes is hard um sometimes isn't this one is a little bit test dependent is getting the right reward function. So reward

08:45

function that just means you have to be able to know when your agent's gone through and say in this case give it an answer to my email. You have to have some way of knowing did it do a good job or a bad job.

All right, that's the reward function. It decides it it it's it's how you decide if it's good or it's bad.

Um some depending on the domain

09:00

sometimes that's really easy. We have I don't know if Nathan's here, he's going to be talking next, but um you know he and his team put together this thing called RLVR which in some verifiable domains it's actually very easy to do a reward.

Often times uh not all domains are like that. oftentimes it is kind of hard.

Um, and so it's it's somewhat task

09:17

dependent. I'm going to go through how we solve these problems specifically with RE.

Okay, first one, realistic environment. So for our RE task, what is the environment we need?

What is the environment this agent's going to be operating in? Well, it needs these tools available.

It needs to be able to go and query an email inbox. It needs to be able to like get emails back um and and

09:34

that look realistic. These emails, you know, the inbox should be large because that's what most email inboxes are like.

Um the emails in it should be diverse and they have to look kind of like real emails. Um so this could be kind of hard because you can't just go ask like a thousand people to you know give you uh their their their personal emails to

09:50

train on. Um luckily in this case we were able to solve this with the help of a company that has contributed a lot um to just the open data ecosystem uh generally.

It's it's like a quite an iconic company perhaps I would call it a historic company. Um I'm of course talking about Enron.

Um

10:07

I'm hearing some laughter. So anyway, Enron was a uh they were a financialized energy company in the '9s and 2000s, committed massive fraud, ended up getting shut down by the Department of Justice.

As part of this uh um you know, process, the the the court cases they were going through. Um a dump of like

10:24

500,000 of their emails was released to the public as part of the discovery process. Um so that's that's that's great for things like this and that's what we used as our environment for the email inboxes.

All right. So now we've got realistic email inboxes um with tens of thousands of emails that are real emails back and forth.

Now we have to

10:40

design our reward function. So as our agent is going and as our agent is um you know we're asking it questions and then it's giving us answers, we have to know is the answer correct or not so we can reward it when it gets the answer right and it can learn to do that better.

There's different ways and this

10:55

part is very task dependent. Um the way that we went about it in this case um was we basically turned it into a more of a verifiable problem.

And the way we did that was we actually took our email inbox. We sort of inverted the problem.

We um we grabbed batches of 20 emails at

11:11

a time uh from the inbox and gave them to Gemini 2.5 Pro and said, "Hey, given this set of emails, give us a few questions that a user might realistically ask that the answers are found in this email." Right? And so Gemini generated the questions, it generated the answers, and then of

11:26

course the the source emails that came from. Um, and there were some extra steps on top of that.

A lot of the questions it came up with looked a little bit unrealistic. We had a separate filtering step where we're like, okay, let's find the subset of these that that actually look like questions that, you know, I would maybe ask.

And we ended up with a list of a few thousand questions um along with

11:44

their verified answers. Um, and so at this point, it becomes much more of a a sort of verified thing.

the the reward function becomes much easier because we know what the correct answer should be. And so the way we can tell if our agent did a good job is we give our agent the question, we let it go and search the email inbox and try and find the right

11:59

emails and everything. And eventually it comes back with an answer.

And then we can just use an LLM as judge, a very simple one, and say like, hey, you know, here's the question. Here's the the golden answer that that we believe is right.

Here's the answer we got from our from our model. Is it right or not?

Um we did have to do a little bit of uh

12:14

iteration there making sure that the judge was well calibrated on like what you know what what counts as as correct or not but by and large this worked pretty well um and was able to make this more of a verified task. Um so so that's how that's how we solved the the reward function problem was by having that you

12:29

know turning this into something where we had more of a golden data set. Um, okay.

So, once you've solved that problem, those problems, once you have your environment, once you have um your your reward function defined, then basically you just kind of have to run a loop over and over and over again where you have your agent go through and it

12:46

tries to um solve the problem and then you figure out if it's good or it's bad. Um, and then you just uh you know, reward if it's if it's good and punish if it's bad and that's it.

Um, and uh, you do this over and over and over again. And then hopefully if you've got everything set up right, um, it learns

13:03

what good looks like, it learns what bad looks like, um, and it starts doing it right. Um, and then again, this is this is the curve we saw earlier where where you can see it it it starts getting better over time.

Um, okay, a few other like interesting learnings from this project. Um, one thing is we found that

13:19

there's actually you can throw a lot of stuff into your reward function uh beyond just the primary thing you're trying to solve for. And so we actually ended up we there were like sort of eight different little things that we gave extra credit for.

Um, and I'm going to share two of them here. So the first

13:34

one here is um is we were trying to have it optimized for the number of turns, how many times back and forth, how many times it had to query the email inbox before it came up with the right answer. Right?

So, because the most important thing, of course, is is getting the answer right, but between two answers that both get it right, we would rather

13:50

it took fewer turns back and forth because that's fewer tokens, that's lower latency, lower costs. Um, it's just like a more efficient agent.

So, um, so you can see here on this first graph that early on, while was getting its feet wet and figuring out what worked, it it ended up spiking up to over six turns on average. So, it would

14:07

go back and forth a bunch of times with the email inbox and and try and find the right thing. But then once it was able to like figure out how to use the tools efficiently, figure out like you know the the right way to construct keywords and find the right email, it was able to get very efficient and actually fast uh better than any of our prompted models

14:22

uh on this metric of um using fewer turns. And again, this was just because we gave it a little bit of extra.

It was it was a very small amount relative to the reward for getting it right, but a little bit of extra credit on on using for fewer turns and it was able to to use that um to to optimize against that.

14:37

Um, another extra reward function we gave it is um to try and discourage it from hallucinating answers. So, um, obviously the best thing is to get the right answer.

If you can't find the right answer, it's much better to say, hey, I don't know, than to make up an answer in in a situation like this. So,

14:53

we basically penalized it if um if the reward model said, "Hey, you got the answer wrong." And but it had tried to get an answer give an answer, that was like a much lower reward than if it just said, "Hey, I don't know. I can't solve this problem." And as you can see, that worked quite well.

Um, compared to any of the prompted models, including 03, we

15:09

ended up with a significantly lower hallucination rate because that was part of our reward function. Um, again, these are these are things that are just sort of like extra credit, but um, we found that like you can throw in a bunch of these and and it can jointly optimize all of them at the same time, which is super powerful.

15:25

Okay, I want to talk a little bit about reward hacking. Um, it's it's something that comes up a lot when you're trying to do this, and it's kind of a fun thing to talk about.

Um this is an iconic video some of you might have seen. Uh this was released by OpenAI almost a decade ago at this point of um they were they were trying to uh they had this

15:41

environment where you were trying to uh get this boat to complete a race and instead of learning to complete uh complete the race it learned that oh if I just go in this like little circle that's not even part of the racetrack I can like just get a bunch of points. Um and so it just started doing that over and over and over again instead of like actually following.

Um, this is

15:57

something that comes up a lot if you're doing reinforcement learning. And it's basically just the difference between um, uh, the difference between what you actually want the model to do and what you can measure.

Um, like what you're actually rewarding it for. And, and if you almost always, if you let one of

16:12

these run long enough, it will figure out some way to exploit your measure. Um, and it will figure out some way to to get a really high reward um, without actually solving the problem.

And you need to just watch for that. So, I'm going to give a couple examples here.

Um this is a this is a graph um from another project actually not this one.

16:29

Uh so an engineer on our team was uh was working on this game called NYT connections. Some of you might know you get 16 words and you have to put them in like four groups of four.

It's quite a challenging game especially for these language models because it requires a lot of world knowledge and like you know lateral thinking. Anyway, um so so they

16:46

were trying to train this model to do it and uh it wasn't figuring out wasn't figuring out. what if it wasn't figured out?

And then boom, you you can see here around step 40, it just like takes off and it's like, okay, we figured out how to how to solve this and and this engineer I'm I'm going to I'm going to call out where's where's where's An on our team? He's here at the conference.

Yeah, he's great. You should talk to him

17:01

after. But um he was like, hey, we we we solved it.

Like we got NYT connections and like and it's like, okay, the graph looks good. Let's look at what it's actually doing.

What it was actually doing is it had figured out there was a bug in how we wrote the verification. And if it just put every single word in every single category, it was able to

17:16

get a perfect score. um because we weren't verifying that they were in fact only four words uh in each category.

Um so this is another example. This is a fun one.

So I was I was training a model um to produce really good titles for hacker news um titles that would get a thing upvoted. So I had this reward

17:33

model I'd trained on like existing hacker news um articles and how many up votes they got. And I was I was trying to train this model to produce new titles and it was working really well for a while.

You can see and and sort of subjectively as well. I I looked at a bunch of these these titles generated and and for these first like thousand steps or so, it was actually learning

17:49

things that I was like, "Okay, as someone who spends way too much time on Hacker News, yeah, that that does look like a good title. You're you're doing a good job." And then you can see around step um 1200 here, it just like jumps a bunch, right?

It's like, "Okay." Um it clearly figured something out. I don't know what it figured out.

Um but we should look at that. Um, and so, uh,

18:07

what it turns out what the model had figured out was that it could just completely ignore the content of the post and generate the same title for every single one of them and that would like maximize its score. So, it generated this title, Google lays off 80% of workforce, literally every single

18:22

article, this was this was what it labeled it as. And was like, yes, that is going to get up on HackerNews for sure, which which it probably would to be fair.

Um so so anyway the way the way we solved this um what we found is that it's it's really important to watch out for this

18:38

solving it typically involves modifying in some way your reward function to penalize things like that. So in the second example I talked about it was actually quite an easy fix once we identified it um which was just add an extra LMS judge that looked at the title looked at the content and said hey is

18:53

there anything in the title that's not supported by the content and we added that on and and it it actually worked great. Um, the important thing here is you want to be looking at your your rollouts, not just blindly trusting the reward function, figuring out what's actually happening.

Um, anyway, so, uh, that's it. Um, I'm I'm almost out of

19:09

time, so I'm going to stop. Couple of QR codes for you.

Um, everything in this presentation, and there's a much longer write up I have of this whole project. It includes the code.

It includes the artifacts, data sets along the way. Um, you can you can check that out there.

Um, one more thing is, uh, we have a

19:25

Discord that's open. We have an open source project for training reinforcement learning models.

Um we have a discord you can go to if you're interested in this kind of thing. Um we we're all in there.

We answer questions. There's lots of people from the community trying to do these things.

So if uh if you're interested in building things with this um feel free to join

19:41

it. And yeah, happy happy to chat there.

And um yes, thank you everyone. Uh appreciate your time.