The Future of Evals - Ankur Goyal, Braintrust | YouTube Summarizer

Category: AI Development

Tags: AI optimization AI products Brain Trust Evals Loop agent

Entities: Brain Trust Claude 4 Gemini Loop OpenAI

Summary

Introduction

The speaker discusses the current state and future of evals and their journey with Brain Trust.

Brain Trust and Evals

Brain Trust has been working with leading companies to build top AI products.
On average, organizations using Brain Trust perform nearly 13 evals daily, with some conducting over 3,000.
Advanced companies spend over two hours daily on evals.

Loop: A New Agent

Loop is a new agent integrated into Brain Trust, enabled by evals.
It optimizes prompts, data sets, and scorers, significantly improving eval quality.
Loop uses Claude 4, which outperforms previous models by six times.

Using Loop

Existing Brain Trust users can activate Loop via a feature flag.
Loop supports various models, including OpenAI and Gemini, and custom LLMs.
Users can view suggested edits side by side in the UI.

Future of Evals

Evals have been manual but are set to be revolutionized by new frontier models.
Brain Trust is excited to incorporate these advancements.

Call to Action

Users are encouraged to try Brain Trust and Loop and provide feedback.
Brain Trust is hiring for roles in UI, AI, and infrastructure.

Transcript

00:00

[Music]

00:18

[Applause] Awesome. Uh so today we're going to talk a little bit about evals to date and where we think eval are going to be going in the future.

Also for those of you who saw my brother

00:34

earlier um I'm going to do my best to live up to his energy and uh and charisma. But um yeah, you know, it's been an amazing almost two-year journey for us at Brain Trust.

We have had the opportunity to work with some of the

00:50

most amazing companies building um I think the best AI products in the world. Uh I'm blown away by how many EVLs people actually run on the product.

The average org that signs up for Brain Trust runs almost 13 EVELs a day. Some

01:06

of our customers run more than 3,000 EVELs a day. uh and some of the most advanced companies that are running EVELs are spending more than two hours in the product every day working through their evals.

And I think one of the

01:21

things that stands out to me is while we have customers building some of the coolest most automated um AI based products and agents in the world eval

01:40

the best thing you can do is look at a dashboard and I think we have a pretty cool dashboard in Brain Trust but still it's just a dashboard that you look at and you walk away and think okay what changes can I make to my code or to my prompts so that this eval does better.

01:55

Um, and I actually think that is all going to change. Uh, so today I'm excited to talk about something called loop.

Loop is an agent that we've been working on for some time now that's built into brain trust. Um, and it's actually only possible because

02:11

of evals. Every quarter for the last two years, we've run evals on the frontier models to see how good they are at actually improving prompts, improving data sets, and improving scorers.

And until very, very recently, they actually weren't very good. In fact, we think

02:28

that Claude 4 in particular was a real breakthrough moment. Um, and it performs almost six times better than the the previous leading model before it.

So, Loop runs inside of Brain Trust and it can automatically optimize uh your

02:44

prompts all the way to very complex uh agents. Um, but just as importantly, it also helps you build better data sets and better scorers because it's really the combination of these three things that make for really great evals.

03:00

This is a little preview of of the UI. Um, you can actually start using it today if you are an existing Brain Trust user or you sign up for the product.

There's a feature flag that you can just flip on called Loop and start using it right away. Um, by default it uses Cloud 4, but you can actually pick any model

03:17

that you have access to and start using it. Whether it's an OpenAI model, a Gemini model, or maybe some of you are building your own LLMs, you can use those as well.

Um, and as you can see, it runs directly inside of Brain Trust. One of the things that we uh learned

03:33

from working with a lot of users is how important it is to actually look at data and look at prompts while you're working with them. And we didn't want that to go away uh when we introduced loop.

So every time it suggests an edit to your data or it suggests a new idea for

03:49

scoring or it suggests an edit to one of your prompts, you can actually see that side by side directly in the UI. Um, of course, for the more adventurous among you, there's also a toggle that you can turn on that says like just go for it and it will go and optimize away.

Um,

04:05

which actually works really well. So, just to recap, uh, to date, EVELs have been a critical part of building some of the best AI products in the world, but the task of actually doing evaluation has been incredibly manual.

04:22

And I'm excited about how over the next year uh eval themselves are going to be completely revolutionized by the latest and greatest that's coming out um from you know the frontier models themselves and we're very excited to incorporate that into brain trust. Please if you're

04:38

not already using the product try it out. Uh try out Loop give us your feedback.

Uh we have a lot of work to do. Um and we'd love to talk to you.

We're also hiring. Uh so if you're interested in working on this kind of problem, whether it's the UI part of it, the AI part of it, or the infrastructure

04:54

uh side of it, we'd love to talk to you. Um you can scan this QR code.

Uh it should be over there. Yeah, you can scan the QR code and and get in touch with us.

Uh we'd love to chat. Thank you.

[Music]