π Add to Chrome β Itβs Free - YouTube Summarizer
Category: AI Development
Tags: AI optimizationAI productsBrain TrustEvalsLoop agent
Entities: Brain TrustClaude 4GeminiLoopOpenAI
00:00
[Music]
00:18
[Applause] Awesome. Uh so today we're going to talk a little bit about evals to date and where we think eval are going to be going in the future.
Also for those of you who saw my brother
00:34
earlier um I'm going to do my best to live up to his energy and uh and charisma. But um yeah, you know, it's been an amazing almost two-year journey for us at Brain Trust.
We have had the opportunity to work with some of the
00:50
most amazing companies building um I think the best AI products in the world. Uh I'm blown away by how many EVLs people actually run on the product.
The average org that signs up for Brain Trust runs almost 13 EVELs a day. Some
01:06
of our customers run more than 3,000 EVELs a day. uh and some of the most advanced companies that are running EVELs are spending more than two hours in the product every day working through their evals.
And I think one of the
01:21
things that stands out to me is while we have customers building some of the coolest most automated um AI based products and agents in the world eval
01:40
the best thing you can do is look at a dashboard and I think we have a pretty cool dashboard in Brain Trust but still it's just a dashboard that you look at and you walk away and think okay what changes can I make to my code or to my prompts so that this eval does better.
01:55
Um, and I actually think that is all going to change. Uh, so today I'm excited to talk about something called loop.
Loop is an agent that we've been working on for some time now that's built into brain trust. Um, and it's actually only possible because
02:11
of evals. Every quarter for the last two years, we've run evals on the frontier models to see how good they are at actually improving prompts, improving data sets, and improving scorers.
And until very, very recently, they actually weren't very good. In fact, we think
02:28
that Claude 4 in particular was a real breakthrough moment. Um, and it performs almost six times better than the the previous leading model before it.
So, Loop runs inside of Brain Trust and it can automatically optimize uh your
02:44
prompts all the way to very complex uh agents. Um, but just as importantly, it also helps you build better data sets and better scorers because it's really the combination of these three things that make for really great evals.
03:00
This is a little preview of of the UI. Um, you can actually start using it today if you are an existing Brain Trust user or you sign up for the product.
There's a feature flag that you can just flip on called Loop and start using it right away. Um, by default it uses Cloud 4, but you can actually pick any model
03:17
that you have access to and start using it. Whether it's an OpenAI model, a Gemini model, or maybe some of you are building your own LLMs, you can use those as well.
Um, and as you can see, it runs directly inside of Brain Trust. One of the things that we uh learned
03:33
from working with a lot of users is how important it is to actually look at data and look at prompts while you're working with them. And we didn't want that to go away uh when we introduced loop.
So every time it suggests an edit to your data or it suggests a new idea for
03:49
scoring or it suggests an edit to one of your prompts, you can actually see that side by side directly in the UI. Um, of course, for the more adventurous among you, there's also a toggle that you can turn on that says like just go for it and it will go and optimize away.
Um,
04:05
which actually works really well. So, just to recap, uh, to date, EVELs have been a critical part of building some of the best AI products in the world, but the task of actually doing evaluation has been incredibly manual.
04:22
And I'm excited about how over the next year uh eval themselves are going to be completely revolutionized by the latest and greatest that's coming out um from you know the frontier models themselves and we're very excited to incorporate that into brain trust. Please if you're
04:38
not already using the product try it out. Uh try out Loop give us your feedback.
Uh we have a lot of work to do. Um and we'd love to talk to you.
We're also hiring. Uh so if you're interested in working on this kind of problem, whether it's the UI part of it, the AI part of it, or the infrastructure
04:54
uh side of it, we'd love to talk to you. Um you can scan this QR code.
Uh it should be over there. Yeah, you can scan the QR code and and get in touch with us.
Uh we'd love to chat. Thank you.
[Music]