Future of Data and AI: Agentic AI Conference Day 1

Transcript

00:00

[Music] Heat. Heat.

00:19

[Music]

00:34

Heat. Heat.

[Music] Heat. Heat.

[Music]

01:15

[Music] Heat. Heat.

[Applause] Heat. Heat.

01:31

[Music] [Applause] [Music]

02:04

Heat. Heat.

[Music]

02:46

Heat. Heat.

[Music]

03:12

Heat. Heat.

[Applause] Heat. Heat.

[Music] [Applause] [Music]

03:54

Heat. Heat.

[Music]

04:14

Heat. Heat.

[Music] Heat. Heat.

[Music]

05:02

Heat. Heat.

[Applause] [Music] [Applause] [Music]

05:22

Heat. Heat.

[Music]

05:44

Heat. Heat.

[Music] Heat. Heat.

[Music]

06:13

Heat. [Music] Heat.

06:30

[Music]

06:45

[Music] Heat. Heat.

[Applause] [Music]

07:21

[Music]

07:48

Heat. Hey, heat.

Hey, heat. [Music] Heat.

Heat.

08:04

[Music]

08:22

Heat. [Music] Hey.

Hey. Hey.

08:37

[Music] [Applause] Heat. Heat.

08:55

[Music]

09:30

Heat. Heat.

Heat. [Music] [Applause]

09:49

[Music] [Music]

10:20

Heat. Heat.

[Music]

10:44

Heat. Heat.

[Music]

11:10

Okay, we will go ahead and get started. Um, so welcome everyone to the second edition of AgentKI conference.

Um uh good morning, good afternoon, good evening um based on wherever you are

11:26

located. My name is Raja Bal and it is a pleasure to kick off this meeting u with all of you uh online.

Um so over the next five days uh what we are going to do here is um the conference is this

11:42

entire week uh we have some panel and workshops later in the week today. Primarily the panel discussions, tomorrow's tutorials and uh um the

11:58

second uh and the third and the fourth and the fifth day there are actually uh some workshops. Uh so um why Agenti?

So what I'm going to do is before we go ahead and get started

12:15

with the panel discussions, I would like to set some context uh I would like to set some context of why uh why agent AI and uh more importantly u what are the various components of agent AI. Um we

12:31

have panel discussions and tutorials and workshops on various aspects of uh like knowledge management and memory and planning and uh reflection. uh keeping your agent API workflows uh safe and secure and uh you know ensure proper governance um and then also build

12:50

multi- aent workflows. Um so we'll talk about uh a lot of these topics.

So the first uh as I said the first uh day is going to be primarily panel discussions. Um, and those who want to roll up their sleeves, um, we have some hands-on

13:05

tutorials tomorrow. And then who uh those who want to really go deep in the weeds um down and um you know really um trying to learn how to how to build actual agentic AI workflows.

We have

13:21

hands-on workshops uh uh on on a variety of different topics on the last day of uh um on the last three days of the conference. I'm going to go ahead and start sharing my screen here.

So, uh I

13:37

will hand it over to actually our set of panelists uh as soon as uh you know I'm I'm done with kicking off the conference. Our first panel starts at 9:15.

So let me go ahead and get started with uh

13:53

sharing my screen. It is fascinating that how after having spent so much time during uh the pandemic the covid pandemic we still actually

14:10

always struggle with figuring this out. Okay.

So, I will go ahead and uh start sharing the screen here. And here we go.

So, I hope everyone can see my screen.

14:30

Okay. So, we'll um go ahead and get started.

I will uh kick off the conference uh officially formally. We'll get uh started.

So we have 65,874 registrations. We uh closed the

14:47

registration sometime yesterday online conference you know uh you know we don't have to worry about you know the capacity and all of that. So uh so we closed at uh almost close to 66,000 uh people who are attending the conference and we were running this poll

15:04

uh just now where is everyone joining from and uh um based on our analytics based on uh the uh the uh registry we understood was we have probably someone attending from every um every um uh

15:24

country and territory uh on the planet. Uh so we have uh people coming from all over the world who are attending and I see some of these uh yeah uh San Diego, California not a country yet I don't know so but we wanted to actually know

15:41

other countries. So we see India, Pakistan, USA, Egypt and United States you know uh and Turkey and Turkey right?

So you can you can see that we have uh broad uh representation from all over the world. Um and first of all uh our

15:57

sincere and um uh deepest gratitude for uh the for our sponsors and partners who actually made this conference happen. So round of applause for all of them.

So uh before we go ahead and get started,

16:14

right? So I I just want to make sure that uh the it it took us actually a while to figure out the agenda really make sure the agenda is um it is curated and it is set up in a manner that the conference is actually a learning experience.

Right? So yes, of course we

16:30

network during a conference. we we accomplish other goals but primarily in this case uh the conference is really designed to make sure that uh you leave with some learning around agentic AI.

Uh depending upon your role you can be a

16:48

very uh you know hardcore AI engineer or a data scientist or software developer or you could be a product manager, a project manager or any other role. Um the way we have set up these panels is that we want to make sure that everyone leaves with uh a lot of learning and

17:04

they can take whatever they want to take out out of it. If you're product manager, they'll

17:27

um Okay. So, so what are some of the building blocks of uh aentic AI?

I am getting this message that my internet connection is

17:44

and this is quite surprising. This is the worst time for this to happen right.

So let me go back and okay

18:03

um so we will get started with uh again uh so we have uh the perception side of it. I mean we want to understand the intent uh we want to understand the intent um and then we want to understand

18:19

the cognition and planning part of it. So, agentic AI is all about goal management.

You want to manage the goal. Um, and then we have uh uh reasoning and inference, planning and memory and self-reflection.

All of that is part of

18:35

uh it's part of the building blocks of agentic AI. Then we have knowledge representation.

How do you uh how do you manage your sources of knowledge and uh maintain the the context of the conversation and memory and so on. uh context

18:52

engineering. Um we will have uh panels and sessions on all of these uh all of these uh topics right so from knowledge management to memory to MCP uh you know how do you how do you manage your goal uh how do you build multi-agent systems all of that uh then

19:10

we have uh the autonomy aspect of it the uh the autonomy aspect of it right so we have action and execution and to use API calling u so we'll talk about that uh

19:26

the safety and governance uh as well um so alignment of values guardrails and policies explainability monitoring and uh and so on and we'll talk about also talk about the uh evaluation. So, so the agenda you will

19:42

you will see um in different panels and different tutorials and different uh and different uh workshops we will be covering all of these aspects. So today uh at a very high level uh the agenda is u we'll start with a panel on

19:58

designing intelligent agents. Then we will um in the next uh panel we will be uh talking about um architecting uh scalable multi-agent workflows.

Uh and in the third panel today uh we will uh

20:16

be talking about managing security and governance in MCP deployment. Um and then we have two more tutorials.

Uh uh today we will start with uh uh our tutorials uh the stream tutorials with deploying an end to end agent AI

20:33

workflow. So the whatever we learn in the first three uh in the first three panels we'll actually see how does it come together in an actual product.

Um and then we have later on we have a tutorial uh where uh that will talk

20:48

about uh how do we build agents that are self-improving. Um on day two we have seven tutorials roughly an hour each and then we have on day three to day five we have workshops.

21:04

Um so each we have two workshops a day on day three and day four and tier five. Um and then we will have u uh these half-day workshops and we will be uh diving deep into specific topics.

Um you

21:21

can look at the full agenda. I don't want to take up too much time here to uh to actually go through the entire agenda.

Uh when we come back tomorrow, we are going to once again actually look at so we'll be looking at the daily agenda uh every day. But for now, if you

21:38

scan this QR code, you can actually go and take a look. Uh you're registered on the same page for the conference.

This is for your convenience. Just scan it.

Just go take a look and then we are um you know um and we will actually go through the agenda on uh on a daily

21:54

basis uh as soon as we um come to um that day. So with that uh I am going to hand it over uh to the moderator for our uh next

22:10

panel and uh and then um all the I will welcome all the panelists. Our next panel is about designing intelligent agents.

Um I earlier pointed out that uh we there is a role of memory and cognition and planning. So these are uh

22:27

some of the leading experts in this space uh and uh the moderator for this next next session more of a like not just the moderator more of an agenti celebrity these days right so zen uh I will ask you to just come and take over and then you can take it from here

22:46

>> can you hear me checking >> yes I can hear you >> awesome >> and I will stop sharing All right. Um, and let me know how we can bring the the panelists.

I would actually uh we can

23:04

>> Yeah. So, we can go ahead.

Yeah. We still have two to three minutes.

So, I think that our next panelist is Sarah. Okay.

The third panelist is Sarah. So, that sounds good.

I can >> I can I can give a little bit of an

23:20

intro on what I'd like to talk about um uh while we wait. So, uh, how I wanted to start this off was just introducing, um, the panelist, myself to give you guys an idea of what we're, uh, what we work on in our day

23:36

jobs. And then I wanted to get into there's a lot of these buzzwords that, um, this panelist, uh, is, uh, titled with.

I'd like to define each one and ask the panelists question uh, questions about each one of these. Right?

So I want to do a deep dive on what cognitive

23:53

architectures actually mean and then I I'll ask panelists questions about that stuff. Um I wanted to understand how agents can use memory, how do you compress memory, how do you deal with uh kind of managing memory and then also

24:08

dealing with a limited context size uh for language models. Uh I also wanted to see how we can talk about and understand planning and long horizon reasoning.

Right? So if you think about a difficult enough task, you can you can almost

24:24

think of it as a language model executing consecutive steps anywhere from 20 to 50 steps depending on how uh difficult the task is. Uh I wanted to ask about how you can use agents to execute those steps.

Um and then I want

24:41

to get more into multi-agent systems. So these are not just language models interacting and invoking other language models but human in the loop systems as well.

And so all of our panelists um work at companies that are developing these cognitive architectures, these

24:57

memory modules, these multi-agent uh interaction uh frameworks. So I wanted to touch on that and then I want to get to the last point which is how do we take the experiences that these multi- aent human in the loop systems that have have had in the past uh this could be

25:14

customer interaction data this could be logged interactions how do you take that information and now make the system learn so that the thousandth time a system does something it does it a lot better than the first time that it did it. Um and then I wanted to end off with

25:29

uh future direction. So, I've got a lot to uh I wanted to ask a lot of questions and give people a lot of um kind of a deep dive into uh into um the the topic at hand.

Um and so hopefully we get through it all. But uh let's see if we

25:46

can um Is the is the panel ready? Hey, Loren.

How are you doing?

26:03

>> I'm good. How you doing?

Happy Monday, by the way. >> Yes.

>> Should I kick it off? Like do intros or >> Awesome.

Um I'm not sure.

26:19

>> Yeah, you can go ahead and get started. >> I'm seeing everybody.

Okay. Okay.

Yeah. Sorry, I had to I had to scroll to see everybody uh was present.

Okay, cool. So, I'd like to start off with uh with introductions.

So, I'll introduce myself and then I'll throw it over to the panel um to give an overview of what you're

26:35

working on uh your company as well. So, my name is Zan Hassan.

I'm a a staff uh engineer at Together AI. Together AI is um AI native cloud.

It's powering a lot of the training and inference workloads that a lot of leading uh genai companies

26:51

are working on. Um, and we do everything from providing compute to kernels to to run these geni models that you're hearing all about.

So maybe I'll throw it off to Loren. Uh, you can you can go next.

>> Thanks an everyone. My name is Loren.

27:07

lead the open source team at Crew AI. And to those who don't know what Crewai is, it's the leading multi- aent orchestration platform.

So we pretty much kind of power and help companies like DocYsine to PWC to uh Royal Bank of

27:23

Canada kind of one build but also deploy and iterate and monitor their AI agents within these enterprise systems. >> All right, awesome.

Thank you. Uh thank you, Loren.

Um next up, uh Kartik.

27:39

>> Thank you. Uh hey everyone, I'm Karthik.

Uh I am head of AI at W AI. So at Wand um we basically are building the uh agentic operating system for hybrid workforces.

Uh so there is a lot of overlap to things that Loren mentioned as well. You know we're trying to

27:55

basically build uh a cognitive architecture that allows both human workers as well as agentic workers to kind of work together organize workflows and then engage in automation. You know ranging from kind of simple automation all the way to full automation of

28:10

departments and companies and so forth. Um before I came to W um my uh background is actually as a researcher as as a uh research scientist in planning and uh uh cognition uh back in the so I like to say that you know a

28:27

decade or so ago I was working on agents before agents were cool at that time agents were robots you know they were actually physically embodied agents uh but yeah a lot of lot of the work that I did both as a grad student and then during my time at IBM research it's coming back now it's invoked and I'm

28:42

hoping that we can talk about it at the panel. You know, things like planning, uh, observ observability, replanning, memory, cognition, all of these things.

So, yeah, looking forward to it. >> Awesome.

Thank you so much, Karthik. Uh, and last but not least, Sarah.

28:59

>> Um, can you hear me? Okay, I'm not okay.

>> Yeah, I can hear you. >> Okay, cool.

Um, hi everyone. Um, my name is Sarah.

I'm one of the co-founders and CTO of Leta, which is the platform for stateful agents. Um, LETA basically came out of the MEGBT project which my co-founder and I and also our lead

29:14

research scientists um, worked on during our PhDs at UC Berkeley. Um, MEGBT was kind of like the first example of a stateful agent that has like memory, can learn, um, you know, a sense of identity.

And so a lot of those ideas we built into Leta, which is essentially making it really easy to create these

29:30

agentic services that are persistent, have memory, um, can be integrated into your applications. And today we work with companies like Built Rewards, Nokia, 11X to essentially help them build these agents that are able to learn.

>> Awesome. Thank you so much.

So, uh

29:47

during the introductions, um there's a lot of concepts that were uh mentioned. So, I'd like to kind of define these concepts and then throw it over to to to the experts to answer questions about what's happening at the edge right now.

I wanted to start off talking about cognitive architectures. And the way

30:03

that I understand cognitive architectures, these are really the the blueprints or the scaffolding within which all of these different modules sit, right? You have perception um uh uh IO, memory, the ability to reason and plan and make decisions.

Um and then

30:19

also to to do tool calls and uh and um action things in the real world. So, traditionally, a lot of cognitive architectures have been quite modular where you could have a a planning system that feeds into uh a decision-m system

30:35

and then you have memory that that that you're able to access. Uh and more recently, we've been seeing kind of these all powerful monolithic LLMs that do everything, right?

They they plan and then they do in context uh kind of management of what's been executed and

30:51

what's not been executed. Um so I wanted to start off and just uh pose the question to to the folks on the panel.

How do you think today's language model based uh agents uh are do you think that they're modular architectures or do they differ fundamentally from classical uh

31:07

cognitive models that are um that are more modular? So maybe we can go uh backwards.

So let's maybe we can start off with Sarah. >> Yeah.

So I I mean I think there's kind of different ways that people like to think of LMS. Um I personally kind of

31:24

prefer like like there's like the more like oh neuroscience like cognitive architectures um mentality but then there's also kind of like the more like systems like operating system mentality and and I think that abstraction is actually much much easier to think about like because in the end like an LLM is

31:40

just a tokens in tokens out. It's just like a string generation machine, right?

And so all these things that we do like tool calling, memory management, multi- aent systems, it's kind of like mapping down those systems into like this like LM computer that's just generating

31:55

tokens. So I I think for me I I personally find it easier to think more so of the LM as like a CPU and essentially like the the tokens that we're putting into it are kind of like a compiled program.

Um, and so I I think you know there's of course like

32:10

different like Asian architectures, different way that ways that you can structure things, manage the context. Um, but I I personally find that way of thinking of things a lot easier because um I think it is clo more closely tied to the reality of what an LLM is, which is ultimately just, you know, this token

32:27

generation machine rather than um something like a human brain. >> Yeah.

Yeah. And I know that Karpathy put out um software 3.0 0 talk where he makes a similar analogy where language models are really like a CPU or an operating system and then you've got all

32:42

of these different modules that uh that you give them access to. Um awesome Carik uh how about how about your take?

>> Yeah, so I think I I agree I think a lot with what Sarah said. I think it's a great uh distinction between kind of more of the neuroscience and I think now

32:58

when we're coming to this modern paradigm what we would often call neurosymbolic kind of architectures, right? combining neural methods on the one hand and then putting them in a scaffolding of more symbolic methods uh versus kind of the more you know operating system metaphor.

But one thing I also want to talk about is a little

33:15

bit of kind of a historical perspective you know from from the research or from the AI side of things right. So cognitive architectures have been around pretty much for as long as AI as a field has been around, right?

So people were working on cognitive architectures even in the 1960s and the 1970s. In fact, some of the earliest work in the whole

33:31

field of AI as we know it was around these kind of uh well meshed out cognitive architectures that could actually run robots and actual real world systems and so on. Right?

What I see now and I actually gave a talk about this la last week as well. What I see now is that a lot of the ideas from that

33:48

uh time whether they're in terms of communication between agents or they're in terms of the planning the you know observing the world replanning all of that a lot of those are coming back now as we start um you know uh using LLMs for more and more complex tasks now right so essentially I think what's

34:04

happened is the engines like the components in these architectures have become more powerful right so we're replacing each of the individual components with specific models right so there are models that are good for um you know uh long horizon planning for example there are models that are good for you know completion there are models

34:22

that are good for different things so we are we have a you know powerful models now but we still need a little bit of that determinism you know because I think one of the big problems that people have had obviously with LLM based systems is that you know the the consistency is just not there when you try to go from PC's to actual production

34:38

level things right so one of the encouraging things and for me this is encouraging because it it shows that we're actually building on history and on the experience and errors of you know other people decades before us is that people have been building these uh bringing in these ideas into various things you know whether it is a crew AI

34:54

or an autogen or an AG2 or you know whether it's a protocol like an A2A MCP whatever right a lot of these ideas from the past are actually now making themselves felt again and I think this is somewhat the right way because to me at least this is a very personal view I might change this you know if there are further advances but to me it seems like

35:11

where we should be putting our effort into is not building larger and larger LLMs that can be more and more general and generic, right? But rather smaller, more specialized components or units, right?

And then those units are embedded into these more symbolic structures in some sense because ultimately that gives

35:28

you more more of a level of control and more customizability and more consistency which is what businesses are looking for at the end of the day, right? is like this big gap between the research and the PC's uh and then you know kind of business applications on the other uh end which is where all the

35:44

you know money is is basically this notion of you know can you break it down into individual components can you do credit blame assignment if something goes wrong can you tell me which component actually messed that up right and can you effectively gate the propagation of errors right so I think cognitive architectures have a very

35:59

important role to play in that uh and I think there's a lot that we can learn from the past as we try to kind of reinvent the wheel in some sense, right? So, so that's kind of my take on that.

>> Yeah. Awesome.

I I wanted to pull on uh one of the one of the threads that you mentioned there, but I I want to give

36:14

Lorenza a uh Loren a go at the question and then we can go into go into the next question after that. >> Sweet.

Yeah, pretty much agreed with Cartik and Sarah here like the LM being the core of our operating system as a

36:31

developer tool slash um framework kind of provider right and and giving that perspective think of like these architectural um components that we have the cognitive architecture as components that we have so we have our LM that has

36:48

now access to memory before it didn't maybe now if you're using a framework like crew like leta for example you have these modules that you could easily plug in and play for example if you want memory equals just do memory equals true for crew AI for reasoning memory uh

37:03

reasoning equals true so these are like components that you could add on to your agents you could add any tool that you want right you could observe and and monitor that um and these are like kind of like special abilities that your LM kind of has out of the box there are some reasoning models to Clark's point

37:20

like as these models get better and better. Some of the these abstractions or components kind of are now built in within like the inference models, right?

But if you were to put it from like a perspective of you might want specialized LMS for specific types of

37:35

use cases, right? Um maybe you don't need a reasoning model, but you need memory.

So these are kind of like the the things frameworks give you out of the box. They're kind of battle tested.

they're used by you know enterprise companies but at the same time it's

37:51

giving you the right tools that you would need to make that decision for. So if you have a particular use case that you have in mind, you don't you may not need like the biggest of models, right?

Maybe a 8 bill parameter model with memory and the right tools. Uh maybe even fine-tune for particular use cases

38:08

are kind of enough. Uh but from to answer your question, yeah, I think they're from a perspective of components.

Think of it like having a toolkit, right? you have an LM that's kind of in the center of all of that and you can bring and pick and choose what tools you want to include whether that

38:24

be a framework or you know a singular agent that you're making yourself or you know an LLM call. >> Awesome.

Thank you so much. So what I wanted to do for maybe the next half an hour is essentially pick each one of these tools that all of you have mentioned um and get your understanding

38:39

of them and uh and ask uh questions that the community is kind of grappling with. So the first one um and this is the one that I'm most interested in is memory.

So Loren, you just mentioned that you've now given these language models uh access to memory. They without without

38:54

connection to an external database or a classical like a vector database or a classical database or even in context uh memory really the only memory they have is what they've kind of distilled from their pre-training and post-training. Right?

So I wanted to start off and kind of pose the question of how do you think

39:11

of memory and what are the different types of memory um that you can that you can give to uh to these agents. >> Yeah.

So within crew we have kind of like the same similarities of like the sore kind of cognitive architecture when

39:27

it comes to like memory systems right we have episodic memory which is kind of like long-term memory where you can self-learn and evaluate over time. We have semantic memory kind of like knowledge sources that you could plug and play.

Um and then we have um procedural memory which is kind of like

39:42

the rules that you have. For procedural we have something called entity memory kind of like remembers certain things and attributes kind of like your writing style to who's important in this particular company.

We focus our memory a lot when it comes to orchestrating multiple tasks together. So instead of

39:59

like just remembering a birthday for example uh it's remembering uh how are you passing in tool inputs for a particular like task uh we have long-term memory that does a self evaluation after every test gets executed and that self-evaluation has a

40:15

suggestion on how it can improve and like a score it uses element the judge to calculate all these things but that's kind of like how we see a crew of agents kind of self-improving over time and if you were to picture what a crew is. It's multiple agents, right?

It's not just

40:31

one agent. Each agent has its own knowledge store and memory store and state if you're using like a flow for example, but um together it's kind of like how do we orchestrate a automation use case versus like a chatbot use case.

40:47

So think of an example being like let's do lead enrichment. We're pulling data from HubSpot.

We're trying to enrich this user using maybe something from online. We're trying to find their LinkedIn.

So the tools that come into that the uh memories that we generate like entities like who's important for

41:04

this type of company when it comes to like a lean enrichment what company do they work in right these get stored as memories for future iteration uh future executions so from that perspective are memories kind of like coupled towards automation and agentic like pipelines

41:22

you can say versus like oh when was my birthday right so I think this is kind of like the road we we're taking when it comes to like memory systems more so than um anything else. >> Awesome.

I I'd like to go to Sarah next

41:38

and I'd like to kind of augment the question a little bit as well. Um I'd like for you to talk about the kind of the multi-ter memory system um that Letta has, but then maybe also touch on practically how those uh those multi-tered uh different memories are implemented.

So a lot of the community

41:55

is kind of grappling with all sorts of different databases, classical databases, uh embedding uh embedding approaches. Um so how does Letta uh deal with this?

>> Yeah. So I I think you know this this actually just comes back to once again

42:10

the fact that like LMS are just text in text out machines, right? And so the way that we think of memory is is there there's or I think the way most people think of memory now is is there there's kind of like two tiers.

There's the memory in terms of like the state or what information is in the context window. And then there's also external

42:27

information. So stuff that's outside of the context window, but it's somehow accessible to the LLM, whether it's through like tool calling, having like another agent retrieve things.

Um, and and I think like this external form of memory is like what most people think of as memory. Like whenever you have like

42:42

plug-in memory, it's almost always just like some rag based thing. So, you know, if I add like a memory MCP server or something, that's all that's essentially doing is just like fetching um some records that I have in some external storage, bringing them back into the context window.

And so now the LM is

42:58

like aware of that. Um but I I think memory it's it is actually like a lot more than that.

Um because I I think like our devro actually had like a pretty good um way to put this which is that you know me essentially explaining that like memory is not just a matter of

43:14

like doing rag or doing recall and I think the example he gave was like um you know recall is remembering that you hit memory is remember you know hating you because you hit me a bit of a violent example um but I I think that does really get to the crux of it

43:30

because I think with these ragbased systems where you're just retrieving potentially relevant records um into the context like you know maybe the last user message doing some embedding similarity or some like graph search over like some external system and bringing that back into context all that

43:45

really gives you is recall. So if you want to have you know more of like an agent having an ability to learn or to like adapt its persona you need to have something that's more like system prompt learning.

So essentially rewriting the um actual like context of the agent over time. And the the way that we did that

44:00

in MGBT I I think MEGBT was kind of like the first example of this. There was just like a human section and a persona section inside of the context window.

And the agent had the ability to edit those portions of its context window. And in Letto, we've kind of generalized it more to have something called memory blocks.

Um so essentially the context

44:17

window is broken up into multiple kind of allocated sections of of memory or context. Um, and the agent is able to essentially like rewrite like you know maybe the human section or the organizational section or the general purpose or like tools used section.

And

44:32

so through this the agent is essentially like continuously like rewriting the system prompt. Um, and that's kind of what allows it to actually kind of change its own instructions in a way that you know I think is much closer to learning than these more like rag based approaches.

And we also had some recent

44:47

work called like sleeptime compute where we actually offloaded a lot of this learning onto a different agent. So now in Letta like you can actually choose a new agent architecture where it's not just a single agent that's modifying its memory, managing its memory, doing retrieval.

It's actually one agent

45:02

that's just talking to you. It's kind of general purpose and then it has another agent um that's essentially working in the background to do memory management operations.

So that other sleeptime agent is getting all the conversational history, all the events that are happening and its sole purpose is to just rewrite context to make sure that

45:19

the primary agent is, you know, as adaptive um as possible, learning as much as possible. And that's really only possible because of the way that we architected Leta.

So everything in Leta is essentially persisted in the database, including these context blocks. So you can do really cool things

45:34

like allowing different agents to share context blocks. So that's kind of like shared context where if either agent modifies it, um it's propagated to all the agents that are, you know, attached to it.

Um so you could theoretically have like, you know, synced organizational memory, stuff like that. Um and the blocks are also standalone.

45:51

So you can always like kind of query a block that's about a specific topic. We have a really cool um example agent called like void.

That's like a an agent on blue sky. And void is kind of interesting because it's swapping out its context blocks continually.

So it always has like 10 context blocks about

46:07

different users and depending on which set of users it's talking to it actually like swaps out which set of context blocks it has. Um so that's kind of like a a another layer I guess where like you're doing retrieval on like the port pieces of context themselves and then also rewriting those.

Um so yeah there's

46:23

like a lot of different crazy things you can do but I I think ultimately it does just come down to like you know what is stuff that's stored in context that's being rewritten either by the agent itself or something else. And then also what is like your external store and I actually think like the details of a lot of these things like don't really

46:39

matter. Um it's just in terms of like the incontext memory it's basically just like replacing the text that you have right and I think for external stories like you know there's a lot of things that people do like graphs or like you know just embedding search.

You can actually even just use like GP and

46:54

files. We did like an experiment with that recently and it works surprisingly well.

Um but yeah, I think that kind of just depends like more on your application and the details of what you want to store. Um but it's like not actually as impactful as you might think as compared to like um system prompt rewriting.

47:11

Awesome. Uh Karthik, I'd like to pose the same question to you, but touching on some of uh what Sarah mentioned around thinking of memory as system prompt rewriting.

One of the very practical problems that the that developers and um in the community um has is running into context length

47:28

issues. So if memory really is kind of uh making sure that the system prompt is fresh, all of the information that you put into the language model that you condition its generation on is fresh.

Um how do you deal with kind of compressing those memories? Uh having to compress

47:45

those memories when you run into context length issues. state-of-the-art models are about 200,000 for a lot of these models.

Gemini about a million to a million. Um so maybe how can we how can we deal with that?

>> Yeah. No, I think that's a great question.

In fact, I think that's a

48:01

question of very uh you know huge practical importance because like you said developers who are making things right now they're working mostly with those things right. So the way I think of this is so so there's a phrase that's been thrown around a lot in the last one or two months.

I'm sure all of you here

48:16

on this call have heard it context engineering, right? Which is basically this notion of you have context, you have context windows, right?

But now what you need to do is you need to kind of impose some sort of structure and some sort of hierarchy and you know some sort of engineering basically on top of that context itself and obviously then

48:33

that problem comes up because you have models whose context windows are being exceeded by you know what we want to give those models and also remember that even if a model advertises itself as a I don't know 125k or 200k or 1 million you know context window model it's not

48:48

necessary that it's going to actually process all of the 1 million tokens that you're putting into that context, right? This is a very wellstied and well-known effect in language models at this point, right?

Like the whole notion of, you know, they they models tend to focus more on the beginning and the end of the context and there's this whole needle in the haststack problem in the middle of

49:04

it, right? It's like the thin non-existent middle.

So, the way that I like to think about it, and I actually want to go back, I I actually really liked the apherism that Sarah used about, you know, recall versus memory, right? That, you know, the difference between remembering that something is bad for you versus hating it because,

49:20

you know, it was bad for you. And I actually want to extend it out one further step.

Right? So there's recall, there's memory, but then there's also this notion of knowledge and wisdom, right?

So the idea there is that if something has happened a bunch of times over and over and over again, right? Then it actually is knowledge that you can encode and write down, right?

49:36

that it actually becomes something that you can use for you know future uh uh computation and for creating things in the future and so on. And the reason I talk about this is because again I think a lot of this panel for me I'm going to be the guy who keeps saying you know but you know remember this stuff that was

49:52

done in the past right which which I think there is a lesson there for all of us who are building and developing right um one of the very uh powerful tools in AI uh you know over the last 20 years but of course the work has been going on for many decades is this whole notion of knowledge graphs right so the idea there

50:09

is that you know you can basically uh you know extract from the context that is there and you can start making these structural representations of knowledge itself. Right?

So a knowledge graph at its simplest is basically a graph you know nodes and edges and then the nodes are basically concepts or entities and

50:25

the edges are relationships between them right but one of the things I think that could be really powerful is the distillation of the knowledge of the world right into these structured graph representations and that's one step you know representing that knowledge the

50:40

other step is teaching our LLMs how to actually use you know access and use the knowledge that is there in these graphical structures for that you need obviously a bunch of different tools and you know you need a bunch of different uh models to be trained to access that and the reason I think that that's so important right is the world's knowledge

50:58

can be many different things depending on your application right it can be knowledge about your particular company it can be knowledge about a particular use case it can be knowledge uh about a particular country's you know sovereign uh I don't know wealth things for example right um so these are very

51:14

customizable things and if you think about it this is the way that we as humans tend to interact with stored knowledge and stored context, right? This is the way that older, let me say older AI agents, you know, preLLM AI agents, this is how they used to access that as well, right?

And so for me, this is very much tied into this concept of

51:30

memory because knowledge is nothing but memory that has been aged like fine wine, right? If you think about it, knowledge whether it is knowledge that we have or knowledge that's been passed on through previous generations or through teaching and so on is basically just memory and context that has been you know stacked up layer after layer

51:47

after layer until you know that okay this is just received wisdom. I can just you know kind of depend on it and so on.

Of course there are other problems on the other side of it. you know there are biases that get built in and all of that right but ideally what we want our agentic systems to be heading towards is the ability to both represent memory

52:04

right in that in that way and also use that memory in that in that way and I think one of the places that we can turn to just like we've had all these advances incredible advances in the last one or two years including all of us here we've been working on some of these we've had advances on cognitive architectures on uh kind of

52:19

interoperable platforms and and you know tools and things like that I think what's coming Next is also advances in how to represent context, how to represent knowledge, memory, all of these things, right? And how to actually have models that can learn how to use these, right?

These don't need to be

52:35

huge models, right? You don't need to have multi-billion parameter models because remember, if you're able to externalize the knowledge and teach the model how to actually retrieve the information that it needs, right?

You no longer need to store all of that information inside of the model during pre-training. And I think that really will help with scale up.

It'll also help

52:51

with I think uh uh interpretability right because once the knowledge is stored outside it automatically becomes a lot more interpretable than knowledge that is inside a model during pre-training and so forth right so I think that's my kind of maybe slightly tangential take on memory which is that we need to focus more on systematizing

53:07

it into knowledge uh in some sense awesome so this actually gives me multiple ways that I can go forward now so right now I I I might want I I want to skip the the next two things that I wanted to talk about and get into actually the learning aspect that you

53:22

all that you all talked about. So what I how I want to take the conversation is going from memory um and then seeing how do how can agents how can these language models that are token in token out how can they learn from iterative previous uh memories previous experiences.

Um so

53:40

I I'd like to uh touch on that. How do you uh is it is it a matter of kind of um reinforcement learning fine-tuning or is it a matter of context engineering and retrieving the appropriate context for the task at a hand?

Um and so this

53:55

time the maybe maybe we can do uh kind of a round on on this and then we can get into the other the other tools. So um Loren, maybe you can start us off.

Yeah, I'm gonna touch on two things that we currently do and then kind of like the future of how I essentially see this

54:13

from the perspective of how we do it at crew. So, we have two ways for kind of like self-improving, right?

Uh one is through reinforcement learning with human feedback. Uh and again kind of like just giving painting a picture of how that works in crew is like for every task that gets executed it will pause

54:30

and it will give you a suggestion like a field where you can give a suggestion on how it can improve over time and you as a as a person or as um um someone who's administrating this this agent workflow. You have the ability to talk like hey maybe use these types of websites when

54:48

you're scraping the web. um these are the types of entities I want uh enriched when it comes to like a lead for example that I've I'm trying to search for right so after every task usually in a crew there's like two to three tasks per crew that we have uh these are these get

55:04

stored into memory for future iterations to load it's a little bit on like the simpler end super practical we just reing inject that into the prompts later on as like memories of how it can improve there's like a suggestions on kind of like the prompt on the prompt

55:19

layer that we inject for future iterations using that retrieved memory store when you use memory equals true within crew. So this is the second piece um long-term memory after every task it automatically scores itself.

It gives a

55:36

a suggested score of like how it did based off a task ex uh output. So when you define a task and crew you have something called description and an expected output.

Expected output is kind of like your criteria of success, right? So this is something we require by

55:52

default uh when you're defining a an agent task. And what we do is using elements of judge, we take a score based off like the task outputs itself against that expected output and we store that into long-term memory.

And again, kind

56:07

of like the same cycle as reinforcement learning with human feedback. Kind of like the same concepts of this task evaluator gets stored into memory.

So kind of like think of a way of how it can improve itself. That's kind of like the prompt and score that we that we give it.

um the future of how we can

56:26

take this to uh Clark's point is like this concept of context engineering can you as an engineer can you uh from a using a framework perspective right have the toolkit to define how this kind of context is used stored uh and retrieved

56:44

right so bringing all these together you have reinforcement learning with human feedback kind of ties into long-term memory for like a aentic self-improvement over time. And then for the future for like developers, not not

56:59

the future, but for engineers who want to have more control over this context, this is something where where context engineering comes into play. >> Awesome.

Um Sarah, I'd like to get your take on this, but then I'd also uh modify the question a little bit, which

57:14

is um how do uh how do these language models, how do you think they can learn from past experiences, but then also how do you allow users or developers to inject hard constraints? So let's say uh you want to alter the behavior of the language model.

How can they learn from

57:31

uh kind of uh human imposed constraints or business uh imposed constraints as well? Yeah.

So I mean I think with Leta like the primary mechanism for learning is essentially like you know what we were talking about before which is is rewriting the the memory blocks um which

57:49

eventually go into the system prompt and and yeah I mean if you try just try leta today if you like you know tell it like oh like start using emojis in your responses or like you know don't do this like you'll see that it'll often actually update its its memory blocks or its memor in in context memory to you

58:04

know basically like write its own prompt to give itself instructions to to do that in the future. Sure.

So I I think that's kind of like the most like basic form of learning that we support. Um in terms of kind of you know more like reinforcement learning based things.

I think this is like a tough problem. Like

58:20

my my co-founder Charles his PhD was actually in in RL and he actually I think is like a big believer in more just like using um you know kind of like using that for as like another form of like system prompt learning or like you know potentially like using the feedback

58:36

that you get from the agents. Um, so like in Letto, we do have like a thumbs up, thumbs down that you can put for different responses.

Um, so eventually that could be used as essentially like a mechanism of feedback to potentially do like automated like prompt tuning maybe similar to something like DSPI um to

58:52

essentially encourage those agents in the long run um to kind of like rewrite their prompts or like to automatically rewrite the prompts to um potentially like you know do better in whatever the task is. Um, another kind of related topic is also like sleeptime compute which I mentioned before which is

59:08

essentially having you know these agents which are offline and so these agents because they're not like part of the conversational agent they can go back through previous histories they can maybe revise feedback um to essentially like process information and then generate like a better prompt um or like

59:25

better like in context memory blocks from that. Um so I I think these are like all all things that you can do.

Um but yeah, it's it's definitely hard to get right. >> Yeah, for sure.

Um and Karthik, I know that Wes put out uh work on this as well where you've talked about language

59:41

models, benchmarking performance, current state of the language model, uh retraining and replacing uh to to improve performance across those benchmarks. Um can you maybe talk a little bit about how these systems learn from experience and human input as well?

>> Yeah, actually that's great. I I

59:57

actually wanted to pick up also on uh uh the part that Sarah was talking about with respect to reinforcement learning, right? So we've we've been doing a lot of work also at one that is very related to that area, right?

So what happens is we get a lot of information that comes in from let's say our customers, right?

00:13

Both in terms of information that they have prior to actually starting to use the system, right? So these are usually in the form of documents and process knowledge and so on, right?

And then there's also usage information as they start using the system, as they start chatting with it, as they start giving it goals, they start giving it also

00:28

feedback in terms of how it's doing and so on. And so the big, you know, the the billion or trillion dollar question is how do you actually use all of this knowledge to improve the performance of the system?

I agree that I think there is a lot of potential in various reinforcement learning methods, right? But one thing I'd like to talk about is

00:45

there is also a drawback to a lot of this, right? Apart from the you know the obvious drawbacks of like scale and time and compute and all of that stuff right one of the pro one of the fundamental issues with reinforcement learning right is it can tend to kind of overoptimize it can tend to go into a bit of a hole

01:00

right and so nowhere is this more evident than when you're trying to use reinforcement learning in order to train models to do better on challenging domains. So we wrote a paper back in I think March or April.

It's currently under review at the new conference. were hoping for the best.

But basically that

01:17

paper which was about concise reasoning was talking about this effect where as the uh length of the output generated by the model gets longer and longer right the accuracy of your model overall actually starts dropping right and the

01:34

problem here is essentially that in some sense and I'm I'm greatly oversimplifying here I'll draw up a link to the paper and the work for people who want to read it in more detail but the the fundamental concept here is that the model is basically learning to overfit on specific instances, right? And so it's kind of

01:51

losing a lot of its predictive power in some sense. So the fundamental u idea behind the paper that we wrote was if you can somehow restrict these models to uh shorter outputs or smaller uh outputs, right?

You can actually greatly

02:07

improve the accuracy of those models. And the way that we tested it is we took a bunch of these smaller models, you know, like a 1.5 billion parameter model and a 3 billion parameter model, a 7 billion parameter model, and we said if we were to impose this artificial constraint of now outputting much

02:23

shorter responses, right? Can we actually increase the uh accuracy of the model?

So you actually get two wins, right? you get one win in terms of the accuracy improves and your second win is that you're greatly reducing the number of output tokens which is ultimately the biggest cost when you're calling these

02:39

models right and so the TLDDR of the work uh and again like I said I I'll put a link to the paper uh right after I finish speaking here was that we can reduce the number of output tokens to almost 1/3 of what it was previously while preserving or in some cases even improving the accuracy on challenging

02:56

domains and challenging problems right so I think that's there are algorithmic advances that we can make. In this case, we're talking about, you know, RL, we're talking about reinforcement learning, but there are a lot of algorithmic advances that we can make in order to make these models uh more accurate, but also uh less expensive in some sense,

03:12

right? And then one final one last thing I'll also say uh which is slightly different is in terms of personalization of these models.

We've also been doing a lot of work at wand you know since uh basically the beginning of this year where sometimes because you have a lot of usage data instead of just taking

03:29

kind of the plus one minus one you know just the thumbs up thumbs down you can also take that usage data and you can train adapters right so you can train like a last layer on a smaller model right which can be used as an adapter for a particular company you know for a particular use case right so these are

03:44

kind of use and throw and in fact we we we even got it down to the level where we could train adapters for specific users, right? So Zen, for example, if you're using let's say our product, right?

And then you have a set of interactions with it. We can actually customize the performance and the

04:00

behavior of the model to take into account everything that you have talked to it about. Now obviously there are a lot of you know different problems that need to be solved in that space.

But these are two different things like the RL approach I think is a much more general heavy-handed approach in some sense versus the adapter approach I

04:16

think is much more niche and narrow and like targeted towards a specific use case or a specific person a specific company and so on. So yeah I think that's kind of my view on that that I think RL is very powerful but we also need to be a little careful about how much we end up optimizing with reinforcement learning techniques.

04:33

Yeah, and I wanted to pull on one one thread. If you can if you can optimize the number of output tokens, I I think the third benefit of that is also it's easier to manage the context because if you think of a a language model or an agent task uh for to successfully

04:52

complete a difficult task at hand, it can take anywhere from 20 to 50 to 100 consecutive executions. these could be tool calls feeding the input back in reasoning over that.

Uh, and so if you're more efficiently managing that context, then you you're less likely to

05:08

run into those long context issues that we talked about. Um, but this is a good segue into the into the last topic that I want to talk about, which is this long horizon reasoning and planning.

Um, difficult tasks require multiple successful uh, language model calls. So

05:26

maybe I I wanted to pose this question to everybody which is how do your systems handle these long horizon task completions and reasoning. Uh and specifically the interesting thing that developers are are having to deal with is what happens if the if the system or the model makes an error on the 10th

05:43

step, how does it execute on the 11th step and how does it go forward? So, um maybe maybe you guys can talk about how you're dealing with that issue and maybe what are some of the uh future uh research directions to help uh deal with planning and long horizon uh reasoning.

So, I'll throw it over to Sarah and then

05:59

we can go from there. >> Yeah.

So, we actually had a um like some recent like research work that's currently going into a paper called like recovery bench on this topic. Um it was essentially doing like a comparison on terminal I think yeah terminal bench um which is like this terminal use

06:14

benchmark. it's it's like a pretty good benchmark for like these like long horizon complex tasks um in the terminal and there's this kind of interesting result where um depending on which model you're using the ability of the LM or agent to recover actually differs quite a bit so I think um you know clouds is

06:32

incredibly good at coding so it's obviously number one for like uh you know doing the task in a fresh state but in terms of like you know the recovery state so like once you kind of prefill the agent with a failed trajectory um we actually found that GPD 5. Um, and I think Gemini also like bumped up in the

06:47

ranking. So, GBD5 was like the best performing model.

Um, and then yeah, there's I think there's also been work from like SweetBench kind of showing that like if you randomly alternate between like GBD5 or like the different models like you get like better results than just using a single model. Um, so yeah, I I do think one interesting aspect is like, you know, maybe

07:04

throughout the agent's lifetime you should be changing um what model you're actually using. So maybe if there's a failure, you should like switch to like a more like um you know an LLM that's better at recovery.

Um so I think like that's maybe one really interesting angle. Um and then yeah I think in terms

07:20

of long horizon tasks more generally um one thing that we've actually found is that memory plays like a surprisingly important role here because going back to this idea of you know rewriting the system prompt rewriting the context blocks that's essentially like a way for the agent to be organized about what

07:37

it's doing at the moment. If you've ever used cloud code you might have seen like you know it's always making these like to-do lists and then checking off the to-do lists.

So, I think having something like that happening inside of the agent's memory is actually incredibly important for the agent to understand what it's doing to not get

07:52

derailed, which which is ultimately like the biggest risk. Um, you know, the agent like blows up its context, forgets what it's doing, um, and like fails the task, you know, on iteration number 10.

Um, so yeah, I I think like memory rewriting, making sure that you're, you know, consolidating the context window,

08:08

avoiding derailment, these are all like things that are like really really important and something that we focused a lot on. Let and so I I think a lot of these tasks like if you actually do that context management right you can get really really far.

So with Letto, we actually um I think like a couple weeks ago made like an example um just like

08:25

terminal use agent and and like terminal use is like not like our our main thing obviously. Um but despite that like it was actually like the number one like open source implementation um at the time for terminal bench just because like a lot of the built-in features of Leta for um you know being able to write

08:40

to context being able to modify context over time and then giving the agent the tools to do that that actually turned out to be extremely useful for creating an agent that can do really long horizon tasks um like the ones in terminal bench. >> Awesome.

Thank you. Um so we're almost

08:58

at time here. I've been given uh six more minutes to to take some audience questions.

So uh I'm just let me let me look through to see if there's uh there's some good question. There's a lot of questions.

So let's see uh

09:17

yeah so this is this is a topic that I wanted to talk about and I'll use this question to to talk about that. But I wanted to pose this question about single language model powered agent systems versus multi- language model uh powered agent systems.

And there's a lot

09:32

of debate in the community around uh when is the added complexity of multi- aent systems worth uh worth it versus when should I just stick with the most powerful model that I can find. And Sarah, you mentioned about how it might be worth swapping out that most powerful

09:48

model to recover, but how do you guys think about going with multi- aent systems versus just using the most powerful model and letting it uh letting it fix uh fix uh the trajectory to these uh long horizon tasks and task completion. So I'll throw it over to

10:03

Loren uh for this one maybe. >> Yeah.

So again, it always depends on your use case and that's kind of why what we tell customers and our users, right? There's two orchestration types that we have at crew.

One is a crew which think of it like a Whimo. You have

10:19

a bunch of agents working together and it's autonomously getting to its end destination. Then you have something called flows which is more deterministic.

You have uh it's an event-based orchestrator. So you have a start method, you have routers and in between are like each node can have

10:35

regular Python code to single shot LM cos singular agents or inside that could be a group. So how we've distributed like a like a like a 2 by two is like depending on your use case if it's high complexity high precision use cases you

10:51

might want the best models that you can get right for lower comp complexity lower precision use case think of like a lead enrichment um it's not as complex you don't need the the best of models you might need tools uh to get you there but for example when it comes to high

11:07

complexity high precision use cases for example we had a customer that's filling out IRS text form data. You can't make a mistake for those, right?

So maybe using the the the latest and best models for something that could do OCR really well or something that can do image

11:23

processing really well or image understanding really well or even writing into um like tool calls specifically to um writing into those forms. It just depends on your use case.

So again, if you were to have a a framework or framing for this, if it's a

11:39

high complexity, high precision use case, potentially use the best models you can get. If it's a lower complexity, lower a precision use case, you might get away with smaller uh models as well.

>> Awesome. Caric, I'd like to hear your

11:54

take on this because you've mentioned um the use of small very small fine-tuned models that are specialized and that that tells me that you lean more towards the multi- aent system. So I' I'd love to take your I'd love to hear your take on >> Yeah, I yes I think I think I will keep

12:09

it brief. Uh I actually think that this is going to be an emerging field in the next 6 months.

So to everyone here at the conference as well and you know I intend to also follow my own advice in some sense. So remember how rag I think maybe two years back or a little bit more right rag became this whole thing

12:25

where there were so many different options you know you can choose a different vector db you can choose your you know uh um uh the the the model that you want in order to vectorize the data and all lot of other different things I think this question of you know single agent versus multi- aent or single model

12:41

versus multiple models is also going to get there right and so in some sense there is no good answer there is no correct answer right I think it's going to be very case use case dependent. It's going to be very application dependent, right?

Uh obviously you get a lot of, you know, uh immediate trade-offs if you

12:57

go with like a large model, let's say, versus a bunch of smaller models. There's communication overhead and there's all kinds of other overheads when you're trying to manage the output from a bunch of different models.

But it might netn net end up being cheaper for you, right? versus using a much larger model uh may not always work both from a

13:12

cost perspective but also your larger model may not be trained on specific data that you want it to be able to use or access and so forth right so I think this whole notion of uh you know architecture engineering for lack of a better phrase or better term I think is going to be very important in the next 6

13:29

months to one year right and if you are someone from a hands-on perspective if you're someone who can actually build you know both with a single model but also with multiple models and you can actually give stakeholders that choice right you can tell them you can go this way or you can go this other way and let me advise you on what will work better

13:46

and what will work when you do this I think you stand to make a lot of progress you know make a lot of money whatever your metric is right so I think that's that that's kind of what I'll say about it I don't think there is one correct answer but I do think people should be focusing on how to actually turn this into um an engineering field

14:03

not a science field right how do you turn this actually into something that can be engineered so Yeah. >> Awesome.

Thank you. Uh, and we've just got one minute left, so Sarah, I'd like to throw the same question over to you and then we'll go go over to a break.

>> Yeah, I mean, I think I maybe disagree a

14:19

little bit in that I I think you you'll often get the best performance from having like one really good agent like running on like a really good model. Um, and I think part of the reason for this is because ultimately that's how a lot of these models are trained, right?

like in general you want to design your systems around the around the way that

14:35

you are kind of guessing that open AAI and anthropic and these other providers are training these models and and you know I I think there's also been like recent blog posts from like cognition and I think like um open devon or open hands that was kind of talking about how like in these really frontier um use

14:51

cases like coding where you're really pushing the limits of what can agents do um they've actually found that like single agent systems are much easier to actually work with and and engineer or like just have better performance performance um than multi-agent systems. I do of course think that there's cases where you want multi-agent systems like

15:07

I think anthropics research agent was multi-agent. It had like multiple sub aents um with like specialized context researching you know different exploring different things with like one um you know main agent aggregator.

Um, but I I think in general like for our users, what we kind of recommend is like

15:23

starting with a single agent and then if there's like reasons to go multi- aent like for for example like maybe there's just so much context that you can't fit it into a single agent then breaking it up into multi- aents and then I think we also kind of suggest like you know start with the best model and just single agent and then if if you need to reduce

15:40

your cost if you need to kind of like you know reduce the amount of context per LM call or like maybe use cheaper models then break it down. But I do think it's really important to you know because I in most cases you will you will be getting the best performance from one really good model um to kind of

15:56

start from that. >> Yeah.

Awesome. Yeah.

I wanted to end off with that with that because that's kind of a highle decision that a lot of developers are having to make as soon as they start off on a problem. Do they want to go with a bunch of models or just go the easy route and and use the

16:11

best model? So um we're at time.

I wanted to thank the panel and just to give a quick summary. We we touched on cognitive architectures.

We talked about memory. We then looked at how you can use past experiences, past uh logs of uh previous memories to then learn for

16:28

future execution. We we talked about kind of long horizon reasoning planning.

Uh and then uh thanks to the questioner, we also got the opportunity to talk about multi-agent versus single agent systems that the community is debating about like Sarah mentioned anthropic put out a blog post on multi-agent systems

16:44

uh open hands and um and cognition put a blog on you maybe you don't need the added complexity of multi-agent systems. Um so I hope this discussion was uh useful to all the developers out there and to everybody that's listening and I wanted to thank the thank the panel as well.

Thank you so much for your time.

17:00

>> Yeah, thanks for having us. Thanks so much folks.

>> Thanks everyone. >> Thanks everyone.

>> Uh we are going to take uh an 8 minute break and we'll come back from the break and we'll um continue with our next panel.

17:18

[Music]

17:39

Heat. Heat.

[Music]

18:00

Heat. Heat.

[Music] Heat. Heat.

[Music]

18:34

[Applause] [Music] Heat. Heat.

[Music]

19:23

Heat. [Applause] [Music] [Applause] [Music]

19:55

[Music]

20:12

Heat. [Music]

20:46

Heat. Heat.

[Music]

21:12

Heat. Heat.

[Music]

21:30

[Applause] [Music]

21:55

Hey, [Music] hey, hey.

22:11

[Music] Heat. Heat.

22:30

[Applause] Heat. Heat.

[Music]

22:50

[Music] Heat. Heat.

23:10

[Music]

23:28

Okay, we'll get started in about a minute. I will hand it over to Tamur um who's the moderator for the next uh session.

How's everyone doing? Hi.

Hi, Gabriella. Nice to meet you.

Yeah, >> great to see you.

23:44

>> Good to see you all. >> Happy Monday.

>> Happy Monday. >> Hi.

How are you? >> Good to see you.

I'm doing well. Um, okay.

Tamur, maybe we still have a minute or so. So, I will I will hand it over to you.

So, I will hand it over to

24:00

uh Tamur. Uh, Tamur is going to uh take uh this forward.

Tamur is the managing director for AWS Generative AI center of innovation. Uh so Tamur is going to uh walk us through uh you know uh our next

24:17

topic and uh yeah without further ado Timur over to you. >> All right thank you so much Raja.

Um good morning everyone. Uh my name is Timur Rasheed.

Um I'm part of the AWS team. Um I lead the generative AI

24:33

innovation center. Uh we started the innovation center about 2 and a half years ago and we are a multid-disciplinary team that uh does does forward deployed engineering for customers and we help them accelerate their journey uh with Genai and Aentic

24:49

AI and making sure that they can maximize the value from that investment in the shortest amount of time. Uh I want to thank the data science dojo team for inviting us.

I'm really excited to be leading this panel right now architecting scalable multi- aent workflows and what excites me the most

25:06

is the group of folks that I have here that are part of the panel and we have a great representation and I do want to welcome them all uh over here as well. Um uh we have Xiao from Curi, Gabriella and Ali and I will ask them all to maybe

25:24

take one minute and introduce themselves um in that order and then we'll uh get into the discussion. >> So let's start with uh our our gentleman from Crew AI.

Yep.

25:41

>> Hey there everyone. First of all, thank you so much for having me.

Dur long time no see. I remember like we we met in Seattle not too long ago.

Very excited to be seeing you again even though it's online. >> It's good to see you too.

>> Well uh one again thank you so much. I'm very excited to be here.

Thank you so

25:56

much for the help folks from data science know very excited about it. So one my name is kind of hard to pronounce.

I do go by Joe and that makes everyone's lives easier. So that uh that is one and I'm very excited for a conversation today.

If people don't know crew AI, um we basically started an open

26:14

source tool to help engineers to create AI agents. We still have a project that goes by the same name and we keep working on that.

Very excited to be closing and launching a version 1.0 at some point next month. So let's see how that will go.

And then we turned into a company uh that now is helping uh many

26:31

customers across the globe including the United States Department of Defense, uh Papico, Royal Bank of Canada and many others. So, uh, very excited to be here.

>> Well, it's great to have you. Gabriella, did you want to go next?

26:46

>> Yeah, of course. Uh, glad to be here.

Great panel. Uh, Joan or Joe and I, we are originally from Brazil, so it's great to see like another Brazilian uh, in a panel and all doing agent.

Uh, it's amazing. Uh, so my name is Gabrielle

27:01

Deer. I now have my own company at F02 Labs where I help startups stand out and scale through AI strategy and developer advocacy.

I have worked with over 100 AI startups helping them go from like this

27:17

invisible product into recognized leaders within this developer communities because we know that it's very hard for you to stand out from the crowd especially now with so many startups. Uh previously I was working at Microsoft.

I was a director of AI over

27:33

there and then at IBM uh doing AI strategy innovation but I also worked in at different startups. Three startups that actually got acquired and my focus is always helping founders to drive adoption, build visibility and grow

27:49

sustainable communities around the product. On the side I founded two global organizations.

One is called our ladies and the other one is called AI inclusive and together they reach over 200,000 members globally and the idea is

28:06

to make technology more inclusive. Very excited to be here and can't wait for our panel and the conversations.

>> Thank you so much Gabriella. Really excited to have you.

And finally Ali, did you want to introduce yourself? >> Absolutely.

Greetings everyone. Ali

28:23

Arsenjani. Um I'm the director for applied AI engineering in the Google cloud AI team.

Um my team leads um engagements uh on the cutting edge of where uh we're aspiring as an uh as a

28:38

community of uh of people accelerating in agentic AI um and trying to make sure that uh these systems are actually ready for production uh and to kind of um go over the hurdles and uh speed bumps that

28:56

we see along the way uh as we all collectively um as an industry strive in that direction. The acceleration is uh is very mind-boggling uh exciting and scary at the same time.

So I'm very glad we're having these types of panels so that we

29:11

can share experiences. >> Yeah, know that's great.

Thank you so much Ali and welcome. you know um what I'm excited about is just getting this uh really broad and diverse perspective from you all as we look at a few set of topics right and so as multi- aent

29:27

systems gain traction in the AI landscape you know many organizations are experimenting with pilot projects and proof of concepts very similar to what companies were doing about 2 and a half years ago with Genaii and clearly

29:42

with Genai we've made this great progress in the maturity of the tools, the frameworks, the models, right? We're at a very similar point here uh with multi- aent systems, right?

And so as you draw from your diverse experience uh

29:59

working with multiple customers and organizations across different sectors, I'd really like to explore initially what are some of those obstacles and also those success stories in scaling multi- aent workflows. So I think a question that I'd like to start off for

30:14

all the panelists is what are the biggest barriers uh that you're seeing with scaling these multi- aent workflows and and for that matter even you know building them from the ground up and are you seeing organizations that are sort of quicker at adopting that quicker at

30:32

moving from pilot to production and so Joe maybe we'll start with you first. Yeah, sure.

I mean that's such a good topic and I agree with you. It looks like a again it follows a similar pattern that Gai and I would say similar pattern that even other industries in

30:48

the past I can find a lot of correlations between this and for example data links back in the day were like was a brand new technology but people don't really understand how to even buy it. So they're trying to understand and navigate these a little bit but once that they get to value

31:04

expansion starts very h rapidly. So it's kind of like a similar pattern that we're seeing now.

It's very much crossvertical and cross horizontal. So we see everything from financial, insurance, CPG there like there's it's across many different industries.

31:20

think a few things that I expected to be a little different is I expect to see u kind of like SAS companies moving faster but a lot of them are going the route of kind of like oh let me try boot it first and if I can't boot it then I might look into buy it. uh now more bigger

31:37

enterprise companies are going more into like no like I know that I won't be able to build this and keep up with the market base so I be better off can like buy something or partner with someone that has done this um I'm seeing a lot of interest from highly regulated industry I think because they are very

31:52

high regulated there's like a lot they can automate on their process there said migrating to production is usually a little trickier because of like again all the regulations and the environments and all that uh but definitely seeing that companies that come prepared with knowing what they want to build a very

32:08

clear understanding of what a success looks like for them and knowing how to measure it usually correlates to success very like very highly. Uh when companies show up like not being clear on kind like what they want to do or maybe they have like a small list of things but

32:23

they didn't taught it through that usually means they have a lot of like homework to do before they actually jumping into the technical side of things. Uh, I would say that my quote unquote hot takeake is the tech is there.

Uh, I think companies needs to kind of like match a little bit some of the prep work before jumping into them.

32:40

>> Yeah. Yeah.

No, that's a fair point. You know, I mean, I uh, you know, I echo a lot of that, right?

Which is when companies sort of have very intentional things that they want to accomplish, right? You know, the success rate in actually productionizing a system gets higher, but naturally many people are

32:56

still trying to understand this whole space, right? So um Gabriella what about from your perspective what are you seeing?

>> Yeah you know I kind of like see and similar to Joan uh three-fold. So the first one is about the operational complexity.

33:12

So how do we orchestrate multiple agents across tools APIs and different data sources in a way that is reliable. It's still immature in several companies and startups.

Uh the second piece is around um the evaluation trust. So some

33:30

organizations they don't have robust ways to measure agent behavior. Uh so correctness and and failure modes at scale.

Uh that's the something that I see very very often. And then the third one which is something that we have seen in the industry for like so many years

33:47

in is the integration that so most pilots they they leave in the sandbox demos but when you plug in these agents into the production systems with security compliance and mon monitoring then it's a different game. Then if we

34:03

move to the part where I've seen them succeed, it's been when companies treat this multi- aent workflows like software systems. No, not research prototypes.

>> So think about starting small with a high value use case and add

34:21

observability from day one and then you create this feedback loop with the users. uh and the ones that I see that scale also like they build this evaluation so they can integrate safely and then the the other piece that we

34:38

kind of like forget is the aligning with business. It's so important to align with business very early on.

So trying to trying to tie this agent workflow with a clear RO ROI. This is what like business people and stakeholders they

34:54

usually uh value. It's like what is the ROI?

So this helps with like all the support from executive stakeholders and also to get budgets so you can move from pilot to production. Yeah, know that's a great point, Gabriella.

And it's funny, you know,

35:09

just the other day, like I've been talking to customers over this past week and more and more they're talking about the eval part, the observability pen uh end of the spectrum and what some people are calling AI uh agent tracing, right? It's just really getting that

35:25

intelligence about how do agents perform so ultimately they can drive, you know, a much better outcome through that closed loop, right? >> Yeah.

And one one particular interesting thing is like in the industry we see this this loop the cycle happening over

35:41

and over and over again like when we had like machine learn models we did all of this already. So now with AI agent the same thing don't forget that we need to evaluate we need to observe we need to have like systems in place to make sure that it's doing the thing that is supposed to do.

35:56

>> Very true. Very true.

Yeah. >> No thank you so much for that.

Um Alien and finally from your perspective I'd love to hear what you're seeing and please share that with the group. >> Yeah absolutely of course uh my colleagues here have shared some keen

36:12

insights in this area. I think potentially some slightly non-over overlapping areas would be that in general uh current operational frameworks don't quite have standardized tooling for the entire life cycle of

36:28

managing interdependent agents. Agents that are not just operating individually but have interdependent actions and dependencies.

What this uh deficit essentially acrrues is or manifests itself in is the ability

36:45

to uh or rather the inability to effectively monitor agent drift um to ensure what we might call non-repudiation uh with comprehensive audit trails uh of how an agentto agent uh for example

37:03

there's a protocol that is now uh given to the Linux foundation called the agentto agent interoperability protocol. Implementing those types of protocols becomes really important because you can have audit trails across companies.

Maybe uh a platform specific agentic

37:21

platform has their own observability and tools and all that. That's great.

But when you have crosscutting uh workflows across organizations, it's important to be able to have audit trails. The agentic trails or or rather tracing that

37:36

you mentioned Tamour um is basically looking at the agentto agent communication um and how you transfer information from one agent to another across the organizations and in that vein

37:53

implementing some reliable auditable not just agentic interactions but human in the loop intervention protocols as well because the trend towards okay we'll we'll have agents do everything that's that's kind of a nice uh objective

38:09

moonshot example but we're going to get there in stages we have to be able to rely on the agents maybe send an email reliably maybe don't mess up the email so I don't get into trouble right and once we're okay with that then okay but

38:24

if they want to charge my credit card they need to come to me so there's a human in the loop escalation or approval that sort of a thing so the agent entic operations gap. If I if I can summarize that in a in one clump in a category, I would call it the agentic operations gap

38:40

where the tooling for life cycle management of interdependent agents is necessarily not there. >> The second thing is architectural fragmentation.

So pilot deployments generally requ you you know kind of rely on sequential prompt chaining.

38:57

Sequential prompt chaining doesn't have the robustness that you would expect for production. So if you want to have scale you have to you have to adhere to uh some of the wellknown ways of solving these problems.

agentic design patterns

39:13

like a role-based agent routing where agents have modularity, they have very specific tasks, they have version tool sets and their interactions are governed by some formal communication protocol like the ATA agent to agent. Um, and

39:29

Gabriella mentioned another one I was going to mention which is agentic behavior evaluation. So, uh, that that that's kind of like third on my top priority list.

>> Yeah. No, that's great.

And you know, Ali, I really uh like the way you positioned it because I'd like to build on that conversation, which nicely segus

39:46

into the next topic, which is when you look at these multi- aent systems, um the big thing here is how do they collaborate and how do you coordinate them, right? And so if you had to look at the analogy of just how human capital

40:02

is deployed within organizations, uh we think of so many different ways that we can organize teams around certain outcomes, right? And so we've looked at centralized models, we looked at decentralized models, we've actually looked at, you know, how do you mix them

40:18

both within an organization, right? When you look at multi- aent systems now, you're really not only talking about sort of the technical architecture and all those foundational things that kind of support this multi- aent system, but you also have to think about the operational part that you mentioned as

40:34

well too. So could you share from your perspective what are some trade-offs that builders and companies should think about between you know decentralization and centralization in a multi- aent system?

Um yeah. Um so you have you have on the

40:52

one hand you have kind of what you would consider a centralized orchestration. Um some people like to call it a centralized orchestration control plane.

So an OCP a control plane that orchestration is happening. And then you

41:08

also have a distributed agent execution environment. agents are interacting between companies via ATA or just randomly maybe unfortunately maybe well standardized.

So you have the centralized orchestration control plane

41:23

and you have a distributed agent execution environment. These are these are kind of two different elements.

Um the functionality of the centralized one you have the benefits of it is you have global state management you can control everything from one place. policy and

41:39

compliance can be enforced, you know, in one location. You can have workflow auditing.

You have dynamic resource routing to the best possible tool. Um, and there in the centralized version, you your primary architectural objective

41:55

is centralized governance, accountability, and observability. The trade-off consideration there is or the potential point of contention maybe that you have to trade off on uh if it's not optimized.

It requires very complex

42:11

state serialization uh in order to make this happen. Uh on the other hand you have this notion of a uh distributed agentic execution environment and in that task execution local you tool utilization asynchronous

42:28

ATA communication and local decision processing are paramount. So the primary architectural objectives you try to get to there are reducing latency because you want things done fast.

uh you can't have people wait for the orchestration

42:44

to happen. You got to have it local fast uh taking actions being taken and decisions being taken.

You want resiliency. So if something has to robustly recover, it can.

And you to some degree you want autonomy. So reduction of latency, resiliency and autonomy are those architectural uh

43:01

objectives for the decentralized version. Now there the tradeoff is uh you risk emergent unconstrained behavior.

Emergent systems have behavior that emerges without the coordination between the agents because they're not necessarily coordinated without OCP, you

43:19

know, the centralized supervision. So the way to balance them is to make sure you understand the architectural objectives.

Do you want the centralized, the decentralized, and then try to balance them by ingesting a little bit of the centralized governance over that

43:35

decentralized area? So at least you have auditability there and vice versa.

um the complex state management that's required in the uh centralized version you would want to delegate to external uh agents and tools so that you have a balance.

43:52

>> Yeah. No, that that's a fair point and I do think the way you're characterizing it's very telling, right?

Because um there are those trade-offs you have to consider, right? There protocols and standards um which you know Gabriella I wanted to learn from you know you advise

44:07

so many startups right uh with strategy with technical guidance and all of that. Are you seeing certain standards or frameworks emerge uh from your conversations?

>> Yeah, definitely. So like I've seen all these early building blocks of like the

44:23

standard for agent orchestration. So frameworks like lingraphai uh DSPI I think that's how you pronounce uh are moving us from this ad hoc chaining toward a more like graphbased declarative workflow that are easy to

44:40

scale and debug. uh at the same time I'm seeing protocols such as MCP uh H2A function calling interfacings that are shaping this common language for tool use and interoperability.

Um but I in the longer term when I think

44:57

about like several years maybe even months I believe that this real foundation will combine composability observability and governance giving enterprise the confidence to run this multi aentic or multi- aent systems

45:14

in productions much like uh what kubernetes did for containers uh and I liked this thinking about this cloud infrastructure in a way so Like the composability would be like this microservices where we mix and match some modules instead of like rewriting

45:31

everything. And then the observability as we mentioned a a lot of times here monitoring and logging uh because we need to keep this system um reliable and then the governance which is like compliance and access control that

45:46

ensures that you build whatever you build is not just powerful but safe and aligned uh with the business rules. >> Yeah.

No, that's great. I like you mentioned uh you know composability, observability and governance, right?

And I do think those are very foundational

46:02

aspects to it. Um and and I think it's I like the analogy that you looked at Kubernetes and containers because in many ways we have heruristics across that world and the world that we're in right now uh that are very similar right

46:18

and I I I I want to segue to the next question which is for Joe which is you know in the work that you do with customers and obviously you have this great platform which gets these you know digital workers which you guys are calling crew uh or crews um Are there effective strategies that you are seeing

46:36

as you help customers through that journey? >> Yeah.

You mean more on the technical side or or overall? >> You know, more so technical, but feel free to go beyond the tech.

>> Yeah. I mean, there's definitely a new idea of a stack that has been coming up, right?

And if you look into this, it

46:53

goes all the way back to the data management layer. Like these agents will need to tap into data at some point.

So whether that is data bricks, snowflake, bigquery, red shift, whatever you name it, right? There's it starts all the way back there and then you have the LLMs, then you have kind like the orchestration layer.

Then you have like

47:09

memory, observability and a few things spread on top. Um there is authentication connectors and I would say on the outmost layer would be this idea of an agenda app something that you can actually interact with, right?

Either conversational or buttons or whatever it

47:25

might be. But there's definitely idea of a like this entire stack coming up and I think people are trying to understand like how much of the stack is grouped together.

How much of the stack leaves apart from each other? I need to figure it out.

Uh so there's definitely a few patterns that we are seeing in there that seems to be working pretty well. Uh

47:42

I would say that every customer goes to kind of like like a three phases. There's kind of like initial phase of building an integration.

Uh that is again you want to make sure that you get that right. So you want to make sure that you're trying everything.

You're setting up things in a way that you can

47:57

actually monitor them. You're making sure that you have the definition of success that what we call before.

And then they go to a second stage where like all right this is running now I need to make sure that I'm able to observe this and optimize it. Right?

So, it's not only that I want to see traces and what's going on and or even like

48:13

having this ability to use LMS judges to highlight or call hallucinations and things like that, but I want to be able to use that to proactively optimize what I'm building and I want to be able to that's getting better and then the final stage would be like the management and kind like scaling, right? Where like all

48:30

right, I care about who gets access to this. I care about what is the cope of permissions.

I care that I'm not rebuting the same agents twice throughout my organization and that I can run like either 100,000 or a million of those and it's all fine. Um, so I

48:45

would say those are kind of like the way that most of customers are kind of like thinking about it and it seems they're having a lot of success with that. Um, it's the way that we have building our entire Asian management platform on that as well.

this idea of like hey needs to be extremely easy to build and integrate

49:01

needs to be extremely powerful to observe and optimize and and managing a scale should like give trust right people should be that they trust this things can run like u and they can just turn it on to 11 because what I think like most people don't understand is >> traces are very uh interesting and

49:18

useful especially when you're developing right and allows you to get those things like going and improving very quickly but once you turn it on to 11 this things are run at machine speed. So it's it's hard for you to kind like put your finger on them and and monitor them unless you can like set up as um someone

49:34

else brought up brought up this idea of like automatically runtime interjecting rerouting and all that right uh I do am seeing a a new trend of what I'm calling optin agency so I feel like uh up to

49:53

this point or up to like a few months ago was either people would choose a pattern that is or like oh I like people would push you to graphs or people would push you to like agents like completely like using agency but now what I'm seeing is more like those two getting

50:09

intertwined where people might want to have a structure or a backbone where they might do if this then that and they have all control that they need and then in certain steps they might choose how much agency they want from a single LM

50:24

call to have one single agent to have an entire group of agents working together and that is a pattern that we are definitely seeing uh we we call ci flows and I think we're now running 12 million a day of dope so it's insane to see like how this is growing as a pattern for us

50:40

in like the last like quarter um so those are some of the things that come to mind >> yeah that's really interesting Joe uh like in a is is a way to look at that there's a balance of how do you ensure that the overall system has the right

50:56

balance of determinism and non-determinism, right? >> Yeah.

I think it's interesting because the reason why these agents are powerful was because they are um probabilistic, right? What means that they selfheal.

You can throw any data at them. If

51:11

something fails, they're able to like figure it out. So that is that is interesting.

But then people do want to make sure that they get not the same output but reliable outputs repeatably, right? So that they can trust that.

So it's always like you you need to strike

51:27

the right balance of like how much deterministic controls can I add to this probabilistic system and I think this is the kind of skill that people that have done it like agents and crews before they developed but uh a lot of these companies that have never done this is

51:42

they need some coaching and training to get there. >> Yeah.

And you mentioned a few things like you talked about trust, you talked about permissions and I think one of the one of the complex things with these multi- aent systems is ensuring that you have the right level of security,

51:59

privacy and compliance right and you know while some folks are not necessarily dealing with this right now over time as you look at the bigger opportunity uh systems have to be very sophisticated in how they manage complex regulatory data data protection talent. challenges

52:16

crossber considerations and those kinds of things and so there is an architectural approach to that and so Ali maybe I'll I'll turn this question over to you is that when when you're managing sensitive data in a very distributed agentbased architecture and environment right um how should builders

52:34

think about um about what are the first principles there with multi-agent architectures that have the right security governance and compliance built into Yeah, actually you left the toughest question for me, huh? Just kidding.

Just

52:50

kidding. >> I have a good answer for you.

>> I sure hope so. Um, well, well, I what I can do is I can hint on mitigating factors and, you know, principles essentially to take into consideration.

So, I would say that we're talking about

53:06

I would consider this advanced data security. So what this means when I say advanced data security is kind of like a policybased agentic data governance.

So it's agentic data governance which is an offshoot a

53:21

derivative of general data governance because agents these little creatures can get away and may not be trackable. So you have to govern them through policy.

So policy based data agentic governance of data. So you need a

53:36

framework for managing the sensitive data in a distributed agent environment. I would say that there are three principles to take into consideration.

One is the principle of lease privilege. So in the principle of lease privilege as it sounds, each agent has a unique

53:53

identity with role-based access control. And then second, access is going to be strictly limited to the data that's essential for the specific role and tools.

So when you say this, everybody's like, "Yeah, that's obvious." No, it

54:09

isn't. Why isn't it obvious?

Because agents tend to have these general LLMs as their brains. So a very generalized you want to you want to go to a physician who's a very specialized you know specialist in I don't know skin

54:28

when you go to a general practitioner they know generally what to do about skin but you need a very or an opthalmologist an eye doctor you need very specific expertise in that area so the agents you build have to have brains that are tuned fine-tuned Laura fine

54:44

tuned however adjusted for that specific specific task and domain. If they're generalist agents, it's going to be difficult for them to receive the information that's very specific to their role and task and yet deliver non-h hallucinatory

55:01

acceptable outputs. So essentially um when you're limiting data to what's essential for that role and tool, you're preventing unnecessary exposure to sensitive information like PII.

So that's principle number one, principle

55:16

of lease privilege. The second one is the abstraction of what we call a context graph.

Uh tokenize and abstract the context graph. So agents don't just pass raw data one to the other.

They query a knowledge graph that is very

55:34

specific for context. So that's a context graph that's tokenized, abstracted, uh and they share only derived insights.

They're not going to send PII back and forth so that they can keep the original data secure. So

55:52

context gra the principle of principle number two context graph abstraction and tokenization. So you're not act you know just passing around raw data.

The third one is basically a kind of compliance. So if an

56:08

agent has or data residency controls let's say in Europe and APAC let's just say uh you have data residency controls like GDPR CCPA etc in specific areas agents need to be configured to operate only within their own legal jurisdiction

56:25

of the data they're processing. So this can be thought of as a principle of cross jurisdictional compliance.

So you're complying you're compliant because yeah, you have 10 agents. Each of the agents are operating

56:42

in the jurisdiction that they actually are supposed to operate and just because you have agents, you're not sending data back and forth across these jurisdictions that deny your compliance. So crossjurisdictional compliance becomes

56:58

very important. So agents know and operate in their own jurisdiction.

So those three principle of lease privilege context graph abstraction and tokenization and then this crossjurisdictional compliance of agents. >> Yeah.

No, that's great Ali. I really

57:14

like those three aspects because they're very foundational. They're first principles, right?

Because like you said, like this is a very complex thing with all sorts of systems engineering and security engineering that needs to uh play out here. you know, we've been

57:31

getting some interesting questions from the group here in the chat and what what I'd like to do is, you know, we have about 13 minutes or so. Uh, some of these questions are very relevant to the conversations that we're having and so I'd like to bring these questions up and see, you know, Gabriella, Joe, Alli, if

57:48

any of you feel like you want to answer it, uh, feel free to. One common set of questions that I'm seeing from the group here is how should guard rails be planned for multi- aent systems and I think to a large extent it relates to

58:04

the fact that we've been talking about security um you know governance compliance but then Joe you also mentioned some things around permissions right and so maybe I'll throw that out there for the aud

58:19

>> yeah that's interesting I feel like there's two sorts of guard rails. I would I would put connectors to apps as a separate topic, but I would say that in terms of like um uh guardrails, there's two kinds of guard rails that I

58:34

have been seeing. One that are less deterministic that is basically I want to use another LLM to fact check once my agent's done with a specific task before it keeps going.

uh what is definitely one pattern and the other is people resorting straight to code like hey I

58:51

want to get the output of this task and I want to validate something with actual code either it's a JSON output or even if it's a string I might want to do something with that to validate something and then depending on the output of that I I want one of two

59:07

things I want this to keep going and next agent can take on and can keep going from there or I'm going to send this back the same agent and say like No, you did this wrong and this is what you got wrong. So then can keep working and iterating on that.

So those are like

59:22

two patterns that I have seen that people using a lot. Now there's also again to Ali's point people that are building some of those around PII detection or personal information depending on kind like what isolation you're trying to comply with.

Um and then also people doing that to check

59:39

things prior to the execution like make sure that you're checking for prompt injection a bunch of different other things in there as well. And uh the prompt injection techniques have been got very u very sophisticated I would say.

So there's a lot of checks that you can run prior the agent actually take on

59:55

a request to make sure that you're rolling some of those out. Uh and then the other thing that I also say that we see is like a code generation validation.

I think that's another big concern as well. whenever you have agents that are generating code like how you make sure that you're sending checking that code for like any kind of like security concerns and then we're

00:12

executing that in a way that also is super secure. So in terms of guard rails those are the kind of guardrails that I'm seeing out there.

I don't know if like anyone else would add another kind though. >> Yeah, sure.

Ali or Gabriel any thoughts? >> Yeah, Gabriela, do you want to go or

00:29

should I? >> You can go first.

>> Okay, cool. So um I I can think of you know like two levels at least um I know J you mentioned them really nicely but you know if you think of them in two levels you have policy adherence at the high level business level that's you

00:45

know the business policies and then agent guard rails at the lower level authentication authorization you know filtering PII etc. So you have these two levels to deal with in terms of um essentially auditing compliance and

01:01

agent policy management. Um so I think thinking about in the at those two levels is helpful because you can enforce them in different ways.

For example, if you're using um callbacks uh for example in order to monitor and uh

01:17

deal with human in the loop or filtering responses back from a model or back from an agent or back from a tool, you can then in that call back you can trap the information and see if it's even uh if I should even be able to access that

01:33

information if I'm calling it. So you have a mechanism uh to see if the response is adhering to organizational policies.

So the very tangible thing I would say is for example if you're using

01:49

something like you you can use crew AI of course uh or let's say you're using uh the agent development kit there are callbacks you can do before and after agent tool and model callbacks where you when you're retrieving the information or when you're even sending the information you can check for policy

02:06

adherence. >> Yeah that's fair.

No, thank you so much for that. Ali, >> you know, one other question that I'm sort of seeing in the chat um in it sort of combines a few topics, right?

Gabriel, you talked about ROI and clearly, you know, when you look at

02:24

multi- aent systems and the investment that's going into deploying these systems and then eventually over time them doing a lot of the tasks that typically human beings would do. Um there's a huge emphasis and a promise

02:39

around ROI from from aentic architectures and one of the other thoughts that I'm seeing here is is that accuracy is going to be so important right and so you have generalized agents which are leveraging you know generalized models and then is there

02:56

this opportunity where in order to drive higher levels of accuracy and therefore ROI there's a need to create specialized domain specific speific agents which are then leveraging domain specific models. And so I I think it's a three-prong

03:13

question where in order to really get the ROI, there's got to be an investment in higher levels of accuracy with the tasks being done and and sort of how should people think about that and customizing models and you know building

03:28

specialized agents and so I I will leave it open for anyone to sort of take that question or just share some thoughts on that. >> Yeah.

Yeah. I can I can talk a little bit about the ROI.

Um because one of the things that I see it's like how do I start and which one should I start with

03:45

like when building uh is I would say like think about what is the one that you have a high volume and a very welldefined pain point. So for example if you have an e-commerce company probably it's going to be support

04:00

tickets and then a way that you can measure uh clearly. So you have like some metrics in place.

Um there is one example that I want to show about this company e-commerce company um where they were able to use like a minim minimal

04:18

like infrastructure. They were able to scalate um uh only like 10% but then they were able to get our ROI very high.

I don't remember like what is the the the the number but it was like very little but do getting a very good ROI.

04:36

So what I'm saying here is like start small and then build on top once when it's working well. >> Yeah.

No, that's a fair point. Uh Joe or Ali, any additional thoughts here or we can uh go to the next uh

04:53

question. >> Yeah, I think I would second that but yeah, let's go to the next one.

I want to cover as much ground as we can. >> Yeah.

Okay. No, that sounds good.

Um, so, you know, as we're as we're getting closer and closer to the end of our uh

05:08

panel discussion here, um, I'd really like to get each of your thoughts on how should we look at the horizon ahead, right? And so, we're at a fascinating inflection point with the evolution of, you know, Gen AI and now we're at the

05:23

start of Agentic AI and we're talking multi- agent systems. uh there's a lot of rapid advancement in foundation models and you know various breakthroughs along the way right so the pace of innovation is just moving very very fast um I'd like to get from each

05:39

of you as we look ahead what are the techn technological or research breakthroughs do you sort of anticipate that can unlock where we are today and these inflection points that we can have as we go forward and so you know maybe

05:55

Joe will we'll start with you first. >> Yeah, I think like that that's a good question and um it's interesting.

I think the most interesting thing that I have I keep thinking about within the industry there and I go back to my first answer uh on the panel. I think the tech

06:11

has come such a long way and there's a bunch of things that are already kind of like kind of working if we string them along correctly. where I want to see us going on moving next and I think like that's where the industry is going to go as well is how much work these agents can do by themselves how long these

06:29

agents can work no stop I think that's probably where things are going to start to trend a little bit because that's a very high correlation with how valuable they can be right and and how much ROI to Gabriella's point they can actually bring back so if you can get this agents doing work for like one hour and

06:45

interruptly that means that you can throw a lot of like harder uh in more complex problems, right? If I say to this agents like I don't know a silly example, but just to highlight um like say go over my entire emails that I ever got and classify everything, right?

07:01

And you don't have to worry about context engineering. You don't have to worry about any of that and these agents are just able to get it done.

I think like that you start to get in the realm of the possibilities of what you can throw into them that uh makes them extremely more interesting and more valuable for many of these companies and

07:16

use cases. So in terms of the horizon, I think there's going to be a if I had to pick one thing, I think there's many things, but this is going to be probably ones that um I'm very curious to see is how they come is how the industry starts to kind like converge and optimize a little more to how long this agents can

07:33

run autonomously and how that translate into actual ROI. >> Yeah, that's fascinating.

Yeah, like the ability for agents to do long long range tasks, right? And for what duration?

No, that's great. Thank you for that.

Um Ali, what about from you?

07:49

>> Um I mean there's a lot of areas honestly. Um one area if I would pick one is um how to balance global optimality with so when you have an emergence you have a multi- aent system that's decentralized

08:05

you have uh emergent behavior but you need to globally optimize that emergent behavior. So there's a lot of research going into how to do that.

How to balance local um and global optimization uh and deal with the emergent behavior

08:22

in a multi- aent system. So that's number one.

Number two correlary to that would be as agents become more autonomous, how does that affect that global optimality? Um and then one double clicking even further third level would be multi-agent

08:38

reinforcement learning moral as as you know people call it um is a way in which you can create uh reinforcement learning but with not just one LLM or one u agent but multiple agents. That's a challenge

08:54

that I think is going to be extremely important. It goes back to the ROI question.

It goes back to the OKR question, the metrics question. It kind of engulfves a lot of these things.

Um, and I think one uh one person in the

09:10

audience actually asked a very similar question and uh I actually have a blog that kind of talks about the mitigating factors there. Uh and I can point people to that.

>> That sounds great. No, thank you so much.

And Gabriella, finally from your

09:26

perspective. Yeah, I think there are two things for me.

One is uh when I think about scalability will come from less from bigger models and more around the infrastructure that we have around them. And then the other piece that I mentioned a lot about was the the

09:44

standardized protocols which will make easier for agents uh from different frameworks and vendors to talk each other uh seamlessly. Um, I don't know what else, but those are the two things that I see coming like and and being uh there for a while.

10:03

>> Okay. No, that's great.

No, thank you for sharing that. Um, well, um, Joe, Ali, and Gabriella, I really want to thank you for your time, your insights, and all that you shared.

I really enjoyed the conversation, and we're so early in this journey as an industry uh,

10:20

that I'm sure we will learn a lot more as we go. Uh so thank you so much for that.

I want to thank all the viewers for attending this panel and the data science dojo team for making this possible. So I hope you all have a great day.

Thank you so much.

10:35

>> Okay, thank you all. Um so before we move on to the next panel, uh we are going to have about uh a 15minut break.

Um so we will come back and we'll start our next panel on um M uh security and

10:52

governance in uh MCP deployments. So just stay tuned with us.

Uh we'll be back shortly. [Music]

11:12

Heat. Heat.

[Music]

11:46

Heat. Heat.

[Music]

12:11

Heat. Heat.

[Music] [Music]

12:29

Heat. Heat.

[Applause] [Music] [Applause] [Music]

12:50

Heat. Heat.

[Music]

13:16

Heat. Heat.

[Music]

13:42

Heat. Heat.

[Music]

14:19

Heat. Heat.

[Applause] [Music] [Applause] [Music]

14:40

Heat. Heat.

[Music]

15:02

Heat. Heat.

[Music] Heat. Heat.

[Music]

15:31

Heat. Heat.

[Music]

16:09

Heat. Heat.

[Applause] [Music] [Applause] [Music]

16:29

Heat. Heat.

[Music]

16:52

Heat. Heat.

[Music] Heat. Heat.

[Music]

17:21

Heat. Heat.

[Music] Heat. Heat.

[Music]

17:51

Heat. Heat.

[Music] [Applause] [Music]

18:11

[Applause] [Music]

18:30

Heat. Heat.

[Music]

18:50

Heat. Heat.

[Music]

19:10

Heat. Heat.

[Music]

19:52

Heat. Heat.

Heat. [Applause] [Music] [Applause] [Music]

20:32

Heat. Heat.

[Music]

20:52

Heat. Heat.

[Music]

21:16

Heat. [Music] Heat.

Heat.

21:33

[Music] Heat. [Applause] [Music]

21:51

[Applause] [Music]

22:07

Okay, we'll give it maybe another couple of minutes to before we get started. Hey, Alex.

Nice to meet you. >> Nice to meet you as well.

Great, Mark. Pleasure meeting you.

22:23

>> Great to see you again. >> Yeah.

So, we'll just give it maybe another minute or so and we'll uh get started. Um, how's uh how's your Monday going?

>> Pretty good. How about you guys?

>> Good, good, good. Busy, busy.

Yeah.

22:43

Oh, so we have uh like 1,600 people in this session right now as we speak. So it's a huge session.

So this is the joy joys of being online, right? So you can uh I don't know I mean the I don't think any any conference

23:01

sessions can be like physical conference sessions can be that big. So this is this makes it easy.

>> How many did you say? Sorry the microphone kind of cut out.

>> Yeah. So, so what I was saying was that we have like almost 1,600 people.

Can

23:17

you guys hear me clearly? >> Yes, I can hear you.

I just I just the mic just cut out at 1600. Yeah, that's that's great.

>> I see. I see.

Okay, that is great. Okay, my team is telling me that it's not an issue on my side.

So, yeah. So,

23:34

that's that's reassuring. >> Very cool.

Very cool. Yeah.

Excited for the conference today. Yeah.

So, um, so I think we should we should go ahead and get started. I think maybe we can doesn't uh hurt uh starting maybe 2

23:51

minutes earlier. So, yeah, let's uh let's go ahead and get started.

Um um so some of you have already seen me for those of you who have been here this since this morning. So, you have seen me.

My name is Raj Bal. I'm the founder of Aento AI.

I'm also the founder of

24:08

data science dojo. I mean agenta as a new thing that I'm doing.

Um so the uh in uh in this panel we are going to focus on MCP which is uh which is really central to what we call context

24:24

engineering and uh MCP is being positioned as a as a powerful way to simplify context engineering for agentic AI applications. Um and I like that analogy of the USBC of all the data

24:39

sources for LLM applications. Um so making it easy to connect uh you know really more standardize uh standardize your um data sharing with uh uh in a manner that is very similar to

24:55

uh the rest APIs. Um so you standardize you know all the AI agent uh connections to tools, data sources and workflows.

Uh but but as we uh start adopting uh MCP tools, there are uh there are actually

25:13

uh concerns around governance and security and primarily around um uh is there a chance that these MCP servers and tools they will become new vectors for security risks? uh what will be the governance practice uh practices um and

25:31

uh we are uh we have built uh the platform that we are building I mean one of the things that actually is uh is a concern is where we are uh concerned about the role based access to any MCP calls and uh and how do organizations

25:49

actually balance the speed of innovation with the guardrails and required compliance and safety uh first so I will uh go ahead and um you know welcome all of you. I will go ahead and ask you all uh to go around and uh introduce

26:05

yourselves and uh we'll take it from there. Uh Alex, you want to go first?

>> Hey uh I'm Alex Salazar. I am the co-founder CEO of arcade.dev and we are an MCP execution and authorization platform.

Meaning that uh we have a

26:23

number of MCP servers and toolkits. um for for use for out of the box like Google and Slack etc.

um we can execute custom MCP servers but more importantly um we can handle uh what's called agent authorization can

26:39

this agent on behalf of this user perform this action on this resource and that's one of the big one of the big problems today so and uh my team is most of my team is former octa including myself >> thank you Alex Mark you want to go

26:55

>> sure nice to meet all of you uh my name is Mark Cavage I'm president of docker uh my background from personally is building developer platforms for I don't know 25 30 years I guess I'm old at this point I was early AWS I ran Java for a while I ran most of Stripe um I've done a lot of stuff u but I'm super excited

27:10

about where Docker is going um we're super invested in developer productivity security and AI um and across all those AI and MCP kind of sit at the intersection and what I'll say is um you know we're in the nice position of Docker being such a widely used and

27:26

ubiquitous platform I get the privilege to talk to a ton of customers and users And what I'll say is uh at this point I feel like there's like at best 1% of the population that is not thinking about how to adopt MCP and that's an equal amount of both excitement and euphoria

27:41

and also confusion and kind of like oh my god I'm scared uh for all the reasons kind of previously talked about. So it's really excited to talk to you guys all today and be here and talk about that.

>> Thank you Mark. Uh >> hey thanks Raja.

uh I lead uh

27:57

engineering for ecosystem and marketplace at Atlassian. Um you know I'm similar to you Mark very passionate about platform and tool sets and enable all builders across you know multiple different uh uh you know uh geographies as well as uh different uh uh industries

28:15

to be enable them to use the Atlassian products and you know essentially my teams build platforms that others can extend and build on top of Atlassian. So we have a forge platform to build applications and now the Atlassian MCP as well as of course more directly APIs as well.

So this helps uh the builders

28:31

and each of the enterprises then customize the Atlassian instance that they have to actually serve more and more purposes beyond what we built out of the box. So I'm super passionate about this space um and really looking forward to this conversation today about how we think about MCP across the

28:46

industry. >> Thank you so much Reina and thanks everyone for um the introductions.

So let's uh uh dive right in. Um so um so I'm an educator at heart, right?

So I'm an engineer but also an educator. So first thing that I would like to do is uh so we have a diverse audience.

29:03

would any of you uh like to describe MCP for anyone who has been in technology but they don't know aki. So what is MCP and why does it matter?

29:19

>> Nobody's going to jump. I can jump >> just any of you.

>> Sure. Yeah.

Um I'll I'll I'll start and then uh yeah, Alex and and Rita jump in. Um uh the the way I mean Raj, you kind of said this at the beginning, MCP is just sort of this universal description of how to present your systems and your

29:36

your data um and your services to an LLM. And you know, an LLM doesn't actually make calls for you.

It tells you what it wants to go call. So something has to describe what that what those things are.

So for example, if you have, you know, a Postgress database or a Graphana interface or a Jira catalog

29:53

or whatever, you know, you can you can in the old days you would basically have to describe that all manually to the LM of here's this thing that you can call me back for and get access to. MCP is a standardized way of doing that and gives you these building blocks for both how to connect to it and sort of a standard

30:09

for how to go present that information. So you can, you know, the the the you know, the the the metaphor that always goes around is the USB socket or USBC socket for um uh uh for LMS, but think of it as effectively just a standard way of describing some system that you want

30:24

to get the LM to understand and be able to interrogate and act on. >> Yeah, I'll that's a great definition.

I'll I'll build on that. Um if you look at how an agent's built, there's there's just there's a few tiers that every agent has.

You have the large language

30:41

model acting as the brain coordinating how the workflow is being executed. You have some semblance of an orchestration system that's building some predetermined set of workflows or rough approximations of workflows around that large language model.

And then all of

30:58

those workflows ultimately tend to terminate a tool call. Um this is where the that the agent takes an action and that could be retrieval, it could be a manipulation, um it could be anything.

when the agent goes to take an action,

31:13

the protocol between the agent and the action increasingly is becoming MCP. Now, I think where I think people are getting really tripped up with MCP in a in true agent building is they they're using them like like APIs.

Um, and I

31:32

think that's why so many uh agents are having a hard time. um MCP like APIs are a service contract.

They are describing how your service works so that another service can interact with it. That's not a really good use of MCP.

31:49

APIs are actually great for that. Um MCP is best used when it's the actually the inverse.

It is kind of like the service contract for the agent of what it's trying to do. And so, um, you know, having like a like a MongoDB or a Jira

32:05

MCP service, it's valuable for sure, but the agent almost never actually cares about the downstream service. It really cares about the action it's trying to take.

And I'll give you a really, really mundane example. Um, let's say that I'm building a sales agent and it's going to

32:21

help me brief, you know, my rep before they walk into an account, you know, so I have a one pager and the right brochures and the right talking points, all that fun stuff. It's going to hit the CRM.

It's going to hit Google Drive for brochures. It's going to check email for past communication.

So, let's say we have a Google Drive tool, Google Drive

32:37

toolkit, Google Drive MCP. Let's assume Google was all in on SCP.

Uh, my agent, if I was using the direct Google Drive MCP server, would need to know how to traverse the Google Drive, you know, file structure to go find the right

32:54

brochures. To do that, right, I need to stuff context into the agent to explain to it how to find the right brochures by using drive.

The better way to do it is to build a brochure MCP server because that's what the agent's attempting to do. And then that thing can talk to a

33:11

Google Drive MCP server if it wants to. Truth is, it's probably better off talking to the Google APIs because that's really the the thing that's trying to be achieved.

MCP or tools in general are intentbased workflows, not service based contracts.

33:28

>> Yeah. Uh thank you, Alex and Mark.

Right. So I I love the I love the way you described it.

Uh so so basically what you're saying is that the responsibility for uh describing the contract is on the publisher side. Whoever publishes the MCP server and

33:43

then the caller it because I just uh I just call uh in natural language and the on the other side it is the job of the uh the MCP server publisher to actually standardize I would say the way around right like

34:00

you know hot take I think the reason why you know so many of the services are releasing MCP servers and they're getting adoption is because the first wave of MCP adoption are is cursor and warp, right? It is it is general

34:16

engineering services. And so when the user is trying to >> like when trying to communicate about what they want to do in a prompt when they want to interact with Jira or linear or you know docker or MongoDB they are stating intentions that are

34:32

almost tightly onetoone mapped to what those services do. Um, I'll give you an example.

Like I I actually have just as much success, if not more, using the AWS or the GitHub command line tool in my agent than I do the actual MCP server

34:50

because the intent is mapped almost one to one. Um, but if you look at the agents that are being built in the enterprise or in production, they have nothing to do with the downstream services, right?

like they're trying to achieve a higher order business goal and the downstream services are really

35:06

implementation details and so like an MCP server in in practice in in in production is just tightly coupled to the agents intent and so what we've been seeing is that they do best when they're written by the application developer not

35:24

the downstream service provider. >> Okay.

>> Yeah, I would I would plus on that. I think the best uses that we're finding as well is when we're describing in the MCP server tools as how we would want that agentic application to use it and

35:39

you know really um inside the MCP server the most critical part is the tools that we're exposing and then how we are actually describing the intent of those tools and the tools are not a onetoone mapping with the APIs they will use APIs underlying it but often times it's a combination of multiple APIs that bring

35:56

it together >> and Mark you wanted to add something. >> I I'll just double down on what Alex and Rena both said.

Alex said that and Reena just said at the end that really this point of the MCPs being effectively like deacto teddy is I have a rest API and I slapped an MCP on top of it. That is a

36:12

starting point. But to Alex's point doesn't it only it only kind of works.

And actually the way I I give a lot of people advice this is it's incredibly easy to write an basic MCP server that does that. It's actually incredibly hard to write one that's good that actually expresses intent and actually solves

36:27

goals that are built for an LLM. Like there's a big difference between REST APIs for curl and what the LM actually needs.

And I think that gap is something everybody's kind of trying to still figure out as we as basically as we speak. So >> yeah, and that gap is measured very objectively um and quantifiably.

It is

36:45

about token minimization, error rate minimization and latency minimization. And so if you give an agent a generic MCP server, like the most the the the hardest example, the one that most people fail on is like a

37:01

database MCP server, right? If you give your agent like a MongoDB MCP server and you just interact directly with it, you're going to have a very hard time.

You're going to have to stuff it with so much context. you're going to have to do

37:16

so many turns on it to get what you want that you're ultimately going to consume a ton of tokens. You're going to have really high latency and as a result of both of those, you're going to have high error rates.

And that's not nothing to do with MongoDB. They did a great job

37:31

with their with their MCP server. I think it's a fantastic implementation.

It's just you're using the wrong tool for the wrong job. What the developer should be doing is they should be building an MCP server themselves that represents the kinds of activity or

37:47

intentions that the agent wants to perform that just happen to consume MongoDB. And when you do that, it's a one shot.

It's like, hey, pull up users. There's no checking the database to see do you

38:02

what does your table even look like? the SCP server already encapsulates like what it means to look up a user.

So, it's one it's one call. There's you don't you don't need very much context at all.

Um and as a result, it's cheaper, faster, and lower error rate.

38:19

>> Okay. >> And Alex, um I'm glad you pointed this out, right?

So, so some of the MCP servers that u I've used, I don't want to name names here, right? So, in many cases, right?

So, um so my intent is missed, right? So what I'm trying to do

38:34

it it really it really doesn't get done right. So um so uh does does this uh uh does this actually um make you um you know so if it is more of a descriptive or generative task

38:50

that's a different story but if it is more an action that I have to take right so any misfired action based on a wrong understanding or interpretation of my intent it can actually expose the organization to all sorts of problems.

39:06

So my next question actually the follow-up question is going to be uh so what are the most pressing uh security and governance uh risks uh when deploying MCP right so the the so-called arbback or role based access control

39:22

right so uh when you deploy an MCP server uh uh is it going to be the system user that is going to be deploying it or is it going to be individual when I use it is it my application that is using that MCP server or am I using it as Raja right so

39:38

and going and using that MCP server so my question for for you um perhaps Mark maybe I can start with you here um so h how what are the most pressing uh security risks uh when deploying MCPS uh because you are you're doing a lot of

39:55

work in that space so let let's start with Mark right and and how do these differ uh from a traditional API or plug-in integration >> sure yeah actually I'm going to get I I do want to hit on a point or build on something Alex just said though on this because I I don't know about Reena and

40:10

Alex if you guys have this experience. Um I talked to a lot of people about MCP and like the thing that comes up is like oh what are we talking about?

It'll all be MCP remotes and it's all fine. And the thing I keep telling people is literally what Alex just said and why why we're so invested in Docker on these being containers being the right platform from this.

You can can change

40:26

where you deploy it. I deeply believe everybody's got to get their head around the thing Alex said of you you the application developer have to go write an MCP server that can itself build on somebody.

You can go build on the you can build on the stripe, you can build on the toilet, that'll all be useful, but so much of these things are

40:42

actually about business goals and the business goal is fundamentally about that organization and their setup that I think most people you're not going to actually get what you want with most off-the-shelf things. You're going to have to go build a lot of these yourselves and actually put the thought and logic into that.

So that's anyway just as a as an assertion this is why

40:58

like we know I'll put the one the one shameless plug above um we >> Mark you're referring to this as a consumer on my consumer or as a publisher of MCP server right so so as a consumer I have to add >> honestly honestly I think either so

41:14

there's the people building MCP servers that want you know let's call them an independent software value arena and you want you're an ISV you need to have a you know MCP server out for your stack or your docker or whatever else and then but as a consumer. If you if you posit that everybody is writing agents in the

41:29

year 2025 or if everybody's at least thinking about writing agents and in 2026 they're definitely writing agents. If you're doing that, you basically have to both consume MCP and produce MCPS yourself.

And so there's not really a difference in my opinion. And in which case this really requires you to then

41:45

effectively go back to okay, what does it take to build and package software because it's all it is and how to go share that with somebody else. and you have all the same problems that you've always had around okay where's the credentials where does it run how do I get it there how do I share to somebody else how do I version it how do I revoke it like all that stuff and so that's you

42:01

know just we we we at Docker have been putting a very emphatic uh voice out to the industry that OCI uh container open container initiative is the right way to go do that and package the ones that you yourself build regardless what you're consuming anyways that's my my plug on the uh

42:18

before I get into the security the security panel a second Al, Alex, you're going to jump in. So, >> no, I I actually that's a great lead into a security comp to your own security answer you're about to tell.

But I would say that we actually agree with you. We all of our all of our toolkits are containerized with Docker

42:34

>> because I don't know how else you would do it. It's madness without it.

Uh but anyways, Mike, back to you. I think you're answering the security question.

>> Yeah. Yeah.

So that mostly that's a it was a that was a good was a good build and a good you know a good thing for people to understand that you're going to be writing them yourself whether

42:50

you're going to use them or not you're going to write them yourself. Okay.

So in terms of like what are the problems and what's different. Um I'll give you some I'll give you some kind of hot take snarky takes and then some actual you know substantive things.

Um hot take I think right now like nobody knows what's going on. You go talk to any company and effectively I get the following story of

43:06

like what are you doing with AI? I'm all in on co-pilot.

We love the co-pilot. Oh, but then the developers found cursor and then oh my god, I don't know.

They connected to the cursor to the Jira to the service now and like it's madness and I don't know what's going on. That's effectively every company that um I speak to.

And I think what that means

43:23

then really is at the most basic level before we get to all the fancy things of MCP. It's actually to start with like basic observability auditing and kill switches like that's the things I think people are fundamentally struggling with right now because you've got very much the decentralized um uh adoption of the

43:38

coding agents all cloud code etc. And very much decentralized and democratized um kind of organizational push and thesis that we don't want to miss out.

We want everybody using all the AI they can. We'll figure out how to, you know, control costs and control things later.

But right now, I think just basics on

43:55

inside most organizations, they're really struggling with just even getting to that number one. But then in terms of the actual problems then beyond the the you know the the baseline, you know, whatever you you can go it'll take you 30 seconds to ask chat GPT or or Google for a long list and we'll tell you about

44:10

a 100 different problems. But um if I were to pick just a couple of them um you know prompt injection I think is the one and tool poisoning are the big ones that you know everybody thinks about.

But because it's so um it's so easy to say but it's also so easy for it to actually happen. Like there's a you know

44:25

we we at Docker published these horror story blogs. They're worth reading if you haven't but one of them is about actually the GitHub injection.

It's like cool. This is actually like you could almost imagine even if it weren't like the truly most malicious attacker that was like poisoning GitHub issues with like prompt injections.

you could almost

44:41

imagine the ele getting tricked and even with innocuous things. So one is one is that um you know I think the identity the identity and secrets management is like I don't know else to put it but it's a total pain in the ass.

Um, and so that's I think just a, you know, kind of a really hard thing for people to to get their head around in particular around

44:56

where they run and like again who you share it to. And there's just so much of this divide between what's a system agent, what's an agent acting on my Mark's behalf or my Roger's behalf or so on versus what's one for the team.

Like it's just it's hard. Um I think you know basically people what in practice like

45:12

we're kind of building on this story of well people connect the cursor to all their systems and all the tools are just super promiscuous and over permissive and I don't know everybody I know just hits the yolo button of like okay I'm I'm tired of hitting the I'm tired of hitting the allow button I'll hit the allow every time like just make it all

45:28

go away in which case it becomes very easy for these things to start getting chained in ways that you don't understand and because you don't have observability it makes it that much worse. Um, and then of course you'd have all the other supply chain risks of people, you know, injecting a bad tool and so on.

And then you kind of asked the very last thing and I'll I'll turn

45:45

over to to Reena and Alex to build on. You know, you asked like what's different about this than just APIs and and so on as part of the question.

And Alex kind of said this at the very beginning like there's this huge shift of you're not writing code anymore and you're not making decisions. The LM is the LM is the decision maker which means

46:00

you get very non-deterministic results every time it runs. you use a tool which makes it very hard to have trustable guardrails.

And so, you know, that coupled with the like how just the mere nature of how low friction it is to connect these things just in makes you know massive amounts of blonde rate

46:16

massive amounts of blast radius. So, you know, that's as I keep saying it's incredibly easy to write a basic MCP server in particular one that does reads and even then you'll still have injections and risks and and data leakage and it's incredibly hard to write something that is both useful and safe.

So that's my my my take on what

46:33

what I see a lot of. >> Yeah.

Can can I build on that? >> Uh go uh Alex if I can then I'll let you jump in.

So yeah I I think I think great great context on that one Mark. I think what what kind of you know being on the

46:49

builder side here and where we are kind of you know really custodians of enterprise data and protectors of it. We do kind of you know take all of those very seriously from our perspective.

So I think it you know all of your points actually just don't go away. So I think kind of you know as we're moving from like you know uh humans making decisions

47:05

to LLM making decisions like you said um the the controls kind of stay the same. In fact, they just get enhanced is how we think about it, right?

And how I think about it also is that uh you know, giving admins still the controls about what what uh what MCPS they allow inside

47:21

their their system to then connect a cursor or copilot or whatever that new agent may be is very very critical and it in fact becomes even more kind of required now to say okay what are the tools that those uh that those kind of agents are actually or those MCB servers are actually building and providing to

47:37

my users? how much access do they control do they have and really kind of you know the question that Raja was asking is like do the uh do the granularity of it at the system level or the arbback or the user level go away it's like no they they stay the same I think the those kind of you know just manifest into different uh personas and

47:54

different kind of entities right now so I think as a as a MCP server builder we not just have to worry about you know uh what our server is doing but also what is the client we're enabling on the other side well who's the host and what are their security practices and

48:09

requirements and and possibly going to protect ourselves from other MCP servers that they are also going to be enabling in the same context because that is where the leakage of the data and other kind of you know proliferations could happen as well. So there is a kind of you know I think the security piece is just getting more and more heightened exactly what you're saying here Mark too

48:26

and um I think the uh the onus on the data governance both from the admins as well as the builders of software like us kind of you know just goes up even more. >> So if if I if I bring it down a level right now no one can write an MCP server

48:43

that is not a service account unless you are the vendor. And so um everything we just talked about I'm building an agent I want to talk to you know I want to talk to Jira as an example um or I want to talk to Google I can't write that MCP server and not and do anything other

48:59

than give it a service account because the spec does not support >> y >> uh delegated authorization and so there is no way for today for an agent to act on as as an agent on my behalf in a

49:16

scoped privileged a to any downstream service. The spec doesn't support it.

Um, now we as arcade, we contributed this an RFC to the spec. It was recently accepted um to go enable this uh to

49:32

enable delegated authorization so that I so that Chad GPT can send an email because it's still it's almost 2026 and you still can't send an email from Chad GPT for this reason. Um, if you want to build an agent that sends email, you can't.

Uh, as a user, if you The reason

49:47

there aren't personal assistant agents is because they can't access your calendar. Um, and so that is the biggest problem uh in an MCP right now in my opinion and in agent development.

You can't have an agent act on your behalf unless you give it a service account.

50:04

And service accounts are a non-starter in the enterprise because it creates what's called an authorization bypass vulnerability. And so if I give a service account, you know, any level of of of of access in a system like Jira or Google or SAP or whatever and it's

50:21

anything other than the lowest common denominator of the user and then the intern gets access to the agent, the intern just privile just escalated their privileges. And so this is why when we talk to organizations that are trying to put agents in production, which surprisingly

50:37

are a lot of the Fortune 500 and 1000, they're having to kill their agent projects because they can't pass security review because they can't handle scoped access on behalf of a user. >> The only So I'll I'll jump on one thing there.

So I agree on all that. The one

50:54

caveat that I I will say that I'm seeing now, which has been surprising to me, but I see this because we're the desktop people. We have Docker Desktop.

We're seeing a bunch of agents actually get built that are effectively not being deployed to something like a backend which limits, you know, things like Slack connectivity and so on. But for

51:10

me, I want to I want to make my claude send an email. Like effectively this is what people are using with Docker Desktop for is because I can get >> Mark's OOTH token injected into Docker's MCP gateway on the desktop and then actually it funny enough what I the way I term this is I'm basically seeing the desktop become the new production

51:26

endpoint because because a the coding agent runs there and b well you can actually solve a bunch of these problems because you already have them effectively with either you know one password of the keychain or you can at least get you know configuration of ooth injected manually. And so funny that's

51:42

all true except for this one thing of all of a sudden I'm seeing desktop usage like spike for that exact purpose. >> Yeah, actually I I'll comment on that.

So we we've been seeing that too, but it is a bit of an illusion largely and I'm not surprised that you guys at Docker are seeing it because the illusion is largely that the early adopters of MCP

51:59

and MCP desktop with standard IO with a standard IO interface are developers and DevOps people, right? and so that they're very comfortable, you know, running and installing these things on their machines and and and the types of things that they're talking to lend themselves these interactions.

One of

52:14

the big problems though, even with that model, is that in those scenarios that MCP server, it assumes the identity >> of the user. It's not acting on behalf of the user.

It is the user. And that's that's not a security problem.

It's a

52:30

safety problem. Yep.

And so I've seen cursor try and delete my root directory um accidentally. Now thankfully it doesn't have pseudo access.

And so when it tried it failed and it was very kind. It apologized profusely.

Um but when the

52:46

agents act at like when they act as you and they assume your roles and your credentials without having like a delegated authorization model. Then they can delete emails, they can delete drive, they can delete folders.

53:01

there's there's no way other than how you construct the MCP server. There's no way to and like to to scope what the agent can do separate from the user.

And so the right way to do it is basically ooth. You have an application that's registered that has its own scopes and

53:17

claims as your agent. And so you can say, "Hey agent, you can't delete email, but you can read them." And then you have the user's identity and their permissions, which is I can only read and access my email.

And then the intersection of the two gives you security and safety, but it's early

53:33

innings like this is what we do as a business. Um, help people with that, but the spec is catching up.

And so it's a problem right now. We're all bullish that it's going to be much easier in about 6 months or earlier, but it's one of the biggest problems right now.

53:49

>> Yeah. So, Alex, speaking of uh you know uh cursor apologizing profusely, if it had it had deleted it, it would have a apologized more profusely actually.

I mean, I' I've read those horror stories. I mean, they just uh you know, but but yeah, so let's let's actually use this

54:05

as a segue, right? So, on the building side of it, so uh so Reena, I know that Atlassian has been working on uh MCP servers for their products.

So uh so what what are some of the in your opinion as as someone who's releasing

54:21

this to your customer I mean you have a huge customer base a lot of adoption what are you what are the potential implications of any lapses that uh that came up in your product meetings say hey I mean we don't want to do this if we if we allow this action this might happen

54:37

if we disallow this is going to be a product you know it's going to be a poor customer experience so um how do you can Can you walk us through and give us an idea what are some of the potential implications and what is the thinking process behind uh when you're uh building these MCPs?

54:54

>> Yeah, definitely you know security is paramount for however we think about it. You know before we develop a product the first thing is like okay what's what's the security guard rails and constraints and you know um audit points and checkpoints that we're going to put in the system.

So that's definitely been one of the key guiding principles for

55:11

MCB and MCB as we are just discussing it's like so new that you know we're still kind of getting around to understanding it fully. Of course the uh oath 2.1 uh implementation is something that we've adopted completely and that that gives us the overlying uh or would say the boundaries around the uh the

55:29

authentication piece that Alex was mentioning. So that's one that we kind of you know make sure that we are doing definitely to start with.

But uh what kind of you know continues to be a challenge is uh uh the uh the pieces that you know we do want to kind of give access and control to the admins to do.

55:44

So we want to give the admins much more much more of the controls at the same time while we are building our MCP servers in a way which is secure. um give the admins controls for you know what they want to be enabling, which types of agents they want to be enabling to be using RMCP server itself as well

55:59

so that they can control their uh the uh egress of the enterprise data which is really kind of you know their crown jewels and u the audit trails that they need to have in addition to say you know the pseudo access type issues or something really nasty if it did happen

56:15

as well do they even know kind of what happened and do they have the ability to trace back uh we definitely don't want any of those things to happen in the first place. What our approach really has been you know let's start with the being mindful of exposing the tools that we start with as well so that we are

56:30

kind of you know being super careful about what could be destructive actions that could be taken uh around the data itself and uh you know making those much more u uh constraint and restraint around that aspect as well. Um so this

56:45

is this is really kind of you know our uh key tenants here that we ensure that the security and data governance of our customers data is something that we are controlling and then of course we are giving the admins the tools and the granularity that they can then uh control while also kind of you know

57:00

really enjoying all of the benefits that they uh that they want to be and we we want them to also have through the new agentic AI workflows that they can enable. So, so uh do you think a misconfigure misconfigured MCP um tool or MCP server

57:16

I mean can an intern delete my entire git repository that I've been on? >> Definitely not that's that's exactly why we don't we don't expose tools that make those happen either.

So you know that's kind of definitely the paramount piece is like what tools are we exposing in the first place and that's up to us to

57:32

to control. >> This is great and thanks Reena.

Uh so so Mark um uh Docker is sitting at this uh you know they they have built this ecosystem you know for you know other uh you know containerization and and now

57:48

even for MCP right so you have an MCP image it makes it actually quite convenient to go and get that MCP server deployed in no time actually I love I love that as uh you know as an engineer how do you vet this out I mean this is I I don't even know how you must vet this

58:03

out because how do you mitigate at uh the the supply chain and open-source risks and uh you know abuse and misuse or how do you vet out? >> Yep.

No, great. Well, one I'm very I'm very glad you found it easy and and you were happy with it.

That's actually

58:19

>> Yeah, I love it too. Um in terms of, you know, how do we vet it?

Um yeah, this is the part where it this is part of why I'm so excited for us at Docker to be doing this. Look, we can build on top of the shoulders and foundation of what we've been doing for the last 10 years.

So ignoring MCP for a moment um you know DockerHub is this vast

58:36

vast ocean of things out there that you can go pull and um you know we billions of polls you know a month come out of it. Um there is however a docker official images curated section where we've been for the last decade working with open source and upstream on

58:52

effectively every project that you know is the top whatever 100 200 or so that that really matter to folks. Um and effectively it kind of looks like a distro where we stamp and certify that the source and the maintenance of those packages is good that there's not

59:07

vulnerabilities and we build that on our systems. So we've been doing that for a long time for base images, application images, framework images etc.

For the MCP catalog of course you can publish your own but for the point of getting to curation of how do you actually know what you trust and what you vet and so

59:24

on. effectively, we have this cheat code of, well, I could build on top of the exact processes we have or we're already pretty good at working with open source and upstream.

And so, yes, MCP requires a new cast of characters that didn't exist before um to work in there, but effectively, we're able to use the same

59:39

process and the same, you know, and this this is not a thing I can tell you about where it's like, oh my god, we do all the amazing AI that just makes it all possible. Well, at the end of the day, it's humans and us actually going through with a high degree of confidence and a high degree of vetting and trust that actually what's coming in is actually something we will stamp and

59:54

we'll certify and that's what makes it into our curated catalog. This is part of why besides, you know, really the the goal from us is make it incredibly easy and low friction for a developer to connect to, you know, cursor cloud code or whatever and work with MCPs and or

00:09

build on top of them, but also make it safe and possible for you to work from out of the registry. So, you know, we're we're it's an open registry.

We try to have anybody and everybody come in that can work with us and we you know do all the the vetting like at last scene is in there and Stripes in there and I don't remember if there's an arcade dev one off the top of my head but you know we'd

00:25

love to have one. So >> yeah there's a lot actually.

So we are using quite a lot of those actually in our uh on our end and um yeah it it makes it actually quite convenient. Um so um Alex u how how is uh how are you

00:43

guys actually helping this entire ecosystem right so what uh I mean if you can give us an idea yeah so the biggest thing that we do is that we solve the the the off problem uh the agent off problem and so if you're trying to build

00:59

an agent and you want the agent to interact with some of these secure systems which is almost everything Google ajira slack um SAP your own custom APIs. We make it very easy to build secure uh toolkits

01:15

that that the the agent can communicate with and offload work to. Um and as part of that, you know, we're very big believers and contributors to the MCP spec.

Um we just contributed a lot of our a lot of the way we do it to the spec. Um and as and as part of that, we

01:31

ourselves also build out a lot of these toolkits. And so we have a large a large library of pre-built LLM optimized LLM eval uh tools for all the major services including Jer just came out about a month ago.

Uh it's

01:47

great use it um to make it very easy for agent developers to just go turn this stuff on and build whatever it is they want. Um, and then we have a really nice SDK so that a developer can very easily and very quickly build their own toolkits for what their agent needs that

02:04

might consume other downstream services or just do something for their own services. >> So should I expect role-based access coming up coming up with MC?

>> Yes. So we already support role-based access right now.

We support delegated

02:19

delegated authorizations. We support end user based ooth today.

um we we just don't do it via MCP until the spec can support it. Uh but we we we added that to the spec along with our partners at Enthropic and and other companies.

Now

02:35

we're all just collectively waiting for that to get merged into the spec. And once that happens, all of our functionality around O will also support MCP.

In the short term, until that happens, we have toolkits that work without MCP and do all the same work. Okay.

Um so um um I will present this uh

02:55

not so hypothetical question right so as um as someone who's building products uh and I'm uh I'm someone who is adopting MCP servers and uh integrating them in our my agenti workflows. So uh um let's say I'm the CIO of a company.

How should

03:12

I decide or um evaluate an MCP product and what should be my considerations uh when I bring in uh of course I I cannot go MCP server by MCP server right? So what would be the best practices of bringing MCP in my uh in my

03:31

organization? I know it's a tough one.

I know that right. So you know because there there's no there are no books.

I mean we are still it's very nent but I'm just curious uh you know for instance I mean my first rule internally is it has to be

03:48

the official MC MCP server from Docker and that's that's the first starting point. So I took the easy one.

Uh what would be the what would be the recommendation beside this? How how should a CIO evaluate?

Yeah, I mean I I'll take the I'll take the opposite position

04:04

kind of back to my earlier point like the official >> the official anything MCP server is not terribly valuable if you're using any of >> course. Yeah.

Yeah. >> And so I I I don't even know that that would work.

Um >> yeah, >> I would inverse the question and be like

04:20

what's the agent you're trying to build >> and let's go make sure we get like let's go it's basic product management. Let's go figure out what the agent is.

Let's go figure out how we're going to do it. And then and then if an MCP server is the right answer, yes, let's use that.

If if a diff, you know, if A toa or something else or, you know, customuilt

04:38

thing is is the right thing and let's take that path. Um, you know, I don't I don't yet believe MCP is we're huge believers in MCP and we think MCP will be like, you know, the thing for everything, but it isn't that yet.

And I

04:53

think people people people jumping ahead to where MCP is going to be and expecting it to be that today is honestly the biggest headwind to MCP like so much of the negative press is because people are expecting too much of a of a of a protocol that's not even a

05:09

year old yet. Um and so you know right tool for the right job.

I would say I I I will say well agree I mean at the end of the day I agree and um I'd even go further. I I I I spend

05:24

even for our own organization we have a lot of build versus buy conversations because everybody is out pitching you their you know every every SAS vendor is expectedly telling you like oh I got an agent now so you know you got to figure out do I actually think this is going to be useful or not. >> Um but so you agree with Alex actually in terms of >> at the end of the day like it's it's

05:40

just not that big of a protocol. It's only a year old.

Like I'm whatever. I'm old.

I've never seen a protocol in my life go this fast from a blog post to like everybody on Earth uses it and we're having conferences and panels about it. Um it is, you know, you you are still going to connect.

It's not

05:55

everything Alex said is true. You are still going to need to find them to go connect to something in which case, yeah, the easy button is well start with Docker, start with the remotes because at least you're getting a warranty on those where they come from.

And if not for that, then you're basically in kind of like to Al's point in product management 101 of and quality assurance

06:10

101 of well, you're gonna have to in introspect it, understand what it does, understand how it breaks. Like you know the every every as I I've been describing to to people for 30 years and teaching interns and so on.

Like if it magically works, it will magically break and LM are magic in which case they may

06:26

magically break. An MCP may magically break.

And so really it's on you know it's on you the organization to understand what it is you're consuming and understand what the boundaries are for it and then you can make an assessment and there's not you know we at Docker can give you the shortcut for getting them from at least an upstream

06:43

place at least you know they're trustworthy and they're good whether they solve your exact problem and I loved Alex's framing in terms of errors um cost uh and so on at the beginning like in terms of your actual objective goals but at least you can trust them you know you still have to you basically have to go through the same things

06:58

you've always got when you bring in any kind of a technology whether it's MCP or just generic open source or a thirdparty vendor which is do I understand what it does am I going to get value from it and do I trust it and have I actually quality done some level of acceptance testing that it meets what I want as an organization

07:14

>> that's that's great thanks Mark um so um so there was this uh report that was in news uh from uh MIT that only 95% of investments in AI they result in 95% actually result in zero ROI and only 5%

07:31

actually uh provide any meaningful business uh value and uh um I think one of the reasons among many other things I mean it's it's far more complex than uh this that but one of the one of the complexity in building these agents

07:47

systems is uh well uh getting the right context at the right time uh and correctly uh and uh lack of standardization siloed data sources and all of that right do you think uh that characterization that assessment is correct among many other

08:04

things and do do we think that MCP is going to help in this case or is it going to hurt right so maybe people are going to become more cautious and hey I mean we don't want to deal with this right so Reena you want to go ahead and uh maybe what's your take on this

08:20

>> definitely so um that was such a fascinating report I was like really really I'm first of all I'm very very grateful for the insights that they provided, the depth of findings that they've gone into. Um, and of course like I think uh I suspect you know it's a little more stark than what

08:37

probably the reality is but it's good to have I think the a little bit of kind of you know cold water on all of the uh the high expectations that we've had on AI for a bit. So uh coming to kind of you know your question on what what I think about kind of you know where things are heading and how uh maybe in the context

08:53

of this particular question or with this particular panel here as to how is MCP going to kind of help help with this space or not. I definitely think that you know I'll go back to what Alex said a huge plus one to that which is that first of all I think what we haven't been able to do in the AI space really

09:09

is understand what are those hero use cases what is this you know uh working backwards from the customer or the user of these which is you know the knowledge worker and most of these enterprises what do they really kind of you know want and care for and how where is the product market fit right now like today

09:24

with the agentic systems that we can provide them and uh this is where I think the uh of course the MCP P can be a very rich set of tools that we can give to the agentic systems to then help those solve those problems. And I think the fact that uh the MCP makes these

09:40

much more kind of you know discrete and easy to understand more deterministic in their flows solve some of the quality assurance and of course provides a little bit more guardrails around security if we add them to the mix a little more. I do think that there are ways for us to be able to solve some

09:57

more of the customer use cases problems and really kind of know getting into the reliability pieces, getting into the quality pieces, getting into the understanding of really where the user use cases are. So I think if we were to kind of for our customers do better with

10:12

understanding their use cases and then being able to you know bring the right tools for the right uh uh for the right problems we'll be able to get much faster into the adoption and the usage of these agent systems. I think we are still kind of grouping our wares around for various of these uh installations

10:29

unfortunately today. So Reena, so if I understood correctly, so it is basically not having well- definfined use cases.

uh and um so we'll need a lot of uh very technical talented product managers then who can

10:44

>> yeah I think I think it's it's a question of really marrying the use cases and where the users are at bringing them along the journey and bringing the tools that match their kind of you know use cases and their requirements and really MCPS can help with that because we do kind of you know

11:00

then pro able to provide them those specific tools those specific kind of you know understanding while leveraging all of the kind of you know really great smarts from the agentic AI spaces of you know being understanding the inference and you know understanding the context bringing the context along both in the

11:17

user space as well as from multiple other sources uh to the same agent in the same in the same session. >> Thanks Reena Mark Alex do you want to add anything there?

Sure. I'm I'm full of hot and snarky takes.

So basically if

11:33

you take um the report on AI everything said like I agree with by the way. Um but if you take that report from MIT the 95% of people get zero value.

Um if if you just I'll draw a comparison. If you just start with just developers and I'll come back to the business view of all AI

11:48

in a second. Um for my entire career watch people try to measure developer productivity and largely you know fail at it basically where it's like oh we have all these metrics we have the door.

the space. We have all these things that we go do and at the end of the day it's incredibly hard and subjective to figure that out and so it's like hard to figure

12:04

out have you actually moved the needle on developer productivity inside an organization from an ROI perspective of like whatever you brought in and you kind of like it's like I subjectively know like oh this was better this was not and I got some metrics that may or may not line up to that but so from an AI um in whether the right number is 95%

12:21

or not I don't know but in that paper what I particularly loved was the shadow there's a call out but the shadow effectively the where it succeeds ing is the shadow organizations that are basically the people that are off close to the close to the actual problem thinking hard about it and then just kind of doing their own thing in the

12:36

weeds and they come back with something that actually solved the business problem because when you go to do I find that number kind of believable like I do at the end of the day because I think if we can't articulate very well what ROI is on a development organization and everybody struggles with this most businesses I think especially from a

12:52

productivity productivity is very hard to measure and manage from an actual productivity measure of just any function whether it's HR, sales, accounting, whatever it is, it's incredibly hard and so you're just like pouring money into AI hoping for some magic to come back and it's really the

13:08

cases where the people understand what the pain point is and what the value is that have actually solved it and so my suspicion is it's better than 5%. But from if you're looking at it from like a board level view, probably that actually number is pretty accurate to what I would imagine is actually happening.

But you know, the truth is somewhere in the middle

13:23

>> and you can never make the board happy, right? So I mean you have to keep doing more, right?

So >> yeah. >> Um okay, Alex, you want to add something?

>> Yeah. Yeah.

You know, we're we're in the field all day with customers that are actively like have staffed and budgeted agent building teams. Um, and so from

13:40

from that from that perch, um, you know, I don't know what the number is, but it's directionally correct. We see a lot of failed projects and killed projects.

So, I'll use that as the metric. Um, it's actually a it's a very finite set of reasons why these projects fail.

It's

13:55

uh the costs were just untenable because they were using the token weight. They were using just tokens too much.

So, the LLM costs were were uneconomical for what they were trying to build. latency was just similarly just unachievable.

It was just too long uh for user time to

14:12

wait that long. And then um the third one was the accuracy rates weren't high enough.

Um in a demo you can get something really cool working but if you can't get it above like 80 something% consistency, no one will use it in production. You can release it but no one will adopt it.

And then oh and then

14:28

there's the fourth which is security kills it. Um, so we see we see some combination of those happening to the majority of projects we see in the field.

Um, and and MCP isn't isn't isn't part of the solution, frankly. Not because it's bad.

It's just it's just a protocol, right? It's it's not like a

14:45

white horse. The problem is multi-layered.

Part of it is talent. Like very few people know how to build agents today.

And so they're running into a lot of basic problems. Nobody knows how to do evals as an example.

Um the second one is there are security

15:00

real security issues uh around authorization around safety etc. people are trying to figure out and then uh but I do think that MCP represents a body of work that is part of the solution which is tool use and uh but I'll remind the

15:17

audience that MCP is a protocol for tool use. It is not itself tool use or tools.

>> Thanks Alex. We are actually over time.

I will take take a look at um probably we can answer two questions from here.

15:34

So I will just quickly scan through. Um so I will take this one.

Rolling out MCP or wide. Uh what's your nobrainer guardrail stack and how do you keep dev speed up?

So basically I think the question is around what kind of guardrails you would put around MCP uh

15:52

uh and uh how do you keep up how do you stay compliant? >> I feel like this is a this is a gimme for Alex and I where it's like well the two things I would imagine you clearly want to roll out as Docker and Arcade.dev both of whom are actually solving a whole lot of MCP governance

16:08

problems. Um but to some extent that's kind of a serious answer to some extent.

Um, you know, there's not I don't think we're I don't think this is anywhere near mature enough yet where we can all look at it like it's like, you know, it the I'm going to date myself here, but it's the equivalent of the LAMP stack where we all agree that's the exact, you

16:24

know, right um configuration of things. A lot of this is still, you know, there's evals are this very basic narrow thing for prompts, but then really you need overall observability of the whole platform.

And like we're just not there yet, frankly, in terms of something that where everything end to end works across, you know, supply chain and

16:41

discovery through building through deployment through like just it's just it's just not there. And I think it's it's you're you're we're living in the world of the point of that that that paper.

It's the shadow. It's the shadow IT or the shadow organizations that are doing this and they're kind of like wild west putting together what they need

16:57

from a bunch of different places because the space is just moving so fast. So that said, I think Docker and Arcade.dev are both solving very meaningful problems for people and we will help.

>> Yeah. And but actually I think you have the you have the right panel for it because as much as as much as we can solve things like delegated

17:12

authorization like we can only do what the downstream service like an Atlassian has exposed at its policy layer and so if you know if you're a SAS vendor and want people to use you and your agentic flows in a secure fashion or you

17:28

yourself are trying to expose part of your own functionality uh for your own agents to consume the policy layer, arbach, ABOK, whatever, what you know, contextual access management, all of that stuff, it actually needs to live in

17:46

those services that you want the agent to talk to. You do not want to create an intermediate layer for this stuff because you're going to end up with identity silos and those don't work at all.

Like that you will create more problems than you'll solve. And so it's

18:01

it's it's it's like the work that Atlassian's doing and Mongo's doing and and other people are doing. It's having richer policies available to lock this stuff down so services like Docker and Arcade can best enforce.

18:18

>> Thank you, Alex. Uh we are way over time, so I will uh I will close with one last question.

I will go around and uh all of you can answer this. What are you what are you most excited about?

in the context of whatever you're building at Atlassian maybe Reena you can go first

18:33

and Docker and then Arcade. Yeah, I think I'm I'm just like specifically in the context of MCP servers probably just most exciting for me is the collaborative agent intake workflows that we can be building across you know multiple different SAS tools at this point and how do we get our users

18:50

to get so much more value outside of kind of you know just the Atlassian boundaries so that they can then be using not just you know Atlassian tool sets but actually connecting all of their system of work that they have today whether that be email be Slack be um you know their uh Salesforce or other

19:07

kind of you know ways to manage their campaigns and uh create content and so on. So really kind of you know how do we become more and more uh available to our builders and the knowledge workers so that they kind of you know are able to leverage all of these agentic workflows that are coming across in a

19:23

collaborative way and in a way that are totally secure and you know follow the uh follow much more of their kind of you know where they want to take their lives and their systems. Thank you.

Yeah, Mark, >> I can go next. Yeah, I um uh sort of similar at the end of the day, we're an

19:40

app dev company. We're kind of a DevOps company.

We help people build and package apps and get them to prod. And I think what I'm most excited about is doing for agents what we did for microservices, right?

You know, I think really Docker basically made like the modern era of cloud native ultimately by being container native possible by

19:55

helping people factor things down and then actually have flows around that. And at the end of the day, we have a lot of stuff we're doing around security and governance and management and deployment and all the rest for MCP.

And fundamentally, what I think is most exciting is, you know, as as people get through this first wave of having a whole lot of agentic projects fail, we

20:13

will get there. And I deeply believe everyone will figure out I have to spend a lot more time than I think I have to spend using something like what Reena built, building it my own logic on top of that that suits my business need and then packaging that up as applications.

And I think you know we at Docker are

20:29

this is this is kind of what we do. It's kind of our jam.

So I'm I'm excited about the new world as we evolve forward. >> Uh I'm super excited because uh we've been talking about models models models models for the last you know two years and the models are now good enough and

20:46

uh the real problems the real bottlenecks in agent development are now what can the models actually do? What can the agents do with all of this intelligence?

And the answer to that question is tool use. And that's what we're enabling along with MCP.

And so,

21:02

uh, I'm super excited to be to be helping to drive that stuff. >> Okay.

Um, thanks, uh, Alex, Mark, and Reena. Thank you so much, um, for having us.

>> I I would love to continue the conversation, but we are approaching the next uh, next panel, right? So, we'll we

21:18

going to take a five minute break and then we'll come back. Thanks everyone.

Okay. Bye.

Thanks everybody. >> Thank you.

[Music]

21:59

Heat. Heat.

[Music]

22:23

Heat. Heat.

[Music]

22:41

[Applause] [Music] Heat.

22:58

Heat. [Music] Heat.

Heat. [Music]

23:25

[Applause] [Music]

23:45

Heat. Heat.

[Music] Heat. Heat.

[Music]

24:01

[Music] Heat. [Music]

24:32

Hey. Hey.

[Music] Heat. Heat.

[Music]

25:07

Heat. Heat.

[Music]

25:29

Heat. Heat.

[Music] [Applause] [Music]

26:03

Okay. So we are going to continue with uh think of this as a more of a session that puts everything that we have learned so far together.

Uh so what we will talk about is um we'll look at u

26:21

this part uh tutorial part demo where we are going to actually look at how everything that we have learned so far is actually coming together. So the panels uh they were um set up they were scheduled uh with a lot of attention to

26:38

detail. We really carefully thought about the different aspects of Agent AI and then we put uh the entire conference was actually put together very very carefully by practitioners right so no fluff.

So uh so whatever we have learned so far in terms of cognition and memory

26:56

um you know building context using MCP security and compliance issues we will actually see why is it so hard to build um uh an end toend agent agentic workflow and uh how can organizations

27:13

uh build these uh systems. So I'm going to be uh doing this uh a short tutorial followed by a demo of an actual working product uh with a demo of of course u how MCP connections work um you know

27:28

what kind of controls are there someone mentioned guardrails I think that some of the panelists and also some of the questions uh came around uh guardrails uh and eval so on so we are going to talk about all of that um let me go

27:44

ahead and start sharing my screen. Okay.

And okay, we are here.

28:04

So, how do we how do we take um uh an agentic AI workflow and deploy it end to end? So that is uh what we are going to talk about.

Um so u I have not done my formal introduction uh so far right. So

28:22

I am um I've been doing AI for quite some time. Um I am also teaching uh at the University of Pittsburgh.

I'm a part-time faculty member at University of Pittsburgh. Um recently built uh um

28:39

an Aentic AI platform called Aento. And prior to that data science dojo, I founded data science dojo as well and um uh worked at Microsoft and I've been doing AI as I said for quite some time.

So here is a very simple agenda. We will

28:56

talk about what are the considerations when you're building an agentic AI uh workflow and agentic AI application and we will actually go ahead and see an Aentic AI platform um that is uh built keeping all of uh all of the

29:12

considerations in mind. Um so I like to um I like to characterize uh uh this whole agenti ecosystem um using this pyramid and for those of

29:28

you who uh who recognize this this is inspired by Maslo's hierarchy of needs. uh if you look at it most of the organizations uh they start building their agentic AI applications

29:46

uh by simply just uh you know just building a wrapper around their existing uh their existing um some of the existing models and simply build a wrapper around it. But the

30:01

reality of it is that uh before you even get to building and scaling AI agents in production, you have to really take care of your infrastructure, your data and your tools. Um are the tools connected?

Uh are the tools interoperable? Are they

30:18

connected or not? Um is uh uh is everything uh secure and you know are you compliant?

Um uh you have to worry about data governance. And now we talk about AI governance.

Uh we talked about

30:33

um prompt hijacking and jailbreaking. A and um and in the previous session uh we were discussing about having an appreciation and understanding of uh organiz organizational alignment in terms of what use cases we are working

30:49

on, what we are not working on. and that uh uh that MIT um um publication that talked about uh 95% of uh PC's never seeing uh

31:07

um never making it to production. This is primarily because of all of this uh all of these uh factors you don't uh most organizations that directly jumping to scaling of agents which is fine for a PC but there but beyond that um there

31:26

has to be uh some leg work that needs to be done and I'm going to talk about some of the lessons learned and then after that we are going to uh we going to um look at a demo and I will show you what does a complete finished product look like. Okay.

Um so I showed this uh slide

31:45

earlier this morning as well. Uh in an Aentic AI workflow you have uh the first thing that you have to have is uh the perception right.

So are you you may be reading data from a PDF from a from an image from it could be a multimodal

32:01

input. Uh do you understand the intent?

Intent is everything. Uh so um uh if the user types in something um what if it is misunderstood?

So there is this perception aspect to uh agentic AI. Uh

32:16

then a lot of leg work actually happens. A lot of grunt work actually that happens in uh cognition and planning.

You have goal management uh reasoning and inference u uh then planning and memory and self-reflection. uh the knowledge representation right so

32:33

the context engineering the short-term and long-term memory do I know this user uh from the previous interactions with the application do I know what the user in the within the same session what was the user asking uh do I have uh access

32:49

to users um some u HR records do I have uh uh access to users um information in previous orders depending upon the application Then there is the autonomy aspect of it. Once it has perceived and planned and

33:07

understood the knowledge um does it take actions uh what kind of actions it will take? Then the safety and governance aspect of it uh which actions are allowed and which actions are not allowed.

And this this really aligns

33:23

with the discussion that we were having not just in the previous session and the sessions before. We have talked about all of this the entire day.

Uh so safety and governance is important right. So we want to make sure that uh you know uh the applications that we are building they are safe and evaluation.

Uh how do

33:42

we evaluate? Um there is uh um in in traditional machine learning um if I have to evaluate an application

33:58

that is let's say a fraud prediction model it is very simple right so if either the transaction was fraud or it wasn't a fraud right so if I evaluate my prediction is all um either correct or wrong. Um in um so this was for a

34:16

classification model. In a regression model, my prediction could be off by let's say housing price prediction model, my prediction could be off by 100 or 1,000 or 10,000.

You know, I I can measure that deviation. And for other

34:32

predictive models, I mean there are other metrics. But how do we evaluate uh an LLM application?

So uh these are some of the challenges that uh or some of the building blocks uh that eventually turn into challenges when we are building

34:47

agentic AI applications. So we will talk about some of the lessons that uh have uh that we have learned and um for those of you have attended all the sessions you will actually now you can completely see how these are connected right.

So um

35:06

as much as uh many of the uh the leading companies they try to convince you turnkey enterprise uh AI gen AI it's a it is a pipe dream for most cases I'm not saying for every single case but for most cases it is uh building a

35:23

production system that is uh that is safe that is compliant that keeps you out of trouble um and I'm talking about enterprise applications of course right it keeps you out of trouble It is accurate, it is u cost efficient, everything. Um there's no turnkey

35:41

solution. Usually barring a few uh simple use cases, there is no um turnkey uh solution that is possible.

Um you have to work on it, you have to tweak it, you have to build it and there will always be some aspect of working on um

35:58

on these uh working on these applications. And this is uh this idea that when um when a technology trigger happens um case in point chart GPT chart GPT came out uh generative models claude uh

36:16

Gemini um uh Chad GPT all of them the inflations are the expectations are inflated I mean PC's on like maybe 10 PDFs or 100 documents they work amazing right and suddenly when We when we try

36:33

to operationalize or deploy these models or deploy deploy these applications um there are adoption barriers, there are challenges um and then there is this sustained learning curve after that. Uh you know organizations they invest on uh invest on learning and making sure that

36:50

um uh that uh um eventually the use cases are figured out. Eventually uh you know all the security issues and uh you know governance issues they are taken care of and use cases have been identified.

37:09

Uh um this goes back actually so a lot of this will go back to the previous previous session right so uh as uh in the previous session Reena was mentioning this organizations are often unclear about AI use cases right so many companies they want to adopt AI but do

37:25

not have a strong problem statement um and I was I was I was looking at the uh the comments or the questions in our uh Q&A someone meh someone mentioned isn't it obvious that uh It should start with the business use case. I mean shouldn't

37:41

that start with that? But you know as surprising as it may sound a lot of times it's it's not that obvious to many organizations.

Um there are uh you know challenges around uh estab well established use cases within organizations.

37:58

Um the third thing is uh what I have found fascinating both um we teach a uh boot camp uh on aentki and we also have an agenti platform. Uh so interacting with uh the learners and the students in

38:14

our uh boot camp and also interacting with some of the customers somehow there is this uh this perception that magically you know uh these all these models are all knowing and magically they are going to figure out everything right so claude uh and llama and uh you

38:31

know openai GPD4 GPT5 um they are going to take care of everything but the truth of the matter is the data is still the king you have to provide context, you have to provide uh the right prompts, you have to provide uh information in

38:47

the in the manner it can be uh in ingested, right? So for instance, if I go and ask a question uh not on chat GPD, but if I go and ask a question about uh anything that recently happened, the model will say the I mean as of my last update uh in on such and

39:05

such state, I don't have any informations about what happened in April of 2025. um these models well uh they know whatever was given to them by that time but they don't know anything that is happening in the future and for that

39:21

purpose you have to connect different sources of context or different sources of data um we have talked about MCP uh so you you can connect an MCP for a web search you can connect MCP for your Google drive for for your your CRM

39:36

system your um for your website really build a context provider uh uh using either tools or APIs or uh as uh we discussed uh using some um some kind of some sort of MCP connection that

39:53

will allow it standardization and interoperability compliance will be a major barrier. um agenti applications from our experience they have been uh the nightmare of the infosc team.

40:10

mean for um infosc uh people especially the CESOs the chief information security officers at companies they simply cannot wrap their head around this this non-deterministic nature of uh well are

40:27

you telling me that uh you know no matter what I ask uh you know it will just interpret and it can take any action right so and people can and then when they when they learn about prompt hijacking and jailbreaking and all of that they seriously they go crazy, right? So, can um can someone prompt u

40:47

prompt hack or you know just make the make my chatbot my customer support chatbot uh make it say anything. Uh compliance is actually a major issue.

Your legal and uh infosc they uh they

41:03

won't like it, right? Right?

So or your customers legal legal in infosc they're not going to like uh your agentic AI applications. Um so the nature of AI models is actually making compliance related tasks very difficult.

You have data privacy and protection issues some

41:19

regulatory compliance uh problems sensitive data handling problem. you have intellectual property and copyright.

Um you know the the uh the data that you or the the content that you consume and the content you create.

41:35

Um copyright intellectual property issues around that uh responsible AI amplification of uh biases and uh uh transparency and explanability and um security and misuse and auditability. Why was I given this answer?

41:53

uh can I explain this answer based on uh sources? uh and whichever way you go uh in this case if you uh the example is not very obvious.

If I ask uh a model uh they are

42:10

trying to correct this but not long ago when you try to create uh an image uh give me an image of a taxi driver, give me an image of a housekeeper or software developer or a flight flight flight attendant. Um you can clearly see the biases that ex have existed in the

42:26

society and in in our um you know on on uh internet as because it's primarily this is the uh internet is the source of data for us. Uh you see these biases and these um model builders these uh in this

42:45

case the image generation model builders they are constantly working on fixing these biases. Well in this case well there has to be diversity and then you have to have uh um you have to have representations from um you know

43:02

potentially whatever the source of biases. in all genders and uh all ethnicities and you know all colors and all races and so on.

But um but if you try to correct it, this is an example where Google actually uh tried to fix

43:19

these biases. And then here is an example, right?

So someone asked for an image um um create an image for um portrait of founding father of America and then it gave these options uh which was very inclusive for and you know all

43:37

that DEI business uh it kicked in but is this now this is not correct right so from a uh from a factual standpoint this is not correct and uh so if they do this they are made a fun of they get in

43:53

trouble. If they do this they are made fun of and they get in trouble.

So when you deploy these applications it is um it is very easy to build an application a proof of concept no big deal but building something that does not get you

44:10

in trouble like this u that is actually quite hard. Also, LLMs have what we call the eagerness to pre please problem.

LLMs, they tend to be um you know, by design.

44:27

Um the discussion is actually beyond the scope uh of this presentation, but from a UX standpoint or from a design standpoint, LLMs are uh these LLM applications, CH GPT and the underlying foundation models, they are uh they have

44:43

this eagerness to please problem. So much so that one of the models uh I believe it was one a version of GPD4.5 they rolled it back because it was showing signs of psychopensy and uh the idea is um so if you if you

45:00

look at this uh here the uh on the horizontal axis we have some terms of service and on the vertical axis we have uh some persuasion techniques and a higher the higher this or the darker the color is uh or the higher compare the

45:15

number that means that uh for instance this is one one is illegal activity and logical appeal. So you can get chat GPT or uh I think this was done on chat GPT or even on llama too actually.

45:32

uh there was uh this uh study that was performed that showed empirically that these uh models they are they are prone to um prone to hijacking. So you can easily uh you can show

45:48

logical appeal and you can perform a legal activity. But on the other hand some activities are going to be very hard to um hard to perform.

For instance this one number three. Number three is uh hate harassment and violence right so

46:03

in that case uh it is very hard uh so there is a paper if you look at it uh Jen at all uh um there there's a paper that you can uh I think this is the title of the paper how can Johnny persuade LLMs to jailbreak them so

46:20

there's a problem so why why am I concerned about this because most of us when we use our when we build these agent applications under the hood the main engine is going to be one of uh Llama or Claude or GPD models and if

46:36

they are prone to jailbreaking our applications are prone to jailbreaking. Um, a while back, uh, I think one of the Ford or GM dealers, uh, car dealers, um, in, uh, somewhere in the Midwest,

46:52

um, they got in trouble because their chatbot actually convinced, uh, their chatbot, uh, one of the customers convinced their chatbot to give them a brand new Ford truck for a dollar, right? So, it's uh, so I mean, so so

47:09

many things are possible. uh and um the models are easy to jailbreak.

Similarly, Air Canada got sued by a customer because Air Canada uh the chatbot that they deployed uh the

47:24

customer somehow convinced it uh convinced the chatbot or the bot was misled. Maybe I I don't know exactly what happened.

The bot had uh stale information and uh uh the bot uh the chatbot offered

47:41

um a special discount um on bereavement fair um and the uh matter went to go court and the court ruled in favor of the customer. Um when you're building this for those of you who are in software engineering

47:57

you you probably know this term um you know technical debt. So when you build uh a product uh which is uh you know anki application so technical debt is going to be inevitable.

You will build you will be building these systems. I mean things are changing.

You want use

48:14

one library the library has moved on. Again I will um I will refer to the previous session where um Mark actually mentioned that uh um everything is still evolving right.

So we are playing catchup. So there is

48:30

no standardization that uh for the most part for a lot of things they're changing. So since sync will be changing uh technical debt is going to be u inevitable.

Um you will need amazing um like good

48:50

product good technical product managers who are also technically very solid. Um it is very hard for a technical product manager um or for a less than technical product manager to build genai products.

Um it is it is simply not possible

49:07

because the nuances that are there to make the right decisions for the product um they simply they're simply going to consume the entire organization and uh here is uh this this notion of

49:22

you have if you have a genai not technical genai uh PM you start with the strategy um you know your um but um if you don't know uh the problems, you don't understand the product, you're not very technical, then uh your devs and

49:39

other stakeholders, they hijack the road map. They hijack the road map.

Uh of course, product does not meet expectations uh more interference and more uh control. And this is a vicious cycle that you remain stuck in forever after

49:54

uh you know once your PM does is not in control. uh rollouts are going to be um a bigger pain in the neck than you would you can even imagine when I release uh

50:12

uh a bunch of uh new lines of code. Um well my my product is non-deterministic uh my in in a traditional software product my productivity is deterministic right so I'm calling some lines of code I know exactly what to expect but when

50:29

you roll out prompts when you roll out your genai changes in your genai application even adding a new document or adding a new data source it can completely change how your genai application or how your

50:44

agent IC AI application behaves. So how do you roll out uh how would you decide uh how do you do prompt control or source control on prompts for the lack of a better term there?

Right? So can you can you have a version control on

50:59

your prompts? Um well you can even if you implemented version control on your prompts can you have version control on your documents and all of that.

So there is actually a lot of things that can go wrong. So and there are different techniques right?

So you know you can

51:14

have universal deployment, you can have opt-in and there are there are different ways which are once again beyond the scope of the discussion here. Um setting up evaluation and monitoring and guardrails early on they're going to be

51:30

a lifesaver for you. um make sure that you're continuously monitoring um your your application uh which MCP was called which tool was called um you know how many documents were returned how the

51:46

reasoning happened um you know how many tokens were consumed um what was the latency of inference and so on. So uh setting these up earlier on it is very helpful also having guardrails guardrails on both inputs.

Sometimes you

52:04

go and pre uh and I will show this in uh all of this that I'm talking about in a demo. I will make some queries without the guardrails.

I will make some queries with the guardrails and then I will show you uh you know what that actually means. Um uh but having guardrails is going to

52:22

be very helpful. it will keep you and your company out of trouble.

Um because in this case if you look at this uh the question here is how many rocks shall I eat? And um uh this is from Google Gemini by the way.

So if you don't

52:37

recognize this is Google and uh you should eat at least one small rock per day. I'm feeling depressed.

Right. So and um uh one Reddit user suggests jumping off the Golden Gate Bridge.

Um then um then you can u so cheese not

52:55

sticking to pizza uh you can add also add about 1/8 cup of non-toxic glue to the sauce to give it more tackiness. So if you look at this right so these are u and why is this happening um Google and Gemini they rely on

53:16

Reddit as one of the most important sources of data. Perhaps these answers that they fetched it from they were one of the highest voted answers.

So basically it is a community guiding these models and uh and these kind of things can happen based of sarcasm or

53:32

for many reasons that I will let you think about. Um it is important to build a UI that is defensive.

Make sure that you are using proper

53:49

um you know a monitoring and evaluation guardrails. In addition to that, make sure that you have uh a UI that don't doesn't let people do things that they should not be doing.

And uh the last thing is that you will

54:05

always find yourself uh stuck in this uh you know performance cost and correctness. um in addition to of course the compliance and governance and safety and guardrails type uh challenges you will always find yourself struggling with these when you're building these

54:22

applications. Um so before I go to a demo the product that I'm going to show you this is uh uh called Aento and what uh the approach that we took we started about 2 years ago uh and the pro approach that we took was that uh the uh Aento is a compliance

54:41

and safety first platform. So it was not an afterthought.

It was something that was built into the design. Right.

So compliance and safety, interoperability of agents, agents can collaborate with each other. Um and they are

54:56

interoperable and then uh it is it is an entire platform and uh um we um so how do we handle the governance and compliance aspect and security and control aspect? Um so Aento actually

55:12

goes and sits within the customer cloud. Um so um you know why can't I go and build the same thing in let's say using open AI or cloud you know chat GPT well uh I can build my custom GPT models but

55:29

then I will be uploading data and giving it to a third party then access and access controls um you know I hope that the those of you who were there in the previous sessions you are seeing the dots being connected uh there is authorization and access controls uh

55:44

that are built in. Uh data governance is uh built in.

AI governance is also there. AI governance refers to the fact that I know I can explain the responses that I get.

I have guardrails on inputs. What kind of questions are okay?

What kind kind of questions are cool? What

56:00

kind of questions are not okay? Um then you have observability.

You can actually go and see what what happened in each uh each uh user prompt. Um then there are feedback loop feedback

56:16

loops and uh you know custom uh or a multi- aent collaboration as well. Um and the way agent works is that instead of letting everyone just go and

56:31

deploy their agents, what we do here is um it is basically the IT and the subject matter uh subject matter experts at a company let's say company X uh you know 10,000 people company there's only 10 or 15 people who

56:47

are building these agents along with uh along with uh some of the subject matter experts. And these subject matter experts they make sure that they identify the use case and uh deployment scenarios.

57:02

they continuously look at the uh and monitor the feedback given for the agents. Hey this is a good response is a bad response and those uh form the feedback loop.

Uh agent ops observability and cost and continuous monitoring. uh it's almost like uh it's

57:20

almost like just like uh your HR your HR actually does your onboarding and recruiting learning and development you know um just offboarding if needed organizational alignment and compliance. So all of that is actually built in into

57:38

the platform. I think it is a good time.

I will answer a few questions and after that I will go to um I will go to the the demo here. Let

57:55

me see. Uh I will start with the most recent one.

Let's see. [Music] Okay.

Uh I think these are okay. These are related to this session.

So with respect

58:12

to data governance and policies which we have in India, how can we leverage aki connecting with MCP servers in terms of data security? As NOS's who handle large volumes of data, they can rely and use this.

So this question is from Sep. Uh Cindi

58:30

uh I am uh um I think it is a question that is complex enough complex enough that uh I don't know how to answer it here. Um because you will have to actually look

58:47

at uh if you're working for a government organization um uh you will have to actually look at I mean uh if your if your government has uh this requirement or this uh uh this uh law uh that uh the data has to reside

59:04

within the within uh within the country within the the jurisdiction or the the geography of the country then you will have to actually find ways you have to find some local cloud or of course the um some of the companies I I believe Azure and AWS they

59:22

they should have uh they should have uh data centers within uh India and then you have other uh you have other concerns but primarily I think from government I think often the especially um the requirements are around you know

59:40

where the data is kept Uh let me see are these slides. Uh we will not be sharing the slides.

We'll be sharing the recording and it will be on the uh on the on our YouTube channel

59:57

or also on the LinkedIn uh our LinkedIn um uh page. Uh okay.

[Music]

00:12

Let me see IML required knowledge of technology and functions whereas other not. Okay.

Okay. So I think these are some thoughts as I was presenting.

Yes. I mean I actually think uh so there's a comment here.

I'm glad tech

00:29

technical product management will become critical. Absolutely.

Right. So uh if you are a good AI or genai technical product manager um I think despite all the issues or all the challenges in the job market that all of us see um this uh

00:46

I mean this should be something this should be a skill that can actually help you a lot in getting your uh next role and if you're already a TPM uh learn some agentic AI learn genai and understand how products are built it

01:01

will help you in the long run. Okay.

What was the technical stack tools used to build the applications for various components? Uh you can build this in any uh any of uh the mainstream platforms.

You can build it in AWS, GCP,

01:17

these are all contain containerized. Um so of course the MCP servers that I will show you in a bit.

I think we still have a few more minutes. uh I will show you the MCP uh connections and really give you a sense of how do we control the you

01:32

know certain types of access and all of that. So, so um we built uh everything in Azure but I don't want to mislead that it cannot be done in in a different platform right so you can do the same thing uh if you wanted we can actually

01:49

go and deploy this in Azure as well because most of our services are containerized u and you know it's it's very easy to port from one platform to another okay uh there's a comment from the uh centralized Centralized AI agent uh

02:06

agent platforms promise scalability and consistency but how do they balance the benefits of central orchestration with the risks of single point of failure vendor lock in and reduced flexibility for enterprise team. This is a great question debucker.

Um so

02:24

um so of course uh I mean no matter what we uh where we go if it's a non-trivial solution you will have to choose one of the clouds. I mean um in principle this uh um you know you can have your MCP servers your sources of data you can

02:41

still have your uh data in Amazon red shift and you're running it in this this platform in Azure but this the the uh the platform has to actually run in Azure but the data sources and the context and even other agents they can be in different platforms so I think uh

02:59

uh yes I see that um uh you know it will reduce uh some flexibility but you um I'm assuming from your question I I see um um I sense that you have been in this space for some time um you um you know

03:15

if you have to manage uh the same um manage a very complex multicloud platform it has its own challenges so um so um I I think there are pros and cons of both right so being in a single

03:32

platform platform uh or single cloud platform versus uh being in multiple uh platforms. Okay, I think I will move on to the demo.

I apologize. Uh let me see if I can see very quickly if there are

03:49

okay what eval were used in the in this agentic AI OS that you were building. We have actually a lot of different evaluations.

I can actually show you. Let's move on to the demo.

Uh and then it should actually bring uh this um it

04:05

should uh start to make sense. Let me see.

Okay, I will have to No, I think this is not the one.

04:22

I will go to I'm just trying to find the agent that I

04:39

wanted to show for the demo. assistance and so let me show you this uh show you all this uh this particular so um so this will actually um this will show you both

04:57

uh the MCP server and also how we are actually using this right so this is a dummy deal that we have in our um CRM so if you look at this uh so these are Um uh these are our MCP servers that we

05:13

have connected. Um so if you look at this uh I have connected HubSpot MCP server with the agent.

I will go back and show you there's a SharePoint MCP server. Uh there's a notion Slack MCP server, Perplexity MCP server and so on.

And um uh if you recall from our

05:30

discussion in the previous uh in the previous uh session uh we were talking about how docker is well dockerizing or containerizing all the MCP server. So literally these so we have built an MCP client and once we have built that MCP

05:47

client we can actually uh connect any MCP server that is available out there or we can write our own MCP servers and um put it there. Right.

So if you now if you look at this move a few windows around. So if you

06:04

look at this um um so this is the official HubSpot MCP server and if you look at this I have this uh I can get a contact I can update a contact I can create a deal I can update a deal. Uh and what kind of actions are available?

06:19

I can if you look at this I can create associations uh batch create objects list associate. So these are the actions and these this is the information that is available to me.

If I want to connect it um uh so

06:34

there is this uh MCP server that we have deployed and then we have given it some permission. Uh and how does this work?

Uh and similarly I can connect a perplexity MCP server. I can uh uh use a uh web search Google drive and so on.

06:52

Let me go back and show you what are the implications of this. So this is my um once again uh we do trainings all the time uh on with data science dojo.

So um you know I have connected this training proposal writer to uh our HubSpot MCP

07:09

server. Uh there's a retrieval retriever tool which is the agentic uh retriever.

Uh then we have uh a perplexity web search tool rack tool etc. Right?

And u so this is a deal a dummy deal

07:24

that I created yesterday and uh so we are pretending that agenttoai wants an agenti training. Uh so I asked it a question check the HubSpot enterprise training deals pipeline for uh for um um from Aentto AI agentic AI

07:44

training. Let me see what it looks like.

So yeah, I think it might not be very helpful but so I mean basically it is some information we're pretending that someone reached out and asked from Magento asked us for a training and then

08:00

now we have to provide uh the context and this is saying um so summarize the communication so far and check the notes and email communication on this deal right and without me even going to the HubSpot MCP uh or HubSpot uh CRM it is

08:16

actually telling me all the notes and all the communication that has happened so far. Okay.

Um so um now how do I how do I make it uh you know use different kind of models? I can

08:34

actually go and make it use any kind of models. If you look at this, right?

So I can I can make it use web search. Uh I can uh change it to um from GPD 4.1.

I can change it to a GPD5 or any other model. Um I can also

08:52

uh right now it is using this uh react framework. I mean uh there could be a reflection.

If you remember uh from the first or the second session we were talking about being able to reflect on your output. So uh if you uh give it a

09:08

reflection type approach, it will look once once it has generated the answer, it will go back and correct or uh reflect on the answers and make sure that the answers are correct. Um I can also uh set up uh some topical

09:27

guardrails, right? Let's say um I can tell it do not talk about uh agentic AI training.

So this is my guardrail. When I go and save this guardrail

09:43

um now I hope that it was updated. Let me see.

Let me see. Um um what is uh tell me uh about the agentic AI training and

10:01

after enabling this guardrail I expect it to comp uh not to respond. Let's see uh I'm doing this in a rush so hopefully this should work out.

Okay. So you see Okay.

So this uh for

10:19

some reason I mean like any demos this is let me see if it it is actually let's see what it does in the end. Okay so it was able to give it to me.

Um um so um I can actually tell it to redact

10:37

the PII. If I redacted the PII, it will actually remove any names from responses and so on.

So this this is an example of guardrails. Um what else?

Uh we were talking about monitoring and uh uh

10:53

monitoring and analytics. I can actually go and see how many queries, how many active agents, uh which ones are the most active agents and so on.

Um I will show you one more agent and I think I will be out of here.

11:09

Actually the guardrail worked here fine. I should have actually.

So in this case if you look at this I have set this guardrail and here I have set this guardrail which says uh

11:26

do not let me actually suppose disable the guardrail. And so this is uh um this agent uh this is another another assistant agent that is connected to the clinical trials API.

So this is not an MCP server this is an API connection.

11:44

Um and then I will go and ask it this question uh find all clinical trials uh for women with breast cancer. The trial should be in the United States.

And what it does is it just goes and calls that API. Uh I ask in natural language.

12:02

let's say I'm a doctor. I'm looking for some medical tri uh some clinical trials or maybe some administrator for a hospital.

Uh it converts my query into uh into uh a clinical trials.gov u API call and

12:21

it gives me it fetches this in real time. I can ask it give me more information on uh this one.

Okay.

12:39

So this is a specific clinical trial uh ID and we are actually asking this question. Let's see what it does.

And you can see that we are analyzing the query uh processing the query and it is going and fetching uh uh fetching the

12:57

results right now for this. Let's see.

Hope this goes through quickly. Don't have much time.

13:16

And this is in reflection mode. So that's why it is taking a lot of time.

Okay. So it is giving me more details right.

So University of Maryland uh NCI uh giving me all the details. I can actually

13:32

um I can look at it here. I want to show one more thing.

Okay. So let me actually try to apply guardrail here.

Okay. Uh do not answer any question

13:52

about uh uh um any trials related to breast cancer. I don't know why we would do it but I

14:07

mean it is what it is right. So we will uh we'll try and let me ask this question.

uh find all clinical trials. I will let me start a new thread and I will ask a question and you expect it to block it.

14:22

If it doesn't then you know I need to go back and check with the team why it is not blocking the but it should actually block this query. Okay, I love it.

Right. So if you look at this it is saying I'm not actually configured to answer a question about

14:39

this. And there are many many use cases.

I wish we had more time. I would love to would have loved to uh but you know there are many many use cases where you even a legit question you might say no I mean this is not what I do do uh no uh

14:56

do not answer this question. Okay I am going to stop sharing and I will uh take see if there are any questions.

Okay. Is it possible to add more comprehensive guardrails experience show

15:11

that any context or system from guardrail? Yes.

Yes. Uh so Doug uh this is a wonderful question.

Uh can we uh can we bypass uh so can we add more guardrails? Yes, we have uh so these are what we call um uh these guardrails are

15:28

what we call the uh you know more like generative guardrails but we also have some predictive guardrails and some reax based guardrails right so if this word is there in the input or this word is there in the output just block it right so because these uh these guardrails are

15:46

prone to failing so that's definitely So Ahmed your question is is Aento AI your platform? Yes.

Uh this is our platform and we actually have more than 10 enterprise customers at the

16:03

moment and uh you know you can think of like companies ranging from 10 million to like 20 billion in revenue. So this is our own platform.

Uh uh okay. Um let me see.

16:20

Yeah, I think I will stop here. I I would really really love to answer more questions but we have one more session um after this and I want to be uh you know respectful to all of you maybe uh just take a break and we will be back in

16:38

9 minutes and I will hand it over to Isaac for a session on Isaac is with AWS. So uh and Isaac is going to start at 115 with a tutorial on building a

16:53

self-improving u a gentic system using I'm assuming Amazon bedrock, right? So and we will we'll come back in at 1:15 p.m.

and that would be the last session of the day. Okay.

17:09

be back in 8 minutes. [Music]

17:37

N [Music]

17:57

Heat. Heat.

[Music] [Applause] [Music]

18:17

Heat. Heat.

[Music]

18:41

Heat. Heat.

[Music]

19:01

Heat. Heat.

[Music] [Applause] Heat. Heat.

N. [Music]

19:25

[Music]

20:01

around. [Music]

20:17

Heat. Heat.

[Music]

20:39

Heat. [Music] Hey, heat.

Hey, heat.

20:55

[Music] [Applause] Heat. Heat.

21:12

[Music] Heat. Hey, Heat.

[Music]

21:57

Heat. Heat.

[Applause] Heat. Heat.

[Music]

22:21

[Music] Heat. [Music]

22:46

Heat. [Music] Heat.

[Music]

23:13

Heat. Heat.

Heat. [Music]

23:41

Heat. Hey.

Hey. Hey.

[Music]

23:56

[Applause] [Music]

24:19

Okay. Um, I'm going to hand it over to Isaac.

Uh, hey Isaac. How are you doing?

>> Hello. Very good.

>> Okay, it is all yours. Uh, you can get started whenever you like.

I will go off

24:35

camera, but I will be around. >> Okay, great.

Let me go ahead and share my screen here.

24:51

All right. Is that working?

>> Yes, I can see it very clearly. >> Okay.

Excellent. All right.

Uh, hey folks. My name is

25:07

Isaac Pervetera. I'm a principal scientist in the uh, generative AI innovation center at Amazon Web Services.

Um, and today I'm going to be talking about um, an area I personally find fascinating, which is um, how do you build a self-improving agentic system. Uh, so I'll be talking about the

25:24

system my team and I built um, using the open-source strands SDK from Amazon. Um, although so it was a an SDK our team developed and then open sourced um, originally um, coming from in-house as well as Amazon Bedrock agent core which is our new um, agentic

25:39

productionalization uh, tool chain. um for deploying agents.

Uh and I think at this point clearly everyone is pretty wellversed in a gentic AI but just I'll talk about it for like two seconds. Um so obviously huge step forward for LLMs um giving them agency.

Uh primary

25:57

components of this are um having some kind of u memory tools uh reasoning loop um and then giving them uh goals uh or tasks that can um complete those. we can use um those different tools to complete

26:12

actions with. So we at AWS believe uh are going to start seeing very large scale propagation of these agents and see them as really a transformative technology.

And we're thinking quite a bit about how we facilitate the productionalization of agents in a stable, secure and scalable manner.

26:32

So with today's LMS, it's not really a question anymore of if AI agents are viable. Instead, the discussion is kind of shifted more towards how can AI agents be best put in a position to succeed.

Um, and I think this has major implications for how we work. And while I don't think quite yet agents are in a position to fully replace humans, the

26:49

best workers are going to be those that are able to effectively use AI to improve their productivity. Um, and I think that several reasons why we think that um, agents are uh, really ready for enterprise use enterprise use.

Um this is uh model reasoning capabilities are

27:04

really as good as they've ever been. Um data and knowledge integration has become easier and easier.

Um now it's very simple to spin up a vector database. Um we on the the AWS side have uh Amazon Bedrock knowledge bases very easy to deploy.

Um also very excited to

27:22

see the infrastructure and especially the protocols. So the the maturity of MCP uh A toa um these are huge factors and then as well as you know lot of great folks here who have been developing um agentic tools I think the the set of tools that are available are

27:38

really mature and it's it's really great to see and I think that this altogether makes uh AI agents much more viable than they were let's say even a year ago. Um, and also just to set the stage in terms of where we came from versus where we are now.

Um, guys are probably familiar

27:53

with robotic process automation. Um, entirely rules-based and just executed with predefined flows.

Um, then 2023 2024 era of rag chat bots, generative AI assistance, the ability to kind of go through a rag database and answer questions. And I think right now we're

28:09

in tier three, which is where we're seeing the propagation of kind of goal- driven agents. um they're a able to independently execute tasks.

You can have multi- aent systems where they're able to work together towards a goal. But we're not quite yet at this fourth tier which I would we would say is kind

28:27

of fully autonomous agentic systems. Um that being said, you know, if you asked me you even like a year and a half ago if I thought that AI agents were ready for prime time, I would have absolutely told you no because the consistency of the LMS at the time were just um not good enough.

uh now we're seeing LMS

28:43

with substantially lower hallucination rates and greater accuracy but that being said the hallucination rates even for the top LMS is not zero um and we'll talk later about ways we can kind of address and look at ways to evaluate and another thing I just want to state is that you know aentic AI not a magic wand

29:01

um it has really impressive capabilities um and it's really powerful um but you need to be careful to carefully tune your prompts and your tool definitions Um, and even when your prompts are um, tuned as well as possible, you still

29:16

will likely see some mistakes in really complex tasks. Um, and this is why human in the loop is a critical component to the success of AI agents.

Um, and I think, you know, it it's it helps to think carefully about what granularity you want your tool definitions to exist at. Uh, so they could be very granular

29:33

if you need a lot of flexibility in the task. But if something, you know, something I like to say is that if something was a directional as cyclic graph or a DAG before, there's not really a reason to insert an agent into the middle of it.

However, you could have an agent control and execute that

29:49

DAG. Um, so I think that uh it there's different ways in which you can implement tools.

So it just makes sense to to think carefully about what the best way for you to implement tools for your use case is. Um, and just generally, I think there's a lot of

30:05

different considerations for the the way you implement agents. Um, latency implications, cost implications, complexity.

Um, do you need to be able to have human in the loop? Um, you have task uh, execution stopping mid um, execution.

So, number of different

30:21

things to think about. Um, but generally you want them use agents where there are tasks that are require flexibility.

So, just talking about the the agentic loop. Um as I mentioned before create an agent LM but providing it with memory tools um

30:38

goals giving it goals to achieve and then allowing it to reason. Um and then the flow for this reasoning can vary.

Um but the dominant um agentic reasoning paradigm right now is react. Um so the loop starts with a user interaction.

Agent will receive the input then decide what action to take next. So it'll

30:54

determine if any tools are needed um and then answer the question. So this can be single multi-step um and after each LM uh each time the LM reasons it's about its previous action it'll look and see if that was successful and then decide if it needs to proceed or stop.

And something I like to talk about is um

31:11

just because I think React is a dominant paradigm right now. Um but there are other paradigms that do exist and I think it it's valuable for us to know about these just because um there's different times in which they may make more sense.

Um so React was first introduced in 2023 by Yao at all. Um

31:28

paper react synergizing reasoning and acting language models. Um and I've kind of mentioned this already.

It just combines reasoning and acting to create uh the agentic glue. Uh it does tend to be the default with pretty much all um different agentic uh

31:46

development kits. Um agent's going to be given a task.

reason about how to accomplish it and then we'll begin calling tools and reason after each action. Ryu is an interesting one in that um the agent generates a plan and then executes it but not intermediate risk.

Um so it will just kind of create

32:02

this step-by-step um execution plan and then we'll just start executing um until the answer or gets the answer and then you can restart execution once it's done. Um and then there's also reflection which is another interesting one um which uh promotes self-reflection

32:18

and iterative refinement. Um however this self-refinement can introduce some latency so it's not always the best choice for consumerf facing applications but it is an interesting one especially for kind of like thinking about um like deep research for example um things where you don't necessarily have these

32:33

latency bounds. All right and another concept that we can kind of talk about we've already talked about agentic reasoning.

Um so now we can talk a little bit about when does it make sense to use single versus multi- aent. Um, single agents tend to be pretty capable, especially with the the new top of the

32:49

line LLMs that are out there. Um, but the more that we expand the scope of our agents, the more likely they are to make mistakes.

So, for the most complex tasks and goals, um, could be a good idea to let agents specialize in specific areas. If we provide an agent with like, let's

33:04

say, you know, 50 tools, 100 tools, um, it's going to be more and more difficult for a single agent to know when to use all those tools. And also the the thing to consider is that um when we have all these different tool definitions um those are all going to wind up in its

33:20

prompt. And so if you have a hundred different tool definitions that need to wind up in a prompt that's going to make it balloon in size.

Uh so it's not exactly the most efficient to to necessarily give it that access. There's different ways you can also um address this via semantic search with tools but generally I like to have agents specialize in kind of these multi- aent

33:37

systems. Um so talk a little bit about multi- aent patterns.

Um so these are two probably the most common patterns which are gentic swarms and then supervisor sub aent. Um there's more patterns than this obviously but these are two of the most common.

Um and the difference between these two patterns is

33:52

while in supervisor sub aent the supervisor controls all the routing and the final decision-m swarm is a little bit more collaborative. Um so each agent in swarm has awareness of the other agents and thinks about when to hand off.

Um supervisor sub agent is a little bit more structured um and also probably

34:09

scales it. it scales better because you're having the supervisor focus on um which agents to go and in what sequence.

Uh that being said, there are advantages to using a swarm. um the awareness of other agents within the swarm can actually be an advantage uh to prevent

34:26

duplicate um work just because if let's say there is some tool overlap between them um then you know because all the other agents the agent within the swarm is aware of all the other agents that sit in its ecosystem whereas it's on the supervisor to inform the other sub

34:42

agents of what was done previously. So um it's there's there's definitely ways in which both of them make sense.

I think as you scale in particular, supervisor sub agent makes more sense. Um but swarm patterns definitely can be um pretty useful and I think there's

34:59

also hybrid patterns or there's hybrid patterns as well. Um so you know think of like supervisor they could hand off to a research swarm.

Uh not necessarily going to talk too much about that here. But there's also competitive patterns.

Um so think about like agentic

35:15

arbitrage. um you have like two agents that are trying to negotiate.

Um in our solution, we used a supervisor sub agent paradigm. Um but tons of different ways in which you can implement um and would just recommend to to test and try as many as possible and just think about

35:31

which ones make most sense. So um I mentioned it before uh we uh open sourced our strands SDK which is um our Agentic SDK um for uh that um we use

35:46

internally at Amazon. Uh basically what we wanted to do is um develop something that was simple, streamlined and model driven.

Um so the idea is it's easy to use um very intuitive uh got strong capabilities um MCP uh integration

36:04

native um integrated with AWS services um also very easy for prototyping uh and so um we use this a lot uh that being said you know you can use whatever framework you want within um Amazon Bedrock um for strands in particular uh

36:22

the way it works is um It's built on this agentic loop which by default react um but uh there are ways to do custom implementations for different types of um custom agentic orchestration. So um

36:39

you know with any standard agentic loop you know it receive user input um process it decide what to do generate response and then kind of continue that loop iteratively. So, um, talked about strands.

Now, we'll

36:56

talk a little bit about, and something that, you know, we at AWS see quite a bit with all of our customers is, um, it's pretty easy to get one agent out there and just have it running locally. Um, but scaling can be a real challenge.

And so, what and especially when you

37:13

think about the number of ways agents need to communicate and how you manage agents at scale. So right now you know if you have your your agent um and has access to its tools uh how do you scope that at the department level organization level?

How do you decide what agents are able to communicate

37:29

externally? Um and especially this this problem only gets worse as agents continue to propagate.

Um and so it's definitely something that is uh can be pretty challenging. Um and I think that you know when you're trying to um

37:44

implement all of these things at production scale then all of a sudden you need to think about um how am I monitoring performance? Um how is my system scaling?

Are my systems or are my agents secure? Um you know how am I governing my agents?

Uh so these are all

38:01

really uh complex questions and I think it actually is something that uh doesn't get addressed quite enough especially when people just kind of go grab some um agentic example off the internet and start implementing it. Um that's pretty

38:17

easy actually deploying all this stuff and getting it to run in prod much more difficult. Um so that was the uh motivation for us to launch Amazon Bedrock agent core.

Um so the the idea for agent core

38:35

is it's our agent or agentic deployment uh tool chain. So it has components for agentic runtimes um memory tool hosting identity and observability.

Um and the idea is is that we wanted to kind of handle a lot of the undifferiated heavy lifting when it came to deploying

38:52

agents. um particularly at scale.

Uh currently the service is in preview but it's going to be launched um soon. And so uh the idea is is that all because you can build whatever you want using AWS or any other cloud infrastructure.

Uh but I think that there's certain

39:08

problems you may run into that tend to be fairly persistent. Um, one is let's say for um your runtime if you want to be using some kind of serverless solution timeouts tend to be an issue if you're doing some kind of deep research project um or figuring out how to do

39:25

kind of birectional streaming. Um these are problems that you you'll run into as you uh go into prod.

And so that was one of the goals of agent core is to be able to solve a lot of these problems and remove that undifferiated heavy lifting. Um so for folks who who may not be as

39:41

familiar can just show you guys kind of what the um AWS agentic stack looks like. Um so uh for folks who are not familiar Amazon Bedrock is our overall generative AI service.

So this includes um pure model hosting um serving uh but

39:58

also a number of other different abstractions. So we have um optimization um model customization guard rails um knowledge bases which are effectively kind of a managed vector database.

Um we also have agent core now which provides

40:15

um agent core runtime uh gateway memory identity observability and a few firstparty tools as well. Um but we also have plenty at the infrastructure layer.

So these all work together. Um and then we also have uh Kira, which is our agent

40:31

coding IDE. Won't be talking too much about that one, but that one is interesting.

So it is definitely worth a look. Um and then we also have Nova Act, which is more of our um browser use uh toolkit.

Uh so that's another thing you can use where you can either have that run on its own or you can use that as a

40:47

tool um for different kinds of browser use, computer use workloads. And then also uh strands agents.

But one thing I'll say is that you know uh agent core and all of this is designed to work with whatever framework you want. So um there's nothing stopping you from um using any model any framework.

If you

41:04

want to use open AI models there's no issue deploying that on agent core. We have no restrictions on kind of what you can do there.

It's effectively the same as just using a lambda. Um so just doubling double clicking a little bit on agent core.

Um so I mentioned kind of some different components here. Um and

41:20

our solution is built using um all these components particularly the runtime is for hosting everything. So um this is where you specify uh all of the components like your agent instructions um you know your context your tools um

41:37

and then whatever framework you want it it's pretty much agnostic if it can work within a docker container it can be deployed on runtime. Um then we have uh some of the other ancillary services here.

So like agent core gateway. So this is um you can kind of think of it

41:52

as an MCP gateway. So uh you can have your tools uh effectively made as um gateway targets and then um the gateway will manage access to these tools.

So that way you're able to have your agents um use different tools from the gateway.

42:09

Also run semantic search over your tools. So you can run semantic search um for all the different tool definitions.

So if you want to uh you can just make a call to the the agent core gateway API um via boto3 or we have several other APIs um and SDKs uh that you can make

42:26

that call and then just semantically retrieve only the tools that you want. Um couple other one first-party tools that we have as well.

um not we didn't use them in this solution but they are interesting and definitely worth a look is uh we have agent core browser which is um a manage browser use um solution

42:44

so it will uh create it'll help um launch browser use workloads in a secure isolated environment um so we'll use uh launch chromium for you can do it either headless or you can do it where you're

42:59

um actually able to kind of view what the agent is doing and we also have an agentic agent core code interpreter. So, um for both of these, these are just workloads that potentially have more kind of security implications with them.

Uh especially just having something have access to a browser, being able to um

43:15

just launch and execute code arbitrarily. So, we wanted to make sure that these were environments where they were secure and isolated um and that you can kind of send instructions to them, but it doesn't just have like open access to to any tool or system.

Then,

43:30

um we have agent core identity. This is not one of the um more uh I guess sexy if you will uh services but it's a really fundamental one.

It's a really important one. Um so the ability to authenticate and have um tokens for

43:48

your agents uh not only to communicate internally but also externally to external APIs. Um this is something that I think is really really fundamental for anyone who's building agents.

Um so this is one that I think um is really important. So uh identity is one that we built in um to the solution.

Also we

44:06

have agore memory. So uh memory works on uh couple different levels.

So there's short-term memory which is kind of your standard just state saving um text as well as long-term memory abstractions. So we have um user preference memory.

44:22

this is effectively looking at your conversations and extracting different user preference details out of them and then saving them to a database. Um and then also we have uh long-term summarization and you can implement your own kind of custom uh memory implementation that you

44:38

want um that will effectively just kind of operate over this this long-term memory. And then finally, but um arguably one of the most important pieces and something that's really critical to the the application we put together is um agent core observability.

Uh so agent core

44:54

observability uh is built in. So if you it's not necess you don't have to use runtime in order to use agent core observability and I think that's something that um should uh mention about all of these is they're designed to be modular.

So you can use them with kind of any one of um or just one on its

45:12

own use them all together. Obviously they're designed to be inter very easily to to work together interoperably but um you can just kind of pick and choose which ones you want to use.

So if you just want to use memory can just use memory. No problem there.

or if you just want to use identity can also do that as well. Um and so for uh agent core

45:30

observability this is built in for runtime. So any agent you deploy um on agent core runtime is automatically going to be um logged agent core observability.

So you'll be able to look at your traces there. And so that's a pretty important part of kind of the system that we put together.

45:46

And then just uh a little bit more on on what runtime does. I won't spend too much more time on this but effectively the idea is that you have um your models uh whatever model you want to use um whichever framework you want to use uh you just add your runtime decorator um

46:02

your identity config and then you can configure what kind of observability um depth you want and then configure it uh create a docker file or we have a starter kit that will create a docker file for you uh then launch and then um what we'll do is it'll have that uh

46:18

docker image host hosted in our um elastic container registry and then um we have the concept of an agent as well as an endpoint. So an endpoint is a specific deployment of that agent um but the agent is kind of the larger abstraction and then uh agent core observability.

46:35

um this is just kind of your your standard tracing. So everything you would expect with um tracing framework you'll be able to find here.

Um and so you can uh as I mentioned before uh agents deployed on agent core runtime um work or agents running outside of agent

46:51

core runtime work as well. So I think there's some some interesting uh concepts when it comes to evaluation.

Um there's two kind of main feedback paradigms. Uh one is offline evaluation.

So this is pretty standard. This is um

47:07

your standard eval metrics. um think like uh you know success rate, think um like tool call accuracy like these are things that are straightforward implement um and they can just be run at

47:23

will so like you can integrate it with your Jenkins pipeline um whenever you maybe make a change to your code you can have your offline evals run and I think it's it's a really important um thing to implement and definitely something you want to have a mature pipeline for um online evaluation I would argue is

47:39

almost I would actually say it's more important just because when you think about um offline versus online offline is really helpful in terms of like giving that that guidance um in terms of directionally but online when you're looking at your actual feedback from

47:55

your users this is incredibly valuable information um it can come in different forms but fundamentally I think it's really important to have some kind of pipeline where you're actively looking at your traces and then looking at especially the feedback you're getting

48:10

from users are you know you getting like thumbs up thumbs down what kind of comments are they leaving and then also I think another thing that's really interesting and very valuable to look at is that implicit feedback um that implicit feedback can mean um something like just leaving the chat early when

48:26

unexpected and I think there's different ways to kind of um look into this but uh it's definitely something that I think is really bears some emphasis so next let's just talk a little bit about um some of the different metrics. So I'll go back to to offline.

Um so I think for

48:43

for a Gentic evaluation some different things I like to look at um are you know tool calling evaluation final response and then also kind of operational evaluations as well. So for tool calling um just kind of assessing the appropriateness of the

48:58

tools and this is kind of on a per agent basis. I think you can also kind of like roll these up um for your multi- aent system as well.

Um but just looking at you know incorrect tools um missing tools uh tool argument accuracy um

49:13

weighted averages of these and then um looking at tool calling accuracy with LM as a judge. Uh so and one of the things that I like about um with tool calling and in particular with um this offline tool calling evaluation is if you have

49:29

the the tool that for a query um it's deterministic which sometimes like LM as a judge is really effective but I think it helps to have those different types of metrics mixed in where you can have some that are deterministic some that are LMS a judge based um so I like to have a good balance uh just because it

49:47

makes a little easier or I think it's a gives you a better view of your overall accuracy. Um it definitely helps in terms of you know reasoning, multi-step actions, long-term coherence and then um talk a little bit about kind of response evaluation metrics.

Uh so these are ones

50:05

you've definitely seen before if you've implemented any kind of offline evaluation, correctness, similarity, precision, recall, relevancy. Um so these are pretty standard.

I think most libraries you can find these um online uh coming back to online metrics. Um

50:21

these are ones tend to be a little trickier because they often well you you're not going to have ground truth with these. Uh so I think that analyzing this data is something that um is really really valuable uh for anyone who's got like large scale agentic deployments

50:37

because there's quite a bit of information within those traces. Um, and I think the the thumbs up thumbs down feedback is really really useful and it's the most common implementation just due to simplicity.

Uh, personally I'm definitely a bigger fan of liyker scale feedback um over thumbs up thumbs down.

50:53

But the problem is is that if you're asking a user to give a star of like one to five stars, a lot of times they'll just kind of like eyes will glaze over and just when they're trying to think about do I think this is one to five? Um, and then just not answer.

Um, so thumbs up, thumbs down. Even though I

51:10

prefer the metrics you're able to calculate with Laker Score, um, you're just going to get less engagement of that actual metric. Um, so for that reason, um, my go-to tends to be thumbs up, thumbs down, even though there's much more information, there's more information and more that you can derive

51:26

from Lyker Scale. Um, then, you know, comments obviously are really valuable, but that's going to be a very small percentage of your users that are actually going to take the time to to write a comment.

Um, so like for online metrics, it definitely makes sense to

51:41

have a pipeline for um, evaluating these comments. Um, but I think it's it's definitely going to be a subset of your users that are going to actually use these comments or going to leave comments at all.

Um, then also there's

51:57

some other implicit ones as well. Latency obviously is pretty easy to measure.

Uh, path efficiency is an interesting one. Um so path efficiency is one where even if the um LLM got the correct answer and the user was happy um did the LLM kind of take unnecessary

52:14

steps uh in did the agent take unnecessary steps in getting to that answer. So that's one that I think is definitely valuable to look at.

Um abandonment rate. So if you're um users kind of are leaving mid-con conversation.

Um and then question repetition. These are ones where um you

52:30

have to actually spend some time figuring out a pipeline of how to tag your traces. Um and I think that especially as it gets larger and larger, like if your application has thousands or hundreds of thousands of users, that's where I think it it bears um a

52:49

lot of thought in terms of how you're going to implement this pipeline. But there's a tremendous amount of information that you can um retrieve uh if you can kind of build this pipeline up.

So uh now this is the actual architecture of what my team and I um

53:05

put together. So um this is this kind of like self-improving agent concept.

Fundamentally it's just um a multi- aent system where we have a supervisor. Um then we have four different agents.

Um in this case uh we were just using um

53:21

fairly simple tool set although one tool is a little bit more complex and you'll see why. Um so the the main idea is it's kind of like a travel assistant but it's also got um a shopping component mixed in because it uses the Amazon um product

53:36

advertising API which has access to the entire Amazon product catalog. So, you're able to ask travel questions, search the internet, or um ask for recommendations for different products uh that would make sense for like your travel.

So, in this case, there's two

53:52

agents that are dedicated to that. One is a shopping assistant.

So, this is the one that actually controls the access to those APIs. And then there's a cart manager that is just dedicated to kind of managing, removing, or adding items to your cart.

Um and so these are it's a

54:08

fairly straightforward implementation in terms of kind of the the main piece just using strands um using a setup with a supervisor. The agents are are um set up as tools for the supervisor.

Um and then each one of the agents has access to its tools which are deployed um MCP servers.

54:26

Uh then the the thing that I think is probably the most interesting here is the analysis agent on the right hand side. So um the analysis agent is set up so that it will be able to look and analyze traces um it also takes in the prompts um and has an understanding of

54:43

the system architecture. So what you're able to do is you're able to pass all this information to the analysis agent.

Then the analysis agent is able to um assess through these traces and through the prompts um you know what changes need to be made, what are kind of common

54:59

errors um and what are um you know the things that need to be changed to improve the the performance of the system. And so you can implement this in an automated way.

What we did is we uh set it up so that it would return those recommendations which the user can then

55:14

implement. Um but pretty straightforward to to do on your own.

And just to give you guys an idea of kind of what this looks like. Um so uh here you can see this is um kind of current instruction.

So this is the the shopping agent. Um

55:29

this was it existing prompt. Uh this was the recommended adjustments.

um as well as also it will look at tool definitions and say, "Oh, this um tool definition needs to be um a little bit more robust or maybe even you should consider taking

55:45

this tool and moving it over to a different agent." All right. And so with that, um we'll jump into the demo.

So here, uh me jump over here.

56:00

Just a simple React front end. Um, nothing too crazy.

Uh, but let's just start by just kind of showing what it does.

56:15

So, here you can see we have um just this uh our our a supervisor agent which is kind of answering our questions, but there's a couple other different wrinkles here within this uh UI. So we have profile information that we're storing um that can kind of dynamically be adjusted here as well as we have a

56:31

shopping cart. Um so let's go ahead and start with a simple question just to start off.

Um oh and then also I think another important piece is uh the ability to uh provide feedback. So what we did is um

56:48

all the feedback goes to a Dynamo this is I cleared this out because it was kind of large but all the feedback goes to a Dynamo DB table um so you can see kind of the feedback any comments that you get get stored here and then can be pulled um by the agent um when it's uh

57:05

running its cycle. So here we go this is um some information about Bali.

Uh so in this case you could see that when uh the supervisor was given the the prompt different from the initial question it then goes to look for um the the

57:22

previous or the travel agent. But now let's go and run something using the Amazon product advertising API.

So here now we can see uh it's calling um a different agent. So it's now using the shopping assistant.

Um so previously

57:37

it's using the travel assistant and so now um it's running searches uh over the Amazon product catalog um kind of while it's doing this and one of the things that we need to do is we need to implement um sub agent streaming. Uh but so in the meantime what I can do is I

57:53

can kind of show you guys what this looks like within Amazon Bedrock agent core. Um so within Bedrock agent core this is the console.

Um so you can see these are the the different components here. Um so we have the the runtime.

So this is where uh we have our agent

58:08

deployed here. Uh so you can see you can see the last time it was updated.

So um when you click on an agent, this will show you you know how you can invoke it um using TypeScript, JavaScript or Python. And this also shows you um all of your different versions as well as your endpoints.

58:24

so uh in this case I just have one endpoint up and running. Um and that's what we're we're invoking right now.

Jump back over here. um taking a little bit longer.

It's probably running multiple searches. But

58:41

anyway, so um what we can also do is we can within a given agent look at um the uh observability. So here uh I can go into um any agent that's deployed within agent core.

I can go look and see, you know, how many

58:57

agents I have, how many sessions I have, what's kind of my traces, my error rates, um all these kind of standard metrics and then I can jump into a specific agent. Um go look at traces.

So here we can see um and these these

59:13

are just kind of just like standard um trace metrics. uh looking at you know what happened each time um also like all of the different components in terms of like what was actually um run using agent core.

So then here we've got

59:28

finally got our answer. Um so in this case we can see it's recommending a few different things.

Um so top recommendation go take a look here. So looks like some decent sunglasses.

59:43

Then to kind of round it out, what we can do is uh we can then say, "Please add the Oakley sunglasses to my cart." Oh, I think I should probably specify

59:59

cables. And so then what it'll do is it will um engage cart manager assistant.

And so then it will um go take a look at my cart. In this case, the cart is Dynamo

00:15

DB. There are ways um that you can do kind of an add to cart button um from using and by the way, the thing that uh we're using here is this uh Amazon product advertising API.

So um this is something that you can get access to depending if you're like with an Amazon

00:32

seller. Um, but overall the the idea is is that you're able to kind of use these different sub aents.

Um, so I think what I'll do looks like there was an issue here, but um, just to kind of jump back over. Uh, so

00:48

to to show you guys what the analysis agent looks like. Um, because that one takes the longest just because it's got to pull in a lot of different traces, all the different prompts, all the tool definitions.

Um, so here, this is one that I ran earlier. So kind of analyze system performance to give me prompt recommendations.

Um so here it's kind of

01:06

looking at response times um usage statistics uh prompt recommendations as well. And then here some system recommendations uh you know current sequential model creates bottlenecks um implement um you know optimize internet

01:21

search. Uh but then also here you know you can ask it for more direct kind of prompt recommendations.

So this is the updated prompted thinks you should use. Um looking using some of those recommendations it had previously.

Um as well as so for all the different

01:38

um agents kind of implementing the and then understanding from the tool definitions kind of what um structure all the data comes in for the the product advertising API. Um same thing for the cart manager and then as well as the the actual supervisor uh agent

01:55

itself. And then you can kind of doubleclick on um different trends and things that um it noticed.

So uh in effective it's kind of like a way to talk to your traces. Um but uh the idea is is that you're able to um pull all

02:10

this different information. Um the and then actually use it to to get some actionable recommendations for how to improve your system.

And um we were able to see some pretty substantial improvements in terms of going from like uh the the base prompts that we had to

02:27

um more uh robust prompts. Um and so I think that this is something that we want to try to uh implement more robustly um and at scale.

So with that, I'm going to jump back over to the

02:42

presentation. Jump out of this So in terms of um recommendations for implementation um the example provided

02:58

is kind of like a toy example. Um at scale you need a more robust search mechanism to identify relevant traces.

So I think that this is where um it makes sense to do some kind of tagging on your traces. Um, so because you know if you're using like a top-end model to

03:14

do to analyze a 100,000 traces, it's going to cost you a lot of money. So it's something where um makes more sense to to start with a smaller model to to do some analysis and to or to effectively kind of add tags and then look for and do kind of filtering on those tags to to do a more robust um

03:32

analysis. Um, in our version, we didn't implement automatic prompt optimization, but pretty straightforward to deploy that.

um just effectively need to give it access to the actual repo um wherever your prompts live uh whether that's you know in GitHub using we have um bedrock

03:49

prompt management as well or whatever prompt repository you're using just giving it access to that um and then I think just generally speaking the standardized analysis of your user traces is something that I think is really important for pretty much every production application um and I think it's something that it's a pipeline that

04:07

it offline evaluation pipelines tend to be really robust. Um I believe online evaluation pipelines should be just as if not more robust.

Uh and then in terms of um examples, we have plenty of examples uh to get started with for um Amazon Bedrock agent core if you're

04:23

interested. Um these are uh very easy to use and um we have uh integrations with a number of different um LM uh agentic frameworks.

So you can go and get started uh and go play around. And that is uh pretty much it.

So can address any

04:40

questions now that folks have.

04:57

Okay. Yeah.

So there's some interesting questions here. I see one.

Can you enable the agentic workflow to decide on its own on the type of agentic reasoning strategy used based on the specific use case? Uh yes.

So this is an area that I've definitely been looking into and I think that um there's ways in which you could kind of either put some kind of

05:15

like small LLM that just is making a really quick decision about like is this low, high, or medium. Um or you can put like have the LLM kind of like make that judgment call.

The thing is in order to do that, you really just need to make sure that you're optimizing that call

05:31

and have using as few tokens as possible. Uh just because otherwise you're going to be adding additional latency to to trying to do um uh dynamic reasoning.

Um and then yes, happy to to share the presentation. Um

05:49

can follow up on uh we we do have a free tier. Um I think that also we have one thing I will say is um the uh we do have uh Amazon Nova models.

Uh those are are priced very very competitively. Um so if you want to kind of run experimentation

06:04

with those those are a pretty good way to to save money. Um see so in terms of business metrics um I would say I would just look at kind of typical metrics like thinking about like marketing metrics for example like if

06:21

you have like a retail system um thinking about like how do I calculate like lift for example those are like pretty wellestablished metrics it's just kind of tying those back in um to uh to

06:37

the um uh to the LLM and the agentic performance. Um so that can be takes some creativity, but I think like there's a lot of like really standard fundamental metrics that exist that you just need to figure out how do I take

06:52

the data that's generated from my agent and my agent interactions and kind of tie this back um and just look at uh you know how that interaction is happening. Um so EC so agent core is effectively like lambda it's like serverless.

So under the hood it is using EC2 but um

07:08

the idea is that you don't have to manage anything. So the whole concept of agent core is it's all serverless.

So nothing is um going to be something you have to manage. Uh it's going to just handle um all those different pieces on

07:24

its own and then just um you'll just uh effectively call it and then your only build um as you use it. So, uh, EC2 like like if you, you know, are hosting an LLM server on EC2, you're going to be paying for, you know, every second that

07:40

that's up versus agent core, you're only paying on a per invocation basis, very similar to Lambda.

07:55

All right, let's see what are these type of agentic agents now commercialized by businesses. I we're starting to see that um at least from from my perspective uh we've started to see some some customers actually moving these to production.

I think they're pretty nent um in general.

08:12

I'm not seeing like there are I think probably the the places that are having like that have like the most robust deployments are kind of ISVS or you know the software vendors. Um there are some retailers though that are launching them.

Uh, I think we're we're starting

08:29

to see them in terms of the actual um capabilities are being given. I think that people are are companies are being a little on the conservative side, which totally makes sense given that I think

08:45

only recently has MCP started to become more secure and have the the security vulnerabilities been addressed. Um so and that's something that you know we as um AWS are spending a lot of time kind of thinking about and one of the reasons why we we really wanted to make sure that we put in some kind of um identity

09:01

uh piece to to agent core let's see if I'm using memory from a third party

09:16

platform provider can I integrate their observability metrics in agent core Um so within the runtime uh you can call whatever API you want. So as long as they have an API um within the runtime then you can pull that data in.

09:36

Uh so difference between agent core and lambda. So agent core lambda has a 15minute timeout is one of the main things.

Um agent core runtime is designed to have an 8 hour um timeout. So it can do much longer jobs.

So any kind of deep research uh as well as it's

09:53

also configured for streaming. Um it's also there's no limit on payload size which you might find with Lambda.

Um and also it's actually priced a little bit differently. Um it's more competitive for agents.

Um so when for example

10:08

you're running like a multi- aent flow and you have different nodes or different runtimes running um then what can happen is with lambda you'd pay for that entire process versus when you have um everything deployed on agore runtime

10:25

those individual runtimes you're just paying for the actual time the agent was invoked. All right.

Well, um, want to be respectful. I don't know if I'm I'm out of time here.

Uh, I think my session

10:42

officially has ended. >> Isaac, if there are any questions, please feel free to take those.

>> Okay, great. >> Great.

You may take maybe five more minutes. Yeah, that should be okay.

>> Okay, great. Happy to do that.

Um,

10:59

yeah. Yeah.

So I think for good I think deep research is probably one where like using these um really highle reasoning models makes sense. It's just tough to use some of the the high reasoning

11:15

capabilities um like when literally what I'm talking about is for for most of these models now they tend to have different settings like low, medium and high reasoning. Um, setting that high reasoning is going to make the the output pretty verbose.

So, you're going

11:31

to get a lot of intermediate tokens. Um, for something where it's not latency sensitive, uh, that definitely makes sense.

Um, and so I think that it just depends on

11:47

like kind of what kind of application you have. Um, yeah, I don't think there may not be a free tier for bedrock, but what I would say is generally speaking, the, as I mentioned, kind of the Nova models are pretty um

12:04

pretty competitively priced. So, Bedrock, it bedrock itself is just a generative AI service.

Um,

12:22

oh man, I love all these questions. I wish I could answer every single one of them.

Um, okay. Uh, let's go back up.

So, what is special about Bedrock in terms of agentic AI? So bedrock fundamentally is like so there's there's different abstractions.

There's bedrock and maybe let me go back to the the stack actually that might be helpful.

12:42

Share Yeah. So, um, Bedrock is kind of a suite of different services here.

So, you can see that we have like the the base kind

13:00

of models. That's just the serving.

Um, so that's, uh, all the different models um, we have available. Oh, also, by the way, we do actually have the the OpenAI OSS models, which are also very, very competitively priced.

So, I would check those out as well. Those are really powerful.

um and have the the different

13:17

reasoning um capabilities as well. Uh then within bedrock there's different components.

Um agent core is specifically a piece of bedrock that is designed for agents. So that's kind of the the main agentic piece um that we we

13:33

are emphasizing. And then we have um our strands SDK as I mentioned but you can use whatever framework you want.

So there's no restriction on if you're you know more of like crewi lang graph that is your preference you can totally use that with uh agent core runtime as well.

13:56

Uh so in terms of the the reasoning why it's verbose for high it's just because it it you're instructing it to think carefully about what it's going to do. So, um, its reasoning process is basically just printing out a whole bunch of tokens and kind of like thinking out loud effectively.

So,

14:12

you're just making it think quite a bit and then that's going to make it more verbose versus low, it's going to output um a much lower number of tokens. Um, so in terms of I think would it be

14:30

overkill to uh set up a single agent with uh bedrock agent core? Definitely not.

I mean if you would deploy single agent on lambda you could do the same thing with agent core. Um and then I see another question I really like uh multi-agent setup reminds me

14:46

with microservices. Do they work together implementation wise?

So this is actually another project we're working on. Um cuz one of the things that I think that we see is a lot of these multi- aent systems are deployed as a monolith where you have all the different agents on the same compute.

A more realistic scenario is actually

15:02

having each agent run as its own micros service and then communicating kind of like a toa style. Um so that is something that I think that we want to um we're we're looking into um more kind of common deployments for that.

15:18

see am I out of time Roger I still have more time? >> Uh so Isaac uh please feel free to take uh so since audience is engaged I think

15:34

uh you can take a few more minutes. >> Okay.

Um and some of these I'm uh happy to take offline. I'm not a security expert, so if it seems like I'm I just want to make sure I'm getting the the

15:50

correct answers on some of these. So, I would uh want to get back to you folks on on some of those.

Yeah. So, I think um in terms of how to make sure these agents are not

16:05

overkilling the process um Yeah. So, that was actually under the hood that was a I got throttled.

we actually have lower limits as AWS employees than um we give to customers. Uh but um the

16:20

effectively it's the way I kind of think about it is kind of similar to security. Um with each a like sub agent it's just kind of like least number of tools it needs and least amount of kind of context it needs.

Uh and I think probably I'm sure this guy mentioned earlier but um the I really like the

16:39

concept of context engineering. Uh so thinking carefully about like how to manage that context um is going to definitely help.

Um so it's and and one thing I'll also say you don't necessarily need to use agents for everything. So you can use like

16:54

workflows if something's more simple and just needs a couple like directional like pieces of information and you don't need that flexibility. You don't necessarily need to use agents.

Um but I think that for something more complex um where there is a lot of reasoning that

17:09

needs to go into it and then there's multiple different paths also you want error correction that's where a workflow kind of falls apart um versus like and because you can only have so many branching trees before the logic just gets too complex to manage.

17:28

uh agentic framework you shared conforms the Amazon native frameworks of the client wants the solution to be running on other cloud platform how much will the additional effort be um so there's different uh with within the strands framework you can use different model providers there's not necessarily um a

17:45

restriction on using just Amazon Bedrock I think the integrations are going to be the best with Bedrock um but we have integrations for for all kinds of different model providers Um yeah, so you can use um uh your own

18:02

MCP server as a target for agent core. Uh so that's definitely something um within I would take a look at agent core gateway.

Um see

18:22

trying to see what other questions I've missed.

18:42

Yeah. So if you look um for in terms of the trust model, I would go take a look at the agent core identity um documentation.

Um, so each one that and actually maybe I have a slide that I can share here, although I don't think I

18:57

have uh I probably don't have that handy unfortunately. Um, but I I would go take a look at the documentation and we can um pass some links there.

Um, well I I unfortunately have to run but uh thank you so much. I'm I'm really happy to see uh all the questions from the audience.

19:13

These are great questions. Um, and for, uh, folks whose questions I wasn't able to answer, uh, I will try to see if I can find an answer and then post something on LinkedIn.

>> Thank you so much, Isaac. Uh, really there was a lot of engagement.

So, thank

19:29

you so much. >> Thanks everyone.

>> Thank you. Okay, so we are coming to an end for the day.

Uh, so this is our first day out of the five days uh, we will be here. Uh what I'm going to do here is uh I'm

19:47

going to give an overview of what we are going to do uh for the rest of the week and then we'll call it a day and uh also maybe a little bit of uh uh some overview of uh the AET boot camp. There

20:04

there has been actually a lot of questions around the that uh boot camp. So I would like to actually um give an overview very quickly uh like maybe five to 10 minutes.

So let me just make sure that my screen is visible. So um so

20:22

tomorrow when we come back uh we will uh start with uh this tutorial by Andrea from landing AI um um for the complex document uh processing. Um so the problem is um uh a lot of uh lot of uh

20:41

types of documents most notably uh the structured data your excel documents CSVs and sometimes you know multimmodal uh type documents uh you have images and text in the same document uh uh processing those documents is not uh

20:57

easy um in most cases. So um we will have a tutorial around how landing AI is uh can help in that case.

Um then we have um um we have this tutorial on how

21:13

do we build a knowledge graph on top of our agentic AI application. Um so knowledge graphs can actually um correct some of the shortcomings of uh they can help you reduce hallucination in your agentic AI applications.

Then uh we have

21:30

well uh vision enabled agents uh with haststack. Then we have this tutorial on um um building an agentic research assistant with Reka.

Uh going beyond chat uh chatbot Michael

21:46

from Docker he will be uh telling us how we can build event- driven agents with GitHub uh web hooks. Uh then we have building smarter AI um building agentic workflows with flow torch and then uh we

22:03

have AI agents with vector databases hands-on VB8 agents right so VB8 is one of the those databases that are what I would like to characterize as uh um AI first or AI native vector databases. So they have bu they have been building a

22:19

lot of interesting features. So, so basically the agentic behavior is built inside the vector database and Scott is going to be presenting that.

Then we have uh another tutorial from um uh

22:36

landing AI on agent document extraction. Um then we have this uh workshop um uh from pine cone in uh about agentic AI for semantic search.

Then we have uh um

22:54

so we talked about I think there was a lot of questions and when I was presenting a there was a lot of questions around observability and monitoring and evaluation. So this is a workshop on uh observability evaluation and monitoring and optimization of agents.

And then finally we have this

23:12

workshop on uh transformer models and how do they actually come together in our agent application. There's one more actually by lance DB which is optim retriever optimization.

when you're when you're retrieving uh you're building a

23:28

retrieval augmented uh generation application, how do you uh how do you optimize your applications? And that should be it.

So there has been questions about uh the boot camp. Uh so

23:43

um what I would like to do is uh we are actually the first boot camp uh in industry. I don't think there is still any other boot camp.

Maybe there are boot camps that are popping up. We have been running this agentic AI large language models boot camp.

Agentic AI is

23:58

in the last 6 to 8 months but we have been running a large language models boot camp covering uh a lot of similar uh same content for the last two years. We are the first boot camp uh in industry that uh was launched uh to to

24:14

address uh you know how do you build these agent AI applications. So I will quickly go over this some of the instructors that you uh some of the speakers that you saw today they are also the instructors and the content is actually uh think of this from uh you

24:30

know 0 to 60 in uh 9 to 10 weeks. Uh so you we start with um uh the fundamentals of AentKI.

Uh we have um uh uh pretty much all the

24:46

fundamentals right as long as you can write basic code in Python. uh if you can write basic code like you have working knowledge of Python and then uh you you have been working in the space a little bit but no deep expertise in coding and programming is needed and we

25:03

walk you through we start with the very fundamentals and take you all the way to well um you know finally um you know start with the introduction then we go through how do you set up a memory, how do you connect your application to a

25:19

retriever, how do do um how do you connect it to vector store? How do you build chains and all of these uh topics there are hands-on exercises and then we have these uh labs that are integrated into uh into our learning platform.

So you don't have to set a lot of things up

25:36

on your laptop. Just bring your laptop in and uh we take care of everything for um for you.

Um so um you know there's context engineering uh that we talk about uh we dive deep into all sorts of planning and reflection and tool usage

25:53

and MCP u u issues or um MCP related learning vector databases you will learn from scratch uh uh agent gap protocols and interoperability how to evaluate agents um and uh and finally at the end

26:10

of uh the at the end of the boot camp um everyone actually builds this fully functional end-to-end application and deploys it and everyone leaves with a web application that is connected to um

26:25

you know different kind of data sources you build your own rag your own set up your own vector database um and then uh set up um a lot of orchestration around your um your uh agents uh let me

26:41

actually very quickly go over this. Uh yeah, so uh it's a 30-hour curriculum.

Um and um we are uh since we are offering this uh this boot camp in in collaboration in association with the University of New Mexico, what it does

26:59

is many companies um uh in many companies cases u you may be eligible for um uh a tuition reimbursement because we are partnering with the university. So uh you don't have to get

27:14

a separate approval. I mean this is part of your benefit.

So many companies have this tuition benefit. If they allow you to take courses uh in a university then uh since uh owing to our partnership with the University of New Mexico, you should be able to attend the boot camp

27:29

for uh for free potentially but you may want to check with your company. Uh so we have partnered with a lot of companies.

uh um in many cases you get free compute in many uh in in many cases you know your resources uh you will be

27:46

getting resources and any other um uh really a lot of we have built an ecosystem so um you will be getting a lot of resources for free as part of your registration um I will I think I've mentioned the

28:01

curriculum already um as I said we'll start with the fundamentals and we'll take you all the way to uh the most uh really the implementation right so you will start with even if you don't have any background in um machine learning or

28:19

AI um we will the the way we have structured the boot camp you should be able to actually u you know get started and we have some um uh preliminary or preparatory material that we can give it to you and you can get ready uh in the

28:35

weeks and is leading to the boot camp. Um let me see evaluation we talk about it uh you know some of the instructors that you see uh here I mean depending upon who's available uh we will have people from different companies here is

28:53

uh you know we and we have been around for quite some time I mean we are one of the longest standing longest running boot camps u uh in data science uh many boot camps came and and disappeared but we are still around we have had like

29:10

more than 11 12,000 graduates by now. So the next cohort that we have two cohorts coming up one is going to run from September 30th to November 25th every Tuesday.

And then this is the morning

29:25

cohort but if you are in different time zone and our evening is your morning or for whatever reasons if it uh it uh works better for you. So, we have an evening cohort as well, which is uh which starts on October 9th and goes all the way to December 4th.

This is the

29:41

Thursday cohort, 9:00 a.m. to 12 uh p.m.

Pacific time and evening cohort is 5:00 p.m. to 8:00 p.m.

Pacific time. And um and if you register uh today and tomorrow uh during the boot camp uh or

29:58

during the conference you can get a 40% off on the on the entire uh boot camp and the if you would like to scan and check it out go check it out and if you have any questions let's let us know and I will go and see if anyone has any

30:15

questions. If not, then we can call it a day.

And I'm trying to find the the chat window. And let me see.

So,

30:30

okay, I had moved it to the other screen. Okay.

Is there any question? Most recent.

Okay. Okay.

So in that case since there are no

30:47

questions I am going to uh end the session and then I will see you all um 9:00 a.m. tomorrow morning.

Thank you everyone.

Summary

Transcript