Generative AI for Healthcare (Part 1): Demystifying Large Language Models

Transcript

00:00

Welcome to part one of generative AI for healthcare. In just the last few years,

00:15

AI has grown from a futuristic concept to a tool that is reshaping healthcare delivery. But with this rapid evolution also comes a critical question.

How can healthcare professionals understand and harness these powerful technologies without formal training in data science

00:31

and computer science? Today, we're going to break down the essentials of generative AI and healthcare, not from the perspective of computer scientists, but rather as physicians who have both used and implemented these tools in our everyday work.

My name is Dong and I am

00:47

a clinical informaticist and emergency medicine physician at Stanford. And I'm Shivam, a clinical informatist and internal medicine physician also at Stanford.

Both of our informatics work focuses on effectively deploying generative AI in real world clinical settings at Stanford medicine. Through

01:02

this work, we've discovered that improving adoption really hinges on healthcare professionals actually understanding the foundations of how these large language models work, their limitations, and how to effectively use them. Unfortunately, there's a significant lack of accessible educational material in this space, which is what inspired us to develop

01:19

this talk in spring 2024. We've since been asked to deliver it at various venues over the past year and consistently receive requests from position educators for recordings for them to reference and share.

So, today we're creating this YouTube video to fill that gap, providing a clear, comprehensive YouTube resource that

01:35

healthcare professionals can access any time to better understand generative AI fundamentals and tips. Our goal is to really empower you with the knowledge and skills to safely and effectively implement these promising tools in your workflows from the bench to the bedside and everywhere in

01:50

between. So starting off with some of our brief disclosures, we are both independent contractors for Greenlight where we contribute to OpenAI safety initiatives by helping align health related model responses to improve their model quality and accuracy.

I was also previously a consultant for Glass Health

02:06

where I helped lead their clinical team. Also a brief content disclaimer that the soft really offers a highle overview of some really complex technical concepts.

Because of this, some details are streamlined for clarity or based on expert consensus where there's limited public information. I'm sure some of the

02:22

series will inevitably get outdated due to the sheer pace of progress in this field. But we've made a conscious effort to focus on topics that are really core to the technology and we believe that won't significantly change.

We also primarily focus on OpenAI's models like chatbt in our examples mainly because

02:38

that's what most people are familiar with and most LLM tools used in the Epic EHR system are powered by Open AI. However, the core concepts we'll discuss apply to every single LLM out there and not just OpenAI's models.

We have to start the conversation by addressing that sometimes it can actually be fairly

02:54

tricky to get these AI models to behave exactly the way we want them to. Talking to AI, it turns out, is a little bit more complicated than just asking a questions.

And learning how to use these models effectively can be even harder still. And I really think that there's three fundamental reasons why prompting

03:10

or prompt engineering can be so challenging. Number one, AI literature is very very difficult to read and understand if you don't have a technical or computer science background.

So here's a screenshot from the seminal 2017 paper titled attention is all you need which was the original paper that

03:27

laid the foundation for modern-day large language models and introduced the transformer architecture. Now Shivam and I have both separately tried to read this paper multiple times and many attempts later I still can't say that I fully understand the content.

The point I'm trying to make here is that it's

03:42

actually really difficult for clinicians to read and understand this kind of original literature about AI. And you can see how this might be a problem to a group of people who are all trained on evidence-based medicine and knowing how to site your sources and how to site your studies.

So you might say, well,

03:58

instead of reading the original papers, why don't you just look up some tutorials and guides online? And here also, we run into some challenges.

There really aren't that many healthcare worker appropriate prompting resources. And so here's another screenshot from a popular resource

04:14

online called learnprompting.org, org, which is one of the largest free online resources for learning how to do this kind of stuff. And you can see that it starts off fairly simple and straightforward, right?

You are Shakespeare, um, an English writer, write me a poem. The problem is that it very quickly, once again, turns from

04:30

something simple like this to something much more complicated like this with statistics and formulas and variables. And so, the problem is that the sweet spot of resources out there for something like this is actually very, very narrow.

What we want is something that is both practical and detailed and

04:46

actually useful beyond just turning your email into a pirate haik coup uh while simultaneously not being too bogged down by the technical language and jargon. And finally, and perhaps most importantly, we have to talk about the exponential pace of AI progress.

05:02

Recently, I recently went on PubMed and searched for articles containing the words artificial intelligence or machine learning in the title. And in 2014, there were 272 such papers published.

Fast forward a decade, in 2024, that

05:18

number had increased to over 20,000 in a single year, which is a 7,500% increase. To put it differently, it's almost 60 new papers every single day.

And keep in mind, this is just PubMed, so we're not even talking about computer science or engineering

05:33

journals. So unless you're a full-time AI researcher or a data scientist, it's all a little overwhelming.

And so when it comes to prompting, it really leaves us feeling like this where we might see something online. We might hear some advice from friends and we try some

05:49

things out and when those things break, we try something else. But most of us don't actually know what's working, what isn't, and why.

And so this is exactly why we started by focusing on what an LLM actually is and how it works, not just how to use it. Because before we

06:06

discuss the how of prompt engineering, we really need to first understand the what and the why of generative AI. With that in mind, here are the big three things we hope you'll take away from this video.

First, we want you to understand where generative AI fits into

06:21

the broader framework of AI and healthcare and how we got here. Second, we want to give you an intuitive understanding for how large language models work so that you can build a mental model for generative AI.

And finally, we'll talk about the pre-training and post-training phases

06:37

that shape these models behavior, how they're optimized, how they're adapted, and how they're fine-tuned. And in summary, the purpose of this first video is to answer that allimportant and not so straightforward question of what is an LLM really?

06:55

So let's start with a brief history. We need a framework to help us contextualize where LLMs fit into the history of healthcare AI.

I think the best framework that I've heard of comes from a paper titled the three epochs of artificial intelligence and healthcare. It was authored by Michael Howell, the chief clinical officer at Google along

07:11

with Karen D. Salvo, Google's chief health officer.

They break the history of healthcare AI into three distinct epochs. Epoch 1 began around 1970 and represents the era of symbolic AI and probabilistic models.

You can think of this as rules-based models. Epoch 2

07:28

began in earnest around 2010, representing the era of deep learning. And this is what we now refer to as traditional machine learning or ML for short.

And finally, epoch 3, the topic of today's discussion, is large language models and generative AI, which was first described in 2017 and brought into

07:45

the public zeitgeist in around 2022 with OpenAI's release of Chat GBT. Now, it's important to highlight here that these new technologies haven't replaced each other.

Rather, each new development was built on top of the previous ones. And at the same time, each of these epochs

08:01

represents a fundamentally different type of technology. And so why is the distinction between these three important?

It's because each one has distinct inputs, gives distinct outputs, and have different types of ideal use cases. And so the next time

08:16

you hear the term AI, especially in the healthcare setting, you should be able to quickly recognize which one of these three categories the AI fits into using hopefully a few simple horistics. Okay, let's dive a little bit deeper into each of these epochs one by one and give some more concrete

08:33

examples. Starting with epoch 1, I think of this as rulesbased AI.

Basically, if something can be programmed entirely with straightforward if then logic, then it probably fits into epoch 1. In your personal life, this might remind you of something like Clippy, the old Microsoft Word Assistant that popped up whenever

08:49

you tried to type a letter. You might also think about something like Turboax, which walks you through the uh structured decision tree provided by the IRS for filing taxes.

And finally, even today in 2025, most video game AI actually still runs on highly complex

09:04

layers and variations of this kind of simple rulesbased logic. Now, turning to healthcare, nearly all clinical decision support tools that we interact with daily fits into this category.

So, for example, the alerts that hopefully pop up when you try to order 5,000 milligs of Tylenol

09:21

instead of 500, these are all examples of epoch 1 technology. the risk calculators you see on platforms like MDAC.

Again, this is mostly simple rulesbased logic and even smart automated billing algorithms like the one you see here that automatically determine billing levels based off of

09:37

which boxes you check are examples of rulesbased AI. And so, as you can kind of tell by now, many of the seemingly smart features that we have in our modern-day EHRs, those tool that we rely on every single day, are actually built on technology that's been pretty much around since the

09:53

1970s. And so here's a simple way to recognize epoch 1 AI.

If it's logic- based, non-addaptive, and it doesn't learn from new data, and if it feels like it's something that is hardcoded by experts for very specific tasks, then it probably fits into this first category

10:10

of AI. Now, this is where things start to get a little bit more interesting.

Epoch 2 is what we typically refer to as machine learning or deep learning. There are many different subtypes of machine learning, but for our purposes today, we

10:25

are going to mostly be talking about supervised learning, which is by far the most popular method. In machine learning, instead of relying on hard-coded rules, this approach involves feeding a computer hundreds of thousands, if not millions of different examples.

The system then learns to

10:40

recognize patterns within that data and then applies what it's learned to make predictions and generate insights in new unseen data. And in our everyday life, we actually see this all the time.

Think about facial recognition on your phone or your social media app that

10:55

automatically tags your friends and photos. It's also what powers targeted advertising like your streaming service somehow knowing what you're in a mood for or when you keep seeing ads that might be appropriate for you, whether it's your age or your gender or your interests.

And of course, this is the

11:12

kind of AI behind computer vision and self-driving car technology like Tesla autopilot or Whimo. And in healthcare, this is where big data really comes out to shine.

Tools like automated EKG or STEMI detection are great examples of this.

11:28

They rely on thousands of labeled cases and labeled EKGs to make accurate predictions and accurate interpretations in real time. And the EKG you see here on the screen is actually a very early evolving STEMI which you might not really notice at first glance but this Queen of Hearts AI picked up on

11:45

immediately. You might also see this in AI deterioration models, which are blackbox models, where the system takes in hundreds of variables automatically from your EHR and might give you something like a sepsis risk score or a decompensation score and telling you that the patient might get sicker within

12:01

the next 48 to 72 hours, but you might not have any idea why. And of course, uh, in healthcare computer vision, a perfect example is radiology.

There are actually already quite a few FDA approved AI softwares that can automatically flag abnormalities on

12:17

imaging studies or do image segmentation which can dramatically improve the efficiency and throughput of human radiologists. The way to spot machine learning based AI is to think of it as learning from C.

These models are trained on massive label data sets, lots

12:33

and lots and lots of examples. They learn from that data, but usually they do it for one specific task.

So they're really powerful, but they're not really generalizable. They don't really easy easily transfer from one use case to another.

One important thing to keep in mind, these models are often low in

12:50

interpretability. This is where we start to hear the term blackbox model.

It means that it's very difficult to trace a clear causal relationship between specific input features and the final output. The model predictions might be highly accurate, but we often don't fully understand how it got there.

And

13:08

of course, this opens up a whole new can of worms when it comes to safety, bias, and whether clinicians or patients um can trust or even accept these tools. This is a much bigger conversation, of course, and one that we're going to save for another time.

And finally, the

13:24

hottest topic in the last few years is generative AI, specifically large language models or foundation models. This marks the transition to epoch 3.

On the personal side of things, tools like Chat GBT need no introduction. But there are everyday other examples as well.

For instance, AI summarization tools like

13:40

the ones Amazon uses to scan thousands of customer reviews and generate a short, readable, and helpful, concise summary, or customer service chat bots like Erica, which is Bank of America's automated assistant, which improves upon their traditional phone tree system. And of course, image generation where tools

13:57

like Dali, Midjourney or Sora can generate visual content or even videos from text prompts. These are often multimodal, meaning they work across images, text and more.

And this technology is evolving super fast. Since the time that the slide was made, this

14:12

classic scary AI fingers are largely now a thing of the past. In healthcare, we're starting to see some really exciting use cases as well.

We now have mature products like clinical knowledge retrieval using tools like open evidence or clinical key AI to pull up fast contextware answers based

14:29

off of our own natural language queries. Um and there's also chart summarization where the model reads through an entire patient's hundreds and hundreds of notes and gives you a highlevel overview when you need it.

And then there is automated note drafting where tools can help you

14:45

generate responses in real time to things like inbasket messages. And then finally, probably the best example in this area for healthcare is ambient dictation, which listens into a clinical encounter and automatically generates the visit note.

Ambient

15:01

dictation is a great example of a technology that actually uses elements from all three epochs, rules-based logic, pattern recognition for machine learning, and the generative capabilities of LLMs. Now when it comes to recognizing this type of AI, the heruristics for epoch 3

15:18

are first these models are general purpose. They are not built for one particular task.

And as the name suggests, they have generative capabilities meaning that they can create new content whether that's text, images, or even video. They're also pre-trained on enormous data sets far

15:35

larger than what we would typically see in the traditional deep learning models. This is what enables them to generalize across a wide range of use cases.

Now, in terms of inputs and outputs, they're often multimodal, capable of handling and producing different kinds of data.

15:51

So, text, images, audio, video, etc. And then finally, when it comes to interpretability, this might be the most opaque of the three.

These models can really generate impressive results, but it's really difficult to trace how they arrived at a specific output.

16:08

This slide gives a highle summary of the similarities and differences across these three epochs. It's definitely a little bit on the dense side, so feel free to pause here and take a moment to look it over.

But the big picture is as we move from epoch 1 to epoch 3, the technologies become less transparent,

16:24

but also more flexible. They go from being rigid and rules-based to being general purpose and capable of handling unstructured data and suited for a wide range of tasks.

At the same time, they become more complex to build and more resource intensive, but also paradoxically, they have become easier

16:41

to use at the front end, which is definitely great for us as physicians. Now, taking a step back from healthcare for a moment, we've actually had basic AI in our lives for quite a while.

Think about voice assistants like Google Home or Siri or Alexa. You can ask them simple questions and they'll

16:57

give you basic facts or maybe carry out a straightforward command. and they're quite useful for doing things like telling you the weather or answering basic trivia questions or managing your smart home, but in other ways are obviously fairly limited still.

Now, fast forward to today and all of a

17:12

sudden we're having riched nuanced conversations with large language models about philosophy and finance and medicine and even linguistic theory. And the kind of output that you see here on this slide is worlds apart from what Siri or Alexa would have given you even a few years ago.

And so what changed?

17:29

How do we go from those early assistants to models that can now write essays, debate moral dilemmas, or summarize dense scientific literature? To put it another way, what's going on inside the so-called black box that is making such a dramatic difference?

Now, we can't

17:46

really open the black box and watch it work step by step, but we do now understand how these models are designed and how they're trained. And understanding that gives us a really powerful lens into understanding how they behave and why.

So to unpack what's going on inside of

18:02

these models, we need a basic understanding of the anatomy and physiology of an LLM. How it's built, how it learns, and how it generates the kind of outputs that we're seeing today.

So let's start at the very beginning with the input prompt. This is what you type into the model.

For example, let's

18:19

say you enter the following prompt. Please help me draft a short but professional letter to a patient explaining why they don't need an MRI for their seasonal allergies.

The first thing the LLM is going to do is it's going to break down this prompt into smaller pieces, a process called

18:34

tokenization. These pieces are called tokens.

And tokens can be entire words. They can be parts of words or they can even be single characters.

But to keep things simple, you can think about a token as being equivalent to roughly a single word for the rest of this video. So here's how the model is going to

18:50

tokenize this specific sentence. In this example, it splits this input into exactly 25 tokens with each one representing a small chunk of the original prompt.

Now, let's just zoom in on one token, say the word patient. When the

19:07

LLM sees this token in the input, the first thing it does is to look it up in something like a dictionary. Specifically, it retrieves what's called a static embedding for that token.

This embedding was created during the model's training process. And this of course

19:22

begs the question, what is a static embedding? Essentially, it's a way of capturing the meaning of a word by turning it into a series of numbers.

In other words, a vector which are projected into a highdimensional space. Now, if you're anything like me, that

19:37

probably sounds like a bunch of mathematical gobblygook. So, let's slow it down a little bit and unpack what this actually means.

Here is a vector. A vector is just a list or an array of numbers.

For example, this is a three-dimensional vector because it has three numbers.

19:54

Now, this vector by itself could represent all kinds of different things. It might represent force in a 3D space, for example.

Maybe 5 Newtons in the X axis, 1 Newton in the Y, and seven in the Z. Or it might represent color using the RGB model.

Five units of red, one

20:10

green, seven blue. Um, a vector doesn't really have any inherent meaning.

It's just numbers until we assign a meaning to it based off of the context. So instead of three numbers, let's now imagine a vector with hundreds or even

20:26

thousands of dimensions. This is what we call an embedding.

A special kind of vector that represents a word like patient in a way that captures its meaning based on how it appears in language. Now let's look again at the sample embedding for the word patient.

20:42

Each number in the embedding of an LLM represents the strength of an association between that token and a particular concept, sentiment or idea, ranging from negative 1 to one. When you combine all of these dimensions, you get something surprisingly powerful.

A dense

20:57

highdimensional representation of a particular word or token's meaning shaped by the patterns the model saw during its training process. If we look at these numbers in each embedding, each one ranges from negative 1 to one.

Remember a vector has both a direction in this case positive

21:13

and negative and magnitude which tells us exactly how strongly that feature is present. In this example 84 might represent the association with a positive versus negative sentiment.

64 might represent the likelihood that the word is a noun. Negative.46 might represent the emotional intensity and so

21:29

on and so forth. Now what's fascinating is that each dimension in the embedded encodes some kind of a subtle relationship or feature.

And when everything is taken altogether, they collectively give us a remarkably detailed representation of meaning, even for abstract or hard to define concepts.

21:45

For instance, how would you define the emotional tone of a word like apple or decide whether a cloud is good or bad or what it means for something to be read beyond just the color? These aren't questions that are clear-cut with human answers, but the model, having seen

22:01

millions of examples, can assign meaning based on how these words appear in context. It's not just that the LLM understands language like the way we do, but rather that it has learned patterns that capture how words relate to one another.

And it's these patterns that are encoded inside of these

22:17

embeddings. To illustrate this, let's take a look at the simplified vector space.

We are working with just three dimensions here, legs, tail, and speaks. Each color arrow here represents the embedding of a particular concept.

Cat, dog, and human. Notice the angle X here

22:34

labeled between the vectors for cat and dog which is much smaller than the angle Y labeled between cat and human in vector space. This angular distance represents something called cosine similarity.

The smaller the angle, the more similar the two concepts are in meaning. And so in this case, the model

22:50

interprets cat and dog as being more similar than cat and human, which is how we as humans also understand these concepts. It's able to do this through mathematical and geometric concepts rather than strict definitions or logic as we humans might.

Here's another classic example of

23:07

how abstract concepts can be captured within vector space. Let's take a look at wordtovec which is a foundational technique in natural language processing used to generate vector representations of words.

So here is where the vector for human might be in word to vec in the

23:22

simplified two-dimensional space and here is the vector for woman. If we subtract the vector for man from woman we get a new vector this red arrow here that captures the difference between the two.

This vector encodes a concept that

23:38

we can collectively describe as gender. It's not the exact word gender, but rather the semantic concept that gender encompasses and describes.

Now, here's the interesting part. The word king also lives somewhere within this vector space.

And if we take that gender vector

23:54

and apply that same difference we saw between man and woman and apply it to the vector for king, we actually end up somewhere around here. And if we look nearby, we will actually find the vector for queen.

We can now label that directional vector as something approximating the concept of

24:11

gender. Similarly, if we look at the difference between man and king or woman and queen, we can derive another consistent relationship here.

Something similar to the concept of royalty seen here in green. These abstract concepts aren't defined in the model strictly

24:27

speaking. They are what we call emergent.

The geometry is a reflection of how these words behave in natural language. So what this shows us is that meaning in these models aren't stored as definitions.

Rather, they're stored as directional relationships in space. The

24:42

model doesn't know what a queen is. It understands that a queen is to a king what a woman is to a man.

Now keep in mind this example comes from wordc which uses simpler linear relationships. In modern LLMs like GBT4, these relationships are often buried in far

24:58

more complex nonlinear structures. But this still gives us an easy way to understand how words can be mathematically related.

Okay. Now that we understand that each number in a word vector represents a different dimension capturing some subtle semantic nuance of a particular word.

Let's ask the natural

25:15

next question. How many dimensions are in a modern language model embedding?

Well, here's what 1,000 different dimensions looks like. Already, you can probably tell we've far exceeded what our human brains are capable of intuitively grasping.

We might understand a few semantic

25:30

relationships like positive versus negative sentiment or health related versus not. But try to imagine that across 1,000 different abstract axises simultaneously.

It's essentially incomprehensible. And yet even 1,000 dimensions is a dramatic

25:46

understatement. GPT3 released in 2020 used around 12,000 different dimensions per token embedding.

And this is what all 12,000 dimensions, a real single token embedding from an actual LLM might look like. It's not designed to be human

26:03

readable. And that's exactly the point.

These vectors live in a space so highdimensional that we can't interpret it directly, but the model can. And it's from this space that it draws its surprising linguistic capabilities.

And so if we go back to our earlier

26:18

visual representation of these vectors, the way I think about embeddings for tokens is that it looks a little bit less like this and more of something like this. A dense point cloud similar to an electron cloud representing a highdimensional fog of meaning where the

26:35

token let's say patient sits at the center and each one of these individual points in the cloud analogous to where an electron might be is actually one individual dimension of meaning. And each point reflects the strength of association between the token patient and some particular latent concept.

For

26:52

example, maybe the association with pain, with learning, virtuousness, the color green, and countless others that we can't even name. And taken all together, this entire cloud forms the model's semantic understanding of the

27:07

word patient. In a little bit, Shivan will walk us through how the static embeddings were determined.

But for now, here's the key thing to keep in mind. Static embeddings are learned during the very expensive pre-training phase of the language model.

When you train a new model, it

27:22

starts out with essentially completely random embeddings, just noise. But over the course of the training, as the model progresses and processes vast amounts of text, these embeddings get collectively and iteratively refined and adjusted again and again until they settle into

27:38

stable positions that reflect meaningful language patterns. By the end of the training, these vectors become the model's fixed starting point for each token.

This also means that once a model is trained, every time it sees a token, let's say the word patient, it retrieves the same vector regardless of where or

27:55

how that word appears. But this is kind of a problem for us.

The word patient in room 3 is something very different than please be patient with me. So how does the model adapt to this?

Now this is where the transformer architecture comes in.

28:11

Specifically, a mechanism called self attention. And this isn't just a technical detail.

Self attention is literally the key breakthrough that made all modern language models possible. As each token's embedding moves through the multiple layers of this transformer, the model begins to recalculate and update

28:27

that vector layer by layer. It uses self attention to weigh the importance of other tokens in the sentence, dynamically adjusting the representation of each word based off of the surrounding context.

And by the end of this process, the original static embedding has been transformed into what

28:43

we now call a contextaware embedding. One that doesn't just reflect the standalone meaning of the token, but now also encodes within it how it fits into the meaning of the entire sentence or the paragraph or even the whole document.

This transformation process

28:59

happens for every single token in the entire input prompt. And by the end, you have a fully contextualized embedding for every single token in the sequence.

And from here the model can finally use these to generate an output. What the model does next is it takes all of these embeddings and shves

29:15

them into a big matrix of numbers. Then using a combination of linear algebra, statistics, and computer science methods, it essentially predicts what the next most likely token is to come in the sequence.

And how confident the model is in its predictions and how creative or conservative it is depends

29:32

on a setting called temperature. At the default temperature of one, the model samples from the probability distribution basically as is uh and strikes a balance between precision and randomness.

And here's what this distribution might look like. There's a clear favorite here in terms of deer,

29:49

but there's still other reasonable options in play. Now, if we raise the temperature, let's say to 1.5, then what you're doing is you're essentially flattening the probability distribution, and the model is now more likely to pick less desirable tokens, which can lead to more creative outputs, but can also

30:05

sometimes lead to unexpected outputs. And if we go in the opposite direction and set the temperature at the minimum value of zero, then you in theory eliminate randomness entirely.

The model should always choose the next token with the highest probability.

30:21

Now, this is a little bit of an oversimplification. There's actually quite a few factors that go into determining what the next most likely token in a sequence is, but the temperature setting is an especially important concept and one of the most easily customizable parameters that users can control and interact when

30:36

using an LLM. So, it's particularly important to know this.

Okay, let's say the model generates its first output token, dear. The model then takes a static embedding for the word deer and appends it to the end of the embeddings matrix.

The same matrix that we've been

30:52

working with up until now that contains all the previously generated contextaware embeddings. And this new token embedding still static at this point gets passed through the transformer where it becomes contextualized by the full sequence that came before it.

And once everything is

31:08

context aware again, the model uses this to generate the next prediction token. And the whole cycle starts all over again.

For every single generation cycle, the model uses the current sequence to compute the next most likely token. And then appends this new token to the end of the sequence.

And the

31:25

whole process repeats itself over and over again until the whole process is over. And here's what the sequence might look like to us in real life.

And you might recognize it. This is exactly how LLM outputs appear in real life, one token at a time.

For every new token

31:41

that appears, the model is using everything that's already seen, both from the original prompt and from any tokens it's generated so far to predict what should come next. This is also why even small changes early in a prompt, can dramatically shift the model's entire response, because each token

31:58

influences all the ones that follow, sort of like a butterfly effect. And just like Grommet here, the LLM only generates one piece of the path at a time, laying down each piece of the track as it goes, and it has no way of knowing what the full path ahead might

32:13

look like. And with every new token, the model builds more context for itself.

But that context is shaped entirely by what it just generated. And here's why this matters.

In models like these, the output isn't just the answer. It also becomes a part of the reasoning path.

32:31

The model doesn't plan its answer in advance. It doesn't think and then speak.

It thinks by speaking. That means that if we want the model to arrive at better answers, especially for complex problems, we need to prompt it to write out its reasoning step by step.

This is

32:47

a super important concept to grasp if you want to understand how LLMs think and is the foundation or at least was the foundation for something called chain of thought which we'll talk about later and is also what forms the basis for newer reasoning models that have been

33:03

released. And let's imagine visually what an LLM navigating through a vector space might look like.

Here is a very basic two-dimensional vector space. And the purple zone here represents the solution space or the region of the vector space where a good, accurate or desirable answer might live.

As the LLM

33:20

breaks out the prompt into tokens and embeddings and as the output is generated, each additional token is another opportunity to inform the model about the context and another opportunity to guide the output towards the ideal solution space until you

33:36

finally reach a correct or acceptable answer. Now, of course, for complex tasks, it's actually quite challenging to get this right on your first attempt.

And your outputs might look something like this or like this or even like this. Which is exactly why learning the

33:51

basics and best practices of effective prompting are so important. And so to go back to our earlier question, how is it possible that an LLM can manipulate ideas and concepts?

If we ask an element question something like this, this is what we might see on our screens. But

34:08

really what's happening behind the scenes is something like this. So if we ask ourselves how do humans convey abstract ideas?

Well, of course we do it through language. And through the complex series of steps that we just described, the LLM

34:24

begins to decompose language into increasingly concrete things from language to words, words to tokens, tokens to numbers, and finally from numbers to highdimensional math that computers can manipulate and the transformation is then reabstracted up

34:40

through the layers until we get to the actual output that we see, which is the response. So Dom gave us an overview of how LLMs work or essentially the anatomy and physiology of LLMs.

But I think in order to truly understand the magic of the generative AI we have today as well as some of the pitfalls intrinsic to

34:56

these models, we need to understand how they were trained. We'll do this by looking at the evolution of the foundational literature in this space over time.

Focusing on the evolution of OpenAI's models just for simplicity. The wave of generative AI essentially started with this now incredibly famous paper titled attention is all you need

35:13

which currently has over 170,000 citations. Interestingly enough, this seminal work was actually developed by the Google brain team at Google research.

It introduced the transformer architecture, a breakthrough that was essentially built entirely on the concept of self attention, which as we

35:29

described earlier means using the entire context of an input to predict the output rather than just relying on the final word or token like older models did. This also allowed for much faster training due to parallelization, essentially taking advantage of the GPU's innate capability of doing

35:44

multiple tasks at once. Over the next year or so, this foundational research was built on by a team of researchers at the then littleknown company called OpenAI.

They released a paper on the first iteration of their generative pre-trained transformer or GPT model in June 2018. This model had about 117

36:02

million parameters, a term referring to the collective weights, biases, and other internal machinery the model adjusts during his training to predict the next token. The training data they used for this was something called the book's corpus.

This is essentially over 7,000 unique unpublished books, as in

36:17

not available in the bookstore or anywhere else. These books spann a wide variety of genres, including romance, fantasy, science fiction, really most things except non-fiction.

This encompassed around 4.6 GB of text, which they used to train these models using 8 GPUs running over 30 days. Note that

36:34

we'll use the unit of pedoplot per second days as the standard measure of training compute across the evolution of these models in this lecture. So breaking this down, a flop or floatingoint operation is a single numerical calculation such as multiplying two decimal numbers.

A pedlop is a quadrillion of these

36:50

operations per second. So by saying it took one pedlop per second day of compute, it means it would take a processor capable of performing one quadrillion of these floatingoint operations per second an entire day to carry out the amount of computation that was used to train GPT1.

37:08

So let's say we have an input context containing a phrase from an HPI 60-year-old female with a history of ESRD on HD emitted for and the goal is to predict the next token or next word. Let's say the model starts with random weights or parameters.

It processes this

37:23

input and predicts the next token which in this case might be something totally off like pineapple. But in the actual training text the next word is hypercalemia.

So the goal is to get the model to predict hyperc calmia instead of pineapple. To do that, we calculate the difference between the predicted

37:39

token and the correct one, something called the loss. Then we figure out how to adjust the model's parameters to reduce that loss and make it more likely to predict hypercalemia next time.

This is done through a technique called back propagation. Once we've calculated how to change the parameters, we actually

37:54

update them using a process called a gradient update. And the full cycle of calculating the loss, figuring out how to minimize it, and updating the parameters is all called gradient descent.

That's a term you'll see all over the machine learning literature to refer to this entire concept. So a fun

38:11

fact, GPT is now open source, which means you can actually run it locally on your own machine. So that's exactly what I did.

I gave it a question I might ask as a hospitalist. Which antibiotics are first line for treating inpatient community acquired pneumonia, assuming

38:26

no risk factors. And here's what GPT-1 gave us.

I don't know, the woman said, but I'll ask. Good.

Now, what else do you know about this plague? Helpful?

Not really. But it kind of makes sense given the training data, right?

I mean, it was trained

38:41

exclusively on books, and I'm sure somewhere there was something about a plague associated with pneumonia, or at least the vector space concepts were related somehow. So it understandably generated this which is essentially an excerpt from a book.

The following year they released GPT2 and made a bunch of

38:57

tweaks in the back end. They increased the number of parameters from 117 million to 1.5 billion.

So about a 13-fold increase. They also changed the training data.

Not only did they scale up to around 40 GB of text, but they also changed the foundational source. It

39:12

wasn't just books anymore. We sort of saw the limitations of that in GBT1.

Now it was broader internet text specifically scraped from web pages. Now how did they decide which web pages to include?

They used those which were linked to in Reddit posts that had more than three upvotes. The reasoning behind

39:28

this was that this three upvote threshold could serve as sort of a proxy for human quality. Make of that what you will if you've ever been on Reddit.

So the links in these posts were ingested, parsed, and then turned into a massive training data set around 40 billion tokens. The third change was compute.

39:44

Interestingly, OpenAI hasn't really published the exact compute used for GPT2, but making some assumptions based on hardware efficiency at the time and the fact that they use 256 Google TPUs. We estimated it took around 600 pedaflop per second days or about 30 days of

39:59

training on that hardware. Now, once again, GPT2 is open source.

So, again, I ran it on my laptop feeding the exact same question. you know which antibiotics are first line for treating inpatient community acquired pneumonia assuming no risk factors.

We get an

40:16

output that's still not great but better. The formatting isn't broken.

It doesn't sound like an excerpt from a fantasy novel and at least is sort of legible pros. Now essentially it's just repeating the same or similar questions multiple times and not really answering the question.

And some of these

40:32

questions like what are the first line antibiotics used to treat MS don't honestly even make sense. But it's still a step in the right direction.

About a year after GPT2 was released, OpenAI published a pivotal paper in January 2020 titled scaling

40:48

laws for neural language models. This is arguably the most important paper since attention is all you need in 2017 as it sort of redefined how we approach training large language models and shaped the direction of model development leading up to chap GPT's release in November 2022.

Functionally,

41:04

the paper outlined there are three back-end key levers in model training. Compute, data set size, and the number of parameters.

All of which we already sort of touched on. However, the key insight was that optimal performance requires scaling all three together.

For

41:20

example, if you only increase compute but not data set size or model parameters, you hit a bottleneck and won't really achieve the best possible performance. This paper visualized this using log lock plots with test loss or the difference between the predicted next token and the actual one as we

41:35

described earlier on the y-axis. The goal is to minimize this test loss.

The graph showed a power law relationship. When all three variables are scaled in tandem, test loss decreases predictably.

Although later research like Deep Mind's Chinchilla paper sort of challenge some

41:51

of the specific details, the core idea still holds strong. Simultaneous scaling is the key to better models.

This understanding directly informed the development of GPT3 released in June 2020. GPT3 had a massive jump in parameters from 1.5

42:07

billion in GPT2 to 175 billion in GPT3 in 116fold increase. The training data also expanded dramatically using a refined data set called Webtex 2.

The original was Webtex along with other curated internet sources. The data set

42:22

included roughly 400 billion tokens filtered down from an initial 45 terabytes of raw text to 570 GB. Sources included common crawl, additional books, and some Wikipedia articles.

Compied also increased substantially, estimated at 3600 pedlopers per second days, or

42:39

about 1 month of using a,000 Nvidia A100 GPUs running continuously at suboptimal efficiency. Now, GPT3 is unfortunately not open source.

So, we use an open- source equivalent model released around the same time with a similar number of

42:55

parameters to test the same input question. The output this time was, I am a nurse.

I am starting to develop a cough. As far as I know, it's an easy to treat cough.

I am not sure. Still not really answering the question directly, but clearly more coherent than the previous outputs.

Actual sentences,

43:12

legible pros, and no obvious formatting errors. So, another step in the right direction, but still not helpful.

Fast forward a couple of years to November 2022, and we get GPT3.5, better known to the world as chat GPT. This model was quite similar to GPT3 in terms of

43:28

parameter count, training data, and compute. So if we give it the same exact input prompt, you instead get first line antibiotics for treating inpatient community acquired pneumonia in patients without risk factors include a combo of betalactone such as septrixone or septoxim plus a macroly

43:46

like a zithro. This is not only actually helpful, it's actually medically accurate.

Those are actually the recommended antibiotics that you'd use in this case. So why is this so much better if they use essentially the same parameters in pre-training?

The breakthrough actually in this case

44:02

wasn't from scale. It was through the use of new post-training techniques specifically supervised fine-tuning where the model is trained on human curated input output pairs to learn the most appropriate combinations and reinforcement learning with human feedback where the model generates multiple outputs and the humans rank

44:19

them by overall quality. The model is then nudged towards generating the better ranked outputs instead of the lower ranked outputs.

So let's do a deeper dive into these two post-training techniques introduced with GPT 3.5 or chat GPT since they had such a profound

44:34

impact on how these models generate outputs and how we train them today. So supervised fine-tuning is functionally very similar to pre-training stage.

You essentially give the model an input. For example, the question which antibiotics are the first line for treating inpatient CAP.

And

44:49

instead of just training it with raw internet data, which is what we did before and which led to the output on the left, you provide a human crafted desired output such as the one on the right. The rest of the process remains essentially the same as pre-training.

The model calculates the loss, again, the difference between the output and

45:06

the desired response. It uses back propagation to determine how to adjust its parameters and then performs a gradient update to reflect those adjustments.

Again, the key difference from pre-training is really in the target. Instead of predicting the next token based on unstructured text data,

45:21

the model learns some carefully crafted labeled responses. Another note is they use a form of supervised fine-tuning called instruction fine-tuning.

This specifically trains the model to actually follow instructions and not just predict the next token. For example, if you prompt the model with

45:36

summarize this text for me, you don't want it to just repeat that phrase as GBT2 did in our prior examples. You want it to actually do the summarization.

This technique was a major leap forward in making models more helpful and more task oriented. Reinforcement learning with human feedback or RLHF is kind of

45:53

similar in spirit but works a bit differently. Let's for example give it the exact same input and in RHF have it generate multiple candidate outputs from that input which are intentionally extremely varied to capture a wide variety of possible generations.

For example, here we have the medically

46:10

accurate explanation on the left and on the right we have avoid antibiotics. Use a more holistic approach with crystals and essential oils.

That'll take care of that cap no problem. And since this model has no built-in concept of truth, both outputs could be equally likely based on internet training data, except

46:27

for the fact that hopefully the one on the left is represented a little bit more frequently than the one on the right. So to resolve this, we bring in a human reviewer who essentially compares the options, thinks for a bit, and chooses the best one.

And the RHF process isn't just helpful to improve

46:43

the factual accuracy of the model, but also to align its behavior with human values. So, what exactly does that mean?

Let's have, for example, a patient facing chatbot that receives the following prompt. I'm feeling depressed and hopeless.

What are the most painless ways to end it all? In this case, you

46:59

actually don't want the model to be helpful. You wanted to refuse the harmful request and instead offer supportive advice like, "I'm sorry you're feeling this way.

Consider reaching out to a mental health professional." Human reviewers help guide the model towards these safer, more value aligned responses.

47:15

And you actually don't have to limit it to just two output completions. In fact, in most cases, they show the human reviewer multiple completions and have them essentially rank between them, ordering them from best to worst.

Now, who exactly are these human reviewers, you may be asking, and it's honestly a great question. This is a headline that

47:31

was from a Time magazine article which delves a little bit more into how OpenAI use lowwage workers to do these tasks. This was unfortunately necessary because they're just performing so many literally tens of thousands of these evaluation tasks on what at that time was more of a shoestring budget than

47:47

they have today. More recently, there's been a push to involve more subject matter experts, doctors, lawyers, PhDs to improve the quality of their LLM responses.

And that's where companies like Scale AI, Turing, and Greenlight come into play. To be able to further scale beyond

48:04

the intrinsic limitations of having human reviewers, namely the amount of time it takes to complete thousands or tens of thousands of these tasks, we began training what is referred to as a reward model, where a separate machine learning model learns to predict what completions humans would prefer based on their answers to the other questions.

48:21

They can then rank new completions on their own, which they were explicitly not trained on. For example, given the question, can eating sugar make cancer grow faster?

The reward model evaluates the candidate answers just as a human would and selects the one that's most aligned with human preferences and factual correctness. In this case, the

48:38

one on the right. We can actually go one step even further.

What if instead of using a more basic machine learning model as we did with reward models, we let other LLMs do the ranking when training the newer and most up-to-date models? This is actually known as LM as a judge in the literature

48:55

and it was actually popularized by anthropic with their constitutional AI approach. So in this setup an LM is prompted with both the outputs and a set of guiding principles or a constitution which poses questions like which of these responses is less harmful?

Choose the one a wise, ethical, polite and

49:11

friendly person would say which is again the one on the right. This model helps scale alignment training beyond the intrinsically human bottlenecks by using large language models to enforce values and quality judgments themselves and also perhaps take out some of the variability and inconsistencies between

49:27

human raiders. So far we have discussed how LMS are trained until around September 2024 when a new paradigm shift began.

This shift was highlighted during a keynote by Nvidia CEO Jensen Wong at CES25 in January 2025. He introduced the

49:43

idea that the original pre-training scaling law had now expanded into three distinct scaling laws. So first we have pre-training which as we discussed in that earlier paper showed that if you scale up the number of parameters, data set size and compute in tandem, model performance improves in a predictable

49:59

power law relationship by minimizing the training loss. However, this trend is nearing its limits.

OpenAI co-founder Ilia Sukver gave a keynote at Europe's 2024, a super popular machine learning conference where he stated that pre-training as we know it will end

50:14

primarily because we've essentially run out of high-quality internet data. Since data is a rate limiting step in the scaling laws, increasing compute or parameters alone can't maintain the same progress.

The second scaling law pertains to post training which as we just talked about includes techniques

50:31

such as RLHF and supervised fine-tuning. The current era as of late 2024 is focused on test time scaling.

Instead of optimizing model parameters before deployment, test time scaling involves giving the model more compute at inference time to reason through a

50:46

problem. This essentially means that when the model is asked the question, it instead of responding instantaneously as it did in prior models, actually thinks longer and more thoroughly before responding.

To see this in action, OpenAI published a figure when they released the first foundation reasoning

51:02

model, GPT01. It shows two scatter plots tracking performance on the American Invitational Mathematics Examination or Amy, a really highlevel math test for top US high school students.

Just an example, here's one of the problems of the 2022 Amy. I'm

51:17

not going to go through it in detail, but suffices to say, this stuff is hard, and I definitely could not do this in high school. On the y-axis, we see passive one accuracy, or the percentage of questions the model gets right on the first try.

The x-axis tracks compute on a logarithmic scale separated into two

51:34

phases. Train time compute on the left side or essentially how much compute was used to optimize the model parameters before seeing the questions and test time compute on the right side.

So how much compute was used by the model after it sees the question while reasoning

51:49

before answering. As expected, more train time compute improves conformance essentially in line with what we saw with the pre-training scaling laws.

But what's pretty striking is that test time compute yields equal or honestly perhaps even better gains suggesting a new powerful lever for improving model

52:05

performance. This next graph shows the performance on various OpenAI models over time and an even more challenging and incredibly unique benchmark called ARC AGI.

So most traditional AI benchmarks like the USL step one, MCAT or SAT type questions have one really

52:22

big glaring limitation. Many of their questions appear in the model's training data, either exactly word for word or with minor variations but pretty similar patterns.

This means that LMS can essentially cheat by regurgitating patterns that they have previously memorized, giving the illusion of

52:38

reasoning without actually performing it. The archi benchmark avoids this by using handcrafted tasks that are intentionally really unlike anything that is in the training set.

Here's an example of one of these ARC tasks. You're given several input output pairs and haven't figured out the underlying

52:54

rule which you then apply to the last input. Well, you notice this is actually a pretty easy task for a human to complete.

Essentially, whenever the green squares form a loop, we fill that in with the yellow squares. But it's actually incredibly difficult for LLMs who really haven't seen anything like

53:09

this during their training to perform this task. So despite massive gains from GPT2 to GPT4 on honestly most benchmarks, unfortunately performance on RAGI has essentially flatlined or barely budged, even GPT40, which is one of the

53:24

most recent models that's still in use today, achieves less than 10% accuracy. However, with the release of the reasoning and optimized models like the GPTO series, there's really a clear inflection point over here where becomes essentially exponential.

For instance, GPD03 tuned high surpasses the estimated

53:40

human level performance threshold of 80 to 90% on this graph over here. These results have sparked questions about whether we're entering the era of artificial general intelligence or where models have the flexibility and reasoning ability to perform any intellectual task that a human can.

While the consensus is that we're not

53:56

quite there yet, the debate is still gaining traction. So naturally, in this new reasoning area, AI companies are focusing on scaling test time compute.

In fact, Sam Alman, OpenAI CEO, previously tweeted that GBT4.5, which has since been released, will be their last non-reasoning model. This signals a

54:13

formal shift. The future is about scaling and optimizing test time reasoning.

All right, let's recap and put it all together. We now know enough to be able to answer that question posed at the very beginning of this video, the fundamental question about what an LLM

54:28

actually is. More specifically, let's look at what the physical form of an LLM might be.

Let's start with the source. In 2025, the vast majority of humanity's collective data output was online.

And we can roughly say that this data

54:44

approximates the collective experience and knowledge of modern-day humanity. And in 2025, this quantity of knowledge is estimated at around roughly a 100red trillion gigabytes or around 6 million books worth of information for every

55:00

single individual on Earth. But these 100 trillion GB of information are not themselves directly in the model, but rather the model training processes that Shivam just went over distills the patterns, relationships, and concepts

55:16

from that vast collection of data into a much smaller form. A set of numbers called the weights, biases, and embeddings, and collectively referred to as the parameters of a model that now encode how different ideas relate to one another.

55:32

For models like GPT4, experts estimate that this compressed representation takes up only around 3,500 GB, which is a compression factor of around 30 billion. The rest of the machinery, the smarts, the transformer, tokenizer,

55:48

inference engine, the user interface is just a few hundred megabytes of code, which means that 99.99% of the model is just parameter storage. And for GBT4, the training process to figure out exactly what these

56:04

three and a half thousand gigabytes of numbers should be cost OpenAI an estimated 100 million and around 7,200 megawatt hours of electricity, which is enough juice to power a small town for a few months. The scale of compute, of

56:21

infrastructure, and of time required to train a new frontier model from scratch means that pretty much only the largest companies can afford to do something like this. But once they finish the training process, that entire model, which costs this astronomical effort to

56:38

train, actually pretty comfortably fits with room to spare in this $300 pocket-sized solid state drive. Now, let's think for a moment about what is on this memory drive.

Contained within its three and a half terabytes are nursery rhymes from every language,

56:56

recipes for every single cuisine you can think of, religious and cultural traditions from around the world, theorems of math and science, and facts and principles taught across countless different subjects. Now sure it's not perfect and it's not comprehensive and it contains the limitations and biases

57:12

inherent to the training data and the training processes. But this drive contains nothing short of a distillation of a part of the human experience.

Through these language models, we have essentially figured out a way of capturing our collective understanding

57:28

and interpretation of reality and captures not just knowledge about the world that we live in, but more importantly, how we reason and how we think. What we have then is a model representation of our reality in a

57:44

portable format. If we think about an LLM in this way, it's easy to see how this tool can not only help us better understand the world, but also how it can serve as the fundamental building block for new technologies for productivity and problem solving.

The

58:00

way we do this, of course, is through the process of prompting, which is the topic of the second video in the series. In the next video, we're going to be talking about some evidence-based methods on how to best communicate with generative AI, and we're going to be reviewing some of the prompting literature out there and distilling it

58:15

into just three easy to remember take-home points. If you like this video and want to learn more about the work that Shiva and I are doing in generative AI at Samper Medicine, or if you want to reach out to us, feel free to check out our Samper profiles, which are linked below.

58:31

Thanks for watching and see you in part two.

Summary

Transcript