Nobel Laureate John Jumper: AI is Revolutionizing Scientific Discovery

Transcript

00:00

This is something of a nice change. I've given a lot of scientific talks and no one claps and cheers when I come on.

Not normally even when I come on. It's really exciting.

It's really

00:15

wonderful to be here. I guess I should start off assuming that not everyone in this cavernous hall knows who I am.

Who am I? I'm I'm someone who has done some work in AI for science who really believes that we can use the AI systems,

00:31

these technologies, these ideas to change the world in a very specific way to make science go faster to enable new discoveries. I think it's really really wonderful.

We have the opportunity to take these tools, these ideas

00:47

and aim them toward the question of how can we build the right AI systems so that sick people can become healthy and go home from the hospital. And it's been kind of a a really wonderful and winding journey for me to end up here.

I was originally trained as a physicist. I

01:04

thought I was going to be a laws of the universe physicist. If I was very very lucky, I could do something that would end up one sentence in a textbook.

And I did physics and I went to actually do a PhD in physics. And then kind of

01:19

what I was working on didn't really grab me. I just it didn't feel like what I wanted to do.

So I dropped out. I didn't start a startup.

That would have been very on point for this event, but I uh dropped out and I ended up working at a company that was doing computational

01:35

biology. How do we get computers to say something smart about biology?

And I loved it. I loved it not just because it was fun, but it was something that would let me do what I thought I was good at.

Write code, manipulate equations, think hard thoughts about the nature of the

01:51

world and use it toward this very applied purpose that at the end we want to ena we want to make medicines or we want to enable others to make medicines. Then I really kind of became a biologist and a machine learner.

Actually a machine learner because I left that job

02:07

and I went back to grad school in biohysics and chemistry and uh I no longer had access to this incredible computer hardware that I had when I was working at my previous job and in fact they had custom asics for simulating how proteins this part of your body that

02:23

I'll talk about move. And since I didn't have that anymore but I still wanted to work on the same problems.

Well, I didn't want to just do the same thing with less compute. And so I started to learn and I was getting very interested in statistics, in machine learning.

02:38

didn't call it AI back then. In fact, we didn't even call it machine learning.

That was a bit disreputable. I said, I'm working in statistical physics.

But you know, how are we going to develop algorithms? How are we going to learn from data and do that instead of very large compute?

And I guess it turns out in terms of AI in addition to very large

02:56

compute to answer new problems. And after this I joined uh Google DeepMind and really joining a company that wanted to say how are we going to take these powerful technologies and all kind of

03:11

these ideas and we they were becoming very very readily apparent how powerful these technologies were with applications uh to especially games but also to things like data centers and others. How are we going to take these technologies and use them to advance science and

03:27

really push forward scientific frontier? And how can we do this in an industrial setting with an incredibly fast pace working with some really smart people working with great computer resources and with all that you darn well better make some progress and it's been really

03:42

really fun and the fact that I'm on this stage indicates that we made some progress and I think it really the guiding principle for me has that when we do this work that ultimately we are building tools that will enable

03:58

scientists to make discoveries. And what I think is really heartening about the work we've done and the part that really I think still just resonates with me at my core is there about I think 35,000 citations of Alphafold.

But

04:13

within that is there are tens of thousands of examples of people using our tools to do science that I couldn't do on my own but are using it to make discoveries. be it vaccines, be it drug development, be it how the body works.

04:30

And I think that's really really exciting. And the part I want to talk to you about today and the story I want to tell you is a bit about the problem, a bit about how we did it.

And I think especially the role of research and machine learning research and the fact that it isn't just off-the-shelf machine

04:46

learning and then I want to tell you a little bit about what happens when you make something great and how people use it and what it does for the world. So, I'll start with the world's shortest biology lesson.

The cell is complex. Um, for people who have only studied

05:04

biology in high school or in college, you might have this idea that the cell is a couple parts that have labels attached to them. And it's kind of simple, but really it looks much more like what you see on the screen.

It's dense. It's complex.

Uh, in terms of

05:19

crowding, it's like the swimming pool on the 4th of July and it's in full of enormous complexity. Humans have about 20,000 different types of proteins.

Those are some of the blobs you see on the screen. They come together to do practically every function in your cell.

05:36

You can see that uh kind of green tail is the psyllium of uh an ecoli. That's how it moves around.

And you can see in fact how it moves around. And you can see that thing that looks like it turns and in fact it turns and drives this motor.

All of this is made of proteins.

05:52

When people say that DNA is the instruction manual for life, well, this is what it's telling you how to do. It's telling you how to build these tiny machines.

And biology has evolved an incredible mechanism to build the machines it needs, literal nano

06:09

machines, and build them out of atoms. And so your DNA gives you instructions that say build a protein.

Now you might say your DNA is a line and so are proteins in a certain sense. It's instructions on how to attach one bead after another where each bead is a

06:24

specific kind of molecular arrangement of atoms. And you should wonder if I my DNA is aligned and I am very much not one-dimensional, what happens in between?

And the answer is after you make this protein and assemble it one piece at a time, it will

06:41

fold up spontaneously into a shape like you've opened your IKEA bookshelf and instead of having to do the hard work, it simply builds itself and you get this quite complex structure. You can see quite typical protein, a kynise for those of you who

06:56

are biologists in the audience over there. And you can see this very complex arrangement of atoms and that arrangement is functional and and the majority not everyone of the proteins uh in your body undergo this transformation and that is what functions and that is

07:13

incredibly small. So light itself is a few hundred nanometers in size and that's a few nanometers in size.

So it's smaller than you can see in a microscope. And for a long time scientists have wanted to

07:28

understand this structure because they use it to predict how changes in that protein might affect disease. How does that work?

How does biology work? Often if you make a drug it is to interrupt the function of a certain protein like this one.

07:44

Now scientists have through an incredible amount of cleverness figured out the structure of lots of proteins and it remains to this day exceptionally difficult. Right?

You shouldn't imagine this as I want to determine the

07:59

structure of a protein. So I shall open the lab protocol for protein structure determination.

I shall follow the steps. It consists of cleverness of ideas of finding many ways.

In this case, I'm describing one type of protein structure prediction in or protein structure,

08:16

sorry, determination, experimental measurement, where you convince that big ugly molecule I just showed you to form a regular crystal kind of like table salt. No one has an easy recipe for this.

So, they try many things. They have ideas and it's exceptionally

08:32

difficult and filled with failure like many things in science. And you're really looking at kind of one way to get an idea of how difficult this is.

Just one kind of ordinary paper that we were using. I flipped to the back and it said, you

08:48

know, in their protocol, after more than a year, crystals began to form. Right?

So, not only did they do all these hard experiments, but they had to wait about a year to find out if it worked. And probably that year wasn't spent waiting.

It was trying a thousand other things that didn't work as well.

09:03

Once you do that, you can take this to a uh synretron, a modest thing. You can see the cars rigging the outside of this instrument so that you can shine incredibly bright X-rays on it and get what is called a defraction pattern and you can solve that and you can deposit

09:20

it in what's called the PDB or the protein datab bank. And one of the things that enabled the work we did is that scientists 50 years ago had the foresight to say these are important, these are hard.

We should collect them all in one place. So there's a data set

09:37

that represents ex essentially all the academic output of protein structures in the community and available to everyone. So our work was on very public data.

About 200,000 protein structures are known. They pretty regularly increase at

09:53

about 12,000 a year. But this is much much smaller than the need.

Getting the kind of input information, the DNA that tells you about a protein is much much much much easier. So

10:09

billions of protein sequences are being discovered. About 3,000 times faster are we learning about protein sequence than protein structure.

Okay, that's all scientific content, but I should talk to you about the little

10:24

thing we did which has this kind of schematic diagram. We wanted to build an AI system.

In fact, we didn't even care if it was an AI system. That's one of the nice things about uh working in AI for science is you don't care how you solve it.

If it

10:39

ended up being a computer program, if it ended up being anything else, we want to find some way to get from the left where each of those letters represents a specific building block of the protein considered an order. We want to put something in the middle in the alpha fold and we want to end up with

10:55

something on the right. And you'll see uh two structures there if you look closely where the blue is our prediction and the green is the experimental structure that took someone a year or two of effort.

If you want to put an economic value on it on the order of

11:10

$100,000 and you can see we were able to do this and I want to tell you how and there were really three components to doing this or to do any machine learning problem and you can say you

11:25

have data and you have compute and you have research and I feel like we tell too many stories about the first two and not enough about the third. In data, we had 200,000 protein structures.

Everyone has the same data.

11:41

In terms of compute, this isn't LLM scale. It's the final model itself was 128 TPU v3 cores, roughly equivalent to a GPU per core for two weeks.

This is again within the scope of say academic

11:58

resources but it's worth saying really most of your compute when you think about how much compute you need don't get distracted by the number for the final model the real cost of compute is the cost of ideas that didn't work all the things you had to do to get there

12:14

and then finally research and I would say this is all but about two people that worked on this it's a small group of people that end up doing this So really when you look at these machine learning breakthroughs they're probably fewer people than you imagine and really

12:31

this is where our work was differentiated. We came up with a new set of ideas on how do we bring machine learning to this problem and I can say earlier systems largely based on convolutional neural networks did okay.

12:46

They certainly made progress. If you replace that with a transformer you're honestly about the same.

If you take the ideas of a transformer and much experimentation and many more ideas, then that's when you start to get real change. And in almost all the AI systems

13:03

you can see today, a tremendous amount of research and ideas and what I would call midscale ideas are involved. It isn't just about the headlines where people will say transformers, you know, scaling, test time inference.

These are all important but they're one

13:20

of many ingredients in a really powerful system and in fact we can measure how much our research was worth. So someone Alphafold 2 is the system that is quite famous the one that uh was quite a large improvement.

Alpha fold one was the best

13:35

in the world but someone did uh the Alcesi lab did a very uh careful experiment where they took Alphold 2 the architecture and they trained it on 1% of the available data and they could show that alpha fold 2 trained on 1% of

13:51

the data was as accurate or more accurate as alphafold one which was the state-of-the-art system previously. So there's a very clean thing that says that the third uh the third of these ingredients research was worth a hundfold of the first of these

14:08

ingredients data. And I think this is generally really really important that one of the big as you're all thinking as you're all in startups or thinking about startups think about the amount to which ideas research discoveries amplify data

14:26

amplify compute they work together with it we wouldn't want to use less data than we have we wouldn't want to use less compute than we have available but ideas are a core component when you're doing machine learning research and they really helped to transform the world.

14:41

>> YC's Next Batch is now taking applications. Got a startup in you?

Apply at y combinator.com/apply. It's never too early.

And filling out the app will level up your idea. Okay, back to the video.

We can even go back and we can do ablations and we can say

14:58

what parts matter. And don't focus too much on the details.

We pulled this from our paper. You can see here this is the difference compared to the baseline.

And you take either of those and you can see that each of the ideas that you might remove from our final system kind of discreet identifiable ideas some of

15:15

which were incredibly popular research areas within the field like this work came out and a part of it was equivariant and people said equivariance that is the answer alphafold is an equivariant system and it's great we must do more research on equivarians to

15:31

get even more great systems well I was very confused by this because the sixth uh row there no IPA invariant point attention that removes all the equavariance in alpha fold and it hurts a bit but only a bit. Alpha fold itself

15:48

on this GDT scale that you can see on the left graph. Alphafold 2 was about 30 GDT better than alphafold one and equivariance explains two or three of this.

It isn't about one idea. It's about many midscale ideas that add up to

16:05

a transformative system. And it's very very important when you're building these systems to think about what we would call in this context biological relevance.

We would have ideas that were better. We kind of got our system grinding 1% at a time.

But what really

16:21

mattered was when we crossed the accuracy that it mattered to an experimental biologist who didn't care about machine learning. And you have to get there through a lot of work and a lot of effort.

And when you do, it is incredibly transformative. And we can measure against uh this axis where the

16:38

dark blue axis the other systems available at the time. And this was assessed.

Protein structure prediction is in some ways far ahead of uh LLMs or the general machine learning space and having blind assessment. Since 1994,

16:53

every two years, everyone interested in predicting the structure of proteins gets together and predicts the structure of a hundred proteins whose answer isn't known to anyone except the research group that just solved it, right? Unpublished.

And so, you really do know what works. And we had about a third of the error of any other group on this

17:10

assessment. But it matters because once you are working on problems in which you don't know the answer, you get to really measure how good things are.

And you can really find that a lot of systems don't live up to what people believe over the course of their research. And because

17:26

even if you have a benchmark, we all overfit to our ideas to the benchmark, right? Unless you have held out.

And in fact, the problems you have in the real world are almost always harder than the problems you train on, right? Because you have to learn from much data and you

17:41

apply it to very important singular problems. So it is very very important that you measure well both as you're developing and when people are trying to decide whether they should use your system.

External benchmarks are absolutely critical to figuring out what

17:57

works and that's what really helps drive the world forward. So just some wonderful examples of this is typical performance for us.

These are blind predictions. You can see they're pretty darn good.

also important we made it available and we thought it was and we did a lot of assessment but we decided

18:13

that it was very important to make it available in two ways. One is that we open source the code and we actually open sourced the code about a week before we released a database of predictions starting originally at 300,000 predictions and later going to 200 million essentially every protein um

18:29

from an organism whose genome has been sequenced. And this made an enormous difference.

And one of the most interesting kind of sociological things is this huge difference between when we released a piece of code that specialists could use and we got some information and then when we made it

18:44

available to the world in this database form. It was really interesting kind of you know you release something and every day you check Twitter to find out or check X to find out what's going on.

And what we would really see is even after

19:01

that CASP assessment, I would say that the structure predictors were convinced this obviously was this enormous advance solved the problem. But general biologists, the people we wanted to use, the people who didn't care about structure prediction, they cared about proteins to do their experiments, they

19:16

weren't as sure. They said, "Well, maybe CASP was easy.

I don't know." And then this database came out and people got curious and they clicked in and the amount to which the proof was social was extraordinary that people would look and say how did deep mind get access to my

19:34

unpublished structure. you know, this moment at which they really believed it that everyone had a a protein either had a protein that they hadn't solved or had a friend who had a protein that was unpublished and they could compare and that's what really made the difference.

And having this database, this

19:50

accessibility, this ease led everyone to try it and figure out how it worked. Word of mouth is really how this trust is built.

And you can kind of see some of these testimonials, right? I wrestled for three to four months trying to do

20:06

this uh scientific task. You know, this morning I got an alpha fold prediction and now it's much better.

I want my time back, right? You know, you really appreciate alphafold when you run it on a protein that for a year refused to get

20:22

expressed and purified. Meaning they for a year they couldn't even get the material to start experiments.

These are really important. When you build the right tool, when you solve the right problem, it matters and it changes the lives of people who are doing things not that you would do but building on top of

20:39

your work. And I think it's just extraordinary to see these and the number of people I talked to.

The time that I really knew this tool mattered. In fact, there was a special issue of science on the nuclear pore complex a few months after the tool came out.

And

20:54

the special issue was all about this particular very large kind of several hundred protein system. And three out of the four uh papers in science about this made extensive use of alpha fold.

I think I counted over a hundred mentions of the word alphafold in science and we

21:11

had nothing to do with it. We didn't know it was happening.

We weren't collaborating. It was just people doing new science on top of the tools we had built and that is the greatest feeling in the world.

And in fact, users do the darnest things. They will use tools in ways you didn't know were possible.

The

21:28

tweet on the left from Yoshaka Morowaki came out two days after our code was available. We had predicted the structure of individual proteins, but we consider we were working on building a system that would predict how proteins came together.

But uh this researcher said, "Well, I have alphapold. Why don't

21:45

I just put two proteins together and I'll put something in between?" You could think of this as prompt engineering but for proteins. And suddenly they find out this is the best protein interaction prediction in the world, right?

That when you train on these a really really powerful system,

22:00

it will have additional in some sense emergent skills as long as they're aligned. People started to find all sorts of problems that Alphafold would work on that we hadn't anticipated.

It was so interesting to see the field of

22:16

science in real time reacting to the existence of these tools, finding their limitations, finding their possibilities and this continues and people do all sorts of exciting work be it in protein design be it in others on top of either

22:31

the ideas and often the systems we have built. One application that really uh I thought was really important is that people have started to learn how to use it to engineer big proteins or to use it in part of and I want to tell this story

22:48

for two reasons. One is I think it's a really cool application but the second is how it really changes the work of science and often people will say science is all about experiments and validation.

So it's great that you have all these alpha fold predictions. Now

23:03

all we have to do is solve all the proteins the classic way so that we can tell whether your predictions are right or wrong. And they're right about one thing.

Science is about experiments. Science is about doing these experiments.

23:19

But they're wrong about another thing. Um science is about making hypotheses and testing them not about the structure of a particular protein.

In this case, the question was they took this protein on the left called the contractile

23:34

inject injection system, but that's a mouthful. They like to call it the molecular syringe.

And what it does is it attaches to a cell and injects a protein into it. And the scientists at the Jang Lab at uh MIT were saying,

23:49

well, can we use this protein to do targeted drug delivery? Can we use it to get gene editors like cast 9 into the cell?

They tried over a hundred methods to figure out how to take this protein, which they didn't have a structure of. This is just kind of a

24:05

rendition after the fact, and say, how can we change what it recognizes? I think it's originally involved in plant defense or something like that, and they didn't know how to do it.

And they ran an alpha fold prediction. You can see the one on the left.

I wouldn't even say it's a great alpha fold prediction, but almost immediately they looked at that

24:21

and said, "Wait a minute. those legs at the bottom are how it must recognize and attach to cells.

Why don't we just replace those with a designed protein? And so almost immediately as soon as they got the alpha fold prediction, they re-engineered to add this design protein

24:36

that you see in red uh to target a new type of cell. And they take this system and then they show in fact that they can choose cells within a mouse and they can inject proteins in this case fluorescent

24:52

proteins. So there you'll see the color and they can target the cells they want within a mouse brain.

And so they are using this to develop a new type of system of targeted drug discovery. And we see many more examples.

We see some in which scientists are using this tool to try

25:10

thousands and thousands of interactions to figure out which ones are likely to be the case. In fact, discovered a new component of how eggs and sperm come together in fertilization.

Many many of these discoveries that are built on top of this. And I like to think that our

25:26

work made the whole field of what's called structural biology, biology that deals with structures, you know, five or 10% faster. But the amount to which that matters for the world is enormous and we will have more of these discoveries.

And

25:43

I think ultimately structure prediction and larger AI for science should be thought of as an incredible capability to be an amplifier for the work of experimentalists that we start from these scattered observations, these natural data. This is our equivalent of

25:58

all the words on the internet. And then we train a general model that understands the rules underneath it and can fill in the rest of the picture.

And I think that we will continue to see this pattern and it will get more general that we will find the right foundational data sources in order to do

26:15

this. And I think the other thing that has really been a property is that you start where you have data but then you find what problems it can be applied to.

And so we find enormous advance, enormous capability to understand

26:30

interactions in the cell or others that are downstream of extracting the scientific content of these predictions and then the rules they use can be adapted to new purposes. And I think this is really where we see the foundational model aspect of alpha fold

26:47

or other narrow systems. And in fact, I think we will start to see this on more general systems, be them LLMs or others, that we will find more and more scientific knowledge within them and we'll use them for important important purposes.

And I think this is really

27:03

where this is going. And I think the most exciting question in AI for science is how general will it be.

Will we find a couple of narrow places where we have transformative impact or will we have very very broad systems? And I expect it will ultimately be the latter as we

27:19

figure it out. Thank you.

Nobel Laureate John Jumper: AI is Revolutionizing Scientific Discovery

Summary

Transcript