Chelsea Finn: Building Robots That Can Do Anything

🚀 Add to Chrome – It’s Free - YouTube Summarizer

Category: Robotics and AI Development

Tags: Data CollectionFoundation ModelsGeneral Purpose RoboticsMachine LearningRobotics Challenges

Entities: ChelseaMichaelPhysical IntelligencePI ZeroPolygeemmaSirajY Combinator

Building WordCloud ...

Summary

    Introduction to General Purpose Robotics
    • Chelsea introduces the challenge of developing general-purpose robots that can perform any task in any environment.
    • She co-founded a company called Physical Intelligence to address this problem.
    • The goal is to create a foundation model for robotics, similar to language models, to enable robots to perform diverse tasks.
    Challenges in Robotics Data Collection
    • Scale is necessary but not sufficient for developing models; diversity in data is crucial.
    • Various data sources like industrial automation, YouTube, and simulation have limitations.
    • Physical Intelligence collects real-world robot data to train models for diverse tasks.
    Developing a Laundry Folding Robot
    • The team developed a robot that can fold laundry, starting with simple tasks and gradually increasing complexity.
    • Pre-training on all data and fine-tuning on curated high-quality data improved the robot's performance.
    • The robot can now fold different types of clothing with some degree of generalization.
    Expanding to Other Tasks
    • The same model was applied to other tasks like cleaning tables and making coffee.
    • Foundation models allow for building on existing data rather than starting from scratch for each task.
    Robots in New Environments
    • Collecting diverse data allows robots to perform tasks in environments they have never encountered.
    • The model was tested in new locations, such as Airbnbs, and succeeded in performing tasks.
    Responding to Open-Ended Prompts
    • The team developed a hierarchical vision-language-action model to handle open-ended prompts.
    • Synthetic data from language models augmented robot data to improve performance.
    • The model can follow complex prompts and respond to interjections.
    Future Directions and Challenges
    • Foundation models for robotics can potentially replace specialized robots.
    • Large-scale real-world data is necessary but not sufficient for developing physical intelligence.
    • More research and open-source contributions are needed to advance the field.

    Transcript

    00:00

    Hi everyone. Um, I'm really excited to talk about developing general purpose robots and how we might uh actually like truly develop and bring intelligence into the physical world.

    So, um, to start off, I'd like to talk about this problem, which is that if you want to

    00:16

    truly solve a robotics application, you essentially need to build an entire company around that application. uh you need to build a different company for logistics, for wet lab automation, for robots and kitchens, for surgical robots and so on.

    And this is really really

    00:34

    hard to do because that company needs to make new hardware, develop custom software, design unique movement primitives for that application, handle edge cases and so on. And you have to do all of that from scratch uh if you want to solve a robotics problem.

    And as a result, uh, a lot of robotics companies

    00:50

    haven't been very successful in actually bringing robots into the physical world successfully, uh, in our daily lives. I co-founded a company called physical intelligence, uh, that's trying to solve this problem.

    And in particular, we're trying to develop a general purpose model that can enable any robot to do

    01:07

    any task in any environment. uh and we think that this sort of generalist model may work better and be easier to use than purpose-built models just like we've seen in the development of found foundation models for language and other

    01:24

    applications. Uh for example, if you want to build uh a coding assistant, you don't nowadays develop something specifically for coding, but you develop and you build on models that were trained on large amounts of data, not just on code.

    And essentially this is the problem of trying to develop these

    01:40

    sorts of foundation models and bring this sort of intelligence into the physical world rather than the digital world where they largely are today. So how do we do this?

    Uh in this talk I'd like to talk about how we go about doing this. And if we were to take a lesson

    01:56

    from language models we know that language models have taught us the importance of scale. And so one possible conclusion would be that perhaps scale is the most important ingredient for developing these models.

    And if you were to say this conclusion is true, then

    02:14

    you might look to certain data sources for largecale uh data. So for example, we might look at data from industrial automation and you get tons and tons of data of robots uh doing tasks over and over again like this.

    But this sort of data isn't going to allow robots to go

    02:30

    into disaster zones or to make a sandwich uh or to bag groceries. And so this massive scale doesn't have the diversity of behaviors that we need in order to solve this general problem.

    Alternatively, maybe we look at data from YouTube which has a also a massive

    02:46

    data source and many videos of humans doing tasks uh that could be useful for training robots. uh but at the same time we don't learn how to write by watching other people write and we don't become expert tennis players by watching Wimbledon.

    Uh and so even though there's a massive scale of data here, it's very challenging to use and there's also a a

    03:02

    gap between the embodiment of robots and humans. Um and lastly, we might look at data from simulation and you can also get a massive scale of data here, but uh this data lacks realism and also has a gap from reality.

    And so I think the lesson here is that scale is necessary

    03:19

    for developing these models that can generalize in open world conditions, but they're subordinate to actually solving the problem. So you need scale, but it's not sufficient uh for the entire problem.

    And so at physical intelligence, we've been um this is an example of a data episode uh that we've

    03:34

    collected. Uh this is uh in honor of our first anniversary, which was a few months ago.

    uh where we um here you can see a teley operator uh in person who's operating um some leader arms to control the robot uh to light a match and light a candle with the match and with this

    03:51

    sort of data we can train robots to do a variety of different tasks and so um what I'd like to talk about is some of our recent results at trying to develop sort of physical intelligence with largecale real robot data I should mention this is large scale by today's

    04:07

    robot standards and arguably a minuscule amount of data compared to the sorts of robot data that we should have in the years to come. And so in particular, we'll be looking at whether robots can do a variety of dextrous long horizon tasks, whether robots can succeed in places they've never been, whether

    04:22

    robots can respond to open-ended prompts and interjections. Uh, and even if you're not excited about robotics, I think that the lessons uh that we've learned from trying to address these problems are applicable outside of the physical world.

    So um can we develop robots that can have uh complete dextrous long horizon tasks? And in

    04:40

    particular uh in this first part I'd like to talk about how we trained uh a pi zero foundation model to do this task which is to unload a dryer and fold laundry. Uh and to date I think this is the most impressive thing that I've seen uh a robot do in the physical world.

    04:56

    It's really hard. [Applause] This is an incredibly difficult problem.

    You can see that it's not perfect. Uh here is making some miscrops, making some mistakes, but it's really really hard because you have to deal with the

    05:12

    variability in the clothes and the way in which they might be positioned and crumpled uh and be able to handle all those sorts of things. And as you're doing this task, which takes about 10 minutes for the robot, there's many opportunities to fail uh to fail catastrophically.

    For example, dropping

    05:28

    um things on the ground, which is hard to recover from. uh and you have to be able to recover from even small mistakes.

    I was personally actually working quite a bit um on this laundry folding robot along with Michael and Siraj uh and of course supported uh and with contributions from the whole

    05:44

    physical intelligence team. Uh so how do you even approach this sort of problem?

    It's this is a really really hard thing for a robot to do and what we did is we started simple. Uh we started with can a robot fold a single size single brand shirt uh and can a robot dynamically

    06:01

    flatten one shirt again single brand single sized and if you start simple this makes the problem quite a bit easier uh we collected some data with teley operation and trained a policy with imitation learning and our model had around 100 million parameters mapping from images from the robot's

    06:16

    cameras to joint target joint positions on the robot arms and we do this source of control at 50 hertz on the robot Uh, and uh, we founded the company in kind of mid-March of of 2024. Uh, and a couple months later after we had set everything up, we were able to get a

    06:33

    policy that could fairly reliably fold a single size single brand shirt. Uh, you can see that I'm testing the policy right here.

    Uh, and we also wanted to test some dynamic motions because you need to be able to match the control frequency accurately in order to do these sorts of dynamic motions. Um and

    06:48

    so these were some of our very initial tests at uh addressing this sort of laundry folding problem. Then from there we wanted to make the problem incrementally harder.

    Uh and so we instead of starting from the shirt flat on the table, we started in a crumpled position like these. And it turns out

    07:06

    that this actually makes it a lot harder. Uh and so here are some videos of some of our initial attempts at trying to train the robot to fold these shirts.

    And the robot struggles. uh the the robot does some things that kind of look somewhat sensible but generally

    07:21

    isn't able to make progress on the task. Uh with many tests we frequently were getting 0% success rate in our tests of this uh system and really struggling to make progress.

    So really here is the it introduces this challenge of handling the sorts of variability in the ways in

    07:36

    which shirts might be crumpled on the table. We had some initial signs of life in late June uh of of last year.

    Uh and so in this case, the robot was able to kind of make progress on flattening the shirt. Uh it's also then able to fold the shirt uh decently well uh from that

    07:53

    initial state. Still not perfect.

    Uh and as you can see, it takes quite a while to do this. So this is a video that was sped up AEX.

    Uh so not something that you might have the patience uh for a robot to do. Um, so with some initial signs of life, the also very low success rate, we

    08:10

    started to transition to a slightly harder version of the task where the laundry starts in a laundry basket. We also introduced variable size shirts and shorts into the mix.

    Uh, and again, the robot really struggled. So in many of our tests, we were getting 0% success

    08:25

    rate across the board, and we're really struggling to actually get the robots to learn how to do these tasks. At this point, we were trying to consider a lot of different things.

    uh we thought that maybe the robot needs memory, needs history in some way. Uh maybe we need to just train our models for longer.

    Maybe

    08:40

    we should be doing control and endeector space rather than in joint space of the robot. Uh maybe our encoders, we knew that there were calibration issues and maybe we need that calibration to be more consistent.

    Uh maybe we need to condition the model on more information about the data. Uh maybe we need hierarchy because this is a pretty long horizon task and it needs to break it

    08:57

    down into different subtasks. Maybe we need higher resolution images.

    uh maybe we need to introduce kind of interventions in data collection. A lot of these things we also tried.

    We had around two to three months of failure where nothing was really working at addressing this task. But then at some point we actually had a bit of a

    09:13

    breakthrough uh which was that um we found one thing that really seemed to make a difference in the robot's ability to do the task. And this was actually to take some inspiration from the world of language modeling to actually instead of just training a policy on all of our data, we pre-train on all the data and

    09:30

    then fine-tune on a highly on a curated consistent highquality set of demonstration data. When we did this, uh, we found that the robot was actually able to make progress and a lot more reliably fold articles of clothing.

    Uh, and so I think that this video was the

    09:45

    first video where the robot was able to fold five items in a row and stack them. Uh, I went home very excited this day.

    Uh, this was in September of 2024, so multiple months after our initial tests. Uh, now this is far from perfect.

    Uh, it

    10:02

    takes 20 minutes to fold five items of clothes. Uh and um at the same time though it kind of suggested that this sort of recipe was able to unlock uh the capability in the robot to actually fold these articles of clothing.

    So you can see these sorts of failures here. In

    10:18

    this case, it attempted to fold the the blue shirt around seven times uh before eventually actually figuring out how to do that. Um there's also other failure modes as well.

    So, here's an example where the robot pushes the stack to the corner of the table uh and decides to kind of fiddle with it a bit uh and then eventually uh slides it off the table

    10:35

    and then it proceeds as if nothing had happened and it's going to continue to fold. We continue to iterate on this recipe.

    We uh selected and worked on our curation strategy for curating a higher quality set of demonstration data. Uh we got it from 20 minutes down to 12 minutes uh for these five items.

    This is

    10:50

    kind of how we were evaluating uh how good our robot system was. uh it still makes mistakes.

    It's still the full quality still varies, but um it's still significantly better than our previous curation recipe. Now, at this point, we were still training models largely um

    11:06

    kind of we were pre-training uh and fine-tuning only on laundry data, and we weren't leveraging uh kind of pre-trained models in the community. And there were some folks working at physical intelligence that were working on developing a pre-trained model trained on all of the robot data.

    And um we then started to try to introduce

    11:22

    these models into our um into our recipe. And so we took an open- source vision language model, a three billion parameter model uh called Polygeemma.

    Previously we're using the previous videos were all with like a 100 to 300 million parameters that we're iterating on. Um this model takes as input images

    11:39

    from the robot also a language command uh and then has a head a diffusion head that's going to attend to all the internal values of the vision language model. and uh with the joint angles uh predict a chunk of 50 actions into the future.

    So about 1 second uh of of

    11:55

    action steps and we're using a flow matching a variant of diffusion uh to actually output these actions and output continuous actions. Um so we took this pre-trained uh this model and instead of pre-training only on laundry, we pre-trained on all of the robot data

    12:11

    that we had collected. Uh and then we just fine-tuned it with the same exact post- training recipe that we had developed uh without using the vision language models.

    Uh and when we did this, we actually saw the robot uh continue to actually get better when we just plugged in that new pre-trained model. Uh and so in the left video, it's

    12:28

    able to do five items in 9 minutes, which was faster than the 12 minutes we had before. In the right videos, we were testing with um some novel clothing items and found that it was also quite efficient at folding multiple items in a row.

    Uh, and we also saw as a result there was also more consistent bold

    12:44

    quality by using this model that was about 10 times larger um, and had seen more robot data as input. To look at a few highlights of this, here's a pair of shorts that the robot hasn't seen before.

    And this is kind of a tricky scenario where to flatten it, it actually kind of needs to reach under

    13:00

    the kind of the bottom of the shorts. And it's able to do that.

    is able to kind of figure out that it should reach under um the the left part of the shorts in order to uh eventually flatten it. Uh and then um once it actually successfully flattens it, uh it's able

    13:15

    to fold it successfully. It also has to do something similar at times to fold shirts.

    So in this case, it needs to actually kind of fold the shirt over on itself with actually puts it in a more crumpled state arguably, but allows it to find the corners of the shirt and then uh go ahead and fold it.

    13:31

    Uh, and then like I mentioned, it also is able to handle unseen clothing items. So, uh, here's an example of a shirt with a V-neck, uh, that is able to fold even though, um, like the the post training data set didn't have, well, didn't this shirt was completely held out and the post training data set

    13:46

    didn't have any V-necks uh, as input in the data set, it's also able to fold shirts with buttons. So, it has some degree of generalization to different clothing items.

    Um, and then lastly, because this policy is a neural network and it's kind of uh taking his input, the current image,

    14:02

    it's able to handle interruptions. So here, Michael is uh continuing to mess with the robot and the robot uh figures out that it should put the the shirt away uh while it's trying to fold the other shirt.

    In this case, Michael's going to continue messing with the robot. So, Michael unfolds one side

    14:19

    and the robot reacts. Michael goes in again and the robot makes some mistakes here but able to recover.

    Michael messes it up again. So those are some results of of what the robot's able to do.

    Now I

    14:37

    talked about this pre-training and post-training recipe being really important. We can actually quantitatively measure that and actually make sure that this is actually what's leading to improvement.

    So, we compared this pre-training and post-training recipe to not using any pre-training and only training on the curated data set

    14:52

    versus no post-training where you're training on all of the data rather than fine-tuning on the curated data set. Uh, and we evaluated these models in terms of their progress on the task where you u make partial progress for getting it out of the bin, which is the easiest part, and then further progress for flattening, folding, and stacking the

    15:09

    items. And we see that the pre-training and post-training recipe is able to get far higher performance than omitting pre-training and omitting post-training.

    Uh and notably omitting pre-training and post- training is basically able to get it out of the bin and make very little progress after that. Whereas when we

    15:26

    combine pre-training and curated post-raining, we get far higher performance whereas able to reliably uh flatten and fold objects. Um and then the last thing that I'll mention on this note is that uh nothing in this recipe is specific to laundry.

    And so we took the same recipe um and fine-tuned on

    15:42

    other tasks. So here uh the task is to um kind of clean up a table.

    And the robot's also able to successfully uh do this task uh despite the fact that we primarily were iterating a lot on laundry, but it's able to also apply this recipe to this task. It also um is

    15:58

    able to scoop uh coffee beans into a coffee grinder. Uh this task is pretty hard.

    it has to construct the bottom part of a cardboard box uh which requires uh quite a bit of dexterity and then um lastly autonomously lighting a

    16:15

    candle with a match again with this kind of same pre-training and post-training recipe. And so this is pointing at this kind of the benefit of foundation models that I alluded to before which is that to do these different tasks you don't have to start completely from scratch.

    you can

    16:31

    actually leverage pre-training across multiple robots and across multiple tasks. And then we're also able to apply that same recipe to robots at other companies.

    Uh this is a robot that I've actually never seen in person before. Uh they collected data.

    They sent the data

    16:47

    to us. We fine-tuned our model on their data.

    We actually didn't even know exactly how the model is being controlled. Uh exactly the representation of their actions.

    uh but by fine-tuning the model on this new robot, the model is able to control the robot in order to uh make a cup of

    17:04

    coffee in this case. So um some takeaways for this part uh we were able to independently develop post- training and pre-training and decouple the problem um and then eventually get the best of both.

    Uh we found that training on all the data doesn't work for complex tasks and this sort of pre post post

    17:20

    pre-training and post-training on curated data leads to far better performance. And then we broke up this really hard problem of folding laundry by gradually starting with folding single shirts and going to more and more complex versions of the task.

    Now there's a number of limitations here and

    17:35

    one limitation I'd like to point out is that these robots inevitably um in this case were trained in the environments that they were tested. Uh and so this means that in principle you could use these methods to collect a lot of data in one environment and then deploy them in one environment.

    But ultimately,

    17:51

    there's going to be things that change about an environment and scenarios where we would want to actually apply these robots to environments that they've never seen in before. And so, how can robots actually succeed in places that they've never been?

    The lesson we've learned from machine learning in other places is that we should collect diverse

    18:07

    data. Uh, and so we started by collecting data of tidying bedrooms and kitchens in many different environments.

    Uh, and here's an example, kind of a sample of that data. uh and we collected robot data in homes across San Francisco here uh and also collected data in

    18:25

    diverse mock kitchens and mock bedrooms and in total we had more than 100 unique rooms represented in the data set that ended up being uh a part of a bigger pre-training mixture. So we trained on this diverse mobile manipulation data uh including the low-level action prediction as well as predicting highle

    18:41

    subtask commands for how to complete the task. But we also trained on previously collected static manipulation data that was also fairly diverse.

    Um static manipulation data that we had collected in our office and in labs as well as web data um and highle instructional data.

    18:57

    And I should point out here that the mobile manipulation data of tidying bedrooms and kitchens only accounted for 2.4% of the overall pre-training mix. And so the lesson here is that you were basically able to spin up a new task and actually an entirely new robot.

    the rest

    19:12

    of the mixture didn't have any mobile manipulation data with this particular mobile manipulator in it um without redoing all of the data collection. We're able to build upon everything that had been done before.

    And it's kind of this kind of same story of foundation models being able to make it easier to

    19:28

    spin up um a new problem, a new application without starting from scratch. Um now this wasn't completely easy.

    Um we had a couple challenges. One of the challenges that we ran into is that naively uh this model can ignore language instructions.

    So we had

    19:44

    actually in this case asked it to pick up the cutting board and it chose to pick up the plate instead. Now we're again asking it to pick up the cutting board.

    Uh and instead the robot had a mind of its own decided to pick up the plate. Uh and then we tell it to put the plate in the sink.

    And eventually it decides that well after kind of moving

    20:00

    away from the cutting board it eventually decided that it would actually pick up the cutting board. And so in the early development of our model, we found that it often ignored language.

    And to solve this, we thought about how vision language models actually follow language well. And so maybe there's a

    20:16

    way to preserve the inherent abilities of the pre-trained models when addressing this task. Uh and so what we did is with this PI zero architecture, this action head that's using diffusion is randomly initialized.

    And this ends up actually deteriorating the

    20:33

    pre-trained knowledge that's present in the vision language model. Uh and we found that if we can prevent this deterioration, we might be able to get better language following.

    Uh and so the recipe that we came up with was actually in some ways fairly similar, but instead

    20:48

    we're going to be predicting tokenized actions. And then when we have the diffusion head, we'll be stopping the gradient from the randomly initialized diffusion head to prevent it from deteriorating the language following abilities of the VLM backbone.

    Uh and we found that this first led to faster training because the tokenized actions

    21:05

    are a more direct supervision signal. And second, it also followed language far better.

    Uh an 80% follow rate rather than a 20% follower rate. Uh which suggests that we're able to preserve the the kind of pre-training in the vision language model backbone.

    So, we put

    21:20

    those pieces together. We took that recipe and trained it um pre-trained it on all of our data, including the mobile manipulation data.

    We fine-tuned it on mobile manipulation data in a variety of environments. And then we tested the model in places it's never been in before.

    So, we rented uh three Airbnbs

    21:35

    that uh we had never been to before. Uh we put the robot in those homes, in this case, in the kitchen, and I asked it to close the cabinet.

    I asked it to put away the dishes. has also never seen these dishes um or the these forks, these objects.

    And the robot's able to

    21:51

    succeed even though it's never been the here before. There's different uh countertops, different furniture, different objects, and so forth.

    Uh lastly, I asked it to clean up the spill, and the robot is able to oblige and wipe down the spill and eventually put the sponge into the sink.

    22:16

    Uh it's also able to do this for bedrooms. So Laura asked it in this case just clean the bedroom and it puts uh articles of clothing in.

    Uh it throws away the trash and uh then is able to tidy the bed by putting the uh putting the pillow at the top of the bed and uh

    22:33

    tidying the the blanket or the comforter of the bed. YC's next batch is now taking applications.

    Got a startup in you? Apply at y combinator.com/apply.

    It's never too early and filling out the

    22:51

    app will level up your idea. Okay, back to the video.

    So, quantitatively, I talked about how the kind of there's only 2.7% or something of the the mixture and so how much does that other data actually help? Uh could we actually just train on that kind of 2.7%.

    23:06

    And we find that these kind of bars on the right which are excluding data from static robots in labs and environments and so forth um reduces performance significantly. So the performance goes down to less than 60% when you exclude that data when evaluated in novel homes compared to if you use the full

    23:22

    pre-training mixture it has uh more than 20% higher performance. Lastly we also looked at is the diversity of data helpful?

    Is it important? And so we increase the amount of data from these environments to test this.

    It's always good to like you can kind of do vibe

    23:38

    eval but it's really helpful to actually measure how well uh these things work and so this is what this is measuring and we find that if we actually increase the amount of homes the amount of uh locations that are represented in the data the performance increases which is great uh and it actually gets to the

    23:54

    same level of performance as if we train on data from that target environment and so it means we're actually mostly closing the generalization gap and suggest that the bottlenecks at this point for this sort of task lie not in collecting more diverse data but in

    24:11

    actually getting higher reliability and higher performance. Um now I should also mention that there's failure modes like this the success rate was around 80%.

    There's lots of room for improvement. Uh here are a couple examples of those failure modes.

    So um here it's told to put the items in the drawer. Uh it is able to put it in the drawer but the

    24:27

    item isn't fully in the drawer at the end and it decides that it's done and kind of moves on to the next thing. Uh here the robot uh needs to put the clothes in the laundry basket.

    It drives over the shirt um and then it gets stuck and it's not able to lift it up. Uh here we asked it to put the dishes in the

    24:43

    sink and it successfully is able to put a number of the dishes in the sink but it struggles to pick up the cutting board uh in this particular case because it's a very thin and it's flush against the surface of uh the countertop. Uh and in the last case, my probably my

    24:58

    favorite case, um it's told to put the spatula into a drawer and it decides that the oven looks a lot like a drawer and so it opens the oven um and uh yeah, tries to to put it in there. Um and beyond this, there's also challenges with regard to speed, partial observability, uh long-term planning um

    25:15

    and so uh yeah, lots of work to do still. So the takeaway here is that with diverse data, uh, robots can follow a variety of instructions in environments that the robot has never been in before.

    Uh, which is a big step up from a lot of robotic scenarios where they're trained

    25:31

    in the scenarios that they are being tested. Now the last kind of bit I'd like to talk about is this model has a fairly limited instruction set.

    It can only follow kind of a certain set of commands. And if we think about how other forms of AI technology have been

    25:48

    deployed, people really like to customize and actually tell the robot what they want or tell the system what they want from these kinds of models. And so just like we prompt language models, can we allow robots to respond to open-ended prompts and open-ended interjections?

    Uh so to do this and actually to do the

    26:04

    past work, we're actually leveraging hierarchical uh vision language action models. So we're going to have a high level policy break down uh the prompt into uh intermediate uh verbal responses and intermediate atomic language commands.

    So the highle prompt might be

    26:22

    kind of can you make me a sandwich uh and this highle policy will break it down into the subtask of pick up one slice of bread. This will be passed to a low-level model that actually executes and predicts target joint angles um to fulfill the low-level command of picking

    26:38

    up one slice of bread. Now, on its own, this isn't going to be able to follow all sorts of prompts, and it's actually fairly tricky to handle open-ended language because it's going to be challenging to collect a large number of human robot interactions with the real

    26:53

    robot in the loop. And this is also going to be fairly hard to scale.

    Uh and so what we did is we kind of took all of our existing robot data and we can actually generate synthetic data for the existing robot data. In particular, we can use language models to reabel and

    27:09

    generate hypothetical human prompts for the scenarios that the robots are in. And so what this looks like is we'll take data that says um here's a kind of a video and then the next skill is to pick up a Kit Kat because that's what the robot does next in terms of just like basic low-level annotation.

    And

    27:25

    then for this scenario where the robot is about to pick up the KitKat, we can ask a vision language model, what is a hypothetical prompt that a human might have asked that led to this um this particular scenario and the robot to actually choose to pick up a Kit Kat. And then we can train our high level

    27:40

    policy on these synthetic prompts to basically augment the robot data with various human interactions that might have led to those different situations. And as a result of this, we're able to actually allow robots to follow a variety of different prompts.

    So on the left, we ask, "Hi, robot. Can you make me a ham and cheese sandwich?"

    27:58

    The robot says, "Sure, I'll start with the bread and add ham and cheese next." And it's able to break down this task into the various subtasks of picking up a slice of bread, putting on the cutting board, picking up a slice of cheese, putting it on the bread, um picking up some ham, um and so on and so forth. I

    28:14

    can also follow more complicated prompts like, "Hi robot, can you make me a vegan sandwich? I don't like pickles, though." uh and in this case is able to break it down and decide that it's going to add lettuce and tomatoes to the sandwich uh and not add pickles, not add cheese, not add um meat as well.

    28:31

    In addition to prompts, we're also able to train the robot to handle different interjections. Um actually here's an a case where of a different kind of prompt.

    So on the left we train the robot to clean tables. So put trash away and put dishes into the bin.

    And on the right we ask the robot clean up only the

    28:47

    trash but not the dishes. And the robot's able to understand what that means and connect that to its low-level actions and only put away the trash and complete when it um when the trash is all put away.

    And then lastly, it's able to handle interjections and situated corrections. So in this case, um the

    29:04

    robot is uh kind of getting items for a user. The user interjects and said, "Get me something sweet that's not in the basket." Right after it had put a Kit Kat into the basket and the robot um says, "Uh, sure.

    Let me get you some Skittles." uh and reasons through kind of basic reasoning of how to uh what how

    29:20

    to fulfill the user's request and is able to um respond to those kinds of corrections situated in the world that the robot is in. Now you might also wonder like maybe some existing foundation models could serve as a highle planner for robots and do this sort of high level reasoning without

    29:35

    actually training a separate model. And so we also evaluated that um and we found that in blue the performance at following instructions and making progress on the task was substantially lower than the performance of our system which is shown in green.

    Uh and in general we found that these frontier

    29:51

    models generally struggle with visual understanding as it pertains to robotics which makes sense because in general these models aren't kind of really targeting uh many physical applications and have very little data in the physical world. Okay.

    Um, so to start to wrap up, um, and then we'll all have

    30:07

    some time for questions. Uh, I talked a bit about how robots can do a variety of dextrous long horizon tasks with pre-training and post- training.

    How robots can succeed in places that they've never been, and how they can respond to open-ended prompts and interjections by leveraging synthetic data from language models on top of the

    30:24

    robot data that we had collected. Um now with some closing notes the we've seen a few different scenarios in this talk where general purpose robots might be more successful than specialist robots but because we can essentially rather than start from scratch for every single application actually build upon a much

    30:41

    broader foundation for physical intelligence in the real world. Um we also saw that like large scale data in the real world is really helpful for developing these things and we found that uh and I think that it's necessary but not sufficient for physical intelligence and there's a lot of uh

    30:57

    challenges and we need more research uh to be done uh ourselves and through open source contributions before robots I think will be truly ready to tackle the open world. I'd also like to mention that at physical intelligence we're hiring a number of roles.

    Uh if you're excited about some of the things that we

    31:13

    talked about, you can see a list of the open roles on the pi pi. As well, awesome.

    Happy to take some questions. Let's

    31:28

    start on the left. Uh hi Chelsea.

    So, uh first I want to say thank you for all your work on robot learning. They're all really impressive.

    Yeah. And uh so mainly I have two questions on uh especially uh regarding the post- training part you mentioned.

    So um the first thing is uh you

    31:45

    mentioned that the in post training the most important part is to have high quality action data. So I'm wondering what the components of that would be and then the second question is what do you think uh RL will play into the part of post training?

    Yeah absolutely. So I think that the the

    32:04

    different components of it a lot of it comes down to consistency of the data and the strategy being followed uh and whether the robots whether the um the data completes the task efficiently and with a reliable strategy. Uh and then on the second question I think that reinforcement learning can play a very

    32:19

    large role in um it actually in post training. I think that online data from the robots uh which reinforcement learning allows you to use can allow robots to have a much higher success rate and also uh be faster than if they're just trained with imitation learning.

    32:35

    Yeah, thank you. Hi, thank you so much for your talk.

    Uh so your work is really fascinating and there is no doubt that it will have a lot of impact in the future. But um can I ask you at this stage uh how can you

    32:50

    find the fundings because honestly I can't imagine how hard it can be to convince people to invest in a robot that folds close and deal with the dishes. Yeah.

    So um it's a good question. I think that well I guess

    33:06

    first I'll mention that we aren't just focused on applications in the home. uh we really want to solve this broader problem of physical intelligence and we've been starting with those applications because they're ones that are kind of easy to make progress on.

    Um but we've also been doing tasks like inserting an Ethernet cable which I put

    33:22

    put in the talk as well as constructing a cardboard box. Uh and generally I think that this sort of problem has a ton of potential for for like making impact in all sorts of realms not just in domestic tasks but all sorts of realms as well.

    And even in domestic task, I think there's a huge market for

    33:38

    um for this kind of technology. Uh we ourselves haven't had um a lot of challenge with fundraising and I think that a lot of robotics companies recently have also done a great job um and found that there's actually a lot of excitement around this sort of technology because I think things are actually starting to work.

    Uh I started

    33:54

    working on this technology uh more than 10 years ago at this point and things really weren't working then and so uh yeah I think that there's a lot of excitement that is starting to mature and and um like actually be ready for the real world. I think that there's a lot more work to do uh but generally it

    34:10

    seems like there's a lot of people excited about this technology and and eager to actually put funds behind it. Okay, thank you so much.

    Yeah. Hi.

    Uh thank you so much. Um I have two questions like one uh uh more broad and one more technical.

    So the technical one

    34:25

    like is uh VAS uh in my opinion like at least to my understanding are a framework that a bit that is a bit separate like from world modeling and I wonder like how the two of them like will interplay among each other and

    34:40

    whether like you have actually planned like to somehow like use them together. uh as I see right now like VAS as more of a policies uh that could actually benefit a lot from world modeling and uh from a B perspective I wonder like which

    34:56

    kind of infrastructure layers could be the most useful uh to work on such as like explanability, traceability or uh uh safety in general to deploy such models like in the real world.

    35:11

    Yeah, great question. So um on the first point we there's actually fairly natural ways to incorporate world model objectives into vision language action models and um we've done some work where um instead of only predicting the next action you predict some intermediate subgoal image uh like what should happen

    35:28

    in the future in order to accomplish the task uh and then predict an action from there uh and we've seen some kind of signs of life that that seems to be quite promising. So I think there's ways to merge the merge the two paradigms.

    Uh at the same time I think there's a lot of challenges that come up with world modeling with regard to the ways in

    35:44

    which basically the data that you put into it not necessarily being kind of reflective of the ways in which you're going to use it. You might train it on demonstration data of successful data of completing the task and then evaluate it on to try to actually use it to evaluate actions that are not optimally completing the task.

    And then the world

    36:01

    model will hallucinate um a video of completing the task successfully even if the actions that you provide as input didn't uh weren't actually going to successfully lead to a good outcome. Um so there's challenges there to overcome and and so it's not like uh yeah there's various challenges uh but there's also

    36:17

    ways to integrate it into the VA uh paradigm and then could you remind me your second question? Um what are like the infrastructure layers like you want the chess to work on uh in the shortest term to bring like the most um

    36:33

    improvements let's say to actually run these models on robots. you need uh we have like a real-time system um that needs to actually be hitting a certain frequency to actually like execute actions successfully.

    Uh and if you have lag in that system and so forth, it introduces all sorts of challenges. And so thinking about fast

    36:49

    inference um and infrastructure for like that's actually going to be on the robot is a big part of uh what our software team does. And then also thinking about like large scale machine learning infrastructure, training large models, ingesting large amounts of data.

    Um the data that we have is different from a lot of kind of typical data sets because

    37:05

    it's very multimodal in nature. Um it's kind of videos, actions, language segments um and and various other uh components as well.

    So um yeah, some interesting infrastructure problems I think both on the robot side uh and on

    37:21

    the kind of model training side. Thank you so much.

    Yep. Hi, I'm Frederick and I have got a question about model sizes in general.

    So I think what we're seeing right now is that in general larger model sizes lead to better accuracy. For example,

    37:36

    also in your experiments or um it's also what OpenAI and Enthropic and others are doing right now with their LLMs. However, there's also the approach of using a quite small model and then outsourcing the world knowledge into a database of some sort with which the model can interact.

    Um what is your take

    37:53

    on that? Do you think that's like a valid approach or do you think encapsulating all the world knowledge inside of the model is better or works better?

    Yeah, it's an interesting question. So in my experience working on like retrievalbased systems um is that it

    38:08

    actually is a little bit tricky to well first figure out what should be offloaded versus actually done by the model and second uh sometimes the model will ignore the retrieved content and try to generate something itself and it it actually seems to be very quite tricky to get that technically to work

    38:24

    uh exactly the way you want it. Um, I think it's probably going to depend on the application and the use case, uh, in terms of how best to like like whether that might make sense, but in my experience, it ends up being quite tricky to figure out what the division of labor is.

    And even the like the model part of it will need to have some degree

    38:41

    of intelligence in order to um like actually make use of the retrieved information and so forth. Uh, so I think it's an really fascinating research problem.

    Uh, but it also needs like a lot of research to make that uh to that make that work successfully. Thank you.

    38:57

    Yeah. Hi, Chelsea.

    My name is Charu Thomas. Um, first off, really appreciate the talk.

    It was really fascinating and have been a big fan of your work since metalarning. Um, when you think about how software and hardware have are going to continue to evolve, what are the

    39:13

    biggest opportunities for builders today for your vision of physical intelligence? I mean, I think that yeah, there's lots of different like opportunities to make things work a lot better and a lot of like open questions.

    I think kind of like what I was mentioning before, uh, thinking about

    39:30

    better ways of having infrastructure on like kind of the robot side. I think that there isn't a lot of like there's some open source code for that sort of thing, but there's a lot of um opportunities to make robot infrastructure better.

    Uh, and not a lot

    39:45

    of people I think are are working on that aspect of the problem. also lots of opportunities like I guess one of the things I love about um about AI and computer science as a whole is there's a really big open source community and I think that there's a ton of opportunity to actually like do open source work and

    40:00

    contribute to like a broader community that's trying to like collect data open source models fix bugs on those models uh fine-tune those models figure out new recipes for fine-tuning those models um so yeah all sorts of questions also like on the research side especially in the open source realm yeah thank you

    40:17

    hi Hi, Chelsea. Uh, I also, just like everyone else, am a big fan of all your work.

    So, thank you for putting that all out. Uh, I've been reading through a lot of your group's work recently and particularly enjoyed reading Siraj uh, Siraj's PhD thesis.

    It taught me a lot

    40:32

    about scaling real world robotics with data. And a question I have is how do you think synthetic data will sort of scale for robotics in the future?

    As we've seen with LMS, we've moved a we've moved away from sort of not moved away from pre-training, but moved away from

    40:49

    human collected data into more creating synthetic data and a lot of filtering and a lot of self-grading. So, how do you think using generative synthetic data for creating environments or reward models will impact robotics?

    Yeah, I have many thoughts on this

    41:04

    topic. Uh I think that at the end of the day there's going to be no replacement for real data and so we're like large amounts of real robot data is going to be a necessary component of any like system that's going to work in a generalizable way.

    Uh so we're going to need that. Um, at the same time I do

    41:19

    think that there's tools for like simulation and synthetic data especially to potentially play on the evaluation side because it's very tricky to actually as you for example are generalizing too many environments. It's very tricky to actually evaluate how well that model generalizes not just in one new environment but in 10 new environments because then you actually

    41:35

    need to bring the robot to those 10 environments or construct 10 environments. Uh whereas in simulation that gets a lot easier.

    Uh and so I think I'm really excited about kind of simulation and synthetic data for that use case. I should also mention that I think that the analog of synthetic data in language models is actually not

    41:52

    necessarily simulation in robotics but closer to something like reinforcement learning. Uh I think that a lot of synthetic data is generated by the model that's actually trying to do the task and then trying to kind of reason through different ways of doing the task.

    And I think that the analogy there is a robot that's trying to attempt the

    42:07

    task and learn from its own attempts and get better from its own attempts. And that sort of online data from the model I think will also play a really critical role in post training and something that uh we're working on quite a bit.

    Uh and so yeah that that I think is like really important and really helpful. Thank you.

    42:22

    Cool. I think we have time for one more question.

    Sorry we won't be able to get to everyone. Yeah.

    Hi. It's super cool to see you as an MIT EES alumni now working in a really cool robotics and talking to us about robotics and entrepreneurship.

    Um, but I've been wondering how robotics research that involves hardware

    42:38

    components plays out differently in academia versus industry and are there typically more resources, fewer constraints or broader applications in one setting over the other? And what kind of people or goals do you think might be better suited for each path?

    Yeah, it's an interesting question. Uh,

    42:54

    I still love both kind of startup um and academic environments and industry environments. I think they all have various pros and cons.

    Uh certainly I think that uh any um I think that generally academic environments aren't quite as well resourced in terms of data collection throughput, eval throughput

    43:10

    and compute as um like startups and industry labs. Uh but at the same time I think that there's a lot of uh problems that you can solve without large amounts of resources uh that uh we need to figure out like on the algorithm side.

    Uh so I think that there's a lot of really interesting work to be done

    43:26

    there. Um and then on the like in industry and in startups, I think the um actually like trying to do some of the research on these big models, scaling up data, seeing what hap things happen at large scales um is is really great to do there.

    Yeah, I think that there's yeah, there's there's a place for both. I also

    43:42

    think that the gap isn't as large as often people make it seem. Uh and oftentimes people in industry environments kind of wish they had more compute.

    Like you kind of always wish that you had more resources. uh and sometimes when you have a lot of resources, you don't actually think as carefully and as critically about what

    43:59

    runs you're going to be doing and so forth and you uh end up being sometimes more wasteful of compute uh than if you were kind of more compute constrained. So there's also actually downsides to having more resources in my experience.

    I'm really sorry. Can I just ask a one quick question on architecture?

    I know

    44:14

    that um the scaling laws have worked well for transformer based architectures and I was thinking do you see currently limits um in VLM based architecture which are kind of made for like text tokens because they don't have like

    44:30

    modules for physical awareness. Yeah.

    And how do you deal with that? Yeah.

    So, we we tokenized the actions and so I'd encourage you to take a look at the the fast tokenizer paper that we put out um as as kind of a way to accomplish that. And yeah, we should uh wrap up there.

    Uh thanks everyone and um

    44:47

    yeah, hope you enjoy the event.