Behind the scenes of Google's state-of-the-art "nano-banana" image model

🚀 Add to Chrome – It’s Free - YouTube Summarizer

Category: AI Technology

Tags: AIEditingGenerationImageModel

Entities: GeminiGoogle DeepMindKosikLogan KilpatrickMustafaNicoleRobert

Building WordCloud ...

Summary

    Introduction
    • Logan Kilpatrick introduces team members Kosik, Robert, Nicole, and Mustafa from Google DeepMind to discuss the Gemini native image generation model.
    • The team is excited about the model's state-of-the-art image generation and editing capabilities.
    Model Capabilities
    • The Gemini model allows for rendering characters from different angles while maintaining consistency.
    • Users can interact with the model using natural language, making it feel smart and conversational.
    • The model can interpret vague prompts creatively and maintain scene consistency across multiple edits.
    • Text rendering is a focus area, with improvements being made for better integration in images.
    Development and Evaluation
    • Human preference has been used historically for evaluation, but new metrics are being developed for efficiency.
    • The team tracks model performance on specific metrics to ensure no regression and continual improvement.
    • Failure cases from previous models are used to build benchmarks for future models.
    Use Cases and Applications
    • The model can be used for practical applications like home and garden redesign.
    • It supports complex workflows, allowing for iterative and incremental editing processes.
    • Gemini's multimodal capabilities make it suitable for creative tasks involving both text and images.
    Future Directions
    • Focus on enhancing model smartness, allowing it to make creative decisions beyond user instructions.
    • Improvements in factual accuracy for tasks like creating infographics and presentations.
    • The team is excited about upcoming releases and the momentum in image generation advancements.
    Actionable Takeaways
    • Explore the Gemini model's ability to maintain character consistency across edits.
    • Utilize natural language to interact with the model for creative image generation.
    • Experiment with the model's iterative editing capabilities for complex image tasks.
    • Provide feedback on model performance to help improve future iterations.
    • Consider using the model for practical applications like home design or text-based image creation.

    Transcript

    00:00

    Today we're talking about native image generation with the team behind the new model that we're releasing. It's a giant quality leap, the model state-of-the-art, and we're really excited about both the generation and editing capabilities.

    You can ask to, for example, render the character from different angles and it

    00:15

    will look like the exact same character. When users interact with this, not only they're impressed by the quality of images, but they feel like, "Wow, this is smart." And can kind of have a fun conversation with the model over multiple turns.

    So I think this like iterative process of creating is kind of like um the magic behind it

    00:31

    and I think we're just scratching the surface um on what these models can do. Hey everyone, welcome back to release notes.

    My name is Logan Kilpatrick. I'm on the Google DeepMind team.

    Today we're joined by Kosik, Robert, Nicole Mustafa. These are the folks who are doing

    00:46

    research and product for our Gemini native image generation model which we're here to talk about today which I'm super excited about. So Nicole, you want to kick us off?

    Why what's the what's the good news? Uh, I'm excited to hear about the the release.

    Yeah, we're releasing an update to our image generation and editing capabilities in Gemini and Tutor 5

    01:03

    Flash. And it's a giant quality leap.

    Um, the model's state-of-the-art and we're really excited about both the generation and editing capabilities. Um, and why don't I just show you what the model does because I that's the best way to kind of get that across.

    I'm excited and I played around with it like once, but I I have not done as much playing around as y'all have. So, I'm

    01:19

    excited to see some some examples. Um, great.

    I'm I'm going to take a picture of you. Okay.

    Um, and let's just start with, let's say, zoom out and show him wearing a giant banana costume and keep his face

    01:36

    visible because we want to make sure you know looks like you. All right, it's going to take a couple of seconds to generate, but it's still it's still pretty snappy, which I think you remember from our last release.

    Like, it was a pretty fast model. Um,

    01:51

    this was one of my favorite things. cuz I feel like this like pace of of editing uh makes these models a ton of fun to play with.

    Can you make it slightly bigger for me? Can you just You can go full screen, I think.

    Click on this. Click on this.

    Let me just click on this. So, there we go.

    This this is Logan. This is still

    02:06

    your face. And what's awesome about this model is that this still looks like you, right?

    Like like this is you, but it's actually like you're wearing a giant banana costume and now there's like a nice background of you walking through a city. That's so interesting because this is in this picture is in Chicago and that actually is pretty much what that

    02:22

    what that street looks like. So world knowledge coming from on this on this model.

    Um and now let's keep going and let's say make it nano. What does that mean?

    What does make it

    02:38

    nano mean? So let's see.

    Let's let's see what the model does. Um when we first released it on Ella Marina, we gave it the code name Nano Banana.

    Yeah. And people started speculating that it's an updated model from us.

    And it is a model updated model from us. And there

    02:54

    you go. Now the model takes you and creates this like cute nano version of you wearing a giant banana costume.

    I love that. That's awesome.

    And the awesome thing here is obviously like this was a very vague prompt, right? Like you were like, "What does this mean?" I actually did not know what that meant.

    03:11

    Um, but then the model's creative enough to kind of interpret it and then like, you know, create a scene of this where like that that it fulfills your prom and it still makes sense in the context and it keeps all the rest of the scene relevant. Um, and this is really exciting because um, it's the first time I think that we're seeing kind of LLMs

    03:26

    be really able to like keep the scene consistent across these multiple edits and have users use really natural language to interact with the model, right? I don't have to put in a super long prompt.

    Like I'm just giving it very natural language instructions and can kind of have a fun conversation with the model over multiple turns. Um, so

    03:43

    that's super exciting. I love that.

    How good is it at like text rendering stuff, which is one of the use cases I care the most about. Um, do you want me to Yeah.

    Yeah. put something on this picture.

    Why don't you give me a prompt? Um,

    04:00

    Gemini Nano. That's the only nano thing that comes to mind.

    I feel like this is the use case that I'm always trying to do is like uh announcement tweets with uh billboards

    04:16

    with with text on them is what I love is my use case. All right, let's go.

    There you go. Nice.

    Um and so this is a relatively simple text, right? It's a pretty small number of letters, like easy words, and that

    04:31

    worked really well. Um we do have some gaps in text rendering that we call out in the release.

    Um and we're working really hard on it. Folks on the team um Koshi maybe can talk about that.

    Um are working on making text rendering even better in our next model. I love it.

    Any other part of the um any

    04:47

    other examples you want to show or like is there any other like metric story around this launch? I know one of the challenges and I'm curious actually how you all think about this is like the eval story is like a lot of human preference stuff um is like what you're measuring.

    It's like hard to have like a

    05:02

    s I think there's probably some things that you could have a source of truth on but I'm curious how um yeah how you how you all think about that for this release but also just in general as we're as we're training these models. I think generally with like you know multimodal stuff like image and video it's like very hard to kind of like hill

    05:19

    climb and you know um the kind of like historic approach has been to use like a bunch of human preference and kind of like hill climb bats. Um obviously like images are like super subjective so you're kind of like um getting like signal from a large group of people and

    05:36

    it takes time right it's not necessarily like the fastest metric and it takes like um like real like hours to kind of get anything back from it. So generally like we've been working really hard to come up with like other metrics that we can like hill climb as like we train.

    05:52

    Um, and I think like text rendering has been a really interesting story because like I think uh Koshik has been, you know, talking about it for a long time. One of the like biggest advocates of it.

    Uh, and we were kind of like brushing him off for a long time about how like, you know, this guy's a little crazy. Like he's really obsessed with text

    06:07

    rendering. Um, but eventually it kind of became like one of the staple things we looked at.

    And you can kind of think about it like um when the model learns how to do this like structure for text, it's also kind of able to learn other

    06:22

    structure in an image as well. And like in an image you have these like um different like frequencies and you can have like structure which you can think of and but you can also have like texture and stuff like this.

    So it really gives you signal into like how good the model is at generating the

    06:38

    structure of the scene. Um, and I'll let Koshik talk a bit more about it because he's like the the main guy.

    Yeah. I'm also curious like what the the initial conviction was for is it just like as you were doing like a bunch of research experiments like it became clear that this was the case or Yeah,

    06:54

    I'm I'm curious to double click on it. Yeah, I think it started from a place of figuring out what these models were bad at.

    So we in order to kind of improve any model you need a signal for uh what is not working well and then you try a bunch of ideas whether it's related to

    07:09

    the model architecture data or other things and once you have that clear signal you you can definitely make good progress on it and I think if we go back a few years there were pretty much no models that were doing a decent job and even prompts that were on the order of

    07:25

    uh short short lengths like this Gemini nano prompt here for example So as we uh spent more time looking into this metric and always tracking it right whatever experiment we run now uh if we track this metric we can make sure that we don't regress on it and just by virtue

    07:41

    of having that as a signal we might even find that uh changes that we didn't expect to make a difference here actually do make a difference and then we can make sure we continue improving that metric over time. Yeah.

    Yeah, and like Robert said, uh it's a great way to just measure overall image quality in in

    07:58

    the absence of other uh other metrics for image quality that don't saturate very quickly. Right?

    I think humans uh I was actually a little bit skeptical of the human raider approach to doing evals for image generation. But I think what

    08:14

    at least I've realized over time is uh when you have enough humans looking at enough prompts across uh a variety of categories, you actually do get quite a bit of good signal. But obviously this is expensive.

    You don't want to always be uh asking a bunch of humans uh to grade images. So looking at this text

    08:31

    rendering metric for example while a model is training gives you great signal as to whether it's performing like you expect. That's super interesting.

    I'm curious about this um interplay between the native image generation capability, native image understanding capability.

    08:46

    We did an episode with Anie and uh that team has obviously been pushing super hard on like Gemini has state-of-the-art image understanding. Is it like a reasonable mental model as our models get better at understanding images?

    Uh there's like some of that capability is

    09:02

    actually transferable to generation as well and vice versa. Like is that I think Yeah.

    Um so basically the hope is that we end up with uh with native image generation or native u native multimodal understanding and generation

    09:18

    and and learning all these modalities and different capabilities at the same uh model within the same training run is that you you want to end up having positive transfer across these different axis. Right?

    And uh it's not only for understanding and generation generation

    09:34

    for a single modality but also it's about can we uh can we learn something about the word that is um like from images or videos or audio that is going to be helping us from on on the text understanding or text generation. So for sure uh image understanding and image

    09:50

    generation are like sisters. So like we definitely see still they're like going hand in hand in interle generation for example.

    uh but also the ultimate goal is to to see uh like let me just give you one example. So for example uh language we have this uh like uh like

    10:06

    phenomena that we call it like reporting biases and what it means that you go to your friend's place and when you come back you never talk about their normal sofa in a conversation right but if you show someone an image of that room it's there right so if you want to learn about a lot of things in the whole world

    10:24

    uh like images and videos they have that information there without like you know um explicit um um explicit like request for for those information. So what I want to say is that eventually with text you can look or or with like other modalities you can learn a lot about

    10:40

    different things but but it might take more tokens. So like visual signals are definitely a good shortcut for for learning about the board and back to the understanding and generation question as I said you know like this these two hands in hands and and coming into the uh like interle generation you can see

    10:56

    that there's actually a huge uh help from understanding to better generation and the other way around. So like you know image generation can help uh like you know like you draw uh something on the board to to solve a problem.

    So maybe you know you can better understand

    11:12

    a a problem that is given to you as as a like a visual uh image. Um so maybe we can actually show some yeah interle generation that that is kind of related to understanding and generation going hand in hand with text as well.

    Um, let me do transform this subject

    11:27

    into a 1980s American glamour mall shot in five different ways. All right, fingers crossed this works.

    11:44

    Okay, this looks promising. And this takes obviously a little bit longer, right?

    Because we're trying to generate multiple images and then we're also trying to generate the text that would describe what's in those images. And one of the things that you'll notice about native image generation is that it's generating these images one after

    12:00

    another. So the model may choose to look at a previous image and either try to generate something very different from it or try to generate a minor modification of it.

    It at least has that context of what is already generated. So that's what we mean by native image generation models.

    They have access to

    12:15

    multimodal context and then they generate an image. Yeah, that's interest.

    And it is my mental model had always been that it was like just I guess maybe that doesn't even make sense, but it would have just been like four independent forward passes or something like that, but this is actually like all in a single it's

    12:31

    all in the context of the model. All in the context of the model.

    That's super interesting. And what's nice is then the style is kind of similar, right?

    It's also the model's doing this funny thing where it has you twice in every single interesting. We could we make some of these full screen.

    I'm going to make some of the So this is arcade king logo.

    12:50

    Um, if we scroll, this is red dude. And see, like none of these descriptions that go with the images were something that we came up with.

    The prompt was just like you as a 1980s American glamour mall shop. Um, Maltt, this you should consider some of these

    13:06

    outfits attire for. And the fourth option, chill bro.

    See, and like you have a different outfit in all of them. They all look like you.

    Um, the fact that you were there twice is probably a little bit of a failure mode. Um, but it's really cool to be able to

    13:21

    see the model kind of come up with these five separate ideas. Um, give them different names, give you different outfits, right?

    And like keep the character consistent. Um, and this is not just useful for character building, but this is also useful if you have a picture of your room.

    Yeah. And you can say like, "Hey, help me decorate this in five different ways,

    13:38

    right? And maybe you can go from like really creative to maybe something more conservative that's a little bit more incremental to what you're doing." Um, and we've seen a lot of people on the team already using it to like redesign their gardens and homes.

    And it's been really cool to see that and more kind of like practical application, not just us making fun of. Yeah.

    I

    13:53

    80s Logan. I vibe coded uh for my girlfriend a in AI studio actually an app to visualize her uh office with every different color of blinds or of curtains.

    And she was like, I don't know what curtain color is going to fit this vibe. So, it literally

    14:09

    just This was with 2.0 know and I'll I'll have to retry it with 2.5 check all the different vibes. It actually worked really well.

    It was like a very helpful and like doesn't incre sometimes with 2.0 and actually this will be a good thing to retest sometime with 2.0 it would like change the bed or like change

    14:25

    like other artifacts would change not just the curtain. Um so it was interesting to see that use case.

    One of my favorites. You should give it a try.

    The the model does a pretty good job keeping the rest of the scene consistent and we and we call this kind of pixel perfect editing. Um, and that's really important, right?

    14:40

    Because sometimes you want to just edit that one thing in your image, but you actually want everything else to stay the same. Again, if if you're doing character building, you just want to turn the character's head, but like everything that they're wearing is to be the same um across the scenes.

    And and the model's really good at that. Um, it will not always 100% work.

    Um, but we're

    14:56

    really excited about how far it's come. Robert, you're going to say something?

    Yeah. Yeah.

    I was going to say like I think one really cool thing is like just how fast it is still, right? Like you know, how long was this whole thing?

    All right, let's let's give this a uh this is 13 seconds. Wow.

    So I think I think each image was

    15:12

    each each image was 13 seconds, right? Is is And so Okay.

    This is the cumulative now. Yeah.

    Yeah. This is me, not the AI studio.

    Yeah. Yeah.

    So So I think like the cool thing is like even when you know 2.0 came out, I was using it for very

    15:27

    similar things. Like I had a bookshelf.

    I had all the stuff on the ground. and I'm like decorate this like what configuration of these items should be placed on my bookshelf and you know my girlfriend might not have agreed with the output so like sometimes we want to like iterate on that and so like rerunning it really quickly and iterating so even if sometimes like it kind of like fails you just tweak the

    15:43

    prompt rerun it and you get something like really good afterwards so I think this like iterative process of creating is kind of like um the magic behind it and any difference in how folks who had tried 2.0 as an example um and like one of the examples for me using 2.0 know was like wanting to be like do um only

    16:01

    single edits like one at a time. Like if you had said like if you had asked it to like change six different things like the model would sometimes not do a great job of that.

    Any of those like is that still something that we you should still do those like type of targeted edits with this model or any other just like general like usability or like things

    16:17

    that folks should know as they're as they're playing around with the model? Th this is something that I wanted to mention basically.

    So uh one of the magics of interleaf generation is that it offers you to do a new paradigm for image generation right like so if you have a very complex prompt you know you're talking about six different edits

    16:34

    what if I go with like 50 different edits right so uh now that the model has a really good mechanism to grab information from the context like pixel perfect and use it in the next turn what you can do is you can ask the model to break down the complex prompt either it

    16:51

    is editing or for image generation into multiple steps and do edits like one by one over different steps. So for the first one you do this like edits like this five different things and then for the for the next one the next five and so on so forth.

    So it's like very

    17:06

    similar to the test on compute that we have on the language side, right? So, so you spend more flops and and you let the model to bring uh basically this thinking in the uh in the kind of like the pixel space plus breaking it down into smaller pieces that you can like

    17:23

    really nail down uh that specific stage but like accumulate it you can do whatever complex task you want right so I think like again this is the magic of interle generation that you can think about you know incremental generation of like really complex images as opposed to traditional way of doing it which was

    17:38

    like really push pushing hard for getting the best image in one shot, right? Like at the end of the day there's a there's a capacity that you can push the middle.

    You know, at some point you realize that okay, you know, with 100 details, uh we cannot do that. But when you have this one interle generation breaking it steps, you can

    17:55

    always go for any capacity in any complexity that you want to generate. One of the one of the things that's always top of mind for me, especially as like you're uh Nicole, you're also the PM for uh for our imagined models.

    Um how should people think about um

    18:10

    developers or just like people who are who have knowledge of all the models like imagine versus this like native capability that we have? Um yeah and you know this but our goal is to always like build one model with Gemini right so our ultimately our goal is to always like bring all the modalities into Gemini so we can benefit

    18:27

    from all the knowledge transfer that Mustafa was talking about and ultimately build towards AGI right on the way there um there's a lot of usefulness out of having specialized models that are just very very good at a specific thing that you need them to do um and imagine is an amazing model for text image generation

    18:43

    right um and we have a lot of different imagine variants that also to do image editing and those are available in Vert.ex. Um, and they're just optimized for that specific task, right?

    So, if you just want text to image and you want just one image out of that model and you want really amazing visual quality and

    18:59

    you also want that to be really cost-effective and kind of snappy um in generation time, um, imagine is the place to go, right? Um, if you want some of these kind of more complex workflows where you want to generate with the model, but then you also want to edit in that same workflow and you want to do it across multiple turns or you want to do

    19:15

    some of this like ideation like we were doing with the model of like you know what what design ideas could you help me come up with for you know my room or this library um then Gemini is the place to go, right? So it really is kind of that more multimodal like creative partner um where it can output images,

    19:30

    it can output text. Um you can be kind of less precise with the instructions that you give to Gemini.

    Um because like when we you know at the beginning we said like make a nano um because it has that kind of world understanding. It will just more creatively interpret your instructions.

    Um but imagine is still a

    19:46

    great family of models for developers to go to um if they want like a super optimized model for that specific task. Yeah.

    Yeah, one of the examples I was trying today and I'm curious what your take is on which model or like if the native image generation model fixes this problem was I was saying like generate this image and like make the uh this is

    20:04

    my my dumb billboard use case. I was like make the billboard use case.

    I need billboards. Um make the billboard uh the style of of some company that I mentioned.

    Is that something that like native image generation benefits from because it's like a little bit better at this world knowledge piece relative to

    20:20

    like imagine being like really good at if you give it a good prompt but like less good at the like in understanding my implying my prompts your your actual intent behind the Yeah. So so I think that's part of it.

    The other part is um with native image

    20:36

    generation if if you just want to grab that style reference that you have from that you know other company that you were trying to um emulate the style of you can also insert that into the model and use that as a reference right so the fact that you can then also input an image as a reference like helps with that prompt and that is just easier to

    20:52

    do in Gemini natively than it is in imagine um so I do you should try it you should let us know we should add this to our emails I'll I'll let you know whether or not the billboard use I'll make a billboard eval Boen email. I love that.

    One of um back to this

    21:07

    thread of like the progress from 2.0. One of the most fun things was when that model launched, people were sending us tons of feedback about the experience in AI Studio and then ultimately the Gemini app.

    Um just like general failure modes for the model and all that stuff. Uh I

    21:22

    made my only contribution to M to the original launch which was adding that hot tag in AI Studio. We're bringing the hot tag back for this model actually and it's going to go away on the other model.

    Um, how like what can we talk about that story of just like the

    21:38

    progress um and like the failure modes that we we did get a ton of feedback on of like things that didn't work well for 2.0 that now hopefully work well for 2.5. Yeah, I mean we literally sat on like X or Twitter and like went through a bunch of feedback and literally I

    21:54

    remember like Kosik and I and some of the other team like gathering all the failure cases and making evals out of that. So we have like a benchmark that we take from like real user feedback just from Twitter and it's just people adding us and saying like hey this

    22:10

    didn't work and like for every model we make in the future we kind of just like append to that so that we know for example like when we release 2.0 know one of the failure cases sometimes we would see is like if you edit it would add your edit but it wouldn't necessarily be consistent with like the

    22:27

    rest of the image right so that was like one of the things that was like in that and that we hill climbed and then there's plenty more so kind of um we're always just like gathering that feedback yeah send us send us the examples that uh that don't work well any any ones for you all that like particularly stand out

    22:44

    of things that like just did not work before that now is like a slam dunk. I don't know if there's anything top of mind.

    You you all play with I think the like the team plays with this model I think so much in the I assume in the process as we're actually building it and bringing it to life. I don't know if

    22:59

    there's any like go-to use cases for you all to test and like is this actually a good model? Yeah, I think one thing I've noticed specifically while playing with the 2.5 model is that in the 2.0 model.

    Actually, one of the things that we thought was going to be hard was

    23:14

    consistency from image to image, but specifically the cases where you have an object or say like a character that you're building and you want that character to remain consistent across images. And uh if you actually leave the character in the same place that it was in the input image, it turns out that

    23:30

    this is actually quite easy and the 2.0 model could do this really well. It could, for example, add a hat, change the expression and stuff like that while kind of keeping the pose and overall structure of the scene the same.

    What the 2.5 model adds on top of what these

    23:46

    capabilities look like in 2.0 uh is that you can ask to for example render the character from different angles and it will look like the exact same character but from for example the side. Or you could uh take a piece of furniture and place it into a completely

    24:02

    different context uh reorient it and create a whole scene. But that piece of furniture it would remain faithful to the original that you uploaded while transforming it in very substantial ways.

    just taking the input image and pasting kind of those pixels into the output image.

    24:17

    I love that. One of the one of the reactions that I had about some of the 2.0 stuff was sometimes the images would almost look like as you would do like add something like I picture my face and add a goofy mustache or a hat or something.

    It almost look like it was

    24:33

    like superimposed or was like kind of like photoshopped onto it. Is that something that is also like similar to this like character?

    It's like it seems tangential to this character consistency, but it feels like it's a similarish problem where it's just like taking pixels from memory and like putting them into the image almost

    24:50

    versus the pixel transfer. I'm curious if that's like a capability that's improved.

    Yeah. And actually I think that comes uh comes down a lot to the actual teams working on this model.

    Uh the previous model actually we were kind of of the mindset that okay it did the edit that's

    25:06

    it like it was successful but when we started working more and more closely with the imagine teams uh they would look at the same exact edit that the that we were looking at from the Gemini side and they'd say this is terrible why would you ever want the model to do something like this right so uh this is one example where blending the

    25:23

    perspectives from both teams uh so on the Gemini side the instruction following world knowledge all of these things and then on the imagined side making the images actually look natural, aesthetically pleasing, and genuinely useful. Uh, so I think it takes both of these and having these teams work together on this led to 2.5 being much

    25:40

    better at the stuff you're describing. I love it.

    Um, yeah, and just on that point, we actually have folks on the team who mostly come from the Imagine team who have like a really honed aesthetic taste. Um, and so a lot of the times when we do evals, um, they will actually just look at like hundreds and thousands

    25:56

    of images and be like, "No, this model is better than this other model." And a lot of other people on the team will kind of look at it and be like, "Okay, you know, like we like like you kind of have to hone that I think um like sensibility over a couple of years." And I've gotten a lot better at it over the

    26:11

    years. Um but there's definitely people on the team who are like amazing at it and we always go to them when we try to pick between models.

    Can you train uh auto raiders on people's like personal We haven't been able to do it yet. Fun side project.

    That's a fun side project. I'm very excited for for as as Gemini gets better

    26:27

    kind of at understanding to have like an aesthetic operator based on you know one of the folks on the team this who are really amazing at this just put that person to to provide training signal for app. Yes.

    Yes. And this is we'll we'll take that as a side project after this.

    I love that. Um lots of progress on 2.5

    26:45

    and obviously I'm I think folks are going to be super excited to try out the model and all that stuff. What comes next?

    We've made a great model. Um, I'm sure we have more stuff cooking in the pipeline, but I don't know if how much we want to say about the the future direction and and what other capabilities hopefully will will land in

    27:01

    the future. Um, so when you when it comes to image generation, I think like we do care about the the visual quality, but uh I think one thing that is again like new and and we want this with like unified omni model is smartness.

    You know, like you want your image generation model to

    27:17

    feel smart. you know when users interact with this not only they're impressed by the quality of images but they feel like wow this is smart you know like one example that I have in mind and u I'm looking forward to to to see this happening and it's a bit controversial because I I cannot even define it well

    27:32

    is when I ask the model to do something it doesn't follow my instruction but it's some it does something that at that at the end of the generation I say I'm glad that it's like you know it didn't follow my instruction it's even better than than what I actually described right like so it's It has this kind of

    27:48

    like edge to it that you know um I is that like the you think the model is like intentionally doing this or it's like it is it's like kind of an unintended accident. Is that what you're trying to say?

    No, no, it's not just that, but basically, you know, like sometimes, you know, underspecified or sometimes you think wrong about like

    28:04

    some like something that is a reality, but you know, outside word with like the knowledge of Gemini um it's different from your perspective, right? And uh I think um again like it's not intentional or or what just like happens organically and uh I think again like

    28:21

    you just feel that I'm interacting with a system that is like smarter than me, right? And and when I when I'm asking for uh some images, um like I don't mind if it goes like off the rail with my my uh my prompt and and generates something that is different from what I ask

    28:36

    because it's most of the time better than than what I had in mind. So uh I think definitely smartness like in high level is the the direction that we we are pushing uh forward while maintaining the visual quality or or improving it.

    Uh but uh there are so many specifics

    28:51

    and capabilities and and use cases especially for developers uh that um um I think like this release has some but next release is going to be like also like and yeah we we have like these coming releases in the pipeline. I can I cannot share about the timeline but but it's just like so exciting and yeah I

    29:08

    should maybe I should. Yeah.

    Um but I'm so excited. I'm I'm like happy and and the momentum is like unmatched here like on the image generation side.

    I love that. any other any other capabilities folks are excited about?

    I'm really excited about factuality. Um, and and so that kind of like goes back

    29:25

    to the point that like sometimes like maybe you need to make a little diagram or like an infographic for a work presentation, right? And like it's amazing if it looks nice, but that's not enough, right, for that case.

    Like it actually has to be it has to be accurate. Um, that you can't have any

    29:40

    extraneous text. Like it it just kind of has to both look good and also be functional for that purpose.

    And I think we're just scratching the surface um on what these models can do with that. And I'm really excited about like some of these upcoming releases like us getting better at that type of use case so that like my dream one day is that these

    29:57

    models can actually make a slide deck for me for work that like looks nice. This is every PM's dream that every dream.

    I'm trying to outsource that part of my job to Gemini. Um and I think we play a really big part in it.

    So awesome. I love it.

    Well, I think folks are going to be super excited to try these models. Uh, thank you all four of

    30:13

    you and for the rest of the team for for making this happen. So, I appreciate all the hard work.

    I'm excited for this. Um, and thanks everyone for watching release notes.

    We'll we'll see you in the next episode. [Music]