GPT-5 Fails. AGI Cancelled. It's all over...

πŸš€ Add to Chrome – It’s Free - YouTube Summarizer

Category: AI Technology

Tags: AI DevelopmentGPT-5Model PerformanceOpenAITech Reviews

Entities: Elon MuskEthan MollikGary MarcusGPT-5Matt SchumerOpenAIRuneSam Altman

Building WordCloud ...

Summary

    Release and Reception of GPT-5
    • GPT-5 was released within the last 24 hours, receiving mixed reviews.
    • Gary Marcus criticized GPT-5, labeling it as disappointing and not a step towards AGI.
    • There's a debate on whether investments in AI data centers will pay off.
    Technical Performance and Issues
    • GPT-5's routing system is currently malfunctioning, causing incorrect model assignments.
    • Rune from OpenAI mentioned that the model auto switcher is broken but will be fixed soon.
    • Despite issues, GPT-5 can perform well when routed to the correct model with maximum reasoning effort.
    Capabilities and Limitations
    • GPT-5 excels in creating code and software solutions, especially for complex tasks.
    • It struggles with basic tasks when not using high reasoning models.
    • The model is particularly good at instruction following and tool calling.
    Community and Developer Feedback
    • Developers report mixed experiences, with some showcasing impressive projects created with GPT-5.
    • Ethan Mollik demonstrated GPT-5's ability to create a 3D city building game.
    • Matt Schumer suggests waiting for optimized agent harnesses for better performance.
    Future Prospects and Improvements
    • Sam Altman acknowledged initial issues with GPT-5 but expects improvements soon.
    • The model will become smarter as routing issues are resolved.
    • OpenAI plans to make model selection more transparent for users.

    Transcript

    00:00

    So, the much awaited GPT5 went live in the last 24 hours, and the results are mixed to say the least. Gary Marcus is saying that GPT5 is very disappointing.

    A lot of it was just hype and marketing. It's not the path to AGI.

    In this post,

    00:16

    he's specifically talking about how OpenAI is falling behind. But in other posts, he's also mentioning that a lot of other people like Elon and Rock, they're investing tons of money into these AI data centers, and those bets are probably not going to pay off because they're not getting us closer to

    00:32

    AGI. It's really interesting to see how different people's opinions are on this same model that just got released.

    GT5 just refactor my entire codebase in one call. None of it worked, but boy was it beautiful.

    Disappointment of Chad. GPT5 has burst the AI high bubble.

    The

    00:49

    narrative changed overnight. GPT5 is disappointing.

    Hallucinates. The big router keeps failing me.

    GPT5 was rumored to do extremely well, better than the human baseline on Simple Bench. That does not appear to be the case.

    Looks like it's in fifth place. How good

    01:05

    is it at math? Well, it's beyond anything we've seen before.

    Previous models would be able to answer various math questions correctly. This model completely redefineses math as we know it.

    As you can see here, 69 is equal to

    01:21

    30. Okay, 69 equals to 30, but also 69 is less than 52.

    I'm sure you learned something today. You're welcome.

    I have no idea what happened here. This is this is beyond me.

    Apparently, Reddit hates it. They're deleting their subscription.

    Opening eye lost all of their respect.

    01:38

    And this is highly upvoted. A lot of people are expressing the same sentiment.

    So, what happened? Is is AGI cancelled?

    Did we just hit a plateau and there's not going to be any AI progress moving forward? Let's break it down a little bit.

    First and foremost, one

    01:53

    thing that GPT5 did that a lot of us kind of were looking forward to if it was done correctly, I guess, but right now, as of right now, it's not working very well, and that is the routing. When you ask GPT5 for whatever you ask it for, it tries to figure out, okay, do we

    02:08

    send you to the big smart model? Do we send you to something that's a little bit, you know, faster and cheaper to use?

    You know, how big is your request? How much reasoning power do we need to allocate to it?

    Right? Now, the reason a lot of us were looking forward to that, I think, is, you know, the a bunch of

    02:24

    different little models. Sometimes it got confusing.

    Sometimes you'd be in the wrong one. It would be a little bit of a pain to select the correct one.

    Now, there's certain situations if you're a developer, you need these specific models to do this specific thing. That was helpful.

    you can kind of like custom tailor the model you want to use for

    02:40

    some specific task, but for a lot of everyday use, you kind of wished it was a little bit more streamlined. But of course, this meant that people would often use the best model.

    You know, it would cost more money and of course that would be an expense for OpenAI. So, one of the reasons why we're thinking some

    02:56

    of this stuff might be happening is the fact that they're routing too much to the cheaper models to decrease their expenses, increase their profits. Now, Rune, who is part of OpenAI, did reply saying, "By the way, model auto switcher is apparently broken, which is why it's not routing you correctly, will be fixed

    03:13

    soon." And I think this is the really big thing that people need to understand is that GPT5 isn't one model. So, when people say GPT5 is mind-blowing and very, very good, that's probably because GPT5 did something mind-blowing and very, very good.

    And if they're saying if they're saying it's horrible, it

    03:28

    failed the basic task I asked it, that's probably true as well. At this point, we've seen tons of people, myself included, that when we were giving hard reasoning tasks to these models and ensuring that it was getting routed to the, you know, maximum reasoning effort

    03:45

    model, the results were kind of amazing, kind of mind-blowing. I showcased some of my results yesterday.

    They were pretty good. Those were one shot.

    So, one prompt, one output, and the results were kind of stunning. Definitely better than what we've seen before.

    Here's

    04:01

    something that took more than one shot. It's a if you know that vampire survivors gate.

    This is Nightfall Survivors. And I'll show you in a little bit how it sounds with sound because I've actually added music and sound effects to it.

    But as you can see here, it's very smooth. It works very well.

    You have the HP system, the leveling

    04:17

    system. You have ammo reloading.

    You have this dash functionality where you dash to a certain place and it blows it up. You only have limited amount of uses that recharge over time.

    You can update your, you know, Razor Instinct and drones. there's going to be multiple drones that are floating around shooting and slowing down enemies and stuff like

    04:32

    that. This feels really smooth.

    It feels really good and iterating on the existing code was very simple. At no point did the model screw something up and you know I added one thing and the whole thing just broke.

    That never

    04:48

    happened. [Music] Fresh meat.

    05:17

    I'll be back.

    05:32

    Fresh meat. Developing this game was a joy.

    It felt

    05:49

    very easy, very straightforward. I'd have the model running in one window and then I'd have the sort of the index.html file open up in another.

    And so every time I would make updates, I would just hit refresh and the new version of the game would be playable. I would keep testing it while the model would be

    06:05

    working on the next iteration. And it was pretty fast.

    So as it's working on putting the next feature that I wanted, I could see it thinking over here. Here I was testing the game in a different window.

    It was this beautiful perfect sort of development flow I guess where

    06:21

    just in real time I was watching different features being added I'd be able to test them tell it what to work on it felt amazing that idea of a vibe coding or whatever you you really got a feel for what that means that quickpaced iteration like as soon as an idea pops into your head you're able to execute

    06:37

    and see it live within usually under a minute now this one I believe right now as it stands is one of the best models for that I have to test it headto-head with the new Claude 4.1. But so far, this model has met and exceeded my expectations.

    Here's the thing. If I'm

    06:54

    testing in cursor, you can see here I'm using GPT5 Max enables maximum context window for advanced users that are cost insensitive. So again, I'm not using their automatic model picker chooser.

    I'm just like, nope, top shelf. Give me the best one.

    I don't care about the

    07:10

    price. Give me the max one.

    Here was the original prompt that created the vampire survival style game. I'm using Chad GPT5 Pro, right?

    So, I'm not using the this which is the routing one. I'm going down to other models.

    We have GPT5 thinking

    07:25

    and then if you want to turn it up to 11 GPT5 Pro. So, you get what I'm saying here.

    There's like a bunch of different models and people are saying if they're good or not, bad or good, but they're not comparing apples to apples. If you really want to test its maximum reasoning abilities, use GPT5 Pro or put

    07:43

    it on max. Or if you're testing out in the playground version, GPT5 here, you're able to select the reasoning effort, right?

    So you're able to put it on high. So you're able to say, "Hey, think really hard about this one." If

    07:58

    you're using the default GPT5, you can also approximate it by saying things like think very hard or use maximum reasoning effort. So that's kind of my big point here is not I'm not necessarily defending OpenAI's decision to go with a router.

    And hopefully it's

    08:14

    true that right now maybe it's not working very well and maybe not directing it to the correct model and that's why a lot of people are having issues with it. They're complaining about it.

    But it's important to understand that when I'm saying, "Hey, I'm seeing some good things created by this model, I'm not tossing a coin and

    08:30

    seeing what model it picks for me. I'm making sure that I'm using max reasoning effort, maximum model.

    I'm asking it to give it its best effort and I'm judging it based on that. If it directs it to the nano model and that gives some bad answer, okay, that's not what I'm trying

    08:45

    to test. I'm trying to see it what can it do at its best capabilities." Here's Ethan Mollik who asked it to do something mindblowing.

    So notice the phrase this is a big deal. Here's the output it wrote.

    Thunder struck here. Watch.

    I build worlds. See ideas become

    09:01

    instruments. I code compose and converse.

    Each sentence is you know this is one word, two words, three words, four words etc. So each one is one word longer and the beginning letter of each word spells out this is a big deal.

    It

    09:21

    spells out this is a big deal. Ethan also was able to create a 3D city building game.

    I was actually able to emulate this as well. I I I showed in my previous video.

    I was very impressed with it. Again, I just did it with the one shot and in the future I plan to do more with it.

    I got to say it's

    09:38

    incredibly impressive for what it's a one-shot ability. With more iterations, I feel like it could build an entire like game or or software thing to create cities.

    Here's somebody that's saying that GPT5 is sick. It seems like they're creating some sort of a multiplayer

    09:53

    online game. So, some sort of a MMO RPG with 3D characters, so it's built with 3JS.

    So, sounds like he used Cursor and GPT5. and it took about 6 minutes, which after some of my experimenting with it, I I would not be super surprised.

    I

    10:09

    mean, this definitely seems very advanced. I would like to be able to see if I can replicate it, but I mean, I wouldn't be shocked if this is indeed just done in I mean, 6 minutes seems I I feel like it would take a little bit longer to create something like this, but again, I haven't I haven't tried it.

    That's on my to-do list. But these are

    10:26

    the sorts of things that people are building. Here's a Matt Schumer saying that a lot of folks who are having a bad experience that are using GPT5 in agent harnesses that aren't yet optimized for it.

    If you had a bad experience using GPT5 and according harness, give it a week and try again. I think you'll be

    10:41

    pleasantly surprised. Here's kind of my take on it.

    There's a series of models that you get routed to. And the higher you are on that sort of chart, the more expensive it is for OpenAI to run it or for yourself if you're using the API.

    I'm just going to put IQ for a lack of a better term. So it's more expensive and

    10:57

    it's smarter. So one here being the full kind of GPT5 model with high reasoning effort.

    Now this model is impressive. It's impressive with the stuff that it can do for code.

    We're getting to the point where it's coding abilities are

    11:14

    getting much better than it's even kind of like a verbal or its abilities to reason within kind of the the word space, if you will. So if you ask it to create a complicated chart, if it's able to do that by generating some code to create that chart, that output is

    11:29

    probably going to be phenomenal. If it has to think through it in words and try to use words to come up with the final example, it it the quality might not be as good.

    So any task where it can default to coding something up is going to be better. A lot of people are asking it simple math questions like what's

    11:46

    this plus that? And when it fails, they say, oh, this is a stupid model.

    To me, it seems like maybe kind of a pointless question because it can easily create a little bit of code to make the calculation. It'll be right 100% of the time.

    You and I might not be able to answer some complicated math problem.

    12:02

    We'll just reach for a calculator, right? If we if we have a tool, then we're able to solve that problem.

    What GPT5 is excellent at is number one, it's excellent at instruction following. It's really good at understanding what you want, what your intent is, and then translating that into the end output.

    12:18

    It's good at tool calling and using code to build its own tools. The point of a large language models and neural nets and AI wasn't so that it's able to answer complicated math problems.

    We have a calculator for that. I mean problems where you have to like calculate stuff, right?

    Some addition or

    12:34

    subtraction, division, etc. If these models can code up a solution and calculate it, then they can solve it.

    They don't need to do the math in their head, so to speak. But what they are getting really good at is creating these bespoke these custommade software little apps to do what you want them to do.

    12:51

    What these models are very good at is if you give them some task and with GPT5 this is becoming a more long horizon task or medium horizon, right? It's something that would take a human intern maybe a few hours to do, right?

    If I ask you to create some software where I can

    13:06

    create 3D cities and change the shapes of buildings and you line them up in a row on a 3D grid. I mean, how long would that take you to code up?

    At least a few hours. Probably probably more actually.

    If you're used to doing that sort of work, it will probably take faster. But the point of these things is if you give

    13:21

    them tasks like that, they can create code or software that then does that task and this is what it's getting very very good at. Again, the top model, the more expensive model.

    So keep that in mind as this debate rages on and some

    13:37

    people say it's bad and some people say it's good. They're both right, but they're not talking about the same thing.

    for tasks where you can create code to complete those tasks whether that's Excel or 3D buildings or simulations of something making Gant charts or creating some software to

    13:54

    analyze the expenses of a company. This thing is getting very very good at at creating these small little apps these little tools that are able to solve those problems quickly.

    There are massive applications for this and right now it's incredible at doing that thing.

    14:11

    all the other stuff that people are complaining about that it's horrible at. They're right.

    It's really bad at those things. And maybe some of those things will get fixed, maybe not.

    But I am very impressed with this particular section of things that it can do. I think there's a lot of potential there if you think about it.

    But it does feel like

    14:28

    maybe there's a little bit of a plateauing going on and maybe we've had some sort of an S-curve when we had this massive upward momentum and maybe it's flattening out a little bit. We could only scale so far, but I do think we're going to see a lot of applications of this.

    I don't think AI progress is done.

    14:44

    We have still tons of room to grow, but it does seem like just bigger and bigger models might not be the way to progress. And just as I'm finishing this up, Sam Alman and the Open team have a Reddit AMA.

    One thing that jumped out at me is one of Sam Alman's response on some

    15:00

    feedback that he was getting on the GPT5 model. He does point out that some things were bumpy, but GPT5 will seem smarter starting today.

    Again, this was two hours ago. So, this is about 24 hours after it's been released.

    He's saying moving forward, it's going to seem smarter. They had an issue with the auto switcher that was out of commission

    15:16

    for a chunk of the day. And the result was that GBT 5 seemed dumber.

    Also, they make some changes on how the decision boundary works. So, the router will help you pick the right model more often.

    And, and this is important, I think to me, we will make it more transparent about which model is answering given query. it would be nice to know where

    15:32

    it's getting routed so we know what model is answering our questions. So again, that's one thing to keep in mind that moving forward it's going to be better because the correct model is going to be called or the more appropriate model I should say.

    And while this isn't a massive leap forward,

    15:48

    this is a good incremental change. GPT5 is better in a lot of ways.

    Now, of course, if we were expecting something completely next generation, next level, this does seem like a letdown, but the model isn't bad. It's improved.

    It's better. There's some very powerful and

    16:04

    strong things it can do. But let me know what your experience was.

    Are you excited about this model? Is it doing anything for you?

    Are you seeing the powerful use cases? Or do you feel like it was kind of a let down?

    If you're using it for coding, is it better than Anthropics, a set of models? Is it better than Gemini?

    Let me know in the

    16:20

    comments. If you made this far, thank you so much for watching and I'll see you in the next