Open-Source AgentGym-RL: GROK 4 vs Gemini Pro

Transcript

00:00

Hello community. So great that you are back.

Our paper today is about a new reinforcement learning agent gym reinforcement learning about training an LLM agent or multiple of them on the long horizon decision-m through a

00:15

multi-turn reinforcement learning published here September 10 2025 Fan University Bance and Shanghai Innovation Institute and they say you know what we have beautiful autonomous LLM agents and they are able to do a lot of things but

00:31

we do have problems s and we introduce here a new reinforcement learning methodology. We make it an open-source code.

So you have here everything available. You have here the GitHub and

00:47

you have a project base everything with videos. So everything is explained plus you have 39 vid 39 pages here explaining exactly what it is.

But today let's go something diff a different route you know because I want to show you that a

01:05

lot of people ask me how can I use AI to learn to be really an AI assistant for me thinking critical thinking learning understanding what it is all about so let's do this because let's have a look at this paper here from entropic and

01:21

they tell us here you know how people use entropic or AI their AI skill breakdown critical thinking Look at this. And then writing and programming as you see is about half.

So there's a lot going on about thinking and writing

01:37

and understanding. Isn't this beautiful?

So let's have a look at this. Although we look at a paper how to code a better reinforcement learning structure.

And also if you look here and this is here how US adults so we are focusing just on

01:53

one market because this is the latest information I could find young adults use AI four search and a four ideas and if you look here at the red one where we look here for people under 30 years old companionship 25%

02:10

here editing images 47% okay writing emails 50% unbelievable able. Yes, of course you have multiple uh indicators here for work 50%.

But coming up with

02:25

idea 62% and searching for information 74%. So there's a lot of application of AI that is not absolutely only for code and code optimization but simply for information ideas work emails and so on.

02:41

So let's jump into this. So which model we going to use?

And I thought I have a look at LM arena and I take you the best model and the model rank number 10. So from 1 to 10 we have everything.

Best one is Gemini 2.5 Pro as voted by the

02:56

community for text and on position number 10 we have Gro 4. So those will be my two models and I will use here the free version.

We'll just email. We'll log in with our email and this is it.

We are not going to pay anything. So if you

03:13

only have access to a browser, it's absolute enough. As I told you, we do have here a homepage here for this particular project here.

As you can see, paper, code, everything, archive, they explain everything to you and they give you here

03:29

some great information here in this um encoded visual. But honestly, if I look at this, I say I don't understand this thing.

So we have different environments like web navigation or deep search or digital games scientific task. I say

03:45

yeah this is interesting. They give you here visualization demo videos and everything.

And of course we do have here a complete GitHub repo here. You have all the code script version everything that you need.

So everything is is available for you. So let's have a

04:03

look how can we learn what is the content of this. Okay community.

So this is the paper of today agent gym reinforcement learning. So let's have a look what we do.

We have here grog for and I say very easily hey examine the

04:18

technical context of the PDF and explain the main insights to me. First show me the general picture of what we are talking about this PDF and then second have a technical explanatory deep dive into the main insights for me to understand in the form of a

04:34

storytelling. So I just provide a PDF.

sort process no content at all in the sort process. So let's have a look at the result.

Grog for tells me here the general picture. So let's have a look.

It is a research paper September 2025

04:49

titled yes beautiful from okay great in it addresses the training of LLMs as autonomous agent capable of handling complex multi-step task in real world environments. Core contribution is a new

05:06

framework modeler extendable system end to end reinforcement learning supports standard reinforcement learning algorithm like PO and GRPO additionally it introduces here a new method scaling inter reinforcement learning a

05:21

progressive training method that gradually increases the interaction length to balance here exploration to discover new strategies and exploitation to refine the known ones leading to a more stable optimization process overall

05:37

gives me the result across 27 tasks demonstrated. This approach boosts here the open-source 7B parameter model to an average of 33 points enabling them to match or exceed proper models like GDP4 or Gemini 2.5 Pro.

Real nice. I know

05:54

exactly where I am. Let's start technical deep dive but now in a storytelling form.

So they tell us, okay, we have explorers now or agent and they must navigate some complex terrain like book a flight on a dynamic website,

06:10

synthesize information from multiple sources in a deep search or deep research or conduct virtual experiment and coding in a scientific lab. And this is why I've chosen the paper.

Now you can have a rigid map like supervised

06:26

fine-tuning, but of course you need a detailed map like reinforcement learning for discovering new paths that emphasize interaction adaption and a growth in the complexity of the solutions. And they say okay what is the problem?

How is it formulated? How is the problem framed?

06:43

It is framed as suppose suppose a partially observable mock of decision process. We have the space of instruction, the state, the action, the observation, transition function, the reward, everything that is classical there.

Journey engine is the policy

06:58

gradient methodology which optimizes here the policy itself. The strategy of the core itself to maximize the expected cumulative reward function.

Absolute standard. Nothing new.

Unlike value-based approaches that estimate a

07:14

reward indirectly, policy gradient directly as center gradient nubbler updating here the parameter pi pit data integrates PO reinforced GRPO everything great a new the hero now the new element the new framework

07:32

we have it decoupled architecture with three modules the environment the agent and the training all connected via standardized HTTP protocols for scalability. The environments are realistic.

We have

07:48

web navigation, deep search, digital games, scientific task. The agent handles the multi-term reasoning, planning and reflection, and the training now implements diverse reinforcement learning pipelines with

08:03

parallel rollouts. Great.

Okay, we come back to the wilderness here. The agent might exploit shortcuts early failing therefore or agent might do missing broader strategies or explore too widely leading

08:22

in total to an unstable optimization problem under the long horizon if you execute multiple steps. And they tell us this is where the scaling inter reinforcement learning method emerges as the agent's evolving

08:38

compass. Okay.

So this new methodology progressively scales the maximum interaction turns. This means the horizon during the training.

Okay. We understand immediately.

No. In the early stages it's restricted here only to five

08:54

steps. So emphasizes here the exploitation building here more reliable basic skills.

Let's say up to five turns. And then as the training progresses here, incrementally increase the horizon to 20 plus turns shifting towards an

09:10

exploration to uncover diverse behavior like multi-step planning or error correction reducing that the complete model collapses. And this is the main problem we have currently with reinforcement learning.

The model collapses simply mathematically. Yes,

09:26

this aligns the policy depth with the capability. Start from short trajectories to for efficient learning baby steps then extend to richer ones balancing the variance in the gradients and the experiment confirm it

09:44

outperforms the fixed horizon baselines plus 27 task and they have all the different models. Beautiful.

And they tell us yes we can have the same level of performance. Beautiful

10:00

agent can conquer the wilderness of of the long horizon task offering a logical blueprint for future advancement in intelligent decision making. Great.

And now you might say okay what was AI Google AI studio here? This is the identical prompt.

This is the identical

10:16

PDF. I have temperature at one Gemini 2.5 Pro.

I have set the syncing process to about 28K. Let's do it.

So your real time we have here start you just have to log in. I am not on any paid abu nothing at all.

We have here

10:34

syncing model is syncing presenting the overview. Let's have a look.

Okay, this is very general thinking process. And here we have now step one and step two.

So let's have a look at step one. Yeah, can get rid of

10:50

this step one. The general picture introduced the agent gym real a new comprehensive software framework for LLM agents to solve complex multi-step problems in various simulated real world environments.

Such a nice sentence. The

11:07

key idea is to move beyond simple single command interaction and teach those agents to make a series of intelligent decisions over a long sequence of steps. Much like in the human world, the researcher also pro proposed a novel

11:23

training methodology called scaling inter reinforcement learning acts like a curriculum. Beautiful explanation starting the agent with shorter simple task and gradually increasing here the length and the complexity of the interaction.

This approach helps the

11:39

agent learn foundational skills and prevents it from getting overwhelmed leading to more stable and more effective training. Nice.

Plus, they tell us even a moderatelysized open-source LLM with 7 billion free

11:55

trainable parameters can be trained to match or even outperform a GBD4 Omni or a Gemini 2.5 Pro on 27 different tasks they evaluated in the paper. You see a much nicer summary.

12:10

Now, step two, technical deep dive. Imagine a team of brilliant AI researchers.

Yes, those are my subscribers. No, they look at a current landscape of AI and see fascinating new frontiers, autonomous LLM agents.

Okay.

12:26

Digital beings can actively do things to navigate website, play the games, or even conduct virtual science experiment. The dream is to create an agent that can receive a complex goal from a human like find the full name of repos I contributed that have over 10 or 100

12:43

stars and intelligent interact with its environment step by step until the task is complete. Okay.

And now the story. Now we have a story board and now okay

12:59

the researcher found a problem. excited about the agents.

There are no standardized way to train them effectively. Most methods rely on pre-existing data expert solution from the supervised fine tune which are expensive to create and limit the agents ability to discover new

13:17

better strategies with unseen data. So what they really wanted was to let the agent learn through trial and error again trial and error but on with a search tree not just pure trial and error like a person learning a new skill.

Reinforcement learning where the

13:34

agent gets a reward for good actions and punishment for bad ones. The existing tools are fragmented.

Absolutely. Training an agent for web navigation was completely different from training an agent for a pure scientific task.

There was no unified flexible

13:50

gymnasium where those agents could practice and learn across a wide variety of challenges and now building the ultimate playground agent gym reinforcement learning. So the team of researcher of the publication decided to build it.

They called it an agent gym

14:07

reinforcement learning. Three parts the environment itself collection of diverse realistic digital worlds.

We have the web arena. We have the controlling a robot or a scientific lab.

14:22

Then we have the agent itself with the LLM plug and play and the training module. Beautiful.

And this is where the new reinforcement learning algorithm comes into place. Absolutely.

Training dilemma. Too much freedom is chaos.

So the researcher put

14:39

their first agent in the gym. Complex task with a long interaction limit.

Say agent was allowed to do 50 steps. Whatever the agent wanted to do, just limit the amount of step the agent was allowed.

But the results were disastrous. The agent would wander

14:54

around aimlessly, repeat the same mistakes over and over again, and eventually the performance would collapse. Great.

This is the normal RL training exercise. So, it was like dropping a toddler in the middle of a sprawling city and expecting them to

15:11

find their way home. The vast exploration space was just too overwhelming.

And now what they discovered if they gave the agent just very short task maximum five steps the agent learned quickly but never develop the ability to

15:26

plan for the long term. So they were just doing baby steps.

They mastered the basic but could not handle complex multi-stage problems. This is the classic exploration versus exploitation trade-off that you have currently in

15:42

reinforcement learning training scenarios. And the master stroke by his new publication, a new curriculum learning plus the new breakthrough idea was scaling interreinforcement learning learning should be progressive as a

15:57

curriculum something we are familiar with. This is nothing new just implemented in a different way.

They started the agent training with a very short leash allowing only a small number let's say five interaction turns. Agent

16:12

was forced to learn the most direct and efficient ways to solve simple task. It was all about exploitation, mastering fundamental skills.

Agent got better. They have here more complex multi-step strategy.

And yeah, absolutely clear.

16:31

And as I told you, 7 billion parameter could outperform a much bigger model. Okay, a different storytelling.

You tell me if you prefer it here or not. And then as I told you, there was not enough mathematical anything in this.

No. So I

16:48

now say explain the new scaling methodology in pure mathematical terms for me to understand the main principle and provide maybe a simplified examples. So sort process here by Grock for nothing.

But the only thing it goes to

17:05

the GitHub nice. So let's have a look.

Mathematical formulation progressive horizon scaling algorithm for reinforcement learning model as a partially observable mark of decision process. You have U the instruction space S the state space A the action

17:21

space O the observation space T is here the transition function and R are the binary word function indicating task success or failure. Standard procedure nothing new.

Agent policy pi data parameized here by data parameters

17:36

generated trajectories you remember state zero action zero observation zero and then state one and yes you got it of length up to n this is the interaction horizon like n equal 5 or n equal 20 where we have the action a and here

17:54

beautiful objective is to maximize the expected reward optimized via a policy gradient methodology such as PO from 2017 or a group relative to RPO with a gradient nabler and a parameter update.

18:09

Great core innovation. Now comes something new is to dynamically scale the maximum horizon h at the training step t starting with a small h to prioritize exploration exploitation.

And this means now in mathematical terms

18:25

this is it the low variance gradient for basic skill acquisition and increase later the horizon to promote now the exploration higher variance gradients for more diverse behavior go out and explore.

18:42

This addresses here the optimization instability from the mathematical side of view. The short horizon reduce the gradient variance but limits the behavior complexity at the very first step while the long horizon enable here a richer pattern for planning for a

18:59

deeper strategy for a deeper understanding of the complexity but risk policy collapse to sparse rewards are sampled with length constraint yes beautiful and this is what we know so

19:14

the main principle let's Have a look at the mathematical explanation by Grock 4. A staged adaptation in RL optimization.

Initial restriction of the horizon minimizes the variant in NBJ by favoring favoring here short high reward

19:31

trajectories. Enable here for sure some stable convergence to a local optima or minima for simple subtask and then progressively increase the horizon.

Let's take from five to 20 steps in the trajectory introduces you a controlled

19:48

variance encouraging you the policy to explore longer path and escape shortcuts thus balancing the exploit exploration to exploitation trade-off. This yields more robust gradients over time.

This is

20:03

what we wanted. And as the policy refineses behavior hierarchically at first baby steps to learn the basic actions first then we go we build u composite strategies

20:19

without the normal instability of fixed long horizon. Nice.

Here we have a simplified example a linear chain m of decision process with the states. Yes.

So the action are just forward or stay the transition are

20:35

forward s plus one transition stay at S and a reward beautiful the optimal policy is to always move forward but with sparse reward fixed long horizon lead to high variance gradients and slow learnings. And now to give you here a

20:52

beautiful example look now the horizon is defined as 5 10 15 or 20 with a delta of 100. This means we increase every 100 step here this.

So they give you now four t equals z the uh value at t= 100

21:08

the value and at t= 300 the value and they show you that it simply works real nice. 5 seconds here by gro 4 total response time 66 seconds by gro 4.

But I think a very nice explanation of the

21:25

mathematical formulation. I would go now a step deeper say hey can you explain this in much more detailed give me a stepbystep mathematical deduction here of each and every term go deeper maybe it's not even in the paper itself but nice now you notice that I copy this and

21:44

I just go now here and I just paste this in and I run this here you're not going to believe it with Gemini let's see how Gemini interprets this now in pure mathematical terms for me to understand. So here we have

22:02

clarifying the trajectory length a little bit more of sinking but it's not really a sinking trace. It is not only a very meter exploration.

Oh it's doing some 25 seconds. Come on.

22:19

Okay let's have a look at it. New principle in mathematical terms.

What is the standard reinforcement learning objective of the policy pi data as a function of the state s and chooses an action a interacts with the environment trajectory is a record of state and actions beautiful we know this receives

22:35

a total reward r of theta we know this objective function is written the goal is to find here the perfect parameter standard so now the modification with this new methodology challenge the idea that k should be a large fixed number

22:51

from the start so you curriculum learning, nothing else. It's always the same ideas.

Introduces a dynamic schedule interaction horizon maximum allowed trajectory length at the variable that increases over the course of training. Ah, it creates a query

23:08

column. You see, it's beautifully explained.

Let's define a maximum interaction horizon constraint to trajectory sample phase dependent objects function standard corum schedule standard transform a single complex long horizon

23:23

optimization problems into a sequence of simpler shorter horizon problems that approach the complexity of the final task. You see also here we have a high complexity task with a long horizon optimization with a complex reasoning

23:39

chain. So what we do?

We break it up into pieces. It's always in the eye the same stupid idea.

I just decompose a higher complexity into multiple lower complexity entities. And hopefully I'm allowed to do this better because otherwise I'm really into deep durable.

23:56

So simplified example. So given the pure mathematical thing, I think Grock for did a better job.

Okay. But now the lost robot as example.

Simple 10 x 10 grid to find a charging station

24:14

for a robot. So what do we have an environment 10 * 10 an agent can move up down left and right like in jazz reach a charging station defined at 99 9x and y coordinates there's a wall in the middle

24:29

of the grid. There's a short dark path with eight step and a much longer path that goes around the wall with 18 step.

Okay. Phase one, the toddler phase.

Oh, baby steps. No,

24:44

the robot is only given a maximum of 10 moves per attempt. Tries to explore the long path around the wall.

It runs out of moves and fail because the reward given is zero. It's a binary reward at the end of all the actions.

So we have no idea what of those 10 steps were

25:00

great genius or simply a complete failure. Learning outcome.

The robot is forced to discover the short eightstep path because it's the only reliable way to get a positive reward within the 10step limit. It masters the basic skill of direct

25:17

navigation. This is a bad example because I want not that it immediately finds the most complex eightstep path because this should be an iterative uh approach.

So that you argue with baby step I find immediately the best

25:34

solution. This is not a good example.

Sorry Gemini, I'm not happy with this. No.

So you see for this particular reason I prefer to go here with GR for

25:52

it's not really elegant. It's not really where I would say hey this is really a challenging exploration but at least it gives you a basic corate understanding because Google know this is not here really helpful in your

26:09

exploration. you did not understand the complexity of this search strategy for more complex strategies.

However, it's given here if you go for the storytelling part quite here

26:26

a nice answer here explain the standard objective and then the modification also not really. So you see for this particular exercise for my default prompt I would go and try to learn here

26:42

understand here grog 4 gives me here for my particular needs a better explanation but you know nothing is final until we really seen the real stuff the real thing so let's have a look at the paper did we miss out on anything here in the

26:58

paper where were the hallucination by rock for where were the hallucination by gemini what was wrongly presented at all. So abstract beautiful.

We have the code in GitHub and we have a project base as I showed you before. Plus we see now

27:16

overall accuracy of the model and here the size of the model. And if you go here with you see this red star is here the new one that we are talking about.

Oh, look at even when it's very small, you have an excellent overall performance that outperforms all the

27:32

other models. So great.

So really a small open-source model with an open-source methodology can if you want push here a small open LLM really to perform like proprietary models. Nice

27:49

intro. Great.

Oh yeah. Okay.

This is the overall schema of the agent gym oil framework. Now we understand what is happening.

We have our agent as a partially observable mock of decision process. Plus we have

28:04

here if you want the gymnasium where we have multiple training station like web complexity, science complexity, search complexity, games pure game and coding and embodied and these are the station that our agent

28:20

our atlet has to learn has to run through experience it and then we have the trajectory length we have different update policy optimization And great, now we understand what's happening.

28:36

Formulation. Yeah, partially markoff decision process here.

Very short policy gradient explored. Nice.

Here's our nubbler. Great.

And now the agent gym framework overview. Okay, here we have some sider code.

Not

28:53

really helpful. Yeah, this is not really helpful.

Okay, we have the three elements. The environment, the agent, and the training.

Got it. feature, diverse scenarios.

Okay. Okay.

Web navigation, deep search, digital games, scientific task. Nice.

Comprehensive

29:11

algorithms, PO, JRPO, Reinforce++, supervised fine tetuning, DPO. Great.

We know this overview of visualized user interference, scalability, reliability. Great.

And here we have now this new methodology scaling inter reinforcement

29:26

learning approach. And now I understand what this gadgets mean here because otherwise I would not be able to understand what we are talking about here just by looking at this particular visualization.

So you see AI can really

29:42

be your tutor a helpful assistant to augment your understanding visualization. Great.

Then scaling into RL here. Now the detail explanation a little bit on the short side but it is

29:59

simple. Okay.

And then the experiments and then they run all the experiments even with an openio3. GBD5 I suppose was not available for free for them.

Settings scenarios overall results. Here we go.

30:16

Scaling post training. Yeah.

Here we go. Verb arena.

Here's the benchmark now for in the gymnasium the vapor arena the first station in our gymnasium. So what we have here?

Okay, you see the last three lines are

30:33

the new lines that we talked about. So we have here the proprietary models GBD4203 to Gemini 2.5 Pro and the performance data and then we have open- sourced model that are greater than 100 billion

30:49

like QN3 or deepseek and then the open source model less than 100 billion 4 billion 8 billion 32 billion up to 70 billion and Llama 3.1 my goodness this is still in operation

31:06

okay so What we see here, we do have some bold indicators here for our new methodology. Now, if you compare this to an 03, that is identical performance.

If you compare this 20 here, okay, 03 is

31:24

significant better. If you compare those 30 here, okay, 03 is significant better.

But hey, for a 7B model, not bad at all. But let's go now for deep search.

This is also interesting. Now how does it

31:39

change if you look for deep search? You know this is our last three 52.

Yeah, super performance compared to 28 by 03 70. Absolutely the identical level like an 03 or a GPD4 Omni 46.

Uh 03 is better

31:58

with 56. 42 uh 03 is better with 46.

But remember this is a 7B model 03. I don't know the size but it's huge.

So this is great. Overall we have here a 38 compared to a

32:15

49 with 03. Excellent.

Come on. It even outperforms here a Gemini 2.5 Pro that has 36.

And here we have 38 with a 7B model. So there is a way how you can

32:30

push your smaller models here in the higher league of high performance system. Text graph not really interesting.

Baby AI not really interested. Scientific world.

Yeah, scientific world. Okay, let's go.

32:47

Do ours have a chance at all at anything scientific? Yes.

Look, open source 33 to 47 with 03. Okay, 03 is better.

But look at this. 59 compared to 25 with 03.

33:03

Beautifully 88 compared to an 03 of 56. Beautifully.

Overall, we have 57. And this is better than everything else we have here.

This looks here really interesting. We should have a deep dive here in everything in

33:20

the explanation. Little bit of a graph case study.

Okay. beautiful web arena base model and here now the demonstration itself related works of course conclusion on further work this is what they achieved real nice then we

33:38

have all the references but the most important part you know this now is the annex where is it appendix here details to the architecture if you really want to learn about it I highly recommend you those

33:53

pages And you have here all the different prompts and everything. Beautiful.

So you see we now have I hopeful a much better understanding of this new reinforcement learning methodology from our friends in China. I've showed you

34:10

how you can use in the very simple case two different LLMs. They are really here augmenting here my learning maybe also your learning here for literature scientific literature whatever science

34:25

topic you're interested give it a try test out here the default version of multiple LLMs and then you will see what you like the style that you like then you can enforce it with specific prompts but don't start with a hyper predefined

34:43

prompt structure with a tight template, it does not work out. Try it here in the default version.

See what base version is here, the version that you personally like for your task, for your complexity, for your domain knowledge. And then if

35:01

you found this single model in this case, it would be Grock 4 in this case for me. And then we can optimize the prompt to bring out more mathematical details, a higher level of mathematical deduction from step one to step two and

35:16

and and whatever you prefer. So I hope I've shown you here a very simple way understand scientific literature, a playful approach to learning with AI system.

Everything was free. You don't have to pay anything.

You just need a

35:32

browser. This is it.

and you can improve your learning experience significantly. Take a chance and hey, please give me your feedback, some response, whatever you experienced if you want interested in some other topics.

Comments are open

35:49

for you. I hope I see you in my next video.

Summary

Transcript