What Are Large Reasoning Models (LRMs)? Smarter AI Beyond LLMs | YouTube Summarizer

Category: AI Technology

Tags: AI Inference Models Reasoning Training

Entities: Distillation Large Language Models Large Reasoning Models LLMs LRMs Reinforcement Learning

Summary

Introduction to Language Models

Large Language Models (LLMs) predict the next token in a sequence using statistical patterns.
Large Reasoning Models (LRMs) extend LLMs by planning and evaluating before generating responses.

Advantages of LRMs

LRMs can handle complex tasks requiring multi-step logic and abstract reasoning.
They offer improved decision-making through internal verification and deliberation.
LRMs require less prompt engineering as they inherently think step by step.

Challenges of LRMs

LRMs incur higher computational costs and increased latency due to their reasoning processes.
They require more VRAM, energy, and result in higher cloud service costs.

Training LRMs

LRMs build upon LLMs with massive pre-training on diverse data sources.
They undergo reasoning-focused tuning using logic puzzles and multi-step problems.
Reinforcement learning and distillation are used to enhance reasoning capabilities.

Inference Time Considerations

Different questions can be assigned varying amounts of thinking time during inference.
Extended inference time allows for multiple chains of thought and external checks.

Actionable Takeaways

Consider using LRMs for tasks requiring complex reasoning and multi-step logic.
Be aware of the higher computational costs associated with LRMs.
Utilize LRMs for improved decision-making and nuanced answers.
Understand that LRMs reduce the need for detailed prompt engineering.
Balance the cost of accuracy and latency with the benefits of deeper reasoning.

Transcript

00:00

You already know large language models or LLMs. They predict the next token in a sequence, using a statistical pattern matching technique to crank out human like text.

There's also LRMs, large reasoning models,

00:19

and they go a bit further. They think before they talk.

Now, give LLM a prompt and it will projectile predict whatever word statistically fits, next it will output a token and then another token, and then another token

00:40

and LRMs, they still do that too. But they first they sketch out a plan.

They weigh options and they double check calculations in a sandbox before building their response. So before they start outputting tokens, they will plan, they will evaluate

00:59

what comes back and eventually that will lead to an answer. and those extra steps, they can matter.

Now, if your question is to write a fun social media post, well then LLMs reflex is usually fine. But if your question is

01:20

debug this gnarly stack trace, or perhaps its trace my cash flow through four different shell companies? Well, reflex isn't enough.

The LRMs internal chain of thought lets it test hypotheses and discard dead ends and land on a reasoned answer, rather than just following a

01:39

statistically likely pattern. Now, of course, this doesn't come for free.

It costs inference time and GPU dollars each extra pass through the network, each self-check, each search branch. It all adds latency and processing time.

So LRMs they buy you deeper reasoning at the

01:59

cost of a longer, pricier think. So how do you build one of these thinking machines?

Well, an LRM usually builds upon an existing LLM that has undergone a set of massive pre-training. So this is the stage where we

02:20

teach a model about the world. So billions of web pages, books, code, repos and the like.

And this gives it language skills and a a broad knowledge base. And then after the pre-training an LRM undergoes specialized reasoning focused tuning.

So

02:40

we're now going to fine tune the model specifically to provide reasoning capabilities. So an LRM, it's fed curated data sets of logic puzzles and multi-step math problems and tricky coding tasks.

And each one of these examples comes with a full

02:59

chain of thought answer key, and the model learns to show its work. So it basically starts with a problem that its been given.

And from that problem, its job is to come up with a plan for a solution. Once it's come up with a plan, it needs to execute that plan, which will be

03:20

in multiple steps. So we might go to step one, step two, and so forth.

And then ultimately the model needs to arrive at a solution. It's learning to reason.

Then we let the model loose trying to solve some fresh problems of its

03:40

own. And that's where it goes through a process of reinforcement learning.

Now, that uses a reward system where either humans through reinforcement learning from human feedback. So RLHF, they give thumbs up or thumbs down for each

04:00

one of these steps as they're written. Or the reinforcement learning can come from smaller models that are really judging models like process reward models and process reward models.

Judge each step of a reasoning chain is good or as bad. And the reasoning

04:20

model learns via this reinforcement learning to generate sequences of thoughts that maximize these thumbs up rewards, ultimately improving its logical coherence. Now, there are some other training methods that can be used as well.

For example, we might choose to use something called

04:40

distillation to train the model further. And that's where we have a larger teacher model that's used to generate reasoning traces.

And then those reasoning steps are used to train a smaller model or a newer model on those traces. So basically, if the advanced teacher model can solve

04:59

a puzzle by thinking through a solution, that solution path can then be added to the training data of the new LRM model. And the result of all of this is a model that can plan, that can verify, and that can explain.

Ready to finally make sense of those shell company cash flows. So

05:19

the LRM is trained to think. And now the question is how much thinking time do you give it at runtime?

Well, that's a question all about inference time or test time, as it sometimes called compute. This is what happens every time you ask a question.

And different questions can be

05:38

assigned different amounts of thinking time. So debug my stack trace.

That might get a good amount of compute allowance while write a fine caption. That kind of gets the budget version where the model just goes through one quick pass, and during extended inference time, a model may run

05:55

multiple chains of thought. Then it might vote on the best one.

It might backtrack with a tree search if it hits a dead end, and it might call external stuff like a a calculator or a database, or a code sandbox for spot checks and each extra pass through the model, well, it comes

06:13

at a cost. It comes at a cost of more compute that is needed.

It comes at a cost of how long you're going to be waiting for a response with higher latency. But hopefully this all does come also with an increase in accuracy, higher accuracy.

06:34

So is this accuracy up arrow worth the cost to get it? Well it depends on the problem you're trying to solve.

Now on the positive side, an LRM offers complex reasoning. LRMs excel are tasks that require multi-step logic planning or abstract

06:54

reasoning. They also offer improved decision making because LRMs can internally verify and deliberate, which means that answers tend to be a bit more nuanced and hopefully more accurate and LRMs they usually require less in the way of prompt engineering.

We don't need

07:13

to sprinkle in magic words in our prompting like, let's think step by step because the model already does it. That's less prompt hackery, but you might be better off with a regular LLM or just a smaller model overall in some situations, because as I've mentioned, there

07:33

is that higher computational cost. That means more VRAM, more energy.

Higher invoice price from your cloud provider. And then there's also the increase in latency.

Slower replies while the model stops to think. Although I'm endlessly kind of

07:53

amused by reading those replies, the model's thinking steps as it works through building a response. But that's probably just me.

So. So look with LRMs, AI models are no longer just spewing language out at you as fast as they can predict the next word in a sentence,

08:09

they are taking time to think through responses. And today the most intelligent models, the ones scoring highest on AI benchmarks, well, they tend to be the reasoning models the LRMs.