Context Rot: How Increasing Input Tokens Impacts LLM Performance

Tags: context LLM models performance tokens

Entities: Alma 4 Chroma Gemini GPT Kelly Longme Eva

Summary

Introduction
Kelly from Chroma discusses context rot and how increasing input tokens can degrade LLM performance.
Models like Gemini and GPT support large token counts, but real-world tasks present challenges beyond simple benchmarks.
Challenges with Long Inputs
Models struggle with reasoning over long conversations and maintaining memory in chat assistants.
A naive approach of using full chat history often leads to unreliable outputs.
Ambiguity in input increases the difficulty for models to maintain performance.
Distractors in input further degrade model performance as input length grows.
Context Engineering
Effective context management is crucial for reliable model performance, even with large token limits.
Strategies like summarization and retrieval can optimize context windows.
Experimentation is key to finding the right context management strategy for different applications.
Conclusion
Large context windows do not guarantee reliable performance; effective management is essential.

Transcript

00:00

Hi, I'm Kelly and I'm a researcher at Chroma and today I'll be talking about context rot, how increasing input tokens degrades LLM performance. You've probably heard a lot about longer and longer context windows in new model

00:16

releases. For instance, the newer Gemini and GPT models support up to 1 million tokens with Alma 4 supporting up to 10 million tokens.

To give some perspective, these four books combined are about a million tokens, which might be more than you'd expect.

00:32

As these models achieve nearperfect scores on the well-known needle neesac benchmark, it's natural to assume that they can reliably handle any long input for any task you give them. But there's a reason why models do so well on this.

Needle hesack is a simple task that doesn't require much processing. It's

00:50

essentially an identification task in which a random fact is placed in the middle of a long document. Then the model is asked to identify it.

Often times this task is designed with lexical matches between the question and the needle. For example, if we have the following question, what was the best running advice I got from my college

01:06

classmate? A needle could be the best running advice I got from my college classmate was to write every week.

which means that the model just needs to perform a simple lexical match, not reason through ambiguity or do any deeper processing. But in practice, these models have to deal with more

01:22

complicated tasks beyond lexical matching. Once we introduce slight challenges like ambiguous needle question pairs or distractors, performance starts to degrade with increasing input length.

The main point you should take away from this video is that models struggle as input length increases, even on tasks

01:39

they can handle perfectly well at shorter lengths. Let's walk through some of our key findings.

First, we demonstrate that models struggle with reasoning over long conversations. Consider a simple use case where you're building a chat assistant with memory.

A user has a multi-session conversation

01:55

with your assistant and a few chats ago, they mentioned that they're living in San Francisco. Now, in the current session, they ask, "What are some good outdoor activities for a sunny day?" You want the assistant to remember that the user is living in San Francisco and suggest San Francisco specific

02:11

recommendations without the user having to repeat themselves. One naive approach is to just shove the full chat history into the prompt and hope that it works.

But we demonstrate that this doesn't work well in practice and you get unreliable outputs. To evaluate this systematically, we use Longme Eva, a

02:27

benchmark designed to test conversational memory over a long context. Each prompt is around 500 messages from both the user and assistant.

Then it ends with a simple question regarding one part of that conversation. The model's task is to find the relevant portion and answer correctly.

To isolate the effect of

02:44

input length, we compare two versions. A full version with all 500 messages averaging out to around 120k tokens.

Then a condensed version with only the relevant snippets needed to answer the final question. This is around 300 tokens.

We can clearly see that models

02:59

perform better on the condensed version which should not be the case if performance is uniform across input lengths. Even the most advanced models struggle to find the right information when too much noise is present.

Another important finding is that ambiguity compounds the challenge of long inputs.

03:15

We can consider a common real world scenario. You're prompting a model to fix a coding bug.

You're unlikely to tell it exactly which lines to look at. Instead, we'll probably give a broader instruction like figure out what's causing this bug along with a large chunk of surrounding code.

To evaluate

03:30

this, we run a modified version of needle and a hay stack where we vary the level of ambiguity in the needle quantified by cosign similarity between the needle and the question. We have the same setup.

We give the model a long hack of content with a needle somewhere in the text and a question at the end.

03:45

For example, if we have the following question, what was the best running advice I got from my college classmate? A high similarity needle would be the best writing advice I got from a college classmate was to write every week.

A more ambiguous needle would be one thing people may not know about me is that I

04:01

write every week. It's the most useful habit I've developed and it started back in my college states when a random guy in my English course suggested it to me.

We write eight needles of varying levels of ambiguity. And we see that as ambiguity increases, model performance degrees faster.

It's important to note

04:17

here that at short inputs, models succeed even with the most ambiguous needle question pairs. This shows that these models are capable of handling ambiguity, but their performance breaks down when input length increases.

Adding on to this, we also show that models struggle with distractors at long inputs. A distractor is topically

04:34

related to the correct answer, but doesn't quite answer the question. Distractors are very common in real scenarios.

For example, going back to our conversational memory example, you might have user messages that are semantically similar like, "I just moved to San Francisco and I just traveled to

04:49

New York, which may both surface when you have a location related query." In the experiment we ran, we used the same question from before. What was the best writing advice I got from my college classmate?

And this needle, the best writing advice I got from my college classmate was to write every week. And

05:06

we have the following distractor. The best writing advice I got from a classmate was to write each essay in three different styles.

This was back in high school. It shares similar phrasing and semantic content but answers a different question.

When we run this experiment, we see the pattern that at short inputs,

05:23

models are able to disambiguate the correct needle from the distractor. But as input length grows, performance drops even though the task itself hasn't changed.

Finally, we show that these models cannot be treated as reliable computing systems. We would like to rely on LLMs to get consistent quality outputs, especially

05:39

for very simple tasks. Consider a simple program that repeats a string n times.

It will always return the same result regardless of what n is or what kind of string you use. We run the synthetic test on our LLMs.

We prompt the model to replicate a list of repeated words with

05:56

one unique word inserted at a specific position. The task is simple.

Replicate exactly what is given. We score model outputs using lean chain distance which measures how many string edits, insertions, deletions or substitutions are needed to match the gold reference.

06:12

We see performance drops even from 500 words often caused by the model repeating words beyond what it is given or generating random outputs. This serves as another demonstration that models do not process their context uniformly.

All of this leads to one key point that

06:28

you need to engineer your context to get reliable performance. Technically, you can use up to a million tokens, but in practice, your optimal context window is much smaller.

This becomes an optimization problem. You want to maximize the amount of relevant information and minimize irrelevant

06:43

context. We call this process context engineering.

There's no single right way to do this. It depends on your use case.

If you're working with a multi-step agent, summarization might be a useful strategy. Instead of chaining long action histories, you can insert

06:58

summarization steps, letting the model distill previous thoughts into shorter, more relevant memory. Another is retrieval.

If you're working with a recurring set of knowledge like tools or documentation, you can store this in a vector database and retrieve only what's relevant at each step. It's fast and

07:15

cost efficient, but you need to invest more time into experimenting with your retrieval strategy. There's no one sizefits-all solution here.

What matters in one application might fill in another. So experimentation is important.

To conclude, even if your model has a 1 million token context window, that does not guarantee reliable

07:32

performance at 1 million tokens. Even the best models today struggle with simple long context tasks.

So you need effective context window management. To read our full technical report, check out research.trychroma.com.

Thank you.

07:48

[Music]