Understanding LLM performance degradation: a deep dive into Context Window limits

Large Language Models (LLMs) have revolutionized how we interact with AI, but they come with a critical constraint: the context window. This limitation isn’t just a theoretical boundary, but it has real, measurable impacts on performance.

I discussed about this topic some weeks ago in an AI training I’ve done in Microsoft Italy, and (as promised in that training) with this post I want to practically show what happens when an LLM approaches its context window limit. For this text I’ve used GPT-4.1 and GPT-5 models, but it can be extended to any AI model you want.

What is a Context Window?

The context window represents the maximum amount of text (measured in tokens) that an LLM can process in a single request. Think of it as the model’s “working memory.” For GPT-4.1, this limit is approximately 128,000 tokens, which translates to roughly 450,000 characters (using an average of 3.5 characters per token). For GPT-5 this limit is approximately 200,000 tokens (~700,000 characters).

This window includes everything: your prompt, any context you provide, the conversation history, and the model’s response. When you approach this limit, you’re essentially maxing out the model’s capacity to understand and process information.

The Performance Challenge

While having a large context window is incredibly powerful (allowing you to include entire documents, extensive code files, or long conversation histories) it comes at a cost. As you fill more of that window, the model needs to process increasingly larger amounts of information, which directly impacts response time and, potentially, quality.

Let’s do our test…

To understand this performance degradation, I created a C# program that systematically tests GPT-4.1 and GPT-5 performance at different context sizes. The program:

1. Generates prompts of increasing sizes (from 10,000 to 450,000 characters for GPT-4.1, from 10,000 to 700,000 for GPT-5).

2. Measures response time for each request

3. Tracks token estimates to understand how close we are to the limit

4. Visualizes the results in an easy-to-understand chart

The test is simple but revealing: we ask the same question (“What is the capital of Italy?“) but progressively add filler text to increase the context size. This isolates the performance impact of context length from the complexity of the task.

The Results:

Here’s what the testing revealed for GPT-4.1:

As you can see from the above table, between 10K and 300K characters (3K-100K tokens), response times remained remarkably consistent, ranging from 538ms to 1,192ms. This represents less than a 1-second delay, which is generally acceptable for most applications.

At 400,000 characters (~133K tokens), we hit a dramatic performance drop. The response time jumped to nearly 60 seconds, a 50x increase compared to the previous test point! This suggests we exceeded the practical context limit, forcing the model into a significantly more expensive processing mode or triggering internal throttling mechanisms.

Interestingly, at 450,000 characters, the response time dropped back to 1.5 seconds. This could indicate that either:

– The request was handled differently by the API

– Some internal optimization kicked in

– Natural variance in API response times

However, the spike at 400K characters is the critical finding: it demonstrates a clear boundary where performance catastrophically degrades. Here is the chart generated from the program itself:

Testing GPT-5 with its larger context window revealed even more interesting patterns:

GPT-5 maintained excellent performance (under 2 seconds) up to 300,000 characters (~100K tokens), demonstrating better optimization for larger contexts compared to GPT-4.1.

Unlike GPT-4.1 single major spike, GPT-5 exhibited a more complex pattern:

– First performance drop at 400K characters (~58 seconds), matching the GPT-4.1 drop.

– At around 500K characters, it has a recovery period (~3.8 seconds).

– Second performance drop at 600K characters (~61 seconds)

– Big continued degradation at 700K characters (~62 seconds)

Here is the generated chart:

The consistent 60+ second response times at 600K and 700K characters reveal that GPT-5’s practical limit is around 600,000 characters (~171K tokens), significantly below its theoretical 200K token maximum.

The curious result is also that both models exhibit severe performance degradation at 400K characters, suggesting this might be maybe a common architectural constraint or API throttling point across the Azure OpenAI’s infrastructure? Need to go in-depth on this…

Why does this happen?

Several factors contribute to this performance degradation:

Attention mechanism complexity: LLMs use attention mechanisms to understand relationships between different parts of the input. The computational cost of attention scales quadratically (O(n²)) with sequence length. As you approach the context limit, this becomes exponentially more expensive.

Memory and processing constraints: Processing massive contexts requires significant GPU memory and computational resources. When you push the limits, the system may need to use more conservative, slower processing strategies to avoid out-of-memory errors.

API rate limiting and throttling: Cloud providers may implement throttling mechanisms when detecting extremely large requests to ensure fair resource allocation across all users. This could explain the dramatic spike we observed.

Practical implications for Developers

These are the practical things that every delevloper should always check:

1. Monitor your context usage: Don’t assume you have all 128K tokens available. In practice, you want to stay well below this limit (aim to remain to 80/85% of the maximum model’s token limit to maintain good performance).

2. Implement context pruning strategies: if you’re building chatbots or applications with long conversations:

– Summarize old messages

– Remove less relevant context

– Implement sliding windows that keep only recent interactions

3. Chunk large documents: Instead of sending an entire 100-page document, break it into smaller chunks and process them sequentially or in parallel, then aggregate the results.

4. Use token counting libraries: don’t estimate token usage, but instead use proper tokenization libraries to accurately count tokens before sending requests.

5. Implement timeouts and fallbacks: Given the unpredictable performance near the limit, always implement:

– Request timeouts (30-60 seconds)

– Retry logic with exponential backoff

– Fallback strategies when requests take too long

My conclusion

Context window is one of the most important (and often underestimated) constraints in LLM applications. By staying within 80% of the practical limit (not the theoretical one) and implementing smart context management strategies, you can build responsive, reliable LLM-powered applications. These tests revealed that

– For GPT-4.1: stay under 300K characters (100K tokens) for best performance.

– For GPT-5: stay under 500K characters (142K tokens) for best performance.

P.S. If you’re interested also in the C# program I’ve used for the test, I can upload it to GitHub…

Original Post https://demiliani.com/2025/11/02/understanding-llm-performance-degradation-a-deep-dive-into-context-window-limits/

0 Votes: 0 Upvotes, 0 Downvotes (0 Points)

Leave a reply

Follow
Search
Popular Now
Loading

Signing-in 3 seconds...

Signing-up 3 seconds...