Context Window¶
- The context window is like the model’s short-term memory: it is the total number of tokens the model can process in a single interaction (including the prompt you give it, conversation history, uploaded documents and the answer it generates) when predicting the next output.
- The context window plays an important part in models by assisting them in understanding the semantic relationships between words or sentences in a given text.
- The model's performance can be significantly impacted by the size of the context window since a larger context window gives the model more data to work with when making predictions, but it may also make computations more difficult.
- For example
- GPT‑4 Turbo = 128,000-token context window, output limit ~4,096 tokens.
- Claude 3.5 Sonnet = 200,000-token context window, output limit typically ~4,096.
Key point:¶
If a model has a context length of $N$ tokens:
- It can only “remember” the last $N$ tokens from the conversation or document.
- Once conversation exceeds this window size, the model will start dropping (forgetting) the oldest tokens so it can stay under its limit.
Why Context Windows Matter:¶
- If your prompt exceeds the context limit, some tokens will be truncated and ignored.
- A larger context window allows the model to process more information at once.
- Shorter context windows require careful prompt engineering to fit necessary details.
Example¶
Model with context window = 8 tokens:
If $C = 8$ and current sequence is:
$$ [1, 2, 3, 4, 5, 6, 7, 8] $$
- When a new token $9$ comes in Oldest token (1) is dropped:
$$ [2, 3, 4, 5, 6, 7, 8, 9] $$
- Mathematically:
$$ \text{New context} = (\text{Old context} \cup \{t_{new}\}) - \{t_{oldest}\} $$
- This ensures the context always contains the latest $C$ tokens.
Fixed Context Windows Vs Sliding Context Windows¶
Fixed Context Windows: Certain models operate with a fixed context window, determined during training and unchangeable afterward. This restriction limits their ability to process long sequences, often requiring truncation of the input or alternative strategies such as the sliding window technique.
Sliding Context Windows: - To overcome fixed-window limitations, some models implement a sliding window mechanism. As new tokens are processed, the context shifts forward, ensuring the model always works within its defined window size. However, this approach can lead to partial loss of earlier context, as older tokens eventually move out of the active window.
Mathematical Understanding of Context Windows¶
- If the context length = $N$, then at any given time:
$$ \text{Input Tokens (Prompt + Conversation History)} + \text{Output Tokens (Generated)} \leq N $$
- If the sum exceeds $N$, older tokens are dropped (sliding mechanism).
- If the input itself exceeds $N$, the extra part is truncated.
Token Limits¶
- A token limit typically refers to the specific maximum number of tokens allowed for either the input or the output, a value that is always less than or equal to the total context window.
- Token limits are a hard constraint because models have finite context windows.
- If your request exceeds this limit, the model will:
- Truncate the input (cut off text beyond the limit), or
- Refuse to process it entirely.
Key Differences in Token Limit Handling¶
- Token limitations are handled differently by various LLM providers.
Combined Limit (e.g., OpenAI):
- In many cases, such as OpenAI, the token limit extends to the sum of the input and output token counts.
- For example, if a model has a 4096-token limit, and your input uses 1000 tokens, the maximum response you can get back is 3096 tokens.
Separate Input/Output Limits (e.g., Gemini)
- In some cases, such as with Gemini, limitations are imposed to the input and output separately.
- This means you could have a large input and a substantial output, as long as neither individually exceeds its respective limit, and the total of both doesn't exceed a larger overarching context window.
Why Token Limit Matters¶
- Memory Constraint: Transformer-based LLMs store all tokens in context memory.
- Performance Constraint: More tokens mean more computational cost (slower inference, higher API cost).
- Accuracy Constraint: If the prompt is too long, crucial context may be truncated and lost.
Mathematical Understanding of Token Limits Inside Context Window¶
- Some providers set separate maximums for input tokens and output tokens.
Let:
- $C = \text{Total context window size}$
- $I = \text{Max allowed input tokens}$
- $O = \text{Max allowed output tokens}$
Then:
$$ I + O \leq C $$
But, depending on provider:
Combined limit (OpenAI): Input + Output ≤ Context window (no independent restrictions). Example: $I + O \leq 128,000$.
Separate limits (Gemini-style): $I \leq I_{max}$ and $O \leq O_{max}$, while still $I + O \leq C$.
Difference between context window and token limit¶
Aspect | Context Window (Total Token Limit) | Specific Token Limits (Input/Output) |
---|---|---|
Definition | The overall size of a model's "working memory," measured in tokens. It includes the user's prompt, any conversation history, system instructions, and the model's generated output. | The maximum number of tokens allocated to a specific part of the interaction, such as the maximum length for the model's response (max_output_tokens). |
Scope | Global constraint on the entire interaction. For a model with a 128k context window, your input and the AI's output must sum to less than 128k tokens. | A finer-grained constraint that operates within the total context window. For example, a model with a 128k context window might have a separate limit of 16k tokens for its generated output. |
Effect on Performance | A larger context window allows the model to analyze longer documents or remember more of a conversation, leading to more coherent and contextually accurate responses. If this limit is exceeded, the model may "forget" older information. | A limited output token count can cause the model to stop generating text abruptly, even if there is more information available. A limited input token count can prevent you from submitting a very long prompt or document at once. |
Trade-offs | Large context windows require exponentially more computational resources, leading to higher costs and slower processing times. They can also suffer from the "murky middle" problem, where important details in long documents are overlooked. | Limiting the number of output tokens is a way to control costs and latency, as generating more text consumes more resources. However, it can compromise the completeness of a response. |
Analogy | Imagine the context window is the total space on a stage for a play. You have to fit all the actors (input) and scenery (output) within that stage. | The output token limit is a specific rule, like "The play can have no more than 10 lines of dialogue from any single character". |
Practical examples¶
- Consider a model with a total context window of 128,000 tokens and a separate maximum output token limit of 16,384 tokens.
Scenario 1: Summarizing a book
- If you submit a 100,000-token book as input, you have 28,000 tokens remaining in the context window (128k - 100k).
- Since your output limit is 16,384 tokens, the model can generate a summary of up to this length.
- The model would not be able to generate a summary longer than 16,384 tokens, even though more space is available in the total context window.
Scenario 2: A long conversation
- In an ongoing chat, the conversation history (both your inputs and the model's responses) fills up the context window.
- If the conversation history reaches 120,000 tokens, only 8,000 tokens are left in the context window (128k - 120k).
- The model's next response is still capped by the 16,384-token output limit, but also by the remaining context window space. If the model generates a 5,000-token response, the context window will be 125,000 tokens full.
- If you continue the conversation and exceed the total 128,000-token window, the model will start to "forget" the earliest messages in the chat to make room for new ones.