Evaluating Long Context (Reasoning) Ability
Date read: 18th October 2025
Blog post
Key Points
- Reasoning performance tends to decrease with context lenght, long before max context length is reached. This is especialy important when doing CoT as this generates a bunch more tokens.
- E.g. ChatGPT-4 decays at >100K tokens
- Measuring this ability tends to come in two parts:
- Determine which parts of the text are relevant
- Perform tasks required
- Examples:
- Long story short, they differ on how difficult it is to determine which parts are relevant and then how easy it is to process this relevant information.