All AI Models Might Be The Same
Date: 18th July 2025
Blog Post
Key Points
- Language models are simply compressing data into model weights
- Talks of Shannon’s Source Code Theorem: minimum number of bits per token an algorithm can use to compress a distribution is the entropy.
- Learning to compress real data = generalisation. This is why smaller models often generalise better in practice.
- Platonic Representation Hypothesis: there is only one correct way to compress things and models are converging to that.