Dominic Rigby

All AI Models Might Be The Same

Date: 18th July 2025

Language models are simply compressing data into model weights
Talks of Shannon’s Source Code Theorem: minimum number of bits per token an algorithm can use to compress a distribution is the entropy.
Learning to compress real data = generalisation. This is why smaller models often generalise better in practice.
- Platonic Representation Hypothesis: there is only one correct way to compress things and models are converging to that.