Self-Supervised Video Models Enable Understanding, Prediction and Planning
Date: 30th June 2025
Key Points
- Trains model in self-supervised manner like LLMs, but predicts missing video frames.
- Predicts latent space, encoded by 1b parameter ViT-g, not pixels.
- Essentially predicts instances of time
- Method for models to scale and begin to understand physics rather than words
- Fine-tuned on action labelles videos of robots doing stuff. This is to learn to predict cause and effect.