Dominic Rigby

Self-Supervised Video Models Enable Understanding, Prediction and Planning

Date: 30th June 2025

Trains model in self-supervised manner like LLMs, but predicts missing video frames.
Predicts latent space, encoded by 1b parameter ViT-g, not pixels.
Essentially predicts instances of time
Method for models to scale and begin to understand physics rather than words
Fine-tuned on action labelles videos of robots doing stuff. This is to learn to predict cause and effect.