Dominic Rigby

How to scale RL to 10^26 FLOPs

We are starting to approach the upper-bound of Web-scale data pre-training. How do we go beyond this?
Reinforcement learning provides a potential path. It allows us the model to create novel reasoning paths on its own. It currently however has a few issues:
- Relies on verifiable rewards: having a well defined scoring function (e.g. Go in easy to score once the game is over)
- Massive compute issues: current models only do a few hundred of few thousand RL steps
  - Sampling reasoning tokens is expensive
  - Rollouts require verification at the end which is expensive. This can mean compiling code, running unit tests for example. This can also mean that the bottleneck is on CPU!
Way forward in his opinion: RL based pre-training by allowing the model to think during pre-training. Calls this Reasoning with Next Token Prediction (RNTP)
- This allows us to get more out of the data. The next token prediction is verifiable, but RL can let the model learn to think beyond that.