Dominic Rigby

Absolute Zero Reasoner

Date: 23rd May 2025 arXiv Link Key Points:

Reinforcement learning where LLM proposes problems and then also solves them in a self-play methodology.
This means we don’t need any external examples.
The problems can be verified using a coding engine. This allows theoretically allows unlimited reasoning as coding languages are Turing complete.

Key Methods:

Problems are built of three parts: input, code and output. One is hidden and the LLM has to guess what it would be. The code engine can verify this.
MuZero-like planning in latent space (using a world model) to guide self-play loops.
Unified code‐execution environment that serves both as verifier and data generator, allowing AZR to remain entirely “zero‐data.”
Open‐source implementation, demonstrating compatibility across model scales and architectures.