Play To Generalise: Learning to Reason Through Game Play
Paper link
Key Points
- Learns transferable reasoning via reinforcement learning on games.
- Showed not only strong out of distribution generalisation, but also strong out of domain generalisation. E.g.
model trained on gams generalises to maths.
- Why? They believe the model learns ‘generalisable cognitive primitives or skills’.
- Specific reasoning capabilities can be induced by game choice (e.g. snake for spatial reasoning)
- Prompting reasoning techniques improved performance.
Key Methods
- Reasoning models would think and then give best and worse move. Prompting the model for both improved performance.
- Curriculum of game difficulty was used.
- Many different synthetic prompts were used (generated by GPT-4o)
- Snake was two player competitive against a PPO learned policy. This made the game reasoning more tricky.
- Used RL-LOO (RL Leave One Out) rather than GRPO with no KL-divergence term to encourage exploration.