Dominic Rigby

Play To Generalise: Learning to Reason Through Game Play

Key Points

Learns transferable reasoning via reinforcement learning on games.
Showed not only strong out of distribution generalisation, but also strong out of domain generalisation. E.g. model trained on gams generalises to maths.
Why? They believe the model learns ‘generalisable cognitive primitives or skills’.
Specific reasoning capabilities can be induced by game choice (e.g. snake for spatial reasoning)
Prompting reasoning techniques improved performance.

Key Methods

Reasoning models would think and then give best and worse move. Prompting the model for both improved performance.
Curriculum of game difficulty was used.
Many different synthetic prompts were used (generated by GPT-4o)
Snake was two player competitive against a PPO learned policy. This made the game reasoning more tricky.
Used RL-LOO (RL Leave One Out) rather than GRPO with no KL-divergence term to encourage exploration.