Date: 20th July 2025
arXiv link
Key Points
- Aims:
- TiZero is an algorithm which learns to play 11 vs 11 football in Google Research Football from scratch.
- Football is a very hard problem: it includes multi-agent coordination, both long and short-term planning and non-transivity.
It is both cooperative and competitive. Rewards for agents without the ball for example are extremely sparse.
- How:
- Tackles large search space using a curriculum learning. The curriculum is set using the speed of the opponent,
position of ball on the pitch and the position of the players (ball close to our goal = hard, close to their goal = easy).
- Tackles non-transivity, competitive and cooperativeness using self-play learning. They create a probability distribution
over next agent scales linearly which biases agents which are more difficult.
- Interesting implementation details:
- They implement a tweak of MAPPO called Joint-ratio Policy Optimisation. This optimised a joint policy which is the product of all
the individual policies. This supposedly achieves better coordination.
- Use different observation for policy (individual) and critic (team).
- As mentioned above: utilises a curriculum.
- Two stage self-play:
- Challenge self-play: plays most recent agent 80% of the time, 20% an older agent
- Generalise self-play: plays against entire opponent pool. Opponent is sample using probability distribution given
in section 4.3
- Utilises an LSTM to learn from earlier observations in episode. This gives insight about opponent strategy and helps
with long term planning.
- Masks out actions which aren’t possible to reduce search space.
- Utilises reward shaping but ensure the game is still zero-sum.
- Processes each feature of the game with a separate MLP and then processes the results using a player-conditioned
LSTM. This saves a bunch of compute as well as you can reuse the observations between agents.
- Random actions at start to create varied scenarios
- Utilises actor-learner distributed framework