Puffing Up Reinforcement Learning
Date read: 9th November 2025
Blog link
Key Points
- Algorithm changes:
- Switches from Adam to Muon optimiser: solved problems ~30% faster.
- Utilise cosine annealing for learning rates
- Data filtering:got rid of uninformative trajectories and sampled them using prioritsed experience replay
- Mixed PPO and IMPALA:
- PPO has clipping and GAE, IMPALA corrects off-policy-ness with V-Trace
- Implementation not truly detailed, but did mention implementations are often slow so they made a custom CUDA kernel to make it faster.
- Performance optimisations:
- No dynamic allocations: all allocated at initialisation.
- No observation copies: write directly to the buffers
- Aggressive caching: reuse as much data and memory as possible
- Very asynchronous: inspired by EnvPool
- Hyperparameter tuning:
- Aim: find Pareto efficient hyperparameters between cost and performance
- Created modified version of CARBS algorithm called Protein
- CARBS:
- Randomly generate Pareto frontier of hyperparameters
- Mutates them and then train Gaussian processes to predict their scores from the HPs.
- Uses the GPs to identify strong new HPs
- Issues: Bias towards Pareto front and susceptible to noise
- RL tips:
- Results > methods: make sure experiments used fast environments or results can be purely noise
- PPO is the normally the goto
- PPO hyperparameter tips:
- Sweep learning rate
- Gamma and lambda: ask yourself how long in the future matters in this game?
- Gamma = 1 - 1/ (number of steps in that time)
- Lambda: bit less than gamma
- Clipping not too low or experiment will be ‘on-rails’
- Perform hyperparameter searches on reward scalings
- Always use white-box RL where you can: everything you make will break