Dominic Rigby

Paper Diary

By Dom Rigby

Note: this is GitHub Pages website. If viewing on GitHub, please go to domrigby.github.io for full experience.


📌 Introduction

Welcome to my Paper Diary! Due to the seemingly never ending supply of interesting reinforcement learning papers which have come out in the last few years, I began to try and read at least one per day. I was however having the issue that after a month of two I could not remember for the life of me where I had read that interesting fact, method or algorithm. I therefore began keeping a diary of the papers/blog posts I was reading. I recently decided to start compressing the key points papers into short, bite-size summaries. I hope you find out something useful!

Notes:


⚙️ Website Workings

This website is a user-friendly entry point and summary of the repository. This hosts the top level themes and parts I thought were interesting. All paper summaries are stored in this repository.

A list of papers read and links to their summaries is in the full diary section.


📈 My Interest Areas

I am fascinated by emergent behaviour, especially when this behaviour is diverse and unexpected. I therefore focus tend to focus on reinforcement learning, auto-curriculums and open-endedness, but also enjoy reading how this is made possible through clever engineer and distribution.

Inspired by figure 2 of OMNI-EPIC and policy diversity method in Foundation Model Self Play, I clustered my papers read using the following method:

  1. Embedding: I get o4-mini-high to create a one sentence, short description of each paper using this prompt. This description is then embedded using Sentence-Transformers python library.

  2. Dimensionality Reduction: The embedding dimension is then reduced from 384D to 30D using PCA and then to 2D using t-SNE and UMAP (in more plots section)

  3. Clustering: The resultant 2D data points are then clustered using K-Means. A list of the titles of each paper are fed into GPT-4o using this prompt (link pending), which asks it to come up with a title for each cluster. This gives me some interesting second opinion into the theme I am exploring.

Hover over any data point to see the name of the paper/blog post. On mobile, go into landscape mode and tap.


🔍 Highlights & Lessons Learned

The following section includes:

1. Reinforcement Learning (RL)

  1. Moments of uncertainty are the best moments to learn from
    • You learn the most when a decision is uncertain as these correspond to “forks in the road” in which making a decision will likely strongly affect the outcome.
    • These moments can be described mathematically as high entropy tokens or decisions.
    • Unsurprisingly, training on these tokens yields significant performance gains when training reasoning models (80:20 Rule).
    • A bit of intuition behind this: Many tokens in language are determined my other words so provide little information in te RL process when they are chosen.

      E.g. “I went to the shop”, “to” and “the” are determined by other words so provide little information.

  2. You don’t learn anything from always winning… but equally little if you are always losing!
    • There exists a ‘zone of proximal development’ in which agents are learning the most about what is right and wrong. This is shown simply in methods such as ProRL and Absolute Zero Reasoner in which they filter out consistently correct or incorrect prompts. This process shares some similarities with auto-curriculum learning, of which a more in depth more discussion can be found in section 2.
  3. It is possible to make Non‑Verifiable Reward Models (e.g. rewards for creative writing!)
    • Writing‑Zero trains an LLM based preference model to grade creative writing pieces and then uses this to train agents to become better at creative writing.
  4. You can use generative AI to expand experience buffer
    • SynthER trains a diffusion model to expand the replay buffer with synthetic experiences for mixed real‑fake training.
  5. You can learn to reason by simply playing games
    • Play to Generalise demonstrates that game‑based move prediction enhances reasoning capabilities. Whilst it was trained on games, it showed improved performance on on a variety of out-of-domain tasks (maths and multi-modal reasoning).
  6. GPU‑Accelerated Environments provide monumental speeds up
    • Frameworks like Kinetix and JaxMARL allow you to run tens of thousands of environments in parallel, as well as minimise CPU-GPU overhead.
    • This could allow for some LLM-like RL ‘pre-training’ on vast amounts of data from diverse scenarios before fine-tuning to the ones of interest.
    • Kinetix demonstrates reasonable zero-shot capability on 2D control tasks by training on randomly generated (then filtered) scenarios.
    • I highly recommend visiting their website an having a play around on their online demo: https://kinetix-env.github.io/

    Architecture diagram

    Figure 1: Example of Kinetix general agent zero-shotting unseen handmade scenario [source](https://github.com/FlairOx/Kinetix/)

    • Learning to walk in minutes trains locomotive robotic policies in under ten minutes using GPU environments and provides advice on how to tune the PPO hyperparameters to take advantage of the huge parallelism (e.g. massive mini-batches, short rollouts etc).
  7. Foundation models have a large role to play in future RL:
    • Foundation models have intuition about what humans find interesting. They are therefore capable of designing curriculums for RL or being involved in the policy improvement steps. See more in the open-endedness section of this blog. Summary of a few interesting methods:
  8. Quality Diversity can be used for testing:
    • MADRID uses a MAP-Elites style quality diversity search to get diverse set of scenarios the algorithm struggles with. It does this by maximising regret across the search grid.
    • The above was done on TiZero (a football playing algorithm) and it found a variety of areas of the pitch in which the agent was not only vulnerable, but did unexpected behaviours like score own-goals.
  9. Hierarchical planning is the more natural way forward:
    • When human plan, we don’t plan how were going to move every muscle in order to get to where we want to go. This would be extremely computationally heavy and make planning over long time horizons impossible, not to mention the drift in our plans due to errors.
    • Hierarchical planning breaks down plans into high-level actions, or options, which are then achieved by lower levels in the hierarchy.

      E.g. if we wanted to go to the shop, the high level planner might plan to walk down the road and then turn right. The low level planner would then do the actual muscle movement.

    • In RL, this tends to be made up of two or more levels or RNN, with the higher levels being called at lower frequencies.
    • Forecaster introduces a manager-worker world model framework for this. The manager pick high level goals with which to condition the worker on. It then performs tree search across a set of possible goals in order to pick which one is best.
    • Hierarchical Reasoning Model uses this approach but for reasoning.
  10. Sometimes you need to stop and have a think about it (scaling test time compute)

2. Open‑Endedness & Auto‑Curricula

Open-endedness and auto-curriculums are crucial for building truly intelligent agents. In the same way that humans didn’t go to the moon by starting working on a rocket, agents can’t achieve superintelligence by just training on a set of pre-defined tasks. Human technology and intelligence has advanced by constantly solving iteratively harder tasks, but the knowledge from the old tasks helps us solve them. We can do this because the world around us is open-ended, and we can constantly try new experiments and create new artefacts in which us humans can learn new things from. Research in open-endedness tends to focus around how we could do this for reinforcement learning agents. Could we: 1) Create environments which are sufficiently complex to be constantly learnable (world models)? (see Genie for the most advanced version of this) 2) Create algorithms which can explore this vast search space in a meaningful way?

If you are interested in this, I would highly recommend reading some Jeff Clune’s or UCL Dark’s work on this.

Common Themes in Open-Endedness and Auto-Curriculums Research:

  1. Open-Endedness requires the generation of novel and learnable artefacts:
    • Open-ended is defined in Open-Endedness is Key to ASI: a system is open-ended if it continually creates novel and learnable artefacts. This is dependent on the observer, the memory and the time horizon.

      Observer example: a mouse can’t learn chess and a computer will eventually plateau in performance. Open-endedness depends on the observer.

      Time-horizon example: AlphaZero is open-ended in chess, but given enough time it will eventually plateau in performance.

      Memory example: Wikipedia might appear open-ended to a human, who could constantly read it and learn new things they had forgotten the last time they read it. An LLM however might be able to memorise the entire thing, given enough weights.

  2. Learnability metrics:
    • Auto-curriculums need a way to be able to rank the novelness and learnability of levels. The main themes I have come across are:
      1. Learning errors: if the network can’t make good predictions about this state, it is likely learnable
      2. Performance: if the network always wins or always looses, there is nothing to be learned. This means AC prefer levels with a medium win-rate, e.g. 0.5.
  3. Procedural Level Generation is used to create novel environments to learn in
    • Procedural generation allows you to algorithmically create new levels, often parameterised by the curriculum.

      E.g. MineCraft procedurally generates landscapes as you explore. This could be made into a curriculum by making resources near the user.

    • Auto-curriculum methods can learn to choose parameters which are in teh zone of proximal development.
    • E.g. POET introduces new level generation parameters, checks they meet a minimum learnability criterion and then only adds the most novel.
  4. Prioritized Level Replay is way to choose previous levels which are the most learnable
  5. Randomly generate a new level, or create a new one!: this creates population or pool of environments for the agent to interact with
    • Auto-Curriculum Learning for Driving Scenarios, POET and many others methods introduces the idea of random generator + editor as the basic building blocks for creating levels. One creates random new levels and the other perturbs existing interesting levels. These new random levels are then tested and filtered to ensure they are sufficiently learnable.
  6. Curriculum generation can be more intelligent using Foundation Models
    • FMs can act as ‘intelligent search operators’ to create new learning opportunities based on what they have learned that the agent would find difficult (e.g. EUREKAVERSE) or humans would find interesting (e.g. OMNI-EPIC).

    OMNI-EPIC diagram

    Figure 2: OMNI-EPIC Architecture it uses to utilise Foundation Models to create interestingnovel scenarios through code [source](https://omni-epic.vercel.app/)

  7. Performance annealed exploration reward:
  8. Euclidean distance in the embedding space as a novelty metric:
    • Many papers use Euclidean distance in the embedding space or feature space as a novelty metric: Foundation Model Self-Play, Enhanced POET, OMNI-EPIC. The basic premise is: the closer a new datapoint is to the others, the less novel it is.
  9. We can learn to learn to generate curriculums
    • MM-ACL introduces a method to learn a model which predicts the improvement an agent will gain on a new level, from a history of its past performances. It is then used to generate new levels which have the highest possible performance improvement.
  10. DISCOVER uses value and uncertainty of an ensemble of critics to form an auto-curriculum for sparse-rewards
    • Policy and values are conditioned on intermediate goal states (g) which are chosen to maximise novelty, achievability and relevance to goal state (g*).
    • Insight:
    • High V(s0, g) means tasks is likely achievable from start state s0.
    • High std(s0, g) means this is not reliable and therefore likely novel
    • High V(g, g*) means the sub-goal g is close to the target goal g* * We can therefore aim the agent an increasingly more difficult (but obtainable) goal.
  11. Parallelisable planning for model-based RL (GPU-able MCTS?!)
    • SMX uses a particle filtering method to perform rollouts to identify a target to perform policy improvement.
    • The advantage of using particle filters over MCTS is that they are entirely parallelisable (GPU-able!) and doesn’t require storing a tree.
    • It also works for both continuous and discrete action spaces.

3. Pretraining, Fine-Tuning & General Training Tips

  1. Heterogeneous Pretraining: think outside the box when it comes to data
    • Pi0.5 and V‑JEPA both use video data to train robotics models. These videos contain a lot information of interest to robotics. Pre-training data can come from a wide range of sources!
  2. Reasoning with Next Token Prediction (RNTP): (allowing the model to reason about the next token during pre-training)
  3. When doing PPO/GRPO, make the upper bound clip larger
    • The upper clip bound being higher increases the probability of unlikely choices and increases exploration (as in ProRL and Play to Generalise) improve exploration and stability.
  4. Dual‑Outcome Reasoning: knowing what’s bad is also useful!
    • Generating both best and worst moves in game scenarios deepens model understanding of decision boundaries (Play to Generalise).
    • XLand did something analagous with their self reward-play, in which agents had to learn to achieve a goal but then also learn how to undo it, increasing their generalisability.
  5. Beware When Using Qwen for RL
    • RL with Spurious Rewards shows that random reward signals can still improve performance on Qwen-2.5-maths. The authors explain that this is likely caused by RL encouraging the model to produce more code.
  6. Telling the model how to think improves performance (CoT prompting)
    • FinCoT improved performance by giving the reasoning model **structured chain-of-thought prompts. For finance problems, methods to solve certain types of problems are well known, or at least the important things to look for. These chain of thought patterns are generated using DeepResearch and then added to the prompt after the question as a suggestion of how to think.
  7. Creating ‘soups’ of all your different hyperparameter fine-tuning models can improve performance.
    • ModelSoups achieved SotA performance on ImageNet by doing a greedy mix (only add if it improves performance). This works as fine-tuned models often end up in the same loss valley and therefore averaging their performance can lead to lower loss and better performance.
  8. Prompt optimisation can outperform RL on single tasks
    • GEPA showed that optimising prompts can be far more effective and sample efficient than GRPO. This done by mutating prompts according to feedback on the chain of thought from other LLMs (intelligent search operators! (1.7)). This makes sense if RL just increases the likelihood of using knowledge already baked into the model.
  9. General solvers through pre-training:
    • GOAL trained a transformer to solve a set of combinatorial optimisation problems. Whilst it did perform slightly worse than tailor made solutions, it showed that features of these problems are shared and meant specialist solvers could be fine-tuned quickly. This was however trained on problems solved by dynamic programming. It would be interesting to see how this could be combined with DRL, perhaps using GPU environments to generate the vast amounts of data needed. 10 RL leads to less catastrophic forgetting than SFT:
    • As explained in RL’s Razor, RL will choose a new policy closest to the original policy by gradually updating the non-zero probabilities. SFT does not do this, and rather drags the whole policy to a random point in the new task optimal policy space.

4. Robotics & Control

  1. Predict multiple actions at once rather than one
    • Mimic One predicts chunks of actions to enforce temporal consistency.
  2. Using diffusion models as policies
    • Diffusion models generate continuous action fields for robot control (Pi0.5, Mimic One). This could also be consider as hierarchical planning, as we create the field of the action we want to do and then allow the lower level control systems to actually perform it.
  3. Learning world models from large scale video data
    • V‑JEPA pretrains on millions of videos to predict missing frames, then fine‑tunes on robotic datasets for causal understanding and planning.
  4. Pre-Training is possible in robotics
    • V-JEPA and Pi0.5 both used huge amounts of internet video data to train world models to predict actions and effects.

5. Distribution

  1. This blog post by Jeremy Jordan covers the basics of how to train a network on thousands of GPUS. Some of the key methods spoke about were:
    • Types of parallelism:
      1. Data parallelism: each GPU has a copy of the model and a different batch of data. They then share gradients to do joint updates.
      2. Model parallelism: for large models. Model layers are split over many GPUs.
    • Communication methods:
      1. Scatter: send different data to each GPU
      2. Broadcast: same data to all
      3. Reduce: combine all data on one GPU.
  2. This blog post on distributed PPO outlines some extra factors to think about:
    1. Synchronous: waits for all agents to calculate their respective gradients before doing a weights update.
    2. Asynchronous: doesn’t wait.
    3. Centralised: single server does all gradient accumulation and weights updates.
    4. Decentralised: all share gradients (all-reduce) but have their own model.
  3. IMPALA outlines a now common, distributed reinforcement learning method with multiple actors and a single centralised learner which broadcasts weights update. This is mimicked in PyTorch in TorchBeast.
    • V-trace is an important part of this setup. It utilises importance sampling to account for the data being collected being more and more off-policy every moment.
  4. Decentralised PPO can scale better
    • DD-PPO scales PPO almost linearly up to 128 parallel agents using decentralised, synchronous training.
    • It crucially relies on a preemptive threshold to end rollouts and start training once a high number of environments are finished and only stragglers remain.
  5. Docker can be used like a lightweight virtual machine for distributing actors or learners across large clusters.

6. Multi‑Agent Reinforcement Learning (MARL)

  1. Stabilise MARL by condition agents actions on the actions of other agents
    • JointPPO orders agents by decision importance, then uses a recurrent action‑conditioned network to generate actions sequentially
  2. GPU based environments are key to tackling to complexity of MARL
    • JaxMARL allows you to run the environment tens of thousands of times in parallel. This means the monumental search space can be explore a bit more thoroughly.
  3. Population‑based methods prevent overfitting and foster diverse behaviors and can help tackle non-transivity
  4. Focus on playing the agents which you struggle against. (similar to curriculums)
    • Agent selection via ELO‑weighted sampling encourages robustness and competitive balance. This is used in Multi-Agent Pommerman, AlphaStar and more.
    • More simple heuristics can be used (e.g. TiZero used $(1-p)^2$ (p: probability of victory against opponent) to define a probability distribution which encourages you to focus on agents you cant beat
  5. TiZero Football: Strong implementation example of many-on-many competitive and collaborative game
    • Their paper provides a strong example of a system designed to play many-on-many games and gives a detailed account of the architecture choices, curriculum and self-play methodology.

7. Self‑Improvement Strategies

  1. LLMs can do self-play for reasoning, as long as their grounded to reality
  2. Unsupervised Self‑Dialog Games
    • VLMs play in‑domain “Guess Who” style games to self‑improve vision‑language reasoning. (VLM Self‑Dialog Games)
  3. Adaptive Prompting & Team Agents
    • Agents of Change evolve prompts and orchestrate agent teams (analyst, coder, researcher) for strategic planning tasks.
  4. Self‑Adapting LLMs
    • SEAL uses RL to generate synthetic edits and hyperparameters, enabling rapid adaptation to new tasks.

8. Architectures:

  1. Rohit Bandaru’s blog post summaried Yann Lecuns JEPA architecture and made the following suggestions:
    1. A framework for human-level AI: includes a bunch of different parts which all play a role found in the human brain.

      Architecture diagram

      Figure 1: Yann Lecun's architecture for human level AI [source](https://openreview.net/pdf?id=BZ5a1r-kVsf)

    2. Energy Based Models:
      • Energy based models predict how plausible a future state is.
      • It’s impossible to know what will happen in the next state… but it possible to predict a latent representation of it.
      • EBM aim to predict the distance between the embedding of current and future state.
      • There is however still uncertainty, so a random variable is used in the prediction of future state to account for this randomness.
  2. Hierarchical multi-timescale planning:
    • When humans plan we do it at multiple timescales. When you think “I’m going to go to work”, you don’t think about every single joint movement you are going to do to get there. You plan the highest level actions and then break them down into sub-tasks. This is what Yann Lecun suggests and is what Hierarchical Reasoning Model implements. A high level planner runs at a low frequency while a high frequency recurrent neural network performs the plans which the high level planner creates.
  3. Interesting Observation Spaces:
  4. Graphs are a great way to represent data which includes relationships
    • Intro to Graph Neural Networks provides a great intro to graphs and how we can build neural networks to learn things about them. It also introduces key ideas like how to present the network the edges, how to batch varying sized graphs and message passing.
    • Graph Transformers provide a highly capable model for evaluating graphs. Their self-attention models connections between all nodes and/or edges. As is the case with transformers, this does come at high compute and memory cost. A GT was applied in RL context in this paper.

9. Quantisation

  1. Maarten Grootendorst’s blog post on quantisation for LLMs give a nice intro to the topic with some intuitive explainations. A brief overview:
    • Quantisation:
      • Reducing the precision of a model’s numerical representation tp reduce its memory overhead.
      • This essentially means storing high precision datatypes such as float32 as smaller datatypes such as uint8
    • Why quantise?
      • LLMs required billions of parameters and therefore massive amounts of memory… smaller datatypes = less memory footprint
      • Using smaller datatypes runs faster (faster memory access, more parallelism, integer accelerated operations)
    • Techniques:
      • Linear mapping:
        • Symmetric: scales all values by s and then used a signed integer (range is -max to +max)
        • Asymmetric: scales and then applies bias such that range is min to max (more efficient and precise)
      • Clipping and calibration:
        • Including outliers can massively reduce precision, as they increase range.
        • Methods often set a reasonable range (e.g. +-5std) and then clip the rest of the values
      • Activation quantisation: you don’t know the activation range during training and therefore must come up with a strategy to quantise them when they appear:
        1. Dynamically quantised: calculate scale and zero-point during inference
        2. Staticly quantised: a quantisation rate is set before inference on a pre-defined dataset.
    • Types:
      • Post Training Quantisation:
        • Weights are quantised after training
      • Quantisation Aware Training
        • Quantises and dequantises during training such that the model can locate the best minima which accounts for its effects.
        • Often lowers FP32 accuracy (no quant) but increases accuracy in low precision models (e.g. int4)

10. GPU Architecture and PyTorch


🛠️ Method

Identification of Papers

  1. X (Twitter): there is a huge AI community on twitter which post papers with discussion in the comments.
    • TIP: If others choose to use this I would highly recommend using the ‘Not Interested’ feature on posts, otherwise your feed will rapidly deteriorate and show less papers.
  2. Reddit: r/MachineLearning
  3. Conferences: I recently attend ICLR and came back with a treasure trove of interesting reads.
  4. Paper references

Use of LLMs

  1. LLMs are NOT used for the analysis of the papers. They are however used for checking. I read the paper, write down what I think the key points are. I then ask o4-mini-high to do the same and double check if we disagree.
  2. Paper recommendations
  3. Formatting and helping with markdown.
  4. Quick analysis scripts.

⚙️ Repository Structure

├── LLM_reinforcement_learning/    # Papers on RL with language models
├── marl/                          # Multi‑agent RL resources
├── non_LLM_reinforcement_learning/ # RL methods outside LLM context
├── robotics/                      # Robotic learning and control papers
├── self_improvement/              # Self‑play and self‑dialog approaches
├── distribution_and_gpu_acceleration/ # GPU‑accelerated training methods
├── open_endedness_and_auto_curriculums/ # Curriculum learning and open‑endedness
└── README.md                      # This overview and highlights

📖 Full Diary

Click the links to see the summaries and get links to the original paper.

May 2025

June 2025

July 2025

August 2025

September 2025

October

1st: Current Best Practices for Training LLMs from Scratch


More Plots

Papers Read Over Time

U-MAP

The t-SNE for comparison: