Dominic Rigby

Paper Diary

By Dom Rigby

Note: this is GitHub Pages website. If viewing on GitHub, please go to domrigby.github.io for full experience.


📌 Introduction

Welcome to my Paper Diary! Due to the seemingly never ending supply of interesting reinforcement learning papers which have come out in the last few years, I began to try and read at least one every day. I was however having the issue that after a month of two I could not remember for the life of me where I had read that interesting fact, method or algorithm. I therefore began keeping a diary of the papers/blog posts I was reading. I recently decided I may as well start posting these incase anyone else find this interesting or useful!

Entries are added on the go (often from my phone or iPad) and later refined on my laptop.

Note: Layout and formatting are continuously improved when time permits.

🛠️ Method

Identification of Papers

  1. X (Twitter): there is a huge AI community on twitter which post papers with discussion in the comments.
    • TIP: If others choose to use this I would highly recommend using the ‘Not Interested’ feature on posts, otherwise your feed will rapidly deteriorate and show less papers.
  2. Reddit: r/MachineLearning
  3. Conferences: I recently attend ICLR and came back with a treasure trove of interesting reads.
  4. Paper references

Use of LLMs

  1. LLMs are NOT used for the analysis of the papers. They are however used for checking. I read the paper, write down what I think the key points are. I then ask o4-mini-high to do the same and double check if we disagree.
  2. Paper recommendations
  3. Formatting and helping with markdown.
  4. Quick analysis scripts.

⚙️ Website Workings

This website is a user-friendly entry point and summary of the repository. This hosts the top level themes and parts I thought were interesting. All paper summaries are stored in this repository


📈 Fun Plots

Inspired by figure 2 of OMNI-EPIC and the method used to create the policy diversity metrics in Foundation Model Self Play. I got o4-mini-high to create a one a short description of each paper using this prompt and then embedded this description using sentence-transformers python library. These embeddings were reduced to 2D using t-SNE and cluster using K-Means. The titles of the clusters are chosen by GPT-4o by giving it the title of each paper in the cluster.

I hope to soon update the embedding model!

There is an option to turn on convex hulls around clusters in the legend. There is a UMAP version of this plot in this more plots section. Hover over (on tap on mobile) any point to see the name of the paper.


🔍 Highlights & Lessons Learned

The following section includes:

1. Reinforcement Learning (RL)

  1. High‑Entropy Token Training
    • Training only on high‑entropy (“forks in the road”) tokens yields significant performance gains in LLMs. (80:20 Rule). Many tokens in language are determined by other words so provide little information in te RL process when they are chosen. E.g. “I went to the shop”, “to” and “the” are determined by the other words
  2. Zone of Proximal Development
    • Methods like ProRL and Absolute Zero Reasoner filter out consistently correct or incorrect prompts to focus learning in the optimal difficulty zone. This is discussed in detail in section 2.
  3. It is possible to make Non‑Verifiable Reward Models
    • Writing‑Zero introduces LLM based prefserence training in non‑verifiable environments, then uses that model as a reward in competitive creative writing games.
  4. You can use generative AI to expand experience buffer
    • SynthER trains a diffusion model to expand the replay buffer with synthetic experiences for mixed real‑fake training.
  5. You can learn to reason by simply playing games
    • Play to Generalise demonstrates that game‑based move prediction enhances specific reasoning capabilities.
  6. GPU‑Accelerated Environments provide monumental speeds up
    • Frameworks like Kinetix and JaxMARL allow you to run tens of thousands of environments in parallel, as well as minimise CPU-GPU overhead.
  7. Foundation Models roles in RL:
    • Foundation models have intuition about what humans find interesting. They are therefore capable of designing curriculums for RL or being involved in the policy improvement steps. See more in the open-endedness section of this blog. Summary of a few interesting methods:

2. Open‑Endedness & Auto‑Curricula

  1. Open-Endedness Requires Novel and Learnable Artefacts
    • Open-ended is defined in Open-Endedness is Key to ASI. A system is open-ended if it continually creates novel and learnable artefacts. This is dependent on the observer and the time-horizon. E.g. a mouse can’t learn chess and a computer will eventually plateau in performance.
  2. Procedural Level Generation is used to create novel environments to learn in
    • POET introduces new levels, checks they meet a minimum learnability criterion and then only adds the most novel.
  3. Prioritized Level Replay is way to order those environments such that they are learnable. This creates an auto-curriculum.
  4. Randomly generate a new level, or create a new one!: this creates population or pool of environments for the agent to interact with
    • Auto-Curriculum Learning for Driving Scenarios, POET and many others methods introduces the idea of random generator + editor as the basic building blocks for creating levels. One creates random new levels and the other perturbs existing interesting levels.
  5. Foundation models can act as ‘intelligent search operators to create new learning opportunities based on what they have learned that humans would find interesting.
  6. Performance annealed exploration reward:

3. Pretraining, Fine-Tuning & General Training Tips

  1. Heterogeneous Pretraining: think outside the box when it comes to data
    • Pi0.5 and V‑JEPA both use video data to train robotics models. This video still contains information of interst to robotics. Pre-training data can come from a wide range of sources!
  2. Reasoning with Next Token Prediction (RNTP): (allowing the model to reason about the next token during pre-training)
  3. **When doing PPO/GRPO, make the upper bound clip larger ($ \epsilon_{clip, high} - 1 > 1 - \epsilon_{clip, low }$)**\
    • The upper clip bound being higher increases the probability of unlikely choices and increases exploration (as in ProRL and Play to Generalise) improve exploration and stability.
  4. Dual‑Outcome Reasoning: knows what’s bad is also useful!
    • Generating both best and worst moves in game scenarios deepens model understanding of decision boundaries (Play to Generalise)
  5. Always use a GPU based environment when possible
    • Always host simulation environments on the GPU when possible. This allows you to run tens of thousands of environments in parallel (JaxMARL, Kinetix)
  6. Beware When Using Qwen for RL
  7. Telling the model how to think improves performance
    • FinCoT improved performance by giving the reasoning model **strucutred chain-of-thought prompts. For finance problems, methods to solve certain types of problems are well known, or at least the important things to look for. These chain of thought patterns are generated using DeepResearch and then added to the prompt after the question as a suggestion of how to think.
  8. Creating ‘soups’ of all your different hyperparameter fine-tuning models can improve performance.
    • ModelSoups achieved SotA performance on ImageNet by doing a greedy mix (only add if it improves performance). This works as fine-tuned models often end up in the same loss valley and therefore averaging their performance can lead to lower loss and better performance.

4. Robotics & Control

  1. Predict multiple actions at once rather than one
    • Mimic One predicts chunks of actions to enforce temporal consistency.
  2. Using diffusion models as policies
    • Diffusion models generate continuous action fields for robot control (Pi0.5, Mimic One).
  3. Learning world models for large scale video data
    • V‑JEPA pretrains on millions of videos to predict missing frames, then fine‑tunes on robotic datasets for causal understanding and planning.
  4. Pre-Training is possible in robotics
    • V-JEPA and Pi0.5 both used huge amounts of internet video data to train world models to predict actions and effects.

5. Distribution

  1. This blog post by Jeremy Jordan covers the basics of how to train a network on thousands of GPUS. Some of the key methods spoke about were:
    • Types of parallelism:
      1. Data parallelism: each GPU has a copy of the model and a different batch of data. They then share gradients to do joint updates.
      2. Model parallelism: for large models. Model layers are split over many GPUs.
    • Communication methods:
      1. Scatter: send different data to each GPU
      2. Broadcast: same data to all
      3. Reduce: combine all data on one GPU.
  2. This blog post on distributed PPO outlines some extra factors to think about:
    1. Synchronous: waits for all agents to calculate their respective gradients before doing a weights update.
    2. Asynchronous: doesn’t wait.
    3. Centralised: single server does all gradient accumulation and weights updates.
    4. Decentralised: all share gradients (all-reduce) but have their own model.
  3. IMPALA outlines a now common, distributed reinforcement learning method with multiple actors and a single centralised learner which broadcasts weights update. This is mimicked in PyTorch in TorchBeast
  4. Docker can be used like a lightweight virtual machine for distributing actors or learners across large clusters.

6. Multi‑Agent Reinforcement Learning (MARL)

  1. Stabilise MARL by condition agents actions on the actions of other agents
    • JointPPO orders agents by decision importance, then uses a recurrent action‑conditioned network to generate actions sequentially
  2. GPU based environments are key to tackling to complexity of MARL
    • JaxMARL allows you to run the environment tens of thousands of times in parallel. This means the monumental search space can be explore a bit more thoroughly.
  3. Population‑based methods prevent overfitting and foster diverse behaviors.
  4. Agent selection via ELO‑weighted sampling encourages robustness and competitive balance. This is used in Foundation Model Self Play, Multi-Agent Pommerman

7. Self‑Improvement Strategies

  1. LLMs can do self-play for reasoning, as long as their grounded to reality
  2. Unsupervised Self‑Dialog Games
    • VLMs play in‑domain “Guess Who” style games to self‑improve vision‑language reasoning. (VLM Self‑Dialog Games)
  3. Adaptive Prompting & Team Agents
    • Agents of Change evolve prompts and orchestrate agent teams (analyst, coder, researcher) for strategic planning tasks.
  4. Self‑Adapting LLMs
    • SEAL uses RL to generate synthetic edits and hyperparameters, enabling rapid adaptation to new tasks.

⚙️ Repository Structure

├── LLM_reinforcement_learning/    # Papers on RL with language models
├── marl/                          # Multi‑agent RL resources
├── non_LLM_reinforcement_learning/ # RL methods outside LLM context
├── robotics/                      # Robotic learning and control papers
├── self_improvement/              # Self‑play and self‑dialog approaches
├── distribution_and_gpu_acceleration/ # GPU‑accelerated training methods
├── open_endedness_and_auto_curriculums/ # Curriculum learning and open‑endedness
└── README.md                      # This overview and highlights

📖 Full Diary

May 2025

June 2025

July 2025


More Plots

Papers Read Over Time

U-MAP

The t-SNE for comparison:


Last updated: 12th July 2025