Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards
Date: 12th June 2025
arXiv Link
Key Points:
- Aims to bridge non-verifiable rewards gap for reinforcement learning on LLMs.
- Subjective tasks such as creative writing are inherently unverifiable and opinion based. This makes giving them rewards
in RL extremely difficult.
- Scoring normally relies on preferences between two pieces. This paper trained a Generative Reward Model (GenRM) to give
a total score of 10 between two pieces of writing. They trained this via LLM style training
Key Methods:
- Bootstrapped Relative Policy Optimisation: as mentioned above, you need some writing to compare the results to.
BRPO uses a random sample from the batch to use as the comparison. This means the reference is always fresh.
- DAPO: filters out any prompts which score 0 or 1 as these don’t tell us much.