Dominic Rigby

Defeating the Training-Inference Mismatch via FP16

Date read: 1st November 2025

In RL, the training (optimised for calculating gradients) and the inference are often separate.
There is often a slight mismatch between the inference and training policy which causes isntability in training.
This is exagerated if one is in FP16 and the other in BF16:
- FP16: Mantissa of 10 and exponent of 5. High precision but low range. Susceptible to over and underflow (values appear as max or as 0)
- BF16: Mantissa of 7 and exponent of 8. Low precision but full FP32 range. Common for training neural networks.
BF16 causes the instabilites, as even if both inference and training are in BF16, the low precision will cause instabilities.
Paper *finds using FP16 for RL fixes these issues. During fine-tuning, overflow should not be an issue as dynamic range is ound during training.