Dominic Rigby

CHIRP:CHange Induced Regret Policy Metrics for Lifelong RL

Proposed a metric called $SOPR$ (Scaled Optimal Policy Regret) which quantified how much expected reward is lost when the environment changes. This is however infeasible to compute in anything other than simple environments.
Proposed CHIRP (CHange Induced Regret Policy) requirements for a computable metric to replace SOPR.
The above is very difficult to compute, therefore another metric was proposed for measuring similarities between different Markov Decision Processes: $W_{1}-MDP$
This metric is for both discrete and continuous action spaces and is cheap to calculate.

CHIRP Policy Reuse (CPR): used for multi-task training. It was used to cluster different tasks and then to use the same policy on similar tasks.