Dominic Rigby

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Date: 2025-06-04
[arXiv Link]:(https://arxiv.org/pdf/1709.10087) Key Points:

Integrates expert demonstrations into RL by adding a demonstration‐constraint term to the policy gradient loss.
Ensures that initially, the policy strongly mimics expert trajectories; over time, this imitation weight is annealed to allow exploration.
Bridges the gap between pure imitation learning and pure RL, resulting in faster convergence and better final performance.

Key Methods:

Loss function:
[ \mathcal{L} = \mathcal{L}{\text{RL}} + \lambda(t)\,\mathcal{L}{\text{demo}},
]
where $\mathcal{L}_{\text{demo}}$ encourages matching the expert’s action distribution (e.g., cross‐entropy), and $\lambda(t)$ decays from $\lambda_0$ to 0 over training.
Annealing schedule: often linear or exponential decay for $\lambda(t)$, ensuring early guidance from demonstrations and eventual policy autonomy.
Expert buffer: store a fixed set of high‐quality trajectories; sample mini‐batches from both on‐policy rollout buffer (for RL loss) and expert buffer (for imitation loss).
PPO‐based optimization: combine GAE (Generalized Advantage Estimation) for $\mathcal{L}{\text{RL}}$ and cross‐entropy on expert states for $\mathcal{L}{\text{demo}}$.