John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, Pieter Abbeel
Iterative procedure for optimizing policies, with guaranteed monotonic improvement.
We have Theory for proving, but need approximations for real play. Practical algorithm: Trust Region Policy Optimization (TRPO)
- natural policy gradient methods
- effective for optimizing large nonlinear policies such as neural networks.
performs robustly on a wide variety of tasks:
- learning simulated robotic swimming, hopping, and walking gaits
- playing Atari games using images of the screen as input.
- approximations deviate from the theory
- TRPO tends to give monotonic improvement, with little tuning of hyperparameters.