Trust Region Policy Optimization

John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, Pieter Abbeel

Iterative procedure for optimizing policies, with guaranteed monotonic improvement.

Big picture:

We have Theory for proving, but need approximations for real play. Practical algorithm: Trust Region Policy Optimization (TRPO)

performs robustly on a wide variety of tasks:

approximations deviate from the theory
TRPO tends to give monotonic improvement, with little tuning of hyperparameters.