diff --git a/README.md b/README.md index 30084ff91..bb1ffed96 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ - [Deep Q-Network (DQN)](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) - [Double DQN](https://arxiv.org/pdf/1509.06461.pdf) - [Dueling DQN](https://arxiv.org/pdf/1511.06581.pdf) -- [C51](https://arxiv.org/pdf/1707.06887.pdf) +- [Categorical DQN (C51)](https://arxiv.org/pdf/1707.06887.pdf) - [Quantile Regression DQN (QRDQN)](https://arxiv.org/pdf/1710.10044.pdf) - [Advantage Actor-Critic (A2C)](https://openai.com/blog/baselines-acktr-a2c/) - [Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/pdf/1509.02971.pdf) @@ -39,7 +39,8 @@ Here is Tianshou's other features: -- Elegant framework, using only ~2000 lines of code +- Elegant framework, using only ~3000 lines of code +- State-of-the-art [MuJoCo benchmark](https://github.com/thu-ml/tianshou/tree/master/examples/mujoco) for REINFORCE/A2C/PPO/DDPG/TD3/SAC algorithms - Support parallel environment simulation (synchronous or asynchronous) for all algorithms [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html#parallel-sampling) - Support recurrent state representation in actor network and critic network (RNN-style training for POMDP) [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html#rnn-style-training) - Support any type of environment state/action (e.g. a dict, a self-defined class, ...) [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html#user-defined-environment-and-different-state-representation) diff --git a/docs/index.rst b/docs/index.rst index a3acbe64c..531b13070 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -13,7 +13,7 @@ Welcome to Tianshou! * :class:`~tianshou.policy.DQNPolicy` `Deep Q-Network `_ * :class:`~tianshou.policy.DQNPolicy` `Double DQN `_ * :class:`~tianshou.policy.DQNPolicy` `Dueling DQN `_ -* :class:`~tianshou.policy.C51Policy` `C51 `_ +* :class:`~tianshou.policy.C51Policy` `Categorical DQN `_ * :class:`~tianshou.policy.QRDQNPolicy` `Quantile Regression DQN `_ * :class:`~tianshou.policy.A2CPolicy` `Advantage Actor-Critic `_ * :class:`~tianshou.policy.DDPGPolicy` `Deep Deterministic Policy Gradient `_ @@ -30,6 +30,7 @@ Welcome to Tianshou! Here is Tianshou's other features: * Elegant framework, using only ~2000 lines of code +* State-of-the-art `MuJoCo benchmark `_ * Support parallel environment simulation (synchronous or asynchronous) for all algorithms: :ref:`parallel_sampling` * Support recurrent state representation in actor network and critic network (RNN-style training for POMDP): :ref:`rnn_training` * Support any type of environment state/action (e.g. a dict, a self-defined class, ...): :ref:`self_defined_env` diff --git a/examples/mujoco/README.md b/examples/mujoco/README.md index 9b32d9063..798b9e97e 100644 --- a/examples/mujoco/README.md +++ b/examples/mujoco/README.md @@ -16,9 +16,8 @@ Supported algorithms are listed below: - [Twin Delayed DDPG (TD3)](https://arxiv.org/pdf/1802.09477.pdf), [commit id](https://github.com/thu-ml/tianshou/tree/e605bdea942b408126ef4fbc740359773259c9ec) - [Soft Actor-Critic (SAC)](https://arxiv.org/pdf/1812.05905.pdf), [commit id](https://github.com/thu-ml/tianshou/tree/e605bdea942b408126ef4fbc740359773259c9ec) - [REINFORCE algorithm](https://papers.nips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf), [commit id](https://github.com/thu-ml/tianshou/tree/e27b5a26f330de446fe15388bf81c3777f024fb9) -- A2C, commit id (TODO) - -## Offpolicy algorithms +- [Advantage Actor-Critic (A2C)](https://openai.com/blog/baselines-acktr-a2c/), [commit id](https://github.com/thu-ml/tianshou/tree/1730a9008ad6bb67cac3b21347bed33b532b17bc) +- [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf), [commit id](https://github.com/thu-ml/tianshou/tree/5d580c36624df0548818edf1f9b111b318dd7fd8) #### Usage @@ -48,15 +47,16 @@ This will start 10 experiments with different seeds. Other graphs can be found under `/examples/mujuco/benchmark/` -#### Hints - -In offpolicy algorithms(DDPG, TD3, SAC), the shared hyperparameters are almost the same[[8]](#footnote8), and most hyperparameters are consistent with those used for benchmark in SpinningUp's implementations[[9]](#footnote9). +## Offpolicy algorithms +#### Notes -By comparison to both classic literature and open source implementations (e.g., SpinningUp)[[1]](#footnote1)[[2]](#footnote2), Tianshou's implementations of DDPG, TD3, and SAC are roughly at-parity with or better than the best reported results for these algorithms. +1. In offpolicy algorithms (DDPG, TD3, SAC), the shared hyperparameters are almost the same, and unless otherwise stated, hyperparameters are consistent with those used for benchmark in SpinningUp's implementations (e.g. we use batchsize 256 in DDPG/TD3/SAC while SpinningUp use 100. Minor difference also lies with `start-timesteps`, data loop method `step_per_collect`, method to deal with/bootstrap truncated steps because of timelimit and unfinished/collecting episodes (contribute to performance improvement), etc.). +2. By comparison to both classic literature and open source implementations (e.g., SpinningUp)[[1]](#footnote1)[[2]](#footnote2), Tianshou's implementations of DDPG, TD3, and SAC are roughly at-parity with or better than the best reported results for these algorithms, so you can definitely use Tianshou's benchmark for research purposes. +3. We didn't compare offpolicy algorithms to OpenAI baselines [benchmark](https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm), because for now it seems that they haven't provided benchmark for offpolicy algorithms, but in [SpinningUp docs](https://spinningup.openai.com/en/latest/spinningup/bench.html) they stated that "SpinningUp implementations of DDPG, TD3, and SAC are roughly at-parity with the best-reported results for these algorithms", so we think lack of comparisons with OpenAI baselines is okay. ### DDPG -| Environment | Tianshou | [SpinningUp (PyTorch)](https://spinningup.openai.com/en/latest/spinningup/bench.html) | [TD3 paper (DDPG)](https://arxiv.org/abs/1802.09477) | [TD3 paper (OurDDPG)](https://arxiv.org/abs/1802.09477) | +| Environment | Tianshou (1M) | [SpinningUp (PyTorch)](https://spinningup.openai.com/en/latest/spinningup/bench.html) | [TD3 paper (DDPG)](https://arxiv.org/abs/1802.09477) | [TD3 paper (OurDDPG)](https://arxiv.org/abs/1802.09477) | | :--------------------: | :---------------: | :----------------------------------------------------------: | :--------------------------------------------------: | :-----------------------------------------------------: | | Ant | 990.4±4.3 | ~840 | **1005.3** | 888.8 | | HalfCheetah | **11718.7±465.6** | ~11000 | 3305.6 | 8577.3 | @@ -68,126 +68,169 @@ By comparison to both classic literature and open source implementations (e.g., | InvertedPendulum | **1000.0±0.0** | N | **1000.0** | **1000.0** | | InvertedDoublePendulum | 8364.3±2778.9 | N | **9355.5** | 8370.0 | -\* details[[5]](#footnote5)[[6]](#footnote6)[[7]](#footnote7) +\* details[[4]](#footnote4)[[5]](#footnote5)[[6]](#footnote6) ### TD3 -| Environment | Tianshou | [SpinningUp (Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench.html) | [TD3 paper](https://arxiv.org/abs/1802.09477) | -| :--------------------: | :---------------: | :-------------------: | :--------------: | -| Ant | **5116.4±799.9** | ~3800 | 4372.4±1000.3 | -| HalfCheetah | **10201.2±772.8** | ~9750 | 9637.0±859.1 | -| Hopper | 3472.2±116.8 | ~2860 | **3564.1±114.7** | -| Walker2d | 3982.4±274.5 | ~4000 | **4682.8±539.6** | -| Swimmer | **104.2±34.2** | ~78 | N | -| Humanoid | **5189.5±178.5** | N | N | -| Reacher | **-2.7±0.2** | N | -3.6±0.6 | -| InvertedPendulum | **1000.0±0.0** | N | **1000.0±0.0** | -| InvertedDoublePendulum | **9349.2±14.3** | N | **9337.5±15.0** | +| Environment | Tianshou (1M) | [SpinningUp (Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench.html) | [TD3 paper](https://arxiv.org/abs/1802.09477) | +| :--------------------: | :---------------: | :----------------------------------------------------------: | :-------------------------------------------: | +| Ant | **5116.4±799.9** | ~3800 | 4372.4±1000.3 | +| HalfCheetah | **10201.2±772.8** | ~9750 | 9637.0±859.1 | +| Hopper | 3472.2±116.8 | ~2860 | **3564.1±114.7** | +| Walker2d | 3982.4±274.5 | ~4000 | **4682.8±539.6** | +| Swimmer | **104.2±34.2** | ~78 | N | +| Humanoid | **5189.5±178.5** | N | N | +| Reacher | **-2.7±0.2** | N | -3.6±0.6 | +| InvertedPendulum | **1000.0±0.0** | N | **1000.0±0.0** | +| InvertedDoublePendulum | **9349.2±14.3** | N | **9337.5±15.0** | + +\* details[[4]](#footnote4)[[5]](#footnote5)[[6]](#footnote6) -\* details[[5]](#footnote5)[[6]](#footnote6)[[7]](#footnote7) +#### Hints for TD3 +1. TD3's learning rate is set to 3e-4 while it is 1e-3 for DDPG/SAC. However, there is NO enough evidence to support our choice of such hyperparameters (we simply choose them because SpinningUp do so) and you can try playing with those hyperparameters to see if you can improve performance. Do tell us if you can! ### SAC -| Environment | Tianshou | [SpinningUp (Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench.html) | [SAC paper](https://arxiv.org/abs/1801.01290) | -| :--------------------: | :----------------: | :-------------------: | :---------: | -| Ant | **5850.2±475.7** | ~3980 | ~3720 | -| HalfCheetah | **12138.8±1049.3** | ~11520 | ~10400 | -| Hopper | **3542.2±51.5** | ~3150 | ~3370 | -| Walker2d | **5007.0±251.5** | ~4250 | ~3740 | -| Swimmer | **44.4±0.5** | ~41.7 | N | -| Humanoid | **5488.5±81.2** | N | ~5200 | -| Reacher | **-2.6±0.2** | N | N | -| InvertedPendulum | **1000.0±0.0** | N | N | -| InvertedDoublePendulum | **9359.5±0.4** | N | N | +| Environment | Tianshou (1M) | [SpinningUp (Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench.html) | [SAC paper](https://arxiv.org/abs/1801.01290) | +| :--------------------: | :----------------: | :----------------------------------------------------------: | :-------------------------------------------: | +| Ant | **5850.2±475.7** | ~3980 | ~3720 | +| HalfCheetah | **12138.8±1049.3** | ~11520 | ~10400 | +| Hopper | **3542.2±51.5** | ~3150 | ~3370 | +| Walker2d | **5007.0±251.5** | ~4250 | ~3740 | +| Swimmer | **44.4±0.5** | ~41.7 | N | +| Humanoid | **5488.5±81.2** | N | ~5200 | +| Reacher | **-2.6±0.2** | N | N | +| InvertedPendulum | **1000.0±0.0** | N | N | +| InvertedDoublePendulum | **9359.5±0.4** | N | N | -\* details[[5]](#footnote5)[[6]](#footnote6) +\* details[[4]](#footnote4)[[5]](#footnote5) #### Hints for SAC - -0. DO NOT share the same network with two critic networks. -1. The sigma (of the Gaussian policy) should be conditioned on input. -2. The network size should not be less than 256. -3. The deterministic evaluation helps a lot :) +1. SAC's start-timesteps is set to 10000 by default while it is 25000 is DDPG/TD3. However, there is NO enough evidence to support our choice of such hyperparameters (we simply choose them because SpinningUp do so) and you can try playing with those hyperparameters to see if you can improve performance. Do tell us if you can! +2. DO NOT share the same network with two critic networks. +3. The sigma (of the Gaussian policy) should be conditioned on input. +4. The deterministic evaluation helps a lot :) ## Onpolicy Algorithms +#### Notes +1. In A2C and PPO, unless otherwise stated, most hyperparameters are consistent with those used for benchmark in [ikostrikov/pytorch-a2c-ppo-acktr-gail](https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail). +2. Gernally speaking, by comparison to both classic literature and open source implementations (e.g., OPENAI Baselines)[[1]](#footnote1)[[2]](#footnote2), Tianshou's implementations of REINFORCE, A2C, PPO are better than the best reported results for these algorithms, so you can definitely use Tianshou's benchmark for research purposes. + ### REINFORCE -| Environment | Tianshou(10M steps) | -| :--------------------: | :-----------------: | -| Ant | **1108.1±323.1** | -| HalfCheetah | **1138.8±104.7** | -| Hopper | **416.0±104.7** | -| Walker2d | **440.9±148.2** | -| Swimmer | **35.6±2.6** | -| Humanoid | **464.3±58.4** | -| Reacher | **-5.5±0.2** | -| InvertedPendulum | **1000.0±0.0** | -| InvertedDoublePendulum | **7726.2±1287.3** | - - -| Environment | Tianshou(3M steps) | [SpinningUp (VPG Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench_vpg.html)[[10]](#footnote10) | -| :--------------------: | :--------------------------: | :------------------------: | -| Ant | **474.9+-133.5** | ~5 | -| HalfCheetah | **884.0+-41.0** | ~600 | -| Hopper | 395.8+-64.5* | **~800** | -| Walker2d | 412.0+-52.4 | **~460** | -| Swimmer | 35.3+-1.4 | **~51** | -| Humanoid | **438.2+-47.8** | N | -| Reacher | **-10.5+-0.7** | N | -| InvertedPendulum | **999.2+-2.4** | N | -| InvertedDoublePendulum | **1059.7+-307.7** | N | - -\* details[[5]](#footnote5)[[6]](#footnote6) +| Environment | Tianshou (10M) | +| :--------------------: | :---------------: | +| Ant | **1108.1±323.1** | +| HalfCheetah | **1138.8±104.7** | +| Hopper | **416.0±104.7** | +| Walker2d | **440.9±148.2** | +| Swimmer | **35.6±2.6** | +| Humanoid | **464.3±58.4** | +| Reacher | **-5.5±0.2** | +| InvertedPendulum | **1000.0±0.0** | +| InvertedDoublePendulum | **7726.2±1287.3** | + + +| Environment | Tianshou (3M) | [SpinningUp (VPG Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench_vpg.html)[[7]](#footnote7) | +| :--------------------: | :---------------: | :----------------------------------------------------------: | +| Ant | **474.9+-133.5** | ~5 | +| HalfCheetah | **884.0+-41.0** | ~600 | +| Hopper | 395.8+-64.5* | **~800** | +| Walker2d | 412.0+-52.4 | **~460** | +| Swimmer | 35.3+-1.4 | **~51** | +| Humanoid | **438.2+-47.8** | N | +| Reacher | **-10.5+-0.7** | N | +| InvertedPendulum | **999.2+-2.4** | N | +| InvertedDoublePendulum | **1059.7+-307.7** | N | + +\* details[[4]](#footnote4)[[5]](#footnote5) #### Hints for REINFORCE -0. Following [Andrychowicz, Marcin, et al](https://arxiv.org/abs/2006.05990), we downscale last layer of policy network by a factor of 0.01 after orthogonal initialization. -1. We choose "tanh" function to squash sampled action from range (-inf, inf) to (-1, 1) rather than usually used clipping method (As in StableBaselines3). We did full scale ablation studies and results show that tanh squashing performs a tiny little bit better than clipping overall, and is much better than no action bounding. However, "clip" method is still a very good method, considering its simplicity. -2. We use global observation normalization and global rew-to-go (value) normalization by default. Both are crucial to good performances of REINFORCE algorithm. Since we minus mean when doing rew-to-go normalization, you can treat global mean of rew-to-go as a naive version of "baseline". -3. Since we do not have a value estimator, we use global rew-to-go mean to bootstrap truncated steps because of timelimit and unfinished collecting, while most other implementations use 0. We feel this would help because mean is more likely a better estimate than 0 (no ablation study has been done). -4. We have done full scale ablation study on learning rate and lr decay strategy. We experiment with lr of 3e-4, 5e-4, 1e-3, each have 2 options: no lr decay or linear decay to 0. Experiments show that 3e-4 learning rate will cause slowly learning and make agent step in local optima easily for certain environments like InvertedDoublePendulum, Ant, HalfCheetah, and 1e-3 lr helps a lot. However, after training agents with lr 1e-3 for 5M steps or so, agents in certain environments like InvertedPendulum will become unstable. Conclusion is that we should start with a large learning rate and linearly decay it, but for a small initial learning rate or if you only train agents for limited timesteps, DO NOT decay it. -5. We didn't tune `step-per-collect` option and `training-num` option. Default values are finetuned with PPO algorithm so we assume they are also good for REINFORCE. You can play with them if you want, but remember that `buffer-size` should always be larger than `step-per-collect`, and if `step-per-collect` is too small and `training-num` too large, episodes will be truncated and bootstrapped very often, which will harm performances. If `training-num` is too small (e.g., less than 8), speed will go down. -6. Sigma of action is not fixed (normally seen in other implementation) or conditioned on observation, but is an independent parameter which can be updated by gradient descent. We choose this setting because it works well in PPO, and is recommended by [Andrychowicz, Marcin, et al](https://arxiv.org/abs/2006.05990). See Fig. 23. +1. Following [Andrychowicz, Marcin, et al](https://arxiv.org/abs/2006.05990), we downscale last layer of policy network by a factor of 0.01 after orthogonal initialization. +2. We choose "tanh" function to squash sampled action from range (-inf, inf) to (-1, 1) rather than usually used clipping method (As in StableBaselines3). We did full scale ablation studies and results show that tanh squashing performs a tiny little bit better than clipping overall, and is much better than no action bounding. However, "clip" method is still a very good method, considering its simplicity. +3. We use global observation normalization and global rew-to-go (value) normalization by default. Both are crucial to good performance of REINFORCE algorithm. Since we minus mean when doing rew-to-go normalization, you can treat global mean of rew-to-go as a naive version of "baseline". +4. Since we do not have a value estimator, we use global rew-to-go mean to bootstrap truncated steps because of timelimit and unfinished collecting, while most other implementations use 0. We feel this would help because mean is more likely a better estimate than 0 (no ablation study has been done). +5. We have done full scale ablation study on learning rate and lr decay strategy. We experiment with lr of 3e-4, 5e-4, 1e-3, each have 2 options: no lr decay or linear decay to 0. Experiments show that 3e-4 learning rate will cause slowly learning and make agent step in local optima easily for certain environments like InvertedDoublePendulum, Ant, HalfCheetah, and 1e-3 lr helps a lot. However, after training agents with lr 1e-3 for 5M steps or so, agents in certain environments like InvertedPendulum will become unstable. Conclusion is that we should start with a large learning rate and linearly decay it, but for a small initial learning rate or if you only train agents for limited timesteps, DO NOT decay it. +6. We didn't tune `step-per-collect` option and `training-num` option. Default values are finetuned with PPO algorithm so we assume they are also good for REINFORCE. You can play with them if you want, but remember that `buffer-size` should always be larger than `step-per-collect`, and if `step-per-collect` is too small and `training-num` too large, episodes will be truncated and bootstrapped very often, which will harm performance. If `training-num` is too small (e.g., less than 8), speed will go down. +7. Sigma of action is not fixed (normally seen in other implementation) or conditioned on observation, but is an independent parameter which can be updated by gradient descent. We choose this setting because it works well in PPO, and is recommended by [Andrychowicz, Marcin, et al](https://arxiv.org/abs/2006.05990). See Fig. 23. ### A2C -| Environment | Tianshou(3M steps) | [Spinning Up(Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench_vpg.html)| -| :--------------------: | :----------------: | :--------------------: | -| Ant | **5236.8+-236.7** | ~5 | -| HalfCheetah | **2377.3+-1363.7** | ~600 | -| Hopper | **1608.6+-529.5** | ~800 | -| Walker2d | **1805.4+-1055.9** | ~460 | -| Swimmer | 40.2+-1.8 | **~51** | -| Humanoid | **5316.6+-554.8** | N | -| Reacher | **-5.2+-0.5** | N | -| InvertedPendulum | **1000.0+-0.0** | N | -| InvertedDoublePendulum | **9351.3+-12.8** | N | - -| Environment | Tianshou | [PPO paper](https://arxiv.org/abs/1707.06347) A2C | [PPO paper](https://arxiv.org/abs/1707.06347) A2C + Trust Region | -| :--------------------: | :----------------: | :-------------: | :-------------: | -| Ant | **3485.4+-433.1** | N | N | -| HalfCheetah | **1829.9+-1068.3** | ~1000 | ~930 | -| Hopper | **1253.2+-458.0** | ~900 | ~1220 | -| Walker2d | **1091.6+-709.2** | ~850 | ~700 | -| Swimmer | **36.6+-2.1** | ~31 | **~36** | -| Humanoid | **1726.0+-1070.1** | N | N | -| Reacher | **-6.7+-2.3** | ~-24 | ~-27 | -| InvertedPendulum | **1000.0+-0.0** | **~1000** | **~1000** | -| InvertedDoublePendulum | **9257.7+-277.4** | ~7100 | ~8100 | - -\* details[[5]](#footnote5)[[6]](#footnote6) +| Environment | Tianshou (3M) | [Spinning Up(Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench_vpg.html) | +| :--------------------: | :----------------: | :----------------------------------------------------------: | +| Ant | **5236.8+-236.7** | ~5 | +| HalfCheetah | **2377.3+-1363.7** | ~600 | +| Hopper | **1608.6+-529.5** | ~800 | +| Walker2d | **1805.4+-1055.9** | ~460 | +| Swimmer | 40.2+-1.8 | **~51** | +| Humanoid | **5316.6+-554.8** | N | +| Reacher | **-5.2+-0.5** | N | +| InvertedPendulum | **1000.0+-0.0** | N | +| InvertedDoublePendulum | **9351.3+-12.8** | N | + +| Environment | Tianshou (1M) | [PPO paper](https://arxiv.org/abs/1707.06347) A2C | [PPO paper](https://arxiv.org/abs/1707.06347) A2C + Trust Region | +| :--------------------: | :----------------: | :-----------------------------------------------: | :----------------------------------------------------------: | +| Ant | **3485.4+-433.1** | N | N | +| HalfCheetah | **1829.9+-1068.3** | ~1000 | ~930 | +| Hopper | **1253.2+-458.0** | ~900 | ~1220 | +| Walker2d | **1091.6+-709.2** | ~850 | ~700 | +| Swimmer | **36.6+-2.1** | ~31 | **~36** | +| Humanoid | **1726.0+-1070.1** | N | N | +| Reacher | **-6.7+-2.3** | ~-24 | ~-27 | +| InvertedPendulum | **1000.0+-0.0** | **~1000** | **~1000** | +| InvertedDoublePendulum | **9257.7+-277.4** | ~7100 | ~8100 | + +\* details[[4]](#footnote4)[[5]](#footnote5) #### Hints for A2C -0. We choose `clip` action method in A2C instead `tanh` option as used in REINFORCE simply to be consistent with original implementation. `tanh` may be better or equally well but we didn't try. -1. (Initial) learning rate, lr decay, and `step-per-collect`, `training-num` affect the performance of A2C to a great extend. These 4 hyperparameters also affect each other and should be tuned together. We have done full scale ablation studies on these 4 hyperparameters (more than 800 agents trained), below are our findings. -2. `step-per-collect`/`training-num` = `bootstrap-lenghth`, which is max length of an "episode" used in GAE estimator, 80/16=5 in default settings. When `bootstrap-lenghth` is small, (maybe) because GAE can at most looks forward 5 steps, and use bootstrap strategy very often, the critic is less well-trained, so they actor cannot converge to very high scores. However, if we increase `step-per-collect` to increase `bootstrap-lenghth` (e.g. 256/16=16), actor/critic will be updated less often, so sample efficiency is low, which will make training process slow. To conclude, If you don't restrict env timesteps, you can try to use larger `bootstrap-lenghth`, and train for more steps, which perhaps will give you better converged scores. Train slower, achieve higher. -3. 7e-4 learning rate with decay strategy if proper for `step-per-collect=80`, `training-num=16`, but if you use larger `step-per-collect`(e.g. 256 - 2048), 7e-4 `lr` is a little bit small, because now you have more data and less noise for each update, and will be more confidence if taking larger steps; so higher learning rate(e.g. 1e-3) is more appropriate and usually boost performance in this setting. If plotting results arises fast in early stages and become unstable later, consider lr decay before decreasing lr. -4. `max-grad-norm` doesn't really help in our experiments, we simply keep it for consistency with other open-source implementations (e.g. SB3). -5. We original paper of A3C use RMSprop optimizer, we find that Adam with the same learning rate works equally well. We use RMSprop anyway. Again, for consistency. -6. We notice that in SB3's implementation of A2C that set `gae-lambda` to 1 by default, we don't know why and after doing some experiments, results show 0.95 is better overall. -7. We find out that `step-per-collect=256`, `training-num=8` are also good hyperparameters. You can have a try. +1. We choose `clip` action method in A2C instead of `tanh` option as used in REINFORCE simply to be consistent with original implementation. `tanh` may be better or equally well but we didn't have a try. +2. (Initial) learning rate, lr decay, `step-per-collect` and `training-num` affect the performance of A2C to a great extend. These 4 hyperparameters also affect each other and should be tuned together. We have done full scale ablation studies on these 4 hyperparameters (more than 800 agents have been trained). Below are our findings. +3. `step-per-collect` / `training-num` are equal to `bootstrap-lenghth`, which is the max length of an "episode" used in GAE estimator and 80/16=5 in default settings. When `bootstrap-lenghth` is small, (maybe) because GAE can look forward at most 5 steps and use bootstrap strategy very often, the critic is less well-trained leading the actor to a not very high score. However, if we increase `step-per-collect` to increase `bootstrap-lenghth` (e.g. 256/16=16), actor/critic will be updated less often, resulting in low sample efficiency and slow training process. To conclude, If you don't restrict env timesteps, you can try using larger `bootstrap-lenghth` and train with more steps to get a better converged score. Train slower, achieve higher. +4. The learning rate 7e-4 with decay strategy is appropriate for `step-per-collect=80` and `training-num=16`. But if you use a larger `step-per-collect`(e.g. 256 - 2048), 7e-4 is a little bit small for `lr` because each update will have more data, less noise and thus smaller deviation in this case. So it is more appropriate to use a higher learning rate (e.g. 1e-3) to boost performance in this setting. If plotting results arise fast in early stages and become unstable later, consider lr decay first before decreasing lr. +5. `max-grad-norm` didn't really help in our experiments. We simply keep it for consistency with other open-source implementations (e.g. SB3). +6. Although original paper of A3C uses RMSprop optimizer, we found that Adam with the same learning rate worked equally well. We use RMSprop anyway. Again, for consistency. +7. We noticed that the implementation of A2C in SB3 sets `gae-lambda` to 1 by default for no reason, and our experiments showed better results overall when `gae-lambda` was set to 0.95. +8. We found out that `step-per-collect=256` and `training-num=8` are also good settings. You can have a try. + +### PPO + +| Environment | Tianshou (1M) | [ikostrikov/pytorch-a2c-ppo-acktr-gail](https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail) | [PPO paper](https://arxiv.org/pdf/1707.06347.pdf) | [baselines](http://htmlpreview.github.io/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm) | [spinningup(pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench_ppo.html) | +| :--------------------: | :----------------: | :----------------------------------------------------------: | :-----------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | +| Ant | **3258.4+-1079.3** | N | N | N | ~650 | +| HalfCheetah | **5783.9+-1244.0** | ~3120 | ~1800 | ~1700 | ~1670 | +| Hopper | **2609.3+-700.8** | ~2300 | ~2330 | ~2400 | ~1850 | +| Walker2d | 3588.5+-756.6 | **~4000** | ~3460 | ~3510 | ~1230 | +| Swimmer | 66.7+-99.1 | N | ~108 | ~111 | **~120** | +| Humanoid | **787.1+-193.5** | N | N | N | N | +| Reacher | **-4.1+-0.3** | ~-5 | ~-7 | ~-6 | N | +| InvertedPendulum | **1000.0+-0.0** | N | **~1000** | ~940 | N | +| InvertedDoublePendulum | **9231.3+-270.4** | N | ~8000 | ~7350 | N | + +| Environment | Tianshou (3M) | [Spinning Up(Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench_ppo.html) | +| :--------------------: | :----------------: | :----------------------------------------------------------: | +| Ant | **4079.3+-880.2** | ~3000 | +| HalfCheetah | **7337.4+-1508.2** | ~3130 | +| Hopper | **3127.7+-413.0** | ~2460 | +| Walker2d | **4895.6+-704.3** | ~2600 | +| Swimmer | 81.4+-96.0 | **~120** | +| Humanoid | **1359.7+-572.7** | N | +| Reacher | **-3.7+-0.3** | N | +| InvertedPendulum | **1000.0+-0.0** | N | +| InvertedDoublePendulum | **9231.3+-270.4** | N | + +\* details[[4]](#footnote4)[[5]](#footnote5) + +#### Hints for PPO +1. Following [Andrychowicz, Marcin, et al](https://arxiv.org/abs/2006.05990) Sec 3.5, we use "recompute advantage" strategy, which contributes a lot to our SOTA benchmark. However, I personally don't quite agree with their explanation about why "recompute advantage" helps. They stated that it's because old strategy "makes it impossible to compute advantages as the temporal structure is broken", but PPO's update equation is designed to learn from slightly-outdated advantages. I think the only reason "recompute advantage" works is that it update the critic several times rather than just one time per update, which leads to a better value function estimation. +2. We have done full scale ablation studies of PPO algorithm's hyperparameters. Here are our findings: In mujoco settings, `value-clip` and `norm-adv` may help a litte bit in some games (e.g. `norm-adv` helps stabilize training in InvertedPendulum-v2), but they make no difference to overall performance. So in our benchmark we do not use such tricks. We validate that setting `ent-coef` to 0.0 rather than 0.01 will increase overall performance in mujoco environments. `max-grad-norm` still offers no help for PPO algorithm, but we still keep it for consistency. +3. [Andrychowicz, Marcin, et al](https://arxiv.org/abs/2006.05990)'s work indicates that using `gae-lambda` 0.9 and changing policy network's width based on which game you play (e.g. use [16, 16] `hidden-sizes` for `actor` network in HalfCheetah and [256, 256] for Ant) may help boost performance. Our ablation studies say otherwise: both options may lead to equal or lower performance overall in our experiments. We are not confident about this claim because we didn't change learning rate and other maybe-correlated factors in our experiments. So if you want, you can still have a try. +4. `batch-size` 128 and 64 (default) work equally well. Changing `training-num` alone slightly (maybe in range [8, 128]) won't affect performance. For bound action method, both `clip` and `tanh` work quite well. +5. In OPENAI implementations of PPO, they multiply value loss with a factor of 0.5 for no good reason (see this [issue](https://github.com/openai/baselines/issues/445#issuecomment-777988738)). We do not do so and therefore make our `vf-coef` 0.25 (half of standard 0.5). However, since value loss is only used to optimize `critic` network, setting different `vf-coef` should in theory make no difference if using Adam optimizer. + + ## Note @@ -197,16 +240,10 @@ By comparison to both classic literature and open source implementations (e.g., [3] We used the latest version of all mujoco environments in gym (0.17.3 with mujoco==2.0.2.13), but it's not often the case with other benchmarks. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though) -[4] We didn't compare offpolicy algorithms to OpenAI baselines [benchmark](https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm), because for now it seems that they haven't provided benchmark for offpolicy algorithms, but in [SpinningUp docs](https://spinningup.openai.com/en/latest/spinningup/bench.html) they stated that "SpinningUp implementations of DDPG, TD3, and SAC are roughly at-parity with the best-reported results for these algorithms", so we think lack of comparisons with OpenAI baselines is okay. - -[5] ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided. - -[6] Reward metric: The meaning of the table value is the max average return over 10 trails (different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered, if not otherwise stated. The shaded region on the graph also represents a single standard deviation. It is the same as [TD3 evaluation method](https://github.com/sfujim/TD3/issues/34). - -[7] In TD3 paper, shaded region represents only half of standard deviation. +[4] ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided. -[8] SAC's start-timesteps is set to 10000 by default while it is 25000 is DDPG/TD3. TD3's learning rate is set to 3e-4 while it is 1e-3 for DDPG/SAC. However, there is NO enough evidence to support our choice of such hyperparameters (we simply choose them because of SpinningUp) and you can try playing with those hyperparameters to see if you can improve performance. Do tell us if you can! +[5] Reward metric: The meaning of the table value is the max average return over 10 trails (different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered, if not otherwise stated. The shaded region on the graph also represents a single standard deviation. It is the same as [TD3 evaluation method](https://github.com/sfujim/TD3/issues/34). -[9] We use batchsize of 256 in DDPG/TD3/SAC while SpinningUp use 100. Minor difference also lies with `start-timesteps`, data loop method `step_per_collect`, method to deal with/bootstrap truncated steps because of timelimit and unfinished/collecting episodes (contribute to performance improvement), etc. +[6] In TD3 paper, shaded region represents only half of standard deviation. -[10] Comparing Tianshou's REINFORCE algorithm with SpinningUp's VPG is quite unfair because SpinningUp's VPG uses a generative advantage estimator (GAE) which requires a dnn value predictor (critic network), which makes so called "VPG" more like A2C (advantage actor critic) algorithm. Even so, you can see that we are roughly at-parity with each other even if tianshou's REINFORCE do not use a critic or GAE. +[7] Comparing Tianshou's REINFORCE algorithm with SpinningUp's VPG is quite unfair because SpinningUp's VPG uses a generative advantage estimator (GAE) which requires a dnn value predictor (critic network), which makes so called "VPG" more like A2C (advantage actor critic) algorithm. Even so, you can see that we are roughly at-parity with each other even if tianshou's REINFORCE do not use a critic or GAE. diff --git a/examples/mujoco/benchmark/Ant-v3/ppo/figure.png b/examples/mujoco/benchmark/Ant-v3/ppo/figure.png new file mode 100644 index 000000000..45002b20c Binary files /dev/null and b/examples/mujoco/benchmark/Ant-v3/ppo/figure.png differ diff --git a/examples/mujoco/benchmark/HalfCheetah-v3/ppo/figure.png b/examples/mujoco/benchmark/HalfCheetah-v3/ppo/figure.png new file mode 100644 index 000000000..1135ee16d Binary files /dev/null and b/examples/mujoco/benchmark/HalfCheetah-v3/ppo/figure.png differ diff --git a/examples/mujoco/benchmark/Hopper-v3/ppo/figure.png b/examples/mujoco/benchmark/Hopper-v3/ppo/figure.png new file mode 100644 index 000000000..ac882be67 Binary files /dev/null and b/examples/mujoco/benchmark/Hopper-v3/ppo/figure.png differ diff --git a/examples/mujoco/benchmark/Humanoid-v3/ppo/figure.png b/examples/mujoco/benchmark/Humanoid-v3/ppo/figure.png new file mode 100644 index 000000000..d1aa4e335 Binary files /dev/null and b/examples/mujoco/benchmark/Humanoid-v3/ppo/figure.png differ diff --git a/examples/mujoco/benchmark/InvertedDoublePendulum-v2/ppo/figure.png b/examples/mujoco/benchmark/InvertedDoublePendulum-v2/ppo/figure.png new file mode 100644 index 000000000..b9ad873d7 Binary files /dev/null and b/examples/mujoco/benchmark/InvertedDoublePendulum-v2/ppo/figure.png differ diff --git a/examples/mujoco/benchmark/InvertedPendulum-v2/ppo/figure.png b/examples/mujoco/benchmark/InvertedPendulum-v2/ppo/figure.png new file mode 100644 index 000000000..ed945acf1 Binary files /dev/null and b/examples/mujoco/benchmark/InvertedPendulum-v2/ppo/figure.png differ diff --git a/examples/mujoco/benchmark/Reacher-v2/ppo/figure.png b/examples/mujoco/benchmark/Reacher-v2/ppo/figure.png new file mode 100644 index 000000000..c4480c876 Binary files /dev/null and b/examples/mujoco/benchmark/Reacher-v2/ppo/figure.png differ diff --git a/examples/mujoco/benchmark/Swimmer-v3/ppo/figure.png b/examples/mujoco/benchmark/Swimmer-v3/ppo/figure.png new file mode 100644 index 000000000..a61e5cf3f Binary files /dev/null and b/examples/mujoco/benchmark/Swimmer-v3/ppo/figure.png differ diff --git a/examples/mujoco/benchmark/Walker2d-v3/ppo/figure.png b/examples/mujoco/benchmark/Walker2d-v3/ppo/figure.png new file mode 100644 index 000000000..c6fdd5f9b Binary files /dev/null and b/examples/mujoco/benchmark/Walker2d-v3/ppo/figure.png differ diff --git a/examples/mujoco/mujoco_a2c.py b/examples/mujoco/mujoco_a2c.py index bdfa3cb78..f4f30a550 100755 --- a/examples/mujoco/mujoco_a2c.py +++ b/examples/mujoco/mujoco_a2c.py @@ -3,6 +3,7 @@ import os import gym import torch +import pprint import datetime import argparse import numpy as np @@ -36,12 +37,6 @@ def get_args(): parser.add_argument('--batch-size', type=int, default=99999) parser.add_argument('--training-num', type=int, default=16) parser.add_argument('--test-num', type=int, default=10) - parser.add_argument('--logdir', type=str, default='log') - parser.add_argument('--render', type=float, default=0.) - parser.add_argument( - '--device', type=str, - default='cuda' if torch.cuda.is_available() else 'cpu') - parser.add_argument('--resume-path', type=str, default=None) # a2c special parser.add_argument('--rew-norm', type=int, default=True) parser.add_argument('--vf-coef', type=float, default=0.5) @@ -50,6 +45,14 @@ def get_args(): parser.add_argument('--bound-action-method', type=str, default="clip") parser.add_argument('--lr-decay', type=int, default=True) parser.add_argument('--max-grad-norm', type=float, default=0.5) + parser.add_argument('--logdir', type=str, default='log') + parser.add_argument('--render', type=float, default=0.) + parser.add_argument( + '--device', type=str, + default='cuda' if torch.cuda.is_available() else 'cpu') + parser.add_argument('--resume-path', type=str, default=None) + parser.add_argument('--watch', default=False, action='store_true', + help='watch the play of pre-trained policy only') return parser.parse_args() @@ -120,6 +123,11 @@ def dist(*logits): action_bound_method=args.bound_action_method, lr_scheduler=lr_scheduler, action_space=env.action_space) + # load a previous policy + if args.resume_path: + policy.load_state_dict(torch.load(args.resume_path, map_location=args.device)) + print("Loaded agent from: ", args.resume_path) + # collector if args.training_num > 1: buffer = VectorReplayBuffer(args.buffer_size, len(train_envs)) @@ -138,12 +146,14 @@ def dist(*logits): def save_fn(policy): torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth')) - # trainer - result = onpolicy_trainer( - policy, train_collector, test_collector, args.epoch, args.step_per_epoch, - args.repeat_per_collect, args.test_num, args.batch_size, - step_per_collect=args.step_per_collect, save_fn=save_fn, logger=logger, - test_in_train=False) + if not args.watch: + # trainer + result = onpolicy_trainer( + policy, train_collector, test_collector, args.epoch, args.step_per_epoch, + args.repeat_per_collect, args.test_num, args.batch_size, + step_per_collect=args.step_per_collect, save_fn=save_fn, logger=logger, + test_in_train=False) + pprint.pprint(result) # Let's watch its performance! policy.eval() diff --git a/examples/mujoco/mujoco_ddpg.py b/examples/mujoco/mujoco_ddpg.py index d491ee711..ebd590fd7 100755 --- a/examples/mujoco/mujoco_ddpg.py +++ b/examples/mujoco/mujoco_ddpg.py @@ -3,6 +3,7 @@ import os import gym import torch +import pprint import datetime import argparse import numpy as np @@ -44,6 +45,8 @@ def get_args(): '--device', type=str, default='cuda' if torch.cuda.is_available() else 'cpu') parser.add_argument('--resume-path', type=str, default=None) + parser.add_argument('--watch', default=False, action='store_true', + help='watch the play of pre-trained policy only') return parser.parse_args() @@ -87,11 +90,10 @@ def test_ddpg(args=get_args()): tau=args.tau, gamma=args.gamma, exploration_noise=GaussianNoise(sigma=args.exploration_noise), estimation_step=args.n_step, action_space=env.action_space) + # load a previous policy if args.resume_path: - policy.load_state_dict(torch.load( - args.resume_path, map_location=args.device - )) + policy.load_state_dict(torch.load(args.resume_path, map_location=args.device)) print("Loaded agent from: ", args.resume_path) # collector @@ -113,12 +115,14 @@ def test_ddpg(args=get_args()): def save_fn(policy): torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth')) - # trainer - result = offpolicy_trainer( - policy, train_collector, test_collector, args.epoch, - args.step_per_epoch, args.step_per_collect, args.test_num, - args.batch_size, save_fn=save_fn, logger=logger, - update_per_step=args.update_per_step, test_in_train=False) + if not args.watch: + # trainer + result = offpolicy_trainer( + policy, train_collector, test_collector, args.epoch, + args.step_per_epoch, args.step_per_collect, args.test_num, + args.batch_size, save_fn=save_fn, logger=logger, + update_per_step=args.update_per_step, test_in_train=False) + pprint.pprint(result) # Let's watch its performance! policy.eval() diff --git a/examples/mujoco/mujoco_ppo.py b/examples/mujoco/mujoco_ppo.py new file mode 100755 index 000000000..3974c2e63 --- /dev/null +++ b/examples/mujoco/mujoco_ppo.py @@ -0,0 +1,175 @@ +#!/usr/bin/env python3 + +import os +import gym +import torch +import pprint +import datetime +import argparse +import numpy as np +from torch import nn +from torch.optim.lr_scheduler import LambdaLR +from torch.utils.tensorboard import SummaryWriter +from torch.distributions import Independent, Normal + +from tianshou.policy import PPOPolicy +from tianshou.utils import BasicLogger +from tianshou.env import SubprocVectorEnv +from tianshou.utils.net.common import Net +from tianshou.trainer import onpolicy_trainer +from tianshou.utils.net.continuous import ActorProb, Critic +from tianshou.data import Collector, ReplayBuffer, VectorReplayBuffer + + +def get_args(): + parser = argparse.ArgumentParser() + parser.add_argument('--task', type=str, default='HalfCheetah-v3') + parser.add_argument('--seed', type=int, default=0) + parser.add_argument('--buffer-size', type=int, default=4096) + parser.add_argument('--hidden-sizes', type=int, nargs='*', default=[64, 64]) + parser.add_argument('--lr', type=float, default=3e-4) + parser.add_argument('--gamma', type=float, default=0.99) + parser.add_argument('--epoch', type=int, default=100) + parser.add_argument('--step-per-epoch', type=int, default=30000) + parser.add_argument('--step-per-collect', type=int, default=2048) + parser.add_argument('--repeat-per-collect', type=int, default=10) + parser.add_argument('--batch-size', type=int, default=64) + parser.add_argument('--training-num', type=int, default=64) + parser.add_argument('--test-num', type=int, default=10) + # ppo special + parser.add_argument('--rew-norm', type=int, default=True) + # In theory, `vf-coef` will not make any difference if using Adam optimizer. + parser.add_argument('--vf-coef', type=float, default=0.25) + parser.add_argument('--ent-coef', type=float, default=0.0) + parser.add_argument('--gae-lambda', type=float, default=0.95) + parser.add_argument('--bound-action-method', type=str, default="clip") + parser.add_argument('--lr-decay', type=int, default=True) + parser.add_argument('--max-grad-norm', type=float, default=0.5) + parser.add_argument('--eps-clip', type=float, default=0.2) + parser.add_argument('--dual-clip', type=float, default=None) + parser.add_argument('--value-clip', type=int, default=0) + parser.add_argument('--norm-adv', type=int, default=0) + parser.add_argument('--recompute-adv', type=int, default=1) + parser.add_argument('--logdir', type=str, default='log') + parser.add_argument('--render', type=float, default=0.) + parser.add_argument( + '--device', type=str, + default='cuda' if torch.cuda.is_available() else 'cpu') + parser.add_argument('--resume-path', type=str, default=None) + parser.add_argument('--watch', default=False, action='store_true', + help='watch the play of pre-trained policy only') + return parser.parse_args() + + +def test_ppo(args=get_args()): + env = gym.make(args.task) + args.state_shape = env.observation_space.shape or env.observation_space.n + args.action_shape = env.action_space.shape or env.action_space.n + args.max_action = env.action_space.high[0] + print("Observations shape:", args.state_shape) + print("Actions shape:", args.action_shape) + print("Action range:", np.min(env.action_space.low), + np.max(env.action_space.high)) + # train_envs = gym.make(args.task) + train_envs = SubprocVectorEnv( + [lambda: gym.make(args.task) for _ in range(args.training_num)], + norm_obs=True) + # test_envs = gym.make(args.task) + test_envs = SubprocVectorEnv( + [lambda: gym.make(args.task) for _ in range(args.test_num)], + norm_obs=True, obs_rms=train_envs.obs_rms, update_obs_rms=False) + + # seed + np.random.seed(args.seed) + torch.manual_seed(args.seed) + train_envs.seed(args.seed) + test_envs.seed(args.seed) + # model + net_a = Net(args.state_shape, hidden_sizes=args.hidden_sizes, + activation=nn.Tanh, device=args.device) + actor = ActorProb(net_a, args.action_shape, max_action=args.max_action, + unbounded=True, device=args.device).to(args.device) + net_c = Net(args.state_shape, hidden_sizes=args.hidden_sizes, + activation=nn.Tanh, device=args.device) + critic = Critic(net_c, device=args.device).to(args.device) + torch.nn.init.constant_(actor.sigma_param, -0.5) + for m in list(actor.modules()) + list(critic.modules()): + if isinstance(m, torch.nn.Linear): + # orthogonal initialization + torch.nn.init.orthogonal_(m.weight, gain=np.sqrt(2)) + torch.nn.init.zeros_(m.bias) + # do last policy layer scaling, this will make initial actions have (close to) + # 0 mean and std, and will help boost performances, + # see https://arxiv.org/abs/2006.05990, Fig.24 for details + for m in actor.mu.modules(): + if isinstance(m, torch.nn.Linear): + torch.nn.init.zeros_(m.bias) + m.weight.data.copy_(0.01 * m.weight.data) + + optim = torch.optim.Adam(set( + actor.parameters()).union(critic.parameters()), lr=args.lr) + + lr_scheduler = None + if args.lr_decay: + # decay learning rate to 0 linearly + max_update_num = np.ceil( + args.step_per_epoch / args.step_per_collect) * args.epoch + + lr_scheduler = LambdaLR( + optim, lr_lambda=lambda epoch: 1 - epoch / max_update_num) + + def dist(*logits): + return Independent(Normal(*logits), 1) + + policy = PPOPolicy(actor, critic, optim, dist, discount_factor=args.gamma, + gae_lambda=args.gae_lambda, max_grad_norm=args.max_grad_norm, + vf_coef=args.vf_coef, ent_coef=args.ent_coef, + reward_normalization=args.rew_norm, action_scaling=True, + action_bound_method=args.bound_action_method, + lr_scheduler=lr_scheduler, action_space=env.action_space, + eps_clip=args.eps_clip, value_clip=args.value_clip, + dual_clip=args.dual_clip, advantage_normalization=args.norm_adv, + recompute_advantage=args.recompute_adv) + + # load a previous policy + if args.resume_path: + policy.load_state_dict(torch.load(args.resume_path, map_location=args.device)) + print("Loaded agent from: ", args.resume_path) + + # collector + if args.training_num > 1: + buffer = VectorReplayBuffer(args.buffer_size, len(train_envs)) + else: + buffer = ReplayBuffer(args.buffer_size) + train_collector = Collector(policy, train_envs, buffer, exploration_noise=True) + test_collector = Collector(policy, test_envs) + # log + t0 = datetime.datetime.now().strftime("%m%d_%H%M%S") + log_file = f'seed_{args.seed}_{t0}-{args.task.replace("-", "_")}_ppo' + log_path = os.path.join(args.logdir, args.task, 'ppo', log_file) + writer = SummaryWriter(log_path) + writer.add_text("args", str(args)) + logger = BasicLogger(writer, update_interval=100, train_interval=100) + + def save_fn(policy): + torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth')) + + if not args.watch: + # trainer + result = onpolicy_trainer( + policy, train_collector, test_collector, args.epoch, args.step_per_epoch, + args.repeat_per_collect, args.test_num, args.batch_size, + step_per_collect=args.step_per_collect, save_fn=save_fn, logger=logger, + test_in_train=False) + pprint.pprint(result) + + # Let's watch its performance! + policy.eval() + test_envs.seed(args.seed) + test_collector.reset() + result = test_collector.collect(n_episode=args.test_num, render=args.render) + print(f'Final reward: {result["rews"].mean()}, length: {result["lens"].mean()}') + + +if __name__ == '__main__': + test_ppo() diff --git a/examples/mujoco/mujoco_reinforce.py b/examples/mujoco/mujoco_reinforce.py index 81683e632..da7a4af7a 100755 --- a/examples/mujoco/mujoco_reinforce.py +++ b/examples/mujoco/mujoco_reinforce.py @@ -3,6 +3,7 @@ import os import gym import torch +import pprint import datetime import argparse import numpy as np @@ -36,17 +37,19 @@ def get_args(): parser.add_argument('--batch-size', type=int, default=99999) parser.add_argument('--training-num', type=int, default=64) parser.add_argument('--test-num', type=int, default=10) + # reinforce special + parser.add_argument('--rew-norm', type=int, default=True) + # "clip" option also works well. + parser.add_argument('--action-bound-method', type=str, default="tanh") + parser.add_argument('--lr-decay', type=int, default=True) parser.add_argument('--logdir', type=str, default='log') parser.add_argument('--render', type=float, default=0.) parser.add_argument( '--device', type=str, default='cuda' if torch.cuda.is_available() else 'cpu') parser.add_argument('--resume-path', type=str, default=None) - # reinforce special - parser.add_argument('--rew-norm', type=int, default=True) - # "clip" option also works well. - parser.add_argument('--action-bound-method', type=str, default="tanh") - parser.add_argument('--lr-decay', type=int, default=True) + parser.add_argument('--watch', default=False, action='store_true', + help='watch the play of pre-trained policy only') return parser.parse_args() @@ -110,6 +113,11 @@ def dist(*logits): action_bound_method=args.action_bound_method, lr_scheduler=lr_scheduler, action_space=env.action_space) + # load a previous policy + if args.resume_path: + policy.load_state_dict(torch.load(args.resume_path, map_location=args.device)) + print("Loaded agent from: ", args.resume_path) + # collector if args.training_num > 1: buffer = VectorReplayBuffer(args.buffer_size, len(train_envs)) @@ -128,12 +136,14 @@ def dist(*logits): def save_fn(policy): torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth')) - # trainer - result = onpolicy_trainer( - policy, train_collector, test_collector, args.epoch, args.step_per_epoch, - args.repeat_per_collect, args.test_num, args.batch_size, - step_per_collect=args.step_per_collect, save_fn=save_fn, logger=logger, - test_in_train=False) + if not args.watch: + # trainer + result = onpolicy_trainer( + policy, train_collector, test_collector, args.epoch, args.step_per_epoch, + args.repeat_per_collect, args.test_num, args.batch_size, + step_per_collect=args.step_per_collect, save_fn=save_fn, logger=logger, + test_in_train=False) + pprint.pprint(result) # Let's watch its performance! policy.eval() diff --git a/examples/mujoco/mujoco_sac.py b/examples/mujoco/mujoco_sac.py index 1800944a0..46d7ac56e 100755 --- a/examples/mujoco/mujoco_sac.py +++ b/examples/mujoco/mujoco_sac.py @@ -3,6 +3,7 @@ import os import gym import torch +import pprint import datetime import argparse import numpy as np @@ -45,6 +46,8 @@ def get_args(): '--device', type=str, default='cuda' if torch.cuda.is_available() else 'cpu') parser.add_argument('--resume-path', type=str, default=None) + parser.add_argument('--watch', default=False, action='store_true', + help='watch the play of pre-trained policy only') return parser.parse_args() @@ -99,11 +102,10 @@ def test_sac(args=get_args()): actor, actor_optim, critic1, critic1_optim, critic2, critic2_optim, tau=args.tau, gamma=args.gamma, alpha=args.alpha, estimation_step=args.n_step, action_space=env.action_space) + # load a previous policy if args.resume_path: - policy.load_state_dict(torch.load( - args.resume_path, map_location=args.device - )) + policy.load_state_dict(torch.load(args.resume_path, map_location=args.device)) print("Loaded agent from: ", args.resume_path) # collector @@ -125,12 +127,14 @@ def test_sac(args=get_args()): def save_fn(policy): torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth')) - # trainer - result = offpolicy_trainer( - policy, train_collector, test_collector, args.epoch, - args.step_per_epoch, args.step_per_collect, args.test_num, - args.batch_size, save_fn=save_fn, logger=logger, - update_per_step=args.update_per_step, test_in_train=False) + if not args.watch: + # trainer + result = offpolicy_trainer( + policy, train_collector, test_collector, args.epoch, + args.step_per_epoch, args.step_per_collect, args.test_num, + args.batch_size, save_fn=save_fn, logger=logger, + update_per_step=args.update_per_step, test_in_train=False) + pprint.pprint(result) # Let's watch its performance! policy.eval() diff --git a/examples/mujoco/mujoco_td3.py b/examples/mujoco/mujoco_td3.py index 28fc2a8f4..97b4e0a0c 100755 --- a/examples/mujoco/mujoco_td3.py +++ b/examples/mujoco/mujoco_td3.py @@ -3,6 +3,7 @@ import os import gym import torch +import pprint import datetime import argparse import numpy as np @@ -47,6 +48,8 @@ def get_args(): '--device', type=str, default='cuda' if torch.cuda.is_available() else 'cpu') parser.add_argument('--resume-path', type=str, default=None) + parser.add_argument('--watch', default=False, action='store_true', + help='watch the play of pre-trained policy only') return parser.parse_args() @@ -103,9 +106,7 @@ def test_td3(args=get_args()): # load a previous policy if args.resume_path: - policy.load_state_dict(torch.load( - args.resume_path, map_location=args.device - )) + policy.load_state_dict(torch.load(args.resume_path, map_location=args.device)) print("Loaded agent from: ", args.resume_path) # collector @@ -127,12 +128,14 @@ def test_td3(args=get_args()): def save_fn(policy): torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth')) - # trainer - result = offpolicy_trainer( - policy, train_collector, test_collector, args.epoch, - args.step_per_epoch, args.step_per_collect, args.test_num, - args.batch_size, save_fn=save_fn, logger=logger, - update_per_step=args.update_per_step, test_in_train=False) + if not args.watch: + # trainer + result = offpolicy_trainer( + policy, train_collector, test_collector, args.epoch, + args.step_per_epoch, args.step_per_collect, args.test_num, + args.batch_size, save_fn=save_fn, logger=logger, + update_per_step=args.update_per_step, test_in_train=False) + pprint.pprint(result) # Let's watch its performance! policy.eval() diff --git a/test/continuous/test_ppo.py b/test/continuous/test_ppo.py index 336e4b673..b1f17faa7 100644 --- a/test/continuous/test_ppo.py +++ b/test/continuous/test_ppo.py @@ -22,14 +22,13 @@ def get_args(): parser.add_argument('--seed', type=int, default=1) parser.add_argument('--buffer-size', type=int, default=20000) parser.add_argument('--lr', type=float, default=1e-3) - parser.add_argument('--gamma', type=float, default=0.99) + parser.add_argument('--gamma', type=float, default=0.95) parser.add_argument('--epoch', type=int, default=5) parser.add_argument('--step-per-epoch', type=int, default=150000) parser.add_argument('--episode-per-collect', type=int, default=16) parser.add_argument('--repeat-per-collect', type=int, default=2) parser.add_argument('--batch-size', type=int, default=128) - parser.add_argument('--hidden-sizes', type=int, - nargs='*', default=[128, 128]) + parser.add_argument('--hidden-sizes', type=int, nargs='*', default=[64, 64]) parser.add_argument('--training-num', type=int, default=16) parser.add_argument('--test-num', type=int, default=100) parser.add_argument('--logdir', type=str, default='log') @@ -38,14 +37,16 @@ def get_args(): '--device', type=str, default='cuda' if torch.cuda.is_available() else 'cpu') # ppo special - parser.add_argument('--vf-coef', type=float, default=0.5) - parser.add_argument('--ent-coef', type=float, default=0.01) + parser.add_argument('--vf-coef', type=float, default=0.25) + parser.add_argument('--ent-coef', type=float, default=0.0) parser.add_argument('--eps-clip', type=float, default=0.2) parser.add_argument('--max-grad-norm', type=float, default=0.5) parser.add_argument('--gae-lambda', type=float, default=0.95) parser.add_argument('--rew-norm', type=int, default=1) parser.add_argument('--dual-clip', type=float, default=None) parser.add_argument('--value-clip', type=int, default=1) + parser.add_argument('--norm-adv', type=int, default=1) + parser.add_argument('--recompute-adv', type=int, default=0) args = parser.parse_known_args()[0] return args @@ -90,6 +91,7 @@ def test_ppo(args=get_args()): # pass *logits to be consistent with policy.forward def dist(*logits): return Independent(Normal(*logits), 1) + policy = PPOPolicy( actor, critic, optim, dist, discount_factor=args.gamma, @@ -98,6 +100,8 @@ def dist(*logits): vf_coef=args.vf_coef, ent_coef=args.ent_coef, reward_normalization=args.rew_norm, + advantage_normalization=args.norm_adv, + recompute_advantage=args.recompute_adv, # dual_clip=args.dual_clip, # dual clip cause monotonically increasing log_std :) value_clip=args.value_clip, diff --git a/test/discrete/test_ppo.py b/test/discrete/test_ppo.py index 8ba380e9e..f98e14003 100644 --- a/test/discrete/test_ppo.py +++ b/test/discrete/test_ppo.py @@ -27,8 +27,7 @@ def get_args(): parser.add_argument('--episode-per-collect', type=int, default=20) parser.add_argument('--repeat-per-collect', type=int, default=2) parser.add_argument('--batch-size', type=int, default=64) - parser.add_argument('--hidden-sizes', type=int, - nargs='*', default=[128, 128]) + parser.add_argument('--hidden-sizes', type=int, nargs='*', default=[64, 64]) parser.add_argument('--training-num', type=int, default=20) parser.add_argument('--test-num', type=int, default=100) parser.add_argument('--logdir', type=str, default='log') @@ -41,7 +40,7 @@ def get_args(): parser.add_argument('--ent-coef', type=float, default=0.0) parser.add_argument('--eps-clip', type=float, default=0.2) parser.add_argument('--max-grad-norm', type=float, default=0.5) - parser.add_argument('--gae-lambda', type=float, default=0.8) + parser.add_argument('--gae-lambda', type=float, default=0.95) parser.add_argument('--rew-norm', type=int, default=1) parser.add_argument('--dual-clip', type=float, default=None) parser.add_argument('--value-clip', type=int, default=1) diff --git a/tianshou/policy/modelfree/ppo.py b/tianshou/policy/modelfree/ppo.py index 9b9c61272..8c0575def 100644 --- a/tianshou/policy/modelfree/ppo.py +++ b/tianshou/policy/modelfree/ppo.py @@ -86,8 +86,7 @@ def process_fn( ) -> Batch: if self._recompute_adv: # buffer input `buffer` and `indice` to be used in `learn()`. - self._buffer = buffer - self._indice = indice + self._buffer, self._indice = buffer, indice batch = self._compute_returns(batch, buffer, indice) batch.act = to_torch_as(batch.act, batch.v_s) old_log_prob = []