-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match PPG implementation #186
Merged
Merged
Changes from 17 commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
419041d
added nit changes from ppg code
dipamc 2e1190b
change observation buffer to uint8
dipamc 86f5be7
sample full rollouts
dipamc beff293
minor device fix
dipamc 4cb85d5
update optimizer settings
dipamc d6ee26b
add ppg documentation
fea4531
update mkdocs
dipamc 20f15da
update images to png for codespell errors
dipamc 6c3cb05
trigger CI
vwxyzjn 631ab96
Minor format change
vwxyzjn d961d0f
format by running `pre-commit`
vwxyzjn 4cff11d
removes trailing space
vwxyzjn fb9c832
Add an extra note
vwxyzjn 31bb5c4
argument names and documentation changes
dipamc ed66604
add capture video
dipamc 1610191
add experiment report
dipamc 51c6aac
Merge branch 'master' into ppg-dev
vwxyzjn a4342f8
Update documentation
vwxyzjn 3d4711c
Quick css fix
vwxyzjn b780521
Update documentation
vwxyzjn 9c4edf8
Fix documentation for PPO
vwxyzjn 23cd48e
Add benchmark commands
vwxyzjn 8e4f977
Add benchmark commands
vwxyzjn 72e8cce
add metrics section
dipamc aa695c1
Add more docs
vwxyzjn 0564584
Quick fix on ddpg docs
vwxyzjn a08039e
Add procgen test cases
vwxyzjn 31a175c
Update CI
vwxyzjn f063a7b
test CI
vwxyzjn 60df2c8
test ci
vwxyzjn e70c71a
Update tests
vwxyzjn 6ebaaae
normalization axis documentation
dipamc File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Phasic Policy Gradient (PPG) | ||
|
||
## Overview | ||
|
||
PPG is a DRL algorithm that separates policy and value function training by introducing an auxiliary phase. The training proceeds by running PPO during the policy phase, saving all the experience in a replay buffer. Then the replay buffer is used to train the value function. This makes the algorithm considerably slower than PPO, but improves sample efficiency on Procgen benchmark. | ||
|
||
Original paper: | ||
|
||
* [Phasic Policy Gradient](https://arxiv.org/abs/2009.04416) | ||
|
||
Reference resources: | ||
|
||
* [Code for the paper "Phasic Policy Gradient"](https://github.com/openai/phasic-policy-gradient) - by original authors from OpenAI | ||
|
||
The original code has multiple code level details that are not mentioned in the paper. We found these changes to be important for reproducing the results claimed by the paper. | ||
|
||
## Implemented Variants | ||
|
||
|
||
| Variants Implemented | Description | | ||
| ----------- | ----------- | | ||
| :material-github: [`ppg_procgen.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppg_procgen.py), :material-file-document: [docs](/rl-algorithms/ppg/#ppg_procgenpy) | For classic control tasks like `CartPole-v1`. | | ||
|
||
Below are our single-file implementations of PPG: | ||
|
||
## `ppg_procgen.py` | ||
|
||
`ppg_procgen.py` works with the Procgen benchmark, which uses 64x64 RGB image observations, and discrete actions | ||
|
||
### Usage | ||
|
||
```bash | ||
poetry install -E procgen | ||
python cleanrl/ppg_procgen.py --help | ||
python cleanrl/ppg_procgen.py --env-id "bigfish" | ||
``` | ||
|
||
## Implementation details | ||
|
||
`ppg_procgen.py` includes the <TODO> level implementation details that are different from PPO: | ||
|
||
1. Full rollout sampling during auxiliary phase - (:material-github: [phasic_policy_gradient/ppg.py#L173](https://github.com/openai/phasic-policy-gradient/blob/c789b00be58aa704f7223b6fc8cd28a5aaa2e101/phasic_policy_gradient/ppg.py#L173)) - Instead of randomly sampling observations over the entire auxiliary buffer, PPG samples full rullouts from the buffer (Sets of 256 steps). This full rollout sampling is only done during the auxiliary phase. Note that the rollouts will still be at random starting points because PPO truncates the rollouts per env. This change gives a decent performance boost. | ||
|
||
1. Batch level advantage normalization - PPG normalizes the full batch of advantage values before PPO updates instead of advantage normalization on each minibatch. (:material-github: [phasic_policy_gradient/ppo.py#L70](https://github.com/openai/phasic-policy-gradient/blob/c789b00be58aa704f7223b6fc8cd28a5aaa2e101/phasic_policy_gradient/ppo.py#L70)) | ||
|
||
1. Normalized network initialization - (:material-github: [phasic_policy_gradient/impala_cnn.py#L64](https://github.com/openai/phasic-policy-gradient/blob/c789b00be58aa704f7223b6fc8cd28a5aaa2e101/phasic_policy_gradient/impala_cnn.py#L64)) - PPG uses normalized initialization for all layers, with different scales. | ||
* Original PPO used orthogonal initialization of only the Policy head and Value heads with scale of 0.01 and 1. respectively. | ||
* For PPG | ||
* All weights are initialized with the default torch initialization (Kaiming Uniform) | ||
* Each layer’s weights are divided by the L2 norm of the weights along the (which axis?), and multiplied by a scale factor. | ||
* Scale factors for different layers | ||
* Value head, Policy head, Auxiliary value head - 0.1 | ||
* Fully connected layer after last conv later - 1.4 | ||
* Convolutional layers - Approximately 0.638 | ||
1. The Adam Optimizer's Epsilon Parameter -(:material-github: [phasic_policy_gradient/ppg.py#L239](https://github.com/openai/phasic-policy-gradient/blob/c789b00be58aa704f7223b6fc8cd28a5aaa2e101/phasic_policy_gradient/ppg.py#L239)) - Set to torch default of 1e-8 instead of 1e-5 which is used in PPO. | ||
|
||
### Extra notes | ||
|
||
- All the default hyperparameters from the original PPG implementation are used. Except setting 64 for the number of environments. | ||
- The original PPG paper does not report results on easy environments, hence more hyperparameter tuning can give better results. | ||
- Skipping every alternate auxiliary phase gives similar performance on easy environments while saving compute. | ||
- Normalized network initialization scheme seems to matter a lot, but using layernorm with orthogonal initialization also works. | ||
- Using mixed precision for auxiliary phase also works well to save compute, but using on policy phase makes training unstable. | ||
|
||
|
||
### Differences from the original PPG code | ||
|
||
- The original PPG code supports LSTM whereas the CleanRL code does not. | ||
- The original PPG code uses separate optimizers for policy and auxiliary phase, but we do not implement this as we found it to not make too much difference. | ||
- The original PPG code utilizes multiple GPUs but our implementation does not | ||
|
||
|
||
### Experiment results | ||
|
||
Below are the average episodic returns for `ppg_procgen.py`, and comparison with `ppg_procgen.py` on 25M timesteps. | ||
|
||
| Environment | `ppg_procgen.py` | `ppg_procgen.py` | | ||
| ----------- | ----------- | ----------- | | ||
| Bigfish (easy) | 27.670 ± 9.523 | 21.605 ± 7.996 | | ||
| Starpilot (easy) | 39.086 ± 11.042 | 34.025 ± 12.535 | | ||
|
||
Learning curves: | ||
|
||
<div class="grid-container"> | ||
|
||
<img src="../ppg/bigfish-easy-ppg-ppo.png"> | ||
|
||
<img src="../ppg/starpilot-easy-ppg-ppo.png"> | ||
|
||
<img src="../ppg/bossfight-easy-ppg-ppo.png"> | ||
|
||
</div> | ||
|
||
Tracked experiments and game play videos: | ||
|
||
Please check this [wandb report](https://wandb.ai/openrlbenchmark/cleanrl/reports/CleanRL-PPG-vs-PPO-results--VmlldzoyMDY2NzQ5) for tracked results. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify "which axis" here.