-
Notifications
You must be signed in to change notification settings - Fork 679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match PPG implementation #186
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is massive! Thanks for making this PR.
I think before any additional changes, our next step is to establish a great baseline. We should try to compare our results against the original results, otherwise, it's hard to attest the quality of our PPG implementation given that we included divergent changes such as using shared optimizers.
The original paper ran experiments using the hard
distribution mode, so we might have to re-run them. I made a fork available here https://github.com/openrlbenchmark/phasic-policy-gradient to run tracked experiments. Unfortunately, I was not able to run them due to insufficient GPU memory... Would you mind giving it a try? The benchmark commands is at https://github.com/openrlbenchmark/phasic-policy-gradient/blob/add-wandb/benchmark.sh
Hey, @Dipamc77 I made some modifications to the documentation. Would you mind adding a section called "Explanation of the logged metrics"? (see here as an example). Everything else looks good on my end. I did notice the wall-time performance of openai/phasic-policy-gradient is better, but the reason could be that we record the videos (which can be costly) or hardware differences. I am not too worried about it though. |
docs/rl-algorithms/ppg.md
Outdated
* Original PPO used orthogonal initialization of only the Policy head and Value heads with scale of 0.01 and 1. respectively. | ||
* For PPG | ||
* All weights are initialized with the default torch initialization (Kaiming Uniform) | ||
* Each layer’s weights are divided by the L2 norm of the weights along the (which axis?), and multiplied by a scale factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify "which axis" here.
Description
Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.
--capture-video
flag toggled on (required).mkdocs serve
.width=500
andheight=300
).