Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRPO Implementation #799

Closed
wants to merge 7 commits into from
Closed

TRPO Implementation #799

wants to merge 7 commits into from

Conversation

plutasnyy
Copy link

What does this PR do?

Hello! As discussed in issue 596 together with @NaIwo we implemented the TRPO method in the RL module.

Implementation & Results

In order to validate the implementation and prove correctness, our code was quite heavily based on an implementation from the Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO publication by Logan Engstrom et al. with source code in Github.

We tested our code on environments from the publication (Walker, Hopper and Humanoid from MuJoCo, all are continuous),
and one simple discrete, just to verify the correctness of the implementation - LunarLander.

Of course, our implementations differ in several ways - we set the direction of improvement based on the whole batch
rather than 10%, different actor and critic architectures, training in a multi/single agent framework etc.
The point here is not to reproduce the results exactly, but simply to show that our implementation meets the expected minimum ;)

The scores presented by the authors are aggregated as expected returns based on agent performance after 500 learning epochs.
We simply show the average of the last 100 episodes after epochs execution (typically one epoch has significantly fewer
episodes 2-10 depending on the environment).

Humanoid

Expected: [576, 596]
Ours - without problem to obtain cumulated rewards around 600

Command to reproduce:

python3 trpo_model.py --max-episode-len 200 --lr-critic 1e-4 --kl-div-threshold 0.07 --env Humanoid-v2  --max_epochs 500 --log_every_n_steps 10
Hopper

Expected: [1948, 2136]
Ours - on the average the model works similarly

Command to reproduce:

python3 trpo_model.py --max-episode-len 2048 --lr-critic 1e-3 --kl-div-threshold 0.05 --env Hopper-v2  --max_epochs 500 --log_every_n_steps 10
Walker

Expected: [2709, 2873]
Ours - Our model performs worse, even with more experience, however, with tuned parameters it meets the expected threshold

Command to reproduce:

python3 trpo_model.py --max-episode-len 2048 --lr-critic 1e-3 --kl-div-threshold 0.1 --env Walker2d-v2  --max_epochs 1000 --log_every_n_steps 10
LunarLander

Expected - in short, a score around 200 indicates a correctly completed task
Ours - the agent was able to learn how to solve task

Command to reproduce:

python3 trpo_model.py --max-episode-len 2048 --lr-critic 1e-3 --kl-div-threshold 0.1 --env LunarLander-v2  --max_epochs 1000 --log_every_n_steps 10

You can see more plots in the link: wandb.

If you would like to run our code without installing MuJoCo you can use Colab, which we also used for a while ;)

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? - Yes, but please check if we haven't missed something
  • Did you write any new necessary tests? [not needed for typos/docs]
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Absolutely!

plutasnyy and others added 5 commits January 18, 2022 07:55
Implementation of TRPO model
Co-authored-by: Kamil Pluciński <kamil.plucinski@deepsense.ai>
Co-authored-by: Iwo Naglik <iwo.naglik98@gmail.com>
@plutasnyy
Copy link
Author

@plutasnyy
Copy link
Author

Hello!
I noticed that unfortunately new models are no longer accepted 😢 However, this was stated after we had started the work and if someone would like to make an exception and still consider our contribution then that would be very much appreciated as we have put a lot of work into it.
Otherwise, feel free to close the pull request.

@Borda
Copy link
Member

Borda commented Apr 1, 2022

I noticed that unfortunately new models are no longer accepted

RL is still welcome, we are just limiting SSL contribution as we are rethinking/reorganizing this part...
also, please bare with us as recently we are a bit short with time... in contrast, if you are interested to take a larger contribution stake, pls ping me on slack, and happy to chat about options :)

@stale
Copy link

stale bot commented Jun 5, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jun 5, 2022
@stale stale bot closed this Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model won't fix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants