TRPO Implementation #799

plutasnyy · 2022-01-19T18:56:41Z

What does this PR do?

Hello! As discussed in issue 596 together with @NaIwo we implemented the TRPO method in the RL module.

Implementation & Results

In order to validate the implementation and prove correctness, our code was quite heavily based on an implementation from the Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO publication by Logan Engstrom et al. with source code in Github.

We tested our code on environments from the publication (Walker, Hopper and Humanoid from MuJoCo, all are continuous),
and one simple discrete, just to verify the correctness of the implementation - LunarLander.

Of course, our implementations differ in several ways - we set the direction of improvement based on the whole batch
rather than 10%, different actor and critic architectures, training in a multi/single agent framework etc.
The point here is not to reproduce the results exactly, but simply to show that our implementation meets the expected minimum ;)

The scores presented by the authors are aggregated as expected returns based on agent performance after 500 learning epochs.
We simply show the average of the last 100 episodes after epochs execution (typically one epoch has significantly fewer
episodes 2-10 depending on the environment).

Humanoid

Expected: [576, 596]
Ours - without problem to obtain cumulated rewards around 600

Command to reproduce:

python3 trpo_model.py --max-episode-len 200 --lr-critic 1e-4 --kl-div-threshold 0.07 --env Humanoid-v2  --max_epochs 500 --log_every_n_steps 10

Hopper

Expected: [1948, 2136]
Ours - on the average the model works similarly

Command to reproduce:

python3 trpo_model.py --max-episode-len 2048 --lr-critic 1e-3 --kl-div-threshold 0.05 --env Hopper-v2  --max_epochs 500 --log_every_n_steps 10

Walker

Expected: [2709, 2873]
Ours - Our model performs worse, even with more experience, however, with tuned parameters it meets the expected threshold

Command to reproduce:

python3 trpo_model.py --max-episode-len 2048 --lr-critic 1e-3 --kl-div-threshold 0.1 --env Walker2d-v2  --max_epochs 1000 --log_every_n_steps 10

LunarLander

Expected - in short, a score around 200 indicates a correctly completed task
Ours - the agent was able to learn how to solve task

Command to reproduce:

python3 trpo_model.py --max-episode-len 2048 --lr-critic 1e-3 --kl-div-threshold 0.1 --env LunarLander-v2  --max_epochs 1000 --log_every_n_steps 10

You can see more plots in the link: wandb.

If you would like to run our code without installing MuJoCo you can use Colab, which we also used for a while ;)

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? - Yes, but please check if we haven't missed something
Did you write any new necessary tests? [not needed for typos/docs]
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Absolutely!

Implementation of TRPO model Co-authored-by: Kamil Pluciński <kamil.plucinski@deepsense.ai> Co-authored-by: Iwo Naglik <iwo.naglik98@gmail.com>

plutasnyy · 2022-01-19T19:01:27Z

We'll fix https://deepsource.io/gh/PyTorchLightning/lightning-bolts/run/fbfd5cea-cbfd-4e27-9dc9-ffd28658dcb1/python/ in a free moment

for more information, see https://pre-commit.ci

plutasnyy · 2022-03-31T20:20:49Z

Hello!
I noticed that unfortunately new models are no longer accepted 😢 However, this was stated after we had started the work and if someone would like to make an exception and still consider our contribution then that would be very much appreciated as we have put a lot of work into it.
Otherwise, feel free to close the pull request.

Borda · 2022-04-01T03:26:46Z

I noticed that unfortunately new models are no longer accepted

RL is still welcome, we are just limiting SSL contribution as we are rethinking/reorganizing this part...
also, please bare with us as recently we are a bit short with time... in contrast, if you are interested to take a larger contribution stake, pls ping me on slack, and happy to chat about options :)

stale · 2022-06-05T20:03:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

plutasnyy and others added 5 commits January 18, 2022 07:55

Trpo implementation (#1)

fbbe084

Implementation of TRPO model Co-authored-by: Kamil Pluciński <kamil.plucinski@deepsense.ai> Co-authored-by: Iwo Naglik <iwo.naglik98@gmail.com>

Update init

0bf8614

Remove unused import

3ef86ee

Fixing usage of GPU. Argparse downgrade to lower python version.

8d31e32

Pre commit

2154196

plutasnyy requested review from ananyahjha93, awaelchli, Borda and ethanwharris as code owners January 19, 2022 18:56

github-actions bot added the model label Jan 19, 2022

NaIwo and others added 2 commits January 19, 2022 20:53

Fix issues connected with DeepSource

b95699e

[pre-commit.ci] auto fixes from pre-commit.com hooks

c38fe76

for more information, see https://pre-commit.ci

stale bot added the won't fix This will not be worked on label Jun 5, 2022

stale bot closed this Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRPO Implementation #799

TRPO Implementation #799

plutasnyy commented Jan 19, 2022

plutasnyy commented Jan 19, 2022

plutasnyy commented Mar 31, 2022

Borda commented Apr 1, 2022

stale bot commented Jun 5, 2022

TRPO Implementation #799

TRPO Implementation #799

Conversation

plutasnyy commented Jan 19, 2022

What does this PR do?

Implementation & Results

Humanoid

Hopper

Walker

LunarLander

Before submitting

PR review

Did you have fun?

plutasnyy commented Jan 19, 2022

plutasnyy commented Mar 31, 2022

Borda commented Apr 1, 2022

stale bot commented Jun 5, 2022