Add RPO to CleanRL #331

masud99r · 2022-12-16T21:09:29Z

Description

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the documentation and previewed the changes via mkdocs serve.
I have updated the tests accordingly (if applicable).

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

vercel · 2022-12-16T21:09:33Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	Jan 4, 2023 at 11:27PM (UTC)

vwxyzjn · 2022-12-19T03:43:29Z

Hey btw I noticed you ran into some issues with AttributeError: 'GLFWContext' object has no attribute '_context'. You just need to install some mujoco dependencies for rendering.

cleanrl/.github/workflows/tests.yaml

Lines 154 to 156 in b558b2b

 - name: install mujoco dependencies 

 run: | 

  sudo apt-get update && sudo apt-get -y install libgl1-mesa-glx libosmesa6 libglfw3

See https://github.com/deepmind/dm_control#rendering. I added a bit of docs to explain this issue.

cleanrl/rpo_continuous_action.py

masud99r · 2022-12-19T20:28:52Z

Hey btw I noticed you ran into some issues with AttributeError: 'GLFWContext' object has no attribute '_context'. You just need to install some mujoco dependencies for rendering.

cleanrl/.github/workflows/tests.yaml

Lines 154 to 156 in b558b2b

- name: install mujoco dependencies

run: |

sudo apt-get update && sudo apt-get -y install libgl1-mesa-glx libosmesa6 libglfw3

See https://github.com/deepmind/dm_control#rendering. I added a bit of docs to explain this issue.

I have tried these commands on my local machine, but the issue persists. I just wanted to give an update.

vwxyzjn · 2022-12-19T20:30:20Z

It might be the case you are using hardware-accelerated rendering. Maybe look up https://github.com/deepmind/dm_control#rendering and google-deepmind/dm_control#136

masud99r · 2022-12-20T02:30:53Z

Hi @vwxyzjn , did you get around this issue? Farama-Foundation/Shimmy#15 (comment)
I am facing a similar issue with dm_control.

vwxyzjn · 2022-12-20T14:30:25Z

Hi @vwxyzjn , did you get around this issue? Farama-Foundation/Shimmy#15 (comment) I am facing a similar issue with dm_control.

Actually yes, I can reproduce this error on my end. For now try running the code without --capture-video.

masud99r · 2022-12-21T19:19:26Z

Here is the result comparison between PPO and RPO. These dm_control environments are used in the RPO paper. The RPO paper: https://arxiv.org/pdf/2212.07536.pdf.

PPO and RPO run for 8M timesteps, and results are computed over 10 random seeds.

	ppo_continuous_action ({'tag': ['v1.0.0-13-gcbd83f6']})	rpo_continuous_action ({'tag': ['pr-331']})
dm_control/humanoid-stand-v0	21.41 ± 30.31	140.84 ± 57.21
dm_control/humanoid-run-v0	5.47 ± 9.32	23.82 ± 19.57
dm_control/hopper-hop-v0	6.33 ± 18.58	62.46 ± 88.61
dm_control/quadruped-walk-v0	241.46 ± 92.15	610.52 ± 215.04
dm_control/quadruped-run-v0	172.81 ± 40.87	370.70 ± 119.32
dm_control/quadruped-escape-v0	22.78 ± 11.22	67.82 ± 28.01
dm_control/walker-stand-v0	451.71 ± 220.04	724.61 ± 158.48
dm_control/walker-run-v0	124.14 ± 88.88	389.88 ± 126.93
dm_control/fish-swim-v0	83.03 ± 8.93	130.44 ± 51.66

Learning curve:
ppo_continuous_action_slurm is the ppo_continuous_action.py run for 8M timesteps and 10 random seeds.

The rest of the dm_control experiments are running and will be added later.

vwxyzjn · 2022-12-21T19:37:36Z

This is perfect. Looking forward to seeing the rest of the results. A quick tip is that instead of creating a separate ppo_continuous_action_slurm file, you could do python ppo_continuous_action.py --exp-name ppo_continuous_action_8M --total-timesteps 8000000

Feel free to prepare the documentation in /docs/rl-algorithms/rpo.md. https://github.com/vwxyzjn/cleanrl/blob/master/docs/rl-algorithms/ppg.md might be a good example.

masud99r · 2022-12-21T19:57:32Z

Thanks for the tips. I was too worried not to mess up with existing experiments with the name ppo_continuous_action. Using the arg --exp-name is a better option.
Additionally, I wanted to direct the experiment's files (wandb, tensorboard, video, etc.) to a separate storage. This is not to overflow my workspace where I have code, as my workspace has a small storage budget. Thus I edited the ppo_continuous_action.py to make ppo_continuous_action_slurm.py. Is there a better way of doing that? Is there any arg that I can use?

As of the final merging, I will only keep the rpo_continuous_action.py and remove the rest of the files I used to run the jobs, including ppo_continuous_action_slurm.py. The code will still be available in the wandb experiments file (openrlbenchmark). Is that fine? Or let me know if there is a better way.

vwxyzjn · 2022-12-21T20:05:14Z

Thanks for the tips. I was too worried not to mess up with existing experiments with the name ppo_continuous_action.

This used to be a concern but not anymore since we are now doing filter-based workflow.

Additionally, I wanted to direct the experiment's files (wandb, tensorboard, video, etc.) to a separate storage. This is not to overflow my workspace where I have code, as my workspace has a small storage budget. Thus I edited the

--exp-name should direct files to folders prefixed with exp_name, so it should have identical effects as creating a ppo_continuous_action_slurm.py. One other thing you might leverage is you could call the file from any directory and it will save the tensorboard and video files at that directory. For example, calling python cleanrl/ppo.py vs cd cleanrl && python ppo.py will have the storage files in different directories.

cleanrl/cleanrl/ppo_continuous_action.py

Lines 140 to 153 in 2dd73af

 run_name = f"{args.env_id}__{args.exp_name}__{args.seed}__{int(time.time())}" 

 if args.track: 

 import wandb 

 wandb.init( 

 project=args.wandb_project_name, 

 entity=args.wandb_entity, 

 sync_tensorboard=True, 

 config=vars(args), 

 name=run_name, 

 # monitor_gym=True, no longer works for gymnasium 

 save_code=True, 

 ) 

 writer = SummaryWriter(f"runs/{run_name}")

As of the final merging, I will only keep the rpo_continuous_action.py and remove the rest of the files I used to run the jobs, including ppo_continuous_action_slurm.py. The code will still be available in the wandb experiments file (openrlbenchmark). Is that fine? Or let me know if there is a better way.

There is no need to delete those experiments because the last benchmarked experiments are tagged with latest, which will be used for querying. So having these ppo_continuous_action_slurm.py would not affect any existing workflows. I would say that you should remove failed and crashed experiments.

masud99r · 2022-12-21T20:13:13Z

There is no need to delete those experiments because the last benchmarked experiments are tagged with latest, which will be used for querying. So having these ppo_continuous_action_slurm.py would not affect any existing workflows. I would say that you should remove failed and crashed experiments.

I am thinking about this PR. I have created some additional files (run_rpo.sh, ppo_continuous_action_slurm.py, etc.) to facilitate running the experiments. My understanding is that this PR will add two files to the vwxyzjn/cleanrl: rpo_continuous_action.py, and rpo.sh for the benchmarking.

vwxyzjn · 2022-12-21T22:19:56Z

My understanding is that this PR will add two files to the vwxyzjn/cleanrl: rpo_continuous_action.py, and rpo.sh for the benchmarking.

That is correct. To keep track of the ppo_continuous_action_slurm.py experiments, I would say update the ppo.sh to add lines such as

OMP_NUM_THREADS=1 xvfb-run -a poetry run python -m cleanrl_utils.benchmark \
     --env-ids dm_control/acrobot-swingup-v0 dm_control/acrobot-swingup_sparse-v0 dm_control/ball_in_cup-catch-v0 dm_control/cartpole-balance-v0 dm_control/cartpole-balance_sparse-v0 dm_control/cartpole-swingup-v0 dm_control/cartpole-swingup_sparse-v0 dm_control/cartpole-two_poles-v0 dm_control/cartpole-three_poles-v0 dm_control/cheetah-run-v0 dm_control/dog-stand-v0 dm_control/dog-walk-v0 dm_control/dog-trot-v0 dm_control/dog-run-v0 dm_control/dog-fetch-v0 dm_control/finger-spin-v0 dm_control/finger-turn_easy-v0 dm_control/finger-turn_hard-v0 dm_control/fish-upright-v0 dm_control/fish-swim-v0 dm_control/hopper-stand-v0 dm_control/hopper-hop-v0 dm_control/humanoid-stand-v0 dm_control/humanoid-walk-v0 dm_control/humanoid-run-v0 dm_control/humanoid-run_pure_state-v0 dm_control/humanoid_CMU-stand-v0 dm_control/humanoid_CMU-run-v0 dm_control/lqr-lqr_2_1-v0 dm_control/lqr-lqr_6_2-v0 dm_control/manipulator-bring_ball-v0 dm_control/manipulator-bring_peg-v0 dm_control/manipulator-insert_ball-v0 dm_control/manipulator-insert_peg-v0 dm_control/pendulum-swingup-v0 dm_control/point_mass-easy-v0 dm_control/point_mass-hard-v0 dm_control/quadruped-walk-v0 dm_control/quadruped-run-v0 dm_control/quadruped-escape-v0 dm_control/quadruped-fetch-v0 dm_control/reacher-easy-v0 dm_control/reacher-hard-v0 dm_control/stacker-stack_2-v0 dm_control/stacker-stack_4-v0 dm_control/swimmer-swimmer6-v0 dm_control/swimmer-swimmer15-v0 dm_control/walker-stand-v0 dm_control/walker-walk-v0 dm_control/walker-run-v0 \
     --command "poetry run python cleanrl/ppo_continuous_action.py --exp-name ppo_continuous_action_slurm --cuda False --track" \
     --num-seeds 3 \
     --workers 9

I would however say that the name ppo_continuous_action_slurm is quite hacky... If it is not too much trouble it maybe preferable to re-run the experiments with the names ppo_continuous_action_8M, but it's not a big deal to me if you want to keep the current ones. We just need to add a short explanation of it in the docs.

masud99r · 2022-12-21T23:56:31Z

Re-running would not be a problem.
However, I want to make sure I understand it fully. The name in the experiments needs to be ppo_continuous_action_8M instead of ppo_continuous_action_slurm. Is that right?
In that case, for the benchmarking, it is easier for me to rename ppo_continuous_action_slurm.py to ppo_continuous_action_8M.py. This is because I need to store the experiment data in a storage-only file system.
I will edit the command in ppo.sh as you mention.

vwxyzjn · 2022-12-22T00:40:29Z

Re-running would not be a problem.
However, I want to make sure I understand it fully. The name in the experiments needs to be ppo_continuous_action_8M instead of ppo_continuous_action_slurm. Is that right?

That’s right. The exp_name needs to be ppo_continuous_action_8M for filtering.

In that case, for the benchmarking, it is easier for me to rename ppo_continuous_action_slurm.py to ppo_continuous_action_8M.py. This is because I need to store the experiment data in a storage-only file system.

Sounds good. Thanks.

masud99r · 2022-12-23T01:38:39Z

Results on all dm_control environments. The PPO and RPO run for 8M timesteps, and results are computed over 10 random seeds.

Table:

	ppo_continuous_action_8M ({'tag': ['v1.0.0-13-gcbd83f6']})	rpo_continuous_action ({'tag': ['pr-331']})
dm_control/acrobot-swingup-v0	25.12 ± 7.77	41.62 ± 5.26
dm_control/acrobot-swingup_sparse-v0	1.71 ± 0.74	3.25 ± 0.91
dm_control/ball_in_cup-catch-v0	937.71 ± 8.68	941.16 ± 9.67
dm_control/cartpole-balance-v0	791.93 ± 13.67	797.17 ± 9.63
dm_control/cartpole-balance_sparse-v0	990.23 ± 7.34	989.03 ± 5.20
dm_control/cartpole-swingup-v0	583.60 ± 13.82	615.99 ± 11.85
dm_control/cartpole-swingup_sparse-v0	240.45 ± 299.08	526.18 ± 190.39
dm_control/cartpole-two_poles-v0	217.71 ± 5.15	219.31 ± 6.76
dm_control/cartpole-three_poles-v0	159.93 ± 2.12	160.10 ± 2.36
dm_control/cheetah-run-v0	473.61 ± 107.87	562.38 ± 59.67
dm_control/dog-stand-v0	332.40 ± 24.13	504.45 ± 135.58
dm_control/dog-walk-v0	124.68 ± 23.05	166.50 ± 45.85
dm_control/dog-trot-v0	80.08 ± 13.04	115.28 ± 30.15
dm_control/dog-run-v0	67.93 ± 6.84	104.17 ± 24.32
dm_control/dog-fetch-v0	28.73 ± 4.70	44.43 ± 7.21
dm_control/finger-spin-v0	628.08 ± 253.43	849.49 ± 23.43
dm_control/finger-turn_easy-v0	248.87 ± 80.39	447.46 ± 129.80
dm_control/finger-turn_hard-v0	82.28 ± 34.94	238.67 ± 134.88
dm_control/fish-upright-v0	541.82 ± 64.54	801.81 ± 35.79
dm_control/fish-swim-v0	84.81 ± 6.90	139.70 ± 40.24
dm_control/hopper-stand-v0	3.11 ± 1.63	408.24 ± 203.25
dm_control/hopper-hop-v0	5.84 ± 17.28	63.69 ± 87.18
dm_control/humanoid-stand-v0	21.13 ± 30.14	140.61 ± 58.36
dm_control/humanoid-walk-v0	8.32 ± 17.00	77.51 ± 50.81
dm_control/humanoid-run-v0	5.49 ± 9.20	24.16 ± 19.90
dm_control/humanoid-run_pure_state-v0	1.10 ± 0.13	3.32 ± 2.50
dm_control/humanoid_CMU-stand-v0	4.65 ± 0.30	4.34 ± 0.27
dm_control/humanoid_CMU-run-v0	0.86 ± 0.08	0.85 ± 0.04
dm_control/manipulator-bring_ball-v0	0.41 ± 0.22	0.55 ± 0.37
dm_control/manipulator-bring_peg-v0	0.57 ± 0.24	2.64 ± 2.19
dm_control/manipulator-insert_ball-v0	41.53 ± 15.58	39.24 ± 17.19
dm_control/manipulator-insert_peg-v0	49.89 ± 14.26	53.23 ± 16.02
dm_control/pendulum-swingup-v0	464.82 ± 379.09	776.21 ± 20.15
dm_control/point_mass-easy-v0	530.17 ± 262.14	652.18 ± 21.93
dm_control/point_mass-hard-v0	132.29 ± 69.71	184.29 ± 24.17
dm_control/quadruped-walk-v0	239.11 ± 85.20	600.71 ± 213.40
dm_control/quadruped-run-v0	165.70 ± 43.21	375.78 ± 119.95
dm_control/quadruped-escape-v0	23.82 ± 10.82	68.19 ± 30.69
dm_control/quadruped-fetch-v0	183.03 ± 37.50	227.47 ± 21.67
dm_control/reacher-easy-v0	769.10 ± 48.35	740.93 ± 51.58
dm_control/reacher-hard-v0	636.89 ± 77.68	584.33 ± 43.76
dm_control/stacker-stack_2-v0	61.73 ± 17.34	64.32 ± 7.15
dm_control/stacker-stack_4-v0	71.68 ± 29.03	53.78 ± 20.09
dm_control/swimmer-swimmer6-v0	152.82 ± 30.77	166.89 ± 14.49
dm_control/swimmer-swimmer15-v0	148.59 ± 22.63	153.45 ± 17.12
dm_control/walker-stand-v0	453.73 ± 217.72	736.72 ± 145.16
dm_control/walker-walk-v0	308.91 ± 87.85	785.61 ± 133.29
dm_control/walker-run-v0	123.98 ± 91.34	384.08 ± 123.89

Learning curves:

vwxyzjn · 2022-12-23T16:58:05Z

The results look exceptionally clean 👍. After all the reported results, I would feel comfortable marking the existing ppo_continuous_action.py as "deprecated" or "not recommended" in favor of rpo_continuous_action.py because of the simplicity of the approach and the superiority of the empirical results.

Please update the docs with these results. Please also remove the ppo_continuous_action_slurm.py results from the project, as they are no longer needed.

Thank you, and have a great holiday.

Howuhh · 2022-12-26T07:22:25Z

Seeing how much this improves the scores, I'm now really curious what happens if we change from Gaussian to the Beta distribution... Or simply clip log std, so that std will not be able to go down to too small value

masud99r · 2022-12-26T14:40:34Z

@Howuhh Trying Beta distribution would be an interesting idea. I observed some performance effects when trying with Laplace and Gumbel (https://arxiv.org/pdf/2212.07536.pdf Figure 7 & 8).

An aspect of the RPO method is maintaining randomness through Uniform distribution. It would be interesting to see if clip log std can achieve a similar effect.

masud99r · 2023-01-03T20:40:49Z

Help needed:
The docs work as expected, except when I add iframe to incorporate tracked experiments. When I add tracked experiments iframe the docs show the first iframe tracked results and break from there. Do you have any idea what I am missing here? @vwxyzjn

RPO Doc: https://github.com/masud99r/cleanrl/blob/master/docs/rl-algorithms/rpo.md

vwxyzjn

Hey I fixed some formatting issues with the iframe. Please let me know if you run into additional issues.

docs/rl-algorithms/rpo.md

masud99r · 2023-01-04T00:56:22Z

masud99r · 2023-01-04T01:02:35Z

Thanks for the fix. Were those tab/space issues or something different?

Hey I fixed some formatting issues with the iframe. Please let me know if you run into additional issues.

I have added a warning section with the depreciation notice.

With that, I think I have addressed all the comments, but I might lose some in the thread! Could you please take a look and let me know what else to be done for this PR?

vwxyzjn

LGTM. Thanks for your patience and this awesome contribution. Please give it a final read at https://cleanrl-git-fork-masud99r-master-vwxyzjn.vercel.app/rl-algorithms/rpo/. I have added you as a collaborator of this repo, and feel free to go ahead and merge.

masud99r · 2023-01-04T23:37:58Z

LGTM. Thanks for your patience and this awesome contribution. Please give it a final read at https://cleanrl-git-fork-masud99r-master-vwxyzjn.vercel.app/rl-algorithms/rpo/. I have added you as a collaborator of this repo, and feel free to go ahead and merge.

It was my pleasure working on this. Thanks for your excellent guidance throughout the process. I have learned a lot.

I took a pass on the docs and fixed some typos. I am going to merge.

adding rpo implementation

321f19c

vercel bot deployed to Preview December 16, 2022 21:10 View deployment

vwxyzjn reviewed Dec 19, 2022

View reviewed changes

cleanrl/rpo_continuous_action.py Outdated Show resolved Hide resolved

Add installation warning

f86407c

vercel bot deployed to Preview December 19, 2022 03:49 View deployment

masud99r added 2 commits December 19, 2022 09:13

add perturbed gaussian for action distribution

a1a49fe

Merge branch 'master' of https://github.com/masud99r/cleanrl

6bccef4

vercel bot deployed to Preview December 19, 2022 14:17 View deployment

update for running with Slurm and manual pr tag.

c264fe1

vercel bot deployed to Preview December 19, 2022 23:11 View deployment

generate results with RPO and PPO for 8M timesteps

c8f16c9

vercel bot deployed to Preview December 26, 2022 21:34 View deployment

vercel bot deployed to Preview January 3, 2023 20:33 View deployment

Fix docs preview

17a2406

vercel bot deployed to Preview January 3, 2023 23:24 View deployment

vwxyzjn reviewed Jan 3, 2023

View reviewed changes

docs/rl-algorithms/rpo.md Outdated Show resolved Hide resolved

docs/rl-algorithms/rpo.md Outdated Show resolved Hide resolved

add deprecation warning

faf5ed3

vercel bot deployed to Preview January 4, 2023 00:52 View deployment

masud99r closed this Jan 4, 2023

masud99r reopened this Jan 4, 2023

update recommendation

d7fb931

vercel bot deployed to Preview January 4, 2023 19:42 View deployment

Merge branch 'master' into master

51f7971

vercel bot deployed to Preview January 4, 2023 19:44 View deployment

quick change

e09f691

vercel bot deployed to Preview January 4, 2023 19:58 View deployment

small fix

a0af2fc

vwxyzjn approved these changes Jan 4, 2023

View reviewed changes

vercel bot deployed to Preview January 4, 2023 20:01 View deployment

fix typos

a953a26

vercel bot deployed to Preview January 4, 2023 23:27 View deployment

masud99r merged commit a90742a into vwxyzjn:master Jan 4, 2023

vwxyzjn added a commit that referenced this pull request Jan 5, 2023

Hotfix for #331

f7f2cba

vwxyzjn added a commit that referenced this pull request Jan 5, 2023

Hotfix for #331 (#342)

94a44b5

This was referenced Jan 12, 2023

Qdagger: Reincarnate RL #344

Merged

Reproduction of Muesli #350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RPO to CleanRL #331

Add RPO to CleanRL #331

masud99r commented Dec 16, 2022 •

edited

Loading

vercel bot commented Dec 16, 2022 •

edited

Loading

vwxyzjn commented Dec 19, 2022 •

edited

Loading

masud99r commented Dec 19, 2022

vwxyzjn commented Dec 19, 2022

masud99r commented Dec 20, 2022 •

edited

Loading

vwxyzjn commented Dec 20, 2022

masud99r commented Dec 21, 2022 •

edited

Loading

vwxyzjn commented Dec 21, 2022

masud99r commented Dec 21, 2022

vwxyzjn commented Dec 21, 2022

masud99r commented Dec 21, 2022

vwxyzjn commented Dec 21, 2022

masud99r commented Dec 21, 2022

vwxyzjn commented Dec 22, 2022

masud99r commented Dec 23, 2022

vwxyzjn commented Dec 23, 2022

Howuhh commented Dec 26, 2022 •

edited

Loading

masud99r commented Dec 26, 2022

masud99r commented Jan 3, 2023

vwxyzjn left a comment

masud99r commented Jan 4, 2023

masud99r commented Jan 4, 2023

vwxyzjn left a comment •

edited

Loading

masud99r commented Jan 4, 2023

Add RPO to CleanRL #331

Add RPO to CleanRL #331

Conversation

masud99r commented Dec 16, 2022 • edited Loading

Description

Types of changes

Checklist:

vercel bot commented Dec 16, 2022 • edited Loading

vwxyzjn commented Dec 19, 2022 • edited Loading

masud99r commented Dec 19, 2022

vwxyzjn commented Dec 19, 2022

masud99r commented Dec 20, 2022 • edited Loading

vwxyzjn commented Dec 20, 2022

masud99r commented Dec 21, 2022 • edited Loading

vwxyzjn commented Dec 21, 2022

masud99r commented Dec 21, 2022

vwxyzjn commented Dec 21, 2022

masud99r commented Dec 21, 2022

vwxyzjn commented Dec 21, 2022

masud99r commented Dec 21, 2022

vwxyzjn commented Dec 22, 2022

masud99r commented Dec 23, 2022

vwxyzjn commented Dec 23, 2022

Howuhh commented Dec 26, 2022 • edited Loading

masud99r commented Dec 26, 2022

masud99r commented Jan 3, 2023

vwxyzjn left a comment

Choose a reason for hiding this comment

masud99r commented Jan 4, 2023

masud99r commented Jan 4, 2023

vwxyzjn left a comment • edited Loading

Choose a reason for hiding this comment

masud99r commented Jan 4, 2023

masud99r commented Dec 16, 2022 •

edited

Loading

vercel bot commented Dec 16, 2022 •

edited

Loading

vwxyzjn commented Dec 19, 2022 •

edited

Loading

masud99r commented Dec 20, 2022 •

edited

Loading

masud99r commented Dec 21, 2022 •

edited

Loading

Howuhh commented Dec 26, 2022 •

edited

Loading

vwxyzjn left a comment •

edited

Loading