Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RPO to CleanRL #331

Merged
merged 43 commits into from
Jan 4, 2023
Merged

Add RPO to CleanRL #331

merged 43 commits into from
Jan 4, 2023

Conversation

masud99r
Copy link
Collaborator

@masud99r masud99r commented Dec 16, 2022

Description

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the documentation and previewed the changes via mkdocs serve.
  • I have updated the tests accordingly (if applicable).

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

  • [x ] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
  • I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
  • I have added additional documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers (if applicable).
    • I have added links to the PR related to the algorithm variant.
    • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves (in PNG format).
    • I have added links to the tracked experiments.
    • I have updated the overview sections at the docs and the repo
  • I have updated the tests accordingly (if applicable).

@vercel
Copy link

vercel bot commented Dec 16, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
cleanrl ✅ Ready (Inspect) Visit Preview Jan 4, 2023 at 11:27PM (UTC)

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 19, 2022

Hey btw I noticed you ran into some issues with AttributeError: 'GLFWContext' object has no attribute '_context'. You just need to install some mujoco dependencies for rendering.

- name: install mujoco dependencies
run: |
sudo apt-get update && sudo apt-get -y install libgl1-mesa-glx libosmesa6 libglfw3

See https://github.com/deepmind/dm_control#rendering. I added a bit of docs to explain this issue.

@masud99r
Copy link
Collaborator Author

Hey btw I noticed you ran into some issues with AttributeError: 'GLFWContext' object has no attribute '_context'. You just need to install some mujoco dependencies for rendering.

- name: install mujoco dependencies
run: |
sudo apt-get update && sudo apt-get -y install libgl1-mesa-glx libosmesa6 libglfw3

See https://github.com/deepmind/dm_control#rendering. I added a bit of docs to explain this issue.

I have tried these commands on my local machine, but the issue persists. I just wanted to give an update.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 19, 2022

It might be the case you are using hardware-accelerated rendering. Maybe look up https://github.com/deepmind/dm_control#rendering and google-deepmind/dm_control#136

@masud99r
Copy link
Collaborator Author

masud99r commented Dec 20, 2022

Hi @vwxyzjn , did you get around this issue? Farama-Foundation/Shimmy#15 (comment)
I am facing a similar issue with dm_control.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 20, 2022

Hi @vwxyzjn , did you get around this issue? Farama-Foundation/Shimmy#15 (comment) I am facing a similar issue with dm_control.

Actually yes, I can reproduce this error on my end. For now try running the code without --capture-video.

@masud99r
Copy link
Collaborator Author

masud99r commented Dec 21, 2022

Here is the result comparison between PPO and RPO. These dm_control environments are used in the RPO paper. The RPO paper: https://arxiv.org/pdf/2212.07536.pdf.

PPO and RPO run for 8M timesteps, and results are computed over 10 random seeds.

ppo_continuous_action ({'tag': ['v1.0.0-13-gcbd83f6']}) rpo_continuous_action ({'tag': ['pr-331']})
dm_control/humanoid-stand-v0 21.41 ± 30.31 140.84 ± 57.21
dm_control/humanoid-run-v0 5.47 ± 9.32 23.82 ± 19.57
dm_control/hopper-hop-v0 6.33 ± 18.58 62.46 ± 88.61
dm_control/quadruped-walk-v0 241.46 ± 92.15 610.52 ± 215.04
dm_control/quadruped-run-v0 172.81 ± 40.87 370.70 ± 119.32
dm_control/quadruped-escape-v0 22.78 ± 11.22 67.82 ± 28.01
dm_control/walker-stand-v0 451.71 ± 220.04 724.61 ± 158.48
dm_control/walker-run-v0 124.14 ± 88.88 389.88 ± 126.93
dm_control/fish-swim-v0 83.03 ± 8.93 130.44 ± 51.66

Learning curve:
ppo_continuous_action_slurm is the ppo_continuous_action.py run for 8M timesteps and 10 random seeds.

dm_control_ppo_rpo

The rest of the dm_control experiments are running and will be added later.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 21, 2022

This is perfect. Looking forward to seeing the rest of the results. A quick tip is that instead of creating a separate ppo_continuous_action_slurm file, you could do python ppo_continuous_action.py --exp-name ppo_continuous_action_8M --total-timesteps 8000000

Feel free to prepare the documentation in /docs/rl-algorithms/rpo.md. https://github.com/vwxyzjn/cleanrl/blob/master/docs/rl-algorithms/ppg.md might be a good example.

@masud99r
Copy link
Collaborator Author

Thanks for the tips. I was too worried not to mess up with existing experiments with the name ppo_continuous_action. Using the arg --exp-name is a better option.
Additionally, I wanted to direct the experiment's files (wandb, tensorboard, video, etc.) to a separate storage. This is not to overflow my workspace where I have code, as my workspace has a small storage budget. Thus I edited the ppo_continuous_action.py to make ppo_continuous_action_slurm.py. Is there a better way of doing that? Is there any arg that I can use?

As of the final merging, I will only keep the rpo_continuous_action.py and remove the rest of the files I used to run the jobs, including ppo_continuous_action_slurm.py. The code will still be available in the wandb experiments file (openrlbenchmark). Is that fine? Or let me know if there is a better way.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 21, 2022

Thanks for the tips. I was too worried not to mess up with existing experiments with the name ppo_continuous_action.

This used to be a concern but not anymore since we are now doing filter-based workflow.

Additionally, I wanted to direct the experiment's files (wandb, tensorboard, video, etc.) to a separate storage. This is not to overflow my workspace where I have code, as my workspace has a small storage budget. Thus I edited the

--exp-name should direct files to folders prefixed with exp_name, so it should have identical effects as creating a ppo_continuous_action_slurm.py. One other thing you might leverage is you could call the file from any directory and it will save the tensorboard and video files at that directory. For example, calling python cleanrl/ppo.py vs cd cleanrl && python ppo.py will have the storage files in different directories.

run_name = f"{args.env_id}__{args.exp_name}__{args.seed}__{int(time.time())}"
if args.track:
import wandb
wandb.init(
project=args.wandb_project_name,
entity=args.wandb_entity,
sync_tensorboard=True,
config=vars(args),
name=run_name,
# monitor_gym=True, no longer works for gymnasium
save_code=True,
)
writer = SummaryWriter(f"runs/{run_name}")

As of the final merging, I will only keep the rpo_continuous_action.py and remove the rest of the files I used to run the jobs, including ppo_continuous_action_slurm.py. The code will still be available in the wandb experiments file (openrlbenchmark). Is that fine? Or let me know if there is a better way.

There is no need to delete those experiments because the last benchmarked experiments are tagged with latest, which will be used for querying. So having these ppo_continuous_action_slurm.py would not affect any existing workflows. I would say that you should remove failed and crashed experiments.

@masud99r
Copy link
Collaborator Author

There is no need to delete those experiments because the last benchmarked experiments are tagged with latest, which will be used for querying. So having these ppo_continuous_action_slurm.py would not affect any existing workflows. I would say that you should remove failed and crashed experiments.

I am thinking about this PR. I have created some additional files (run_rpo.sh, ppo_continuous_action_slurm.py, etc.) to facilitate running the experiments. My understanding is that this PR will add two files to the vwxyzjn/cleanrl: rpo_continuous_action.py, and rpo.sh for the benchmarking.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 21, 2022

My understanding is that this PR will add two files to the vwxyzjn/cleanrl: rpo_continuous_action.py, and rpo.sh for the benchmarking.

That is correct. To keep track of the ppo_continuous_action_slurm.py experiments, I would say update the ppo.sh to add lines such as

OMP_NUM_THREADS=1 xvfb-run -a poetry run python -m cleanrl_utils.benchmark \
     --env-ids dm_control/acrobot-swingup-v0 dm_control/acrobot-swingup_sparse-v0 dm_control/ball_in_cup-catch-v0 dm_control/cartpole-balance-v0 dm_control/cartpole-balance_sparse-v0 dm_control/cartpole-swingup-v0 dm_control/cartpole-swingup_sparse-v0 dm_control/cartpole-two_poles-v0 dm_control/cartpole-three_poles-v0 dm_control/cheetah-run-v0 dm_control/dog-stand-v0 dm_control/dog-walk-v0 dm_control/dog-trot-v0 dm_control/dog-run-v0 dm_control/dog-fetch-v0 dm_control/finger-spin-v0 dm_control/finger-turn_easy-v0 dm_control/finger-turn_hard-v0 dm_control/fish-upright-v0 dm_control/fish-swim-v0 dm_control/hopper-stand-v0 dm_control/hopper-hop-v0 dm_control/humanoid-stand-v0 dm_control/humanoid-walk-v0 dm_control/humanoid-run-v0 dm_control/humanoid-run_pure_state-v0 dm_control/humanoid_CMU-stand-v0 dm_control/humanoid_CMU-run-v0 dm_control/lqr-lqr_2_1-v0 dm_control/lqr-lqr_6_2-v0 dm_control/manipulator-bring_ball-v0 dm_control/manipulator-bring_peg-v0 dm_control/manipulator-insert_ball-v0 dm_control/manipulator-insert_peg-v0 dm_control/pendulum-swingup-v0 dm_control/point_mass-easy-v0 dm_control/point_mass-hard-v0 dm_control/quadruped-walk-v0 dm_control/quadruped-run-v0 dm_control/quadruped-escape-v0 dm_control/quadruped-fetch-v0 dm_control/reacher-easy-v0 dm_control/reacher-hard-v0 dm_control/stacker-stack_2-v0 dm_control/stacker-stack_4-v0 dm_control/swimmer-swimmer6-v0 dm_control/swimmer-swimmer15-v0 dm_control/walker-stand-v0 dm_control/walker-walk-v0 dm_control/walker-run-v0 \
     --command "poetry run python cleanrl/ppo_continuous_action.py --exp-name ppo_continuous_action_slurm --cuda False --track" \
     --num-seeds 3 \
     --workers 9

I would however say that the name ppo_continuous_action_slurm is quite hacky... If it is not too much trouble it maybe preferable to re-run the experiments with the names ppo_continuous_action_8M, but it's not a big deal to me if you want to keep the current ones. We just need to add a short explanation of it in the docs.

@masud99r
Copy link
Collaborator Author

Re-running would not be a problem.
However, I want to make sure I understand it fully. The name in the experiments needs to be ppo_continuous_action_8M instead of ppo_continuous_action_slurm. Is that right?
In that case, for the benchmarking, it is easier for me to rename ppo_continuous_action_slurm.py to ppo_continuous_action_8M.py. This is because I need to store the experiment data in a storage-only file system.
I will edit the command in ppo.sh as you mention.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 22, 2022

Re-running would not be a problem.
However, I want to make sure I understand it fully. The name in the experiments needs to be ppo_continuous_action_8M instead of ppo_continuous_action_slurm. Is that right?

That’s right. The exp_name needs to be ppo_continuous_action_8M for filtering.

In that case, for the benchmarking, it is easier for me to rename ppo_continuous_action_slurm.py to ppo_continuous_action_8M.py. This is because I need to store the experiment data in a storage-only file system.

Sounds good. Thanks.

@masud99r
Copy link
Collaborator Author

Results on all dm_control environments. The PPO and RPO run for 8M timesteps, and results are computed over 10 random seeds.

Table:

ppo_continuous_action_8M ({'tag': ['v1.0.0-13-gcbd83f6']}) rpo_continuous_action ({'tag': ['pr-331']})
dm_control/acrobot-swingup-v0 25.12 ± 7.77 41.62 ± 5.26
dm_control/acrobot-swingup_sparse-v0 1.71 ± 0.74 3.25 ± 0.91
dm_control/ball_in_cup-catch-v0 937.71 ± 8.68 941.16 ± 9.67
dm_control/cartpole-balance-v0 791.93 ± 13.67 797.17 ± 9.63
dm_control/cartpole-balance_sparse-v0 990.23 ± 7.34 989.03 ± 5.20
dm_control/cartpole-swingup-v0 583.60 ± 13.82 615.99 ± 11.85
dm_control/cartpole-swingup_sparse-v0 240.45 ± 299.08 526.18 ± 190.39
dm_control/cartpole-two_poles-v0 217.71 ± 5.15 219.31 ± 6.76
dm_control/cartpole-three_poles-v0 159.93 ± 2.12 160.10 ± 2.36
dm_control/cheetah-run-v0 473.61 ± 107.87 562.38 ± 59.67
dm_control/dog-stand-v0 332.40 ± 24.13 504.45 ± 135.58
dm_control/dog-walk-v0 124.68 ± 23.05 166.50 ± 45.85
dm_control/dog-trot-v0 80.08 ± 13.04 115.28 ± 30.15
dm_control/dog-run-v0 67.93 ± 6.84 104.17 ± 24.32
dm_control/dog-fetch-v0 28.73 ± 4.70 44.43 ± 7.21
dm_control/finger-spin-v0 628.08 ± 253.43 849.49 ± 23.43
dm_control/finger-turn_easy-v0 248.87 ± 80.39 447.46 ± 129.80
dm_control/finger-turn_hard-v0 82.28 ± 34.94 238.67 ± 134.88
dm_control/fish-upright-v0 541.82 ± 64.54 801.81 ± 35.79
dm_control/fish-swim-v0 84.81 ± 6.90 139.70 ± 40.24
dm_control/hopper-stand-v0 3.11 ± 1.63 408.24 ± 203.25
dm_control/hopper-hop-v0 5.84 ± 17.28 63.69 ± 87.18
dm_control/humanoid-stand-v0 21.13 ± 30.14 140.61 ± 58.36
dm_control/humanoid-walk-v0 8.32 ± 17.00 77.51 ± 50.81
dm_control/humanoid-run-v0 5.49 ± 9.20 24.16 ± 19.90
dm_control/humanoid-run_pure_state-v0 1.10 ± 0.13 3.32 ± 2.50
dm_control/humanoid_CMU-stand-v0 4.65 ± 0.30 4.34 ± 0.27
dm_control/humanoid_CMU-run-v0 0.86 ± 0.08 0.85 ± 0.04
dm_control/manipulator-bring_ball-v0 0.41 ± 0.22 0.55 ± 0.37
dm_control/manipulator-bring_peg-v0 0.57 ± 0.24 2.64 ± 2.19
dm_control/manipulator-insert_ball-v0 41.53 ± 15.58 39.24 ± 17.19
dm_control/manipulator-insert_peg-v0 49.89 ± 14.26 53.23 ± 16.02
dm_control/pendulum-swingup-v0 464.82 ± 379.09 776.21 ± 20.15
dm_control/point_mass-easy-v0 530.17 ± 262.14 652.18 ± 21.93
dm_control/point_mass-hard-v0 132.29 ± 69.71 184.29 ± 24.17
dm_control/quadruped-walk-v0 239.11 ± 85.20 600.71 ± 213.40
dm_control/quadruped-run-v0 165.70 ± 43.21 375.78 ± 119.95
dm_control/quadruped-escape-v0 23.82 ± 10.82 68.19 ± 30.69
dm_control/quadruped-fetch-v0 183.03 ± 37.50 227.47 ± 21.67
dm_control/reacher-easy-v0 769.10 ± 48.35 740.93 ± 51.58
dm_control/reacher-hard-v0 636.89 ± 77.68 584.33 ± 43.76
dm_control/stacker-stack_2-v0 61.73 ± 17.34 64.32 ± 7.15
dm_control/stacker-stack_4-v0 71.68 ± 29.03 53.78 ± 20.09
dm_control/swimmer-swimmer6-v0 152.82 ± 30.77 166.89 ± 14.49
dm_control/swimmer-swimmer15-v0 148.59 ± 22.63 153.45 ± 17.12
dm_control/walker-stand-v0 453.73 ± 217.72 736.72 ± 145.16
dm_control/walker-walk-v0 308.91 ± 87.85 785.61 ± 133.29
dm_control/walker-run-v0 123.98 ± 91.34 384.08 ± 123.89

Learning curves:
dm_control_all_ppo_rpo_8M

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 23, 2022

The results look exceptionally clean 👍. After all the reported results, I would feel comfortable marking the existing ppo_continuous_action.py as "deprecated" or "not recommended" in favor of rpo_continuous_action.py because of the simplicity of the approach and the superiority of the empirical results.

Please update the docs with these results. Please also remove the ppo_continuous_action_slurm.py results from the project, as they are no longer needed.

Thank you, and have a great holiday.

@Howuhh
Copy link
Contributor

Howuhh commented Dec 26, 2022

Seeing how much this improves the scores, I'm now really curious what happens if we change from Gaussian to the Beta distribution... Or simply clip log std, so that std will not be able to go down to too small value

@masud99r
Copy link
Collaborator Author

@Howuhh Trying Beta distribution would be an interesting idea. I observed some performance effects when trying with Laplace and Gumbel (https://arxiv.org/pdf/2212.07536.pdf Figure 7 & 8).

An aspect of the RPO method is maintaining randomness through Uniform distribution. It would be interesting to see if clip log std can achieve a similar effect.

@masud99r
Copy link
Collaborator Author

masud99r commented Jan 3, 2023

Help needed:
The docs work as expected, except when I add iframe to incorporate tracked experiments. When I add tracked experiments iframe the docs show the first iframe tracked results and break from there. Do you have any idea what I am missing here? @vwxyzjn

RPO Doc: https://github.com/masud99r/cleanrl/blob/master/docs/rl-algorithms/rpo.md

Copy link
Owner

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey I fixed some formatting issues with the iframe. Please let me know if you run into additional issues.

docs/rl-algorithms/rpo.md Outdated Show resolved Hide resolved
docs/rl-algorithms/rpo.md Outdated Show resolved Hide resolved
@masud99r
Copy link
Collaborator Author

masud99r commented Jan 4, 2023

@masud99r masud99r closed this Jan 4, 2023
@masud99r masud99r reopened this Jan 4, 2023
@masud99r
Copy link
Collaborator Author

masud99r commented Jan 4, 2023

Thanks for the fix. Were those tab/space issues or something different?

Hey I fixed some formatting issues with the iframe. Please let me know if you run into additional issues.

I have added a warning section with the depreciation notice.

With that, I think I have addressed all the comments, but I might lose some in the thread! Could you please take a look and let me know what else to be done for this PR?

Copy link
Owner

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your patience and this awesome contribution. Please give it a final read at https://cleanrl-git-fork-masud99r-master-vwxyzjn.vercel.app/rl-algorithms/rpo/. I have added you as a collaborator of this repo, and feel free to go ahead and merge.

@masud99r
Copy link
Collaborator Author

masud99r commented Jan 4, 2023

LGTM. Thanks for your patience and this awesome contribution. Please give it a final read at https://cleanrl-git-fork-masud99r-master-vwxyzjn.vercel.app/rl-algorithms/rpo/. I have added you as a collaborator of this repo, and feel free to go ahead and merge.

It was my pleasure working on this. Thanks for your excellent guidance throughout the process. I have learned a lot.

I took a pass on the docs and fixed some typos. I am going to merge.

@masud99r masud99r merged commit a90742a into vwxyzjn:master Jan 4, 2023
vwxyzjn added a commit that referenced this pull request Jan 5, 2023
vwxyzjn added a commit that referenced this pull request Jan 5, 2023
This was referenced Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants