Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RLops Guide #296

Closed
vwxyzjn opened this issue Oct 19, 2022 · 1 comment
Closed

RLops Guide #296

vwxyzjn opened this issue Oct 19, 2022 · 1 comment

Comments

@vwxyzjn
Copy link
Owner

vwxyzjn commented Oct 19, 2022

Our current contribution guide mainly covers the process of contributing new algorithms. However, it is unclear what the process looks like for contributing to existing algorithms, which require a different set of procedures.

Problem

DRL is brittle and has a series of reproducibility issues — even bug fixes sometimes could introduce performance regression (e.g., see how a bug fix of contact force in MuJoCo results in worse performance for PPO). Therefore, it is essential to understand how the proposed changes impact the performance of the algorithms. At large, we wish to distinguish two types of contributions: 1) non-performance-impacting changes and 2) performance-impacting changes.

Importantly, regardless of the slight difference in performance-impacting changes, we need to re-run the benchmark to ensure there is no regression. This post proposes a way for us to re-run the model and check regression seamlessly.

Proposal

We should add a tag for every benchmark run to distinguish the version of CleanRL used to run the experiments. This can be done by

WANDB_TAGS=$(git describe --tags) OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \
    --env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 InvertedPendulum-v2 Humanoid-v2 Pusher-v2 \
    --command "poetry run python cleanrl/td3_continuous_action.py --track --capture-video" \
    --num-seeds 3 \
    --workers 1

This gives us a tag in the tracked experiments, as shown below:

Screen Shot 2022-10-19 at 11 28 46 AM

Then we can design APIs to compare results from different tags / versions of the algorithm. Something like

import cleanrl_utils.compare
compare(
    ["HalfCheetah-v2", ],
    filters1={"exp_name": "td3_continuous_action", "tag": "v1.0.0b2-7-g4bb6766"},
    filters1={"exp_name": "td3_continuous_action", "tag": "v1.0.0b2-7-gxfd3d3"},
)

which could generate wandb reports with the following figure and corresponding tables.

image

If the newer tag version v1.0.0b2-7-g4bb6766 works without causing major regression, we can then label it as latest (and remove the tag latest for v1.0.0b2-7-gxfd3d3 correspondingly.

In the future, this will allow us to compare two completely different versions, too, like v1.0.0b2-7-g4bb6766 vs v1.5.0

CC @dosssman @yooceii @dipamc @kinalmehta @joaogui1 @araffin @bragajj @cool-RR @jkterry1 for thoughts

@vwxyzjn vwxyzjn changed the title Contribution guide on performance-impacting changes RLops Guide Oct 20, 2022
This was referenced Oct 31, 2022
@vwxyzjn vwxyzjn mentioned this issue Mar 29, 2023
18 tasks
@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Mar 29, 2023

Closed by #368

@vwxyzjn vwxyzjn closed this as completed Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant