You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our current contribution guide mainly covers the process of contributing new algorithms. However, it is unclear what the process looks like for contributing to existing algorithms, which require a different set of procedures.
Problem
DRL is brittle and has a series of reproducibility issues — even bug fixes sometimes could introduce performance regression (e.g., see how a bug fix of contact force in MuJoCo results in worse performance for PPO). Therefore, it is essential to understand how the proposed changes impact the performance of the algorithms. At large, we wish to distinguish two types of contributions: 1) non-performance-impacting changes and 2) performance-impacting changes.
Importantly, regardless of the slight difference in performance-impacting changes, we need to re-run the benchmark to ensure there is no regression. This post proposes a way for us to re-run the model and check regression seamlessly.
Proposal
We should add a tag for every benchmark run to distinguish the version of CleanRL used to run the experiments. This can be done by
which could generate wandb reports with the following figure and corresponding tables.
If the newer tag version v1.0.0b2-7-g4bb6766 works without causing major regression, we can then label it as latest (and remove the tag latest for v1.0.0b2-7-gxfd3d3 correspondingly.
In the future, this will allow us to compare two completely different versions, too, like v1.0.0b2-7-g4bb6766 vs v1.5.0
Our current contribution guide mainly covers the process of contributing new algorithms. However, it is unclear what the process looks like for contributing to existing algorithms, which require a different set of procedures.
Problem
DRL is brittle and has a series of reproducibility issues — even bug fixes sometimes could introduce performance regression (e.g., see how a bug fix of contact force in MuJoCo results in worse performance for PPO). Therefore, it is essential to understand how the proposed changes impact the performance of the algorithms. At large, we wish to distinguish two types of contributions: 1) non-performance-impacting changes and 2) performance-impacting changes.
gamma
parameter in PPO (added gamma to reward normalization wrappers #209), properly handling action bounds in DDPG (Td3 ddpg action bound fix #211), and fixing bugs (TD3: fixed dimension of clipped_noise for target actions, added noise … #281)Importantly, regardless of the slight difference in performance-impacting changes, we need to re-run the benchmark to ensure there is no regression. This post proposes a way for us to re-run the model and check regression seamlessly.
Proposal
We should add a tag for every benchmark run to distinguish the version of CleanRL used to run the experiments. This can be done by
This gives us a tag in the tracked experiments, as shown below:
Then we can design APIs to compare results from different tags / versions of the algorithm. Something like
which could generate wandb reports with the following figure and corresponding tables.
If the newer tag version
v1.0.0b2-7-g4bb6766
works without causing major regression, we can then label it aslatest
(and remove the taglatest
forv1.0.0b2-7-gxfd3d3
correspondingly.In the future, this will allow us to compare two completely different versions, too, like
v1.0.0b2-7-g4bb6766
vsv1.5.0
CC @dosssman @yooceii @dipamc @kinalmehta @joaogui1 @araffin @bragajj @cool-RR @jkterry1 for thoughts
The text was updated successfully, but these errors were encountered: