Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TQC to CleanRL #258

Open
2 tasks done
AdityaGudimella opened this issue Aug 13, 2022 · 5 comments
Open
2 tasks done

Add TQC to CleanRL #258

AdityaGudimella opened this issue Aug 13, 2022 · 5 comments

Comments

@AdityaGudimella
Copy link

AdityaGudimella commented Aug 13, 2022

Problem Description

Would you be interested in adding Truncated Quantile Critics to CleanRL? If so I can work on a PR.

If you are interested, then I have a few questions:

  1. Is it okay if I use the new gym.Env api where env.step returns obs, reward, terminated, truncated, info instead of obs, reward, done, info?
  2. Is it okay if I deviate slightly from the style followed by the other algorithms in CleanRL? I would like to put the code for calculating the losses and updating the models into a separate class. Everything else will remain the same. This way, if people are familiar with the general RL procedure, and only want to see how the core of the algorithm changes, they will only need to look at this class.

Checklist

@vwxyzjn
Copy link
Owner

vwxyzjn commented Aug 14, 2022

Hi, @AdityaGudimella thanks for your interest in contribution! We would definitely be interested in the TQC contribution.

Is it okay if I use the new gym.Env api where env.step returns obs, reward, terminated, truncated, info instead of obs, reward, done, info?

It's ok but note the merging timeline may be considerably delayed because of this. Because poetry only supports using a single version for gym, we would need to make sure other scripts at least run with the latest gym. Judging by the list of things gym recently breaks, it's hard to estimate the amount of work involved... We might be able to merge it sooner with an ad-hoc instruction like poetry run pip install gym==0.24.1 in the usage section as a temporary solution. See example here

Is it okay if I deviate slightly from the style followed by the other algorithms in CleanRL? I would like to put the code for calculating the losses and updating the models into a separate class. Everything else will remain the same. This way, if people are familiar with the general RL procedure and only want to see how the core of the algorithm changes, they will only need to look at this class.

I still think it's worth it to adhere to the styles in other scripts for consistency unless the loss function is re-used multiple times. The problems with putting loss function in the class are

  1. the class will have much more lines of code
  2. the overhead of jumping from the bottom of the script to see how update_agent is defined (we don't really do this because we would just literary put a comment like # update agent.

That said, I am happy to be convinced otherwise; feel free to prototype what you think is best and I will help review.

@AdityaGudimella
Copy link
Author

Hi, @vwxyzjn. Thank you for your response.

It's ok but note the merging timeline may be considerably delayed because of this. Because poetry only supports using a single version for gym, we would need to make sure other scripts at least run with the latest gym. Judging by the list of things gym recently breaks, it's hard to estimate the amount of work involved... We might be able to merge it sooner with an ad-hoc instruction like poetry run pip install gym==0.24.1 in the usage section as a temporary solution. See example here

That makes sense to me. I had to update gym from 0.23 to 0.25 in my own library, but it did require too many changes. Out of curiosity, if I put up a PR (separately) just for upgrading the gym version to 0.25.1, what would be required to do to merge it in? Would you be running all the algorithms currently present to convergence?

For now, I can see that TQC even without differentiating between terminated and truncated episodes, performs quite well. I can put up a PR first that uses the dones from the gym env as is. Once the rest of the PR is approved, I can change it to differentiate between terminated and truncated. Is that okay?

I still think it's worth it to adhere to the styles in other scripts for consistency unless the loss function is re-used multiple times. The problems with putting loss function in the class are ...

For your point 1., I don't think the class will have any extra lines of code. If I pass in the args SimpleNamespace to the class constructor, the number of lines should be exactly the same as the other version. For point number 2. while I can see some people having to differentiate between the two, for me it makes it much more clearer to put the loss into a separate function / class. That being said, while I would prefer to put it in a separate class, I'm fine with following the format followed by the other algorithms too. I will first put up a version that has the loss in a separate class, and once you feel that there aren't any mistakes in it, I will refactor it to the format followed by the other algorithms. That way I don't have to be worried that my current implementation itself is wrong (it does reproduce - and in some cases improve upon, the performance as given in the paper, but you never know with RL). Is that okay?

I might have a few more questions while I write the PR. Is it okay to ask them here?

@vwxyzjn
Copy link
Owner

vwxyzjn commented Aug 16, 2022

Out of curiosity, if I put up a PR (separately) just for upgrading the gym version to 0.25.1, what would be required to do to merge it in?

The main thing would be to ensure the scripts run without error.

Would you be running all the algorithms currently present to convergence?

No, this would not be required unless we are highly suspicious about a particular environment / script.

For now, I can see that TQC even without differentiating between terminated and truncated episodes, performs quite well. I can put up a PR first that uses the dones from the gym env as is. Once the rest of the PR is approved, I can change it to differentiate between terminated and truncated. Is that okay?

Yes this is ok :)

For your point 1., I don't think the class will have any extra lines of code. If I pass in the args SimpleNamespace to the class constructor, the number of lines should be exactly the same as the other version. For point number 2. while I can see some people having to differentiate between the two, for me it makes it much more clearer to put the loss into a separate function / class. That being said, while I would prefer to put it in a separate class, I'm fine with following the format followed by the other algorithms too. I will first put up a version that has the loss in a separate class, and once you feel that there aren't any mistakes in it, I will refactor it to the format followed by the other algorithms. That way I don't have to be worried that my current implementation itself is wrong (it does reproduce - and in some cases improve upon, the performance as given in the paper, but you never know with RL). Is that okay?

Perfectly reasonable.

I might have a few more questions while I write the PR. Is it okay to ask them here?

Here would be a perfect place to ask. Discord is also good.

@AdityaGudimella
Copy link
Author

I put up an initial PR here. I've tested it out on Hopper-v3 and Humaoid-v3. Please take a look at it.

@vwxyzjn vwxyzjn mentioned this issue Aug 22, 2022
20 tasks
@AdityaGudimella
Copy link
Author

I also created a new issue for the gym upgrade here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants