-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The stable baselines implementation of TD3 can not achieve the same performance as the original TD3 [question] #840
Comments
Hello, You should be using the hyperparameters from the rl zoo (that matches the original paper) as mentioned in the documentation.
|
Mentioned it before in some issue here & and seen others also do. I believe something is broken in TD3 in recent releases of stable-baselines In fact the same HP setup that I previous used to take 3 or 4 place on OpenAI leaderboard would with Have not had time to investigate it more myself. Will try to look in to it when i have time, but i encourage the stable-baselines authors to look in to it asap as TD3 is one of the more potential algos included. |
If you could provide an example code and versions which provide the different results, this would be a good start for the debugging. Indeed if there has been this bad regression (or even a change of default behaviour), it should be fixed asap. |
I just ran a quick sanity check on
it reaches a mean reward > 2000 in ~5e5 steps which is the expected behavior (I've got similar results using the PyTorch version and SAC both in SB and SB3). Some checkpoints:
Hyperparameters:
|
@araffin The done signal is set to False all the time. Is that sufficient or do i need to look more into the "hack" that you have cited? I am not sure that i understand how to use it. I am sorry for asking something that could be found in the documentation, but I still haven't found the reference to the RL Zoo that you are talking about yet. I have been using the hyperparameters that you stated, and it seems, even though that I'm not done testing yet, that the stable baselines implementation can reach the same level as the original implementation. Thank you very much. |
I think this is related to araffin/rl-baselines-zoo#79, I recommend you to read the associated paper.
good to hear, please read the rl tips and tricks carefully next time ;)
It is mentioned in the README, in the first page of the documentation, in the rl tips and has its own section in the documentation... |
closing this issue then. |
I wanted to use the stable baselines implementation of TD3 in order to be able to compare the algorithm to other reinforcement learning algorithms more easily.
I have compared the original implementation of TD3 from sfujim to the one from stable baselines. The one from stable baselines is faster in terms of computation time, but it cannot achieve as good results as the original implementation. I have created a custom policy with two hidden layers of 256 nodes and set the hyperparameters according to the original implementation as seen below:
Using this code, the original implementation consistently achieves a performance of approximately 150% of the stable baselines implementation. Does anyone have any clues about some things that I am missing? Or maybe some ideas on why the results differ?
System Info
Describe the characteristic of your environment:
The text was updated successfully, but these errors were encountered: