-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAC's loss explode on Hopper-v3 #332
Comments
@ChenDRAG could you please have a check? |
Looking into that @Trinkle23897 @tongzhoumu |
@tongzhoumu I have run 4 experiments on both e605bde (older version, 0.4.0 ), and 6426a39 (latest), and find that older version of tianshou works quite well. There does seem to be a problem with SAC in 6426a39. You mentioned that you have failed to reproduce benchmark even in e605bde . Is it possible that you forget to switch tianshou library you have installed to older version? In other words, your code under tianshou/examples/mujoco is older version, but the library you use is the latest one? Suggestion: try
and then reproduce benchmark. Meanwhile, I will try to fix the bug with sac in latest version as soon as possible. |
#313 introduces this bug |
solved with #333 , turns out that tanh mapping is not supported in policies where action is used as input of critic , because raw action in range (-inf, inf) will cause instability in training. If you really need to ues tanh for mapping, you have to use it by definition in |
Hi, I have not entirely understood your fix. After BTW, I take a look at your solution, it seems you just removed I am looking forward to your reply. Thanks! |
This key point is not to remove See, in tianshou, there is a function called For instance, In mujoco Humanoid, the action space is (-0.4, 0.4), but the output of your neural network is (-inf,inf). Using This is particularly useful in on policy algorithms because action is usually sampled from Guassion distribution (range from -inf to inf in theory). However in SAC, when action is used as input of a neural network, (-inf, inf) action is no longer allowed because it will cause gradient explosion if action is too large. So in this case, tanh cannot be a part of black box, and can only be manually defined in |
After carefully inspecting the codes, I noticed that the key difference is |
Problems
I am trying to reproduce the results mentioned in the mujoco benchmark. And the command I used is
./run_experiments.sh Hopper-v3
, which runs SAC algorithm. The losses exploded after several epochs. I have repeated the experiments for 6 times and tried fixed alpha and auto alpha tuning, but all of them exploded finally.Other info
The text was updated successfully, but these errors were encountered: