Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAC's loss explode on Hopper-v3 #332

Closed
4 of 8 tasks
tongzhoumu opened this issue Mar 31, 2021 · 9 comments · Fixed by #333
Closed
4 of 8 tasks

SAC's loss explode on Hopper-v3 #332

tongzhoumu opened this issue Mar 31, 2021 · 9 comments · Fixed by #333
Assignees
Labels
bug Something isn't working

Comments

@tongzhoumu
Copy link

tongzhoumu commented Mar 31, 2021

  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
  • I have visited the source website
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:
    import tianshou, torch, sys
    print(tianshou.__version__, torch.__version__, sys.version, sys.platform)

Problems

I am trying to reproduce the results mentioned in the mujoco benchmark. And the command I used is ./run_experiments.sh Hopper-v3, which runs SAC algorithm. The losses exploded after several epochs. I have repeated the experiments for 6 times and tried fixed alpha and auto alpha tuning, but all of them exploded finally.

Other info

  • Version: I tried 6426a39 and e605bde, both of them have this issue
  • Operating system: ubuntu 18.04
  • Environment: 0.4.0 1.8.1+cu102 3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0] Linux

image
image

@Trinkle23897
Copy link
Collaborator

@ChenDRAG could you please have a check?

@Trinkle23897 Trinkle23897 added the bug Something isn't working label Mar 31, 2021
@ChenDRAG
Copy link
Collaborator

Looking into that @Trinkle23897 @tongzhoumu

@ChenDRAG ChenDRAG self-assigned this Mar 31, 2021
@ChenDRAG
Copy link
Collaborator

ChenDRAG commented Mar 31, 2021

@tongzhoumu I have run 4 experiments on both e605bde (older version, 0.4.0 ), and 6426a39 (latest), and find that older version of tianshou works quite well. There does seem to be a problem with SAC in 6426a39.
image
(older version results)

You mentioned that you have failed to reproduce benchmark even in e605bde . Is it possible that you forget to switch tianshou library you have installed to older version? In other words, your code under tianshou/examples/mujoco is older version, but the library you use is the latest one?

Suggestion: try

cd tianshou
git checkout e605bde
pip install -e .

and then reproduce benchmark.

Meanwhile, I will try to fix the bug with sac in latest version as soon as possible.

@tongzhoumu
Copy link
Author

tongzhoumu commented Mar 31, 2021

@ChenDRAG Yes, you are right. I forgot to install tianshou after switching to the old commit id. Now I can reproduce your results with e605bde codes.
I am looking forward to your solution for fixing the latest version. Thank you so much!

@Trinkle23897
Copy link
Collaborator

#313 introduces this bug

@ChenDRAG
Copy link
Collaborator

ChenDRAG commented Apr 1, 2021

solved with #333 , turns out that tanh mapping is not supported in policies where action is used as input of critic , because raw action in range (-inf, inf) will cause instability in training. If you really need to ues tanh for mapping, you have to use it by definition in forward()

@tongzhoumu
Copy link
Author

tongzhoumu commented Apr 1, 2021

solved with #333 , turns out that tanh mapping is not supported in policies where action is used as input of critic , because raw action in range (-inf, inf) will cause instability in training. If you really need to ues tanh for mapping, you have to use it by definition in forward()

Hi, I have not entirely understood your fix. After tanh, the action should be in [-1, 1], why it is incorrect?

BTW, I take a look at your solution, it seems you just removed if self.action_bound_method == "tanh" and self.action_space is not None:, what is the reason for this?

I am looking forward to your reply. Thanks!

@ChenDRAG
Copy link
Collaborator

ChenDRAG commented Apr 2, 2021

BTW, I take a look at your solution, it seems you just removed if self.action_bound_method == "tanh" and self.action_space is not None:, what is the reason for this?

This key point is not to remove if self.action_bound_method == "tanh" and self.action_space is not None:, the key point is to make batch.act = squashed_act

See, in tianshou, there is a function called map_action used to map action in range (-inf, inf) to range (action_sapce.low, action_sapce.high). You can consider map_action as a kind of gym environment wrapper. It is a part of environment like a black box because remapped action will not be stored in buffer.

For instance, In mujoco Humanoid, the action space is (-0.4, 0.4), but the output of your neural network is (-inf,inf). Using map_action, you don't have to care about this.

This is particularly useful in on policy algorithms because action is usually sampled from Guassion distribution (range from -inf to inf in theory). However in SAC, when action is used as input of a neural network, (-inf, inf) action is no longer allowed because it will cause gradient explosion if action is too large. So in this case, tanh cannot be a part of black box, and can only be manually defined in forward.

@tongzhoumu
Copy link
Author

tongzhoumu commented Apr 2, 2021

BTW, I take a look at your solution, it seems you just removed if self.action_bound_method == "tanh" and self.action_space is not None:, what is the reason for this?

This key point is not to remove if self.action_bound_method == "tanh" and self.action_space is not None:, the key point is to make batch.act = squashed_act

See, in tianshou, there is a function called map_action used to map action in range (-inf, inf) to range (action_sapce.low, action_sapce.high). You can consider map_action as a kind of gym environment wrapper. It is a part of environment like a black box because remapped action will not be stored in buffer.

For instance, In mujoco Humanoid, the action space is (-0.4, 0.4), but the output of your neural network is (-inf,inf). Using map_action, you don't have to care about this.

This is particularly useful in on policy algorithms because action is usually sampled from Guassion distribution (range from -inf to inf in theory). However in SAC, when action is used as input of a neural network, (-inf, inf) action is no longer allowed because it will cause gradient explosion if action is too large. So in this case, tanh cannot be a part of black box, and can only be manually defined in forward.

After carefully inspecting the codes, I noticed that the key difference is act=squashed_action. Thanks so much for your detailed explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants