-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is Collector mapping randomly sampled actions using map_action? #512
Comments
Yes you are right, thanks for reporting this bug. |
Cool. I'll just leave one more note here regarding a possible fix then. It will probably not be enough to just move the action remap under the appropriate branch in the if. Since for policy actions, we store the raw, unremapped actions in the replay buffer, to maintain consistency, we'll also probably need to apply the inverse of map_action to random actions before storing them. |
Storing unremapped actions presumably also has impacts for offline training. There we are passing a replay buffer where actions may have been collected using a policy with different remapping. I.e. under the policy being trained, which has a different remapping strategy, they might actually lead to completely different transitions than those recorded in the buffer. |
Thank you so much for your suggestions. Here is the our idea when we first design action remapping function.
|
Yes, I get what the difficulty is. The context in which I am dealing with this now is that I have a setting somewhat similar to offline learning in that I pass transitions to the agent from the outside. I don't think there is currently a safe way to do it – I always risk that I will pass the actions in the wrong format depending on how the policy remaps. This would not be such a problem perhaps with just clipping (which does not actually rescale), but with tanh or with any custom implementation of map_action in a derived policy, things will break down and I am not sure how to deal with that. The solution with an env wrapper would be much cleaner in this respect, but I get that it would put more demands on users and could become very confusing in the context of point 4. |
(Issue #512) Random start in Collector sample actions from the action space, while policies output action in a range (typically [-1, 1]) and map action to the action space. The buffer only stores unmapped actions, so the actions randomly initialized are not correct when the action range is not [-1, 1]. This may influence policy learning and particularly model learning in model-based methods. This PR fixes it by adding an inverse operation before adding random initial actions to the buffer.
…-ml#568) (Issue thu-ml#512) Random start in Collector sample actions from the action space, while policies output action in a range (typically [-1, 1]) and map action to the action space. The buffer only stores unmapped actions, so the actions randomly initialized are not correct when the action range is not [-1, 1]. This may influence policy learning and particularly model learning in model-based methods. This PR fixes it by adding an inverse operation before adding random initial actions to the buffer.
Hi, everyone; it seems that in Collector , if
random
is true, we are sampling agents from the action space usingself._action_space[i].sample()
.We then apply
action_remap = self.policy.map_action(self.data.act)
to them just as we do to action generated by the policy.Is this correct? It seems to me that the actions sampled from the action space should already be scaled correctly and squashing them probably changes their distribution.
The text was updated successfully, but these errors were encountered: