Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is Collector mapping randomly sampled actions using map_action? #512

Closed
michalgregor opened this issue Jan 19, 2022 · 6 comments · Fixed by #568
Closed

Why is Collector mapping randomly sampled actions using map_action? #512

michalgregor opened this issue Jan 19, 2022 · 6 comments · Fixed by #568
Labels
bug Something isn't working

Comments

@michalgregor
Copy link
Contributor

Hi, everyone; it seems that in Collector , if random is true, we are sampling agents from the action space using self._action_space[i].sample().

We then apply action_remap = self.policy.map_action(self.data.act) to them just as we do to action generated by the policy.

Is this correct? It seems to me that the actions sampled from the action space should already be scaled correctly and squashing them probably changes their distribution.

@Trinkle23897
Copy link
Collaborator

Yes you are right, thanks for reporting this bug.

@Trinkle23897 Trinkle23897 added the bug Something isn't working label Jan 19, 2022
@michalgregor
Copy link
Contributor Author

Cool. I'll just leave one more note here regarding a possible fix then.

It will probably not be enough to just move the action remap under the appropriate branch in the if. Since for policy actions, we store the raw, unremapped actions in the replay buffer, to maintain consistency, we'll also probably need to apply the inverse of map_action to random actions before storing them.

@Trinkle23897
Copy link
Collaborator

@ChenDRAG

@michalgregor
Copy link
Contributor Author

Storing unremapped actions presumably also has impacts for offline training. There we are passing a replay buffer where actions may have been collected using a policy with different remapping. I.e. under the policy being trained, which has a different remapping strategy, they might actually lead to completely different transitions than those recorded in the buffer.

@ChenDRAG
Copy link
Collaborator

ChenDRAG commented Jan 20, 2022

Thank you so much for your suggestions. Here is the our idea when we first design action remapping function.

  1. Action remapping is designed to be something like a env wrapper. Remapped action is never stored in the replay buffer, so to users, it is more like a function inside the blackbox of the environment.
  2. Why we need to do this? Take REINFORCE algorithm as an example, our action is sampled usually from a gaussion distribution, which is not constrained. This particular action will be sent to env and stored into the buffer. When we update the policy, this action will be needed to calculate dist.log_porb(action). Now, if action space is [-1, 1] and this action is clipped in the forward function. dist.log_porb(action) can be wrong, because now the action space is clipped gaussion instead of gaussion. I personally have done some experiments like a year ago, the reampping can really boost performace in some envs.
  3. In other cases, users should always clip or squash actions to make them legal in policy.forward function.
  4. I acknowledge that action remapping can be confusing sometimes. And it can be erroneous in offline setting if not taken care carefully (especially when we are generating data ourselves), but I really haven't come up with a neater design by now. Perhaps we can replace remapping with env wrapper? But how do we bound a certain env wrapper with a certain algorithm?

@michalgregor
Copy link
Contributor Author

Yes, I get what the difficulty is. The context in which I am dealing with this now is that I have a setting somewhat similar to offline learning in that I pass transitions to the agent from the outside. I don't think there is currently a safe way to do it – I always risk that I will pass the actions in the wrong format depending on how the policy remaps.

This would not be such a problem perhaps with just clipping (which does not actually rescale), but with tanh or with any custom implementation of map_action in a derived policy, things will break down and I am not sure how to deal with that.

The solution with an env wrapper would be much cleaner in this respect, but I get that it would put more demands on users and could become very confusing in the context of point 4.

@Trinkle23897 Trinkle23897 linked a pull request Mar 12, 2022 that will close this issue
9 tasks
Trinkle23897 pushed a commit that referenced this issue Mar 12, 2022
(Issue #512) Random start in Collector sample actions from the action space, while policies output action in a range (typically [-1, 1]) and map action to the action space. The buffer only stores unmapped actions, so the actions randomly initialized are not correct when the action range is not [-1, 1]. This may influence policy learning and particularly model learning in model-based methods.

This PR fixes it by adding an inverse operation before adding random initial actions to the buffer.
BFAnas pushed a commit to BFAnas/tianshou that referenced this issue May 5, 2024
…-ml#568)

(Issue thu-ml#512) Random start in Collector sample actions from the action space, while policies output action in a range (typically [-1, 1]) and map action to the action space. The buffer only stores unmapped actions, so the actions randomly initialized are not correct when the action range is not [-1, 1]. This may influence policy learning and particularly model learning in model-based methods.

This PR fixes it by adding an inverse operation before adding random initial actions to the buffer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants