Why is Collector mapping randomly sampled actions using map_action? #512

michalgregor · 2022-01-19T12:55:07Z

Hi, everyone; it seems that in Collector , if random is true, we are sampling agents from the action space using self._action_space[i].sample().

We then apply action_remap = self.policy.map_action(self.data.act) to them just as we do to action generated by the policy.

Is this correct? It seems to me that the actions sampled from the action space should already be scaled correctly and squashing them probably changes their distribution.

The text was updated successfully, but these errors were encountered:

Trinkle23897 · 2022-01-19T13:01:55Z

Yes you are right, thanks for reporting this bug.

michalgregor · 2022-01-19T14:25:13Z

Cool. I'll just leave one more note here regarding a possible fix then.

It will probably not be enough to just move the action remap under the appropriate branch in the if. Since for policy actions, we store the raw, unremapped actions in the replay buffer, to maintain consistency, we'll also probably need to apply the inverse of map_action to random actions before storing them.

Trinkle23897 · 2022-01-19T14:32:53Z

@ChenDRAG

michalgregor · 2022-01-19T14:53:04Z

Storing unremapped actions presumably also has impacts for offline training. There we are passing a replay buffer where actions may have been collected using a policy with different remapping. I.e. under the policy being trained, which has a different remapping strategy, they might actually lead to completely different transitions than those recorded in the buffer.

ChenDRAG · 2022-01-20T04:28:01Z

Thank you so much for your suggestions. Here is the our idea when we first design action remapping function.

Action remapping is designed to be something like a env wrapper. Remapped action is never stored in the replay buffer, so to users, it is more like a function inside the blackbox of the environment.
Why we need to do this? Take REINFORCE algorithm as an example, our action is sampled usually from a gaussion distribution, which is not constrained. This particular action will be sent to env and stored into the buffer. When we update the policy, this action will be needed to calculate dist.log_porb(action). Now, if action space is [-1, 1] and this action is clipped in the forward function. dist.log_porb(action) can be wrong, because now the action space is clipped gaussion instead of gaussion. I personally have done some experiments like a year ago, the reampping can really boost performace in some envs.
In other cases, users should always clip or squash actions to make them legal in policy.forward function.
I acknowledge that action remapping can be confusing sometimes. And it can be erroneous in offline setting if not taken care carefully (especially when we are generating data ourselves), but I really haven't come up with a neater design by now. Perhaps we can replace remapping with env wrapper? But how do we bound a certain env wrapper with a certain algorithm?

michalgregor · 2022-01-20T07:52:23Z

Yes, I get what the difficulty is. The context in which I am dealing with this now is that I have a setting somewhat similar to offline learning in that I pass transitions to the agent from the outside. I don't think there is currently a safe way to do it – I always risk that I will pass the actions in the wrong format depending on how the policy remaps.

This would not be such a problem perhaps with just clipping (which does not actually rescale), but with tanh or with any custom implementation of map_action in a derived policy, things will break down and I am not sure how to deal with that.

The solution with an env wrapper would be much cleaner in this respect, but I get that it would put more demands on users and could become very confusing in the context of point 4.

(Issue #512) Random start in Collector sample actions from the action space, while policies output action in a range (typically [-1, 1]) and map action to the action space. The buffer only stores unmapped actions, so the actions randomly initialized are not correct when the action range is not [-1, 1]. This may influence policy learning and particularly model learning in model-based methods. This PR fixes it by adding an inverse operation before adding random initial actions to the buffer.

…-ml#568) (Issue thu-ml#512) Random start in Collector sample actions from the action space, while policies output action in a range (typically [-1, 1]) and map action to the action space. The buffer only stores unmapped actions, so the actions randomly initialized are not correct when the action range is not [-1, 1]. This may influence policy learning and particularly model learning in model-based methods. This PR fixes it by adding an inverse operation before adding random initial actions to the buffer.

Trinkle23897 added the bug Something isn't working label Jan 19, 2022

Jimenius mentioned this issue Mar 12, 2022

Add map_action_inverse for fixing error of storing random action #568

Merged

9 tasks

Trinkle23897 linked a pull request Mar 12, 2022 that will close this issue

Add map_action_inverse for fixing error of storing random action #568

Merged

9 tasks

Trinkle23897 closed this as completed in #568 Mar 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is Collector mapping randomly sampled actions using map_action? #512

Why is Collector mapping randomly sampled actions using map_action? #512

michalgregor commented Jan 19, 2022

Trinkle23897 commented Jan 19, 2022

michalgregor commented Jan 19, 2022

Trinkle23897 commented Jan 19, 2022

michalgregor commented Jan 19, 2022

ChenDRAG commented Jan 20, 2022 •

edited

Loading

michalgregor commented Jan 20, 2022

Why is Collector mapping randomly sampled actions using map_action? #512

Why is Collector mapping randomly sampled actions using map_action? #512

Comments

michalgregor commented Jan 19, 2022

Trinkle23897 commented Jan 19, 2022

michalgregor commented Jan 19, 2022

Trinkle23897 commented Jan 19, 2022

michalgregor commented Jan 19, 2022

ChenDRAG commented Jan 20, 2022 • edited Loading

michalgregor commented Jan 20, 2022

ChenDRAG commented Jan 20, 2022 •

edited

Loading