-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Incorrect reset handling in collectors #937
Comments
thanks for reporting this. I'll make a PR to fix this bug but feel free to keep the conversation going regarding |
Yep, that's challenging and requires a lot of code changes, including the objective and loss modules. But I do think having "done" and ("next", "done") to mean different things is sometimes favorable, e.g., resetting a recurrent module to some non-trivial hidden states would require identifying if the current step is just after a reset by checking Currently the "done" return by BTW a test for this @pytest.mark.parametrize("env_class", [MockSerialEnv, MockBatchedLockedEnv])
def test_initial_obs_consistency(
env_class, seed=1
):
if env_class == MockSerialEnv:
num_envs = 1
env = MockSerialEnv(device="cpu")
elif env_class == MockBatchedLockedEnv:
num_envs = 2
env = MockBatchedLockedEnv(device="cpu", batch_size=[num_envs])
env.set_seed(seed)
collector = SyncDataCollector(
create_env_fn=env,
frames_per_batch=(env.max_val * 2 + 2) * num_envs, # at least two episodes
split_trajs=False
)
for _, d in enumerate(collector):
break
obs = d["observation"].squeeze()
arange = torch.arange(1, collector.env.counter).float().expand_as(obs)
assert torch.allclose(obs, arange) |
After due consideration i'm open about it. TensorDict({
"done": torch.Tensor(...),
"action": torch.Tensor(...),
"reward": torch.Tensor(...),
"next": TensorDict({
"observation": torch.Tensor(...),
}, batch_size),
}, batch_size) to one where the TensorDict({
"action": torch.Tensor(...),
"reward": torch.Tensor(...),
"next": TensorDict({
"done": torch.Tensor(...),
"observation": torch.Tensor(...),
}, batch_size),
}, batch_size) This is a major BC-breaking change, meaning that if we don't do it now (before beta) it'll be more difficult to bring it later. This has several advantages but mainly the point is that "done" is a property of the next state (ie: it is final) and not of the current state. See above for the full discussion. cc @shagunsodhani @matteobettini @albertbou92 @Benjamin-eecs @riiswa @smorad @XuehaiPan |
For policy input consistency, I suggest having
either inside the policy or in the collector. |
Isn't is sufficient that the ("next", "done") is True? Like this two consecutive trajectories are clearly delimited. |
Would it make sense to have |
So just to recap and think about this Before step we have (o_t, done_t, r_t,a_t) which means that we saw state (o_t, done_t, r_t) and took action (a_t) TensorDict({
"done": torch.Tensor(...),
"action": torch.Tensor(...),
"reward": torch.Tensor(...),
"observation": torch.Tensor(...),
}, batch_size) Now, after the step, we should get (o_t+1, done_t+1, r_t+1), which, paired with the previous data, would be a td like TensorDict({
"done": torch.Tensor(...),
"action": torch.Tensor(...),
"reward": torch.Tensor(...),
"observation": torch.Tensor(...),
"next": TensorDict({
"observation": torch.Tensor(...),
"done": torch.Tensor(...),
"reward": torch.Tensor(...),
}, batch_size),
}, batch_size) Now, we can take the new action TensorDict({
"done": torch.Tensor(...),
"action": torch.Tensor(...),
"reward": torch.Tensor(...),
"observation": torch.Tensor(...),
"next": TensorDict({
"observation": torch.Tensor(...),
"done": torch.Tensor(...),
"reward": torch.Tensor(...),
"action": torch.Tensor(...),
}, batch_size),
}, batch_size) And finally step the MDP to go back to the top of this comment TensorDict({
"done": torch.Tensor(...),
"reward": torch.Tensor(...),
"observation": torch.Tensor(...),
"action": torch.Tensor(...),
}, batch_size) The intermediate view TensorDict({
"done": torch.Tensor(...),
"action": torch.Tensor(...),
"reward": torch.Tensor(...),
"observation": torch.Tensor(...),
"next": TensorDict({
"observation": torch.Tensor(...),
"done": torch.Tensor(...),
"reward": torch.Tensor(...),
"action": torch.Tensor(...),
}, batch_size),
}, batch_size) contains all the info we could possibly want. |
Commenting on this, I think reward should go togheter with done and obs, whatever we do |
EDIT: |
See my edited comment above |
Would reset need to return a reward in that case? Couldnt it just return done and obs? Done is needed as the return because the reset could be partial |
Yeah maybe we can do without. TensorDict({
"obs": torch.Tensor([T, ...]),
"action": torch.Tensor([T, ...]),
"done": torch.Tensor([T, ...]),
"next": TensorDict({
"obs": torch.Tensor([T, ...]),
"done": torch.Tensor([T, ...]),
"reward": torch.Tensor([T, ...]),
}, batch_size=[T]),
}, batch_size=[T]) |
Following up on the related slack conversation, this seems fine to me.
…On Mon, Feb 27, 2023 at 6:19 PM Vincent Moens ***@***.***> wrote:
Yeah maybe we can do without.
So for a trajectory we'd have
TensorDict({
"obs": torch.Tensor([T, ...]),
"action": torch.Tensor([T, ...]),
"done": torch.Tensor([T, ...]),
"next": TensorDict({
"obs": torch.Tensor([T, ...]),
"done": torch.Tensor([T, ...]),
"reward": torch.Tensor([T, ...]),
}, batch_size=[T]),
}, batch_size=[T])
—
Reply to this email directly, view it on GitHub
<#937 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOUUJDMIBCMODTPJNATWZTO2HANCNFSM6AAAAAAVIU6MRE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
This looks good to me |
At this point, the reward specs should go in the observation_spec, like action is in input_spec |
I mean it could be output_spec which is a composite of reward and obs and **info pecs. Reset would return output to fit obs_spec and **info specs |
Following the standard convention in RL:
Maybe I missed something but is the plan to go with
I thought we want to keep reward next to obs so would have expected this:
|
I think the solution proposed by Vincent (first of your comment): TensorDict({
"obs": torch.Tensor([T, ...]),
"action": torch.Tensor([T, ...]),
"done": torch.Tensor([T, ...]),
"next": TensorDict({
"obs": torch.Tensor([T, ...]),
"done": torch.Tensor([T, ...]),
"reward": torch.Tensor([T, ...]),
}, batch_size=[T]),
}, batch_size=[T]) does what we all want. If you think aboout it, at time step 0 you have TensorDict({
"obs": torch.Tensor([1, ...]),
"action": torch.Tensor([1, ...]),
"done": torch.Tensor([1, ...]),
}, batch_size=[1]) which has obs_0 and done_0 coming from reset and action_0, taken for that obs. You call step with this and you get done_1, obs_1, rew_1, which go in "next". TensorDict({
"obs": torch.Tensor([1, ...]),
"action": torch.Tensor([1, ...]),
"done": torch.Tensor([1, ...]),
"next": TensorDict({
"obs": torch.Tensor([1, ...]),
"done": torch.Tensor([1, ...]),
"reward": torch.Tensor([1, ...]),
}, batch_size=[1]),
}, batch_size=[1]) Then, when you step the mdp, you obtain TensorDict({
"obs": torch.Tensor([1, ...]),
"action": torch.Tensor([1, ...]),
"done": torch.Tensor([1, ...]),
}, batch_size=[1]) again, which now contains obs_1 and done_1, with added action_1. |
Sorry I didnt follow. Lets take the example of gym API and then we will continue with the torch-rl example. At time 0, we just reset the env so we get just obs. At time t, the agent uses the last observed state [footnote 0], obs_{t-1}, used that to predict the action to take (a_t), performs that action, reaches a state obs_t, emits a reward r_t and a done signal d_t. Now, in the standard gym API obs_t, r_t, d_t are returned with env.step call. Now lets extend this to TorchRL
What is the action here? The agent hasnt taken any action so far. If the agent has taken an action, there should be a reard field as well.
Why are we still returning old obs and old done?
Just to confirm, we will have the reward as well here correct ? [0]: or observation, or all observations seen so far etc. I am overloading the notation here. |
I think you are using a different notation for the action. My notation is (as in the picture from the rl book above): a_t is the action taken when seeing obs_t. so at time 0 that is the action taken on the reset observation The reward, done, state got after taking action a_t is s_t+1, done_t+1 and r_t+1. So in openAi gym step(a_t) = r_t+1, done_t+1 and s_t+1 |
@matteobettini Thanks for the clarification. In that case, is the following flow correct: at time t = 0, env emits
agent chooses an action and returns
The action is executed and the env returns
Could you also clarify what is the object that the agent sees at time step 1 |
What you wrote is correct. at timestep 1 the agent sees what you have put in next (obs_1, d_1) TensorDict({
"obs": torch.Tensor([1, ...]), # o_1
"done": torch.Tensor([1, ...]), # d_1
}, batch_size=[1]) We could also keep in memory r_1 but there is not really a purpose for this. Here we are keeping in memory d_1 and o_1 because this alligns with the info returned by reset and thus available at start of the trajectory |
Sounds good - thanks for the clarification :) |
Something I'm having some trouble figuring out is:
|
We need to keep in mind vectorized and miulti-agent environments. If i have a batch size with 32 vectorized environments each with 4 agents lets say, i want to be able to reset just one of the agents in one of the envs , independently if they are done or not. So, if i reset just one agent in just one part of the batch, the rest of the done will be what it was before when returning from the reset. In other words, we need to keep in mind that “done” is multidimensional and the dimensions can be independent and unrelated. For the same reason, some part of the env that is done and not resetted has to stay done. Therefore, i think reset should only depend of the “_reset” flag, which could match the previous done or not. It will return a truthfull done, stating which dimensions, after the partial (always think of a reset as partial as we are batched) reset are still done. |
@vmoens I don't think so. You should never do that if you only have one boolean flag (i.e.,
The old For value learning (V or Q): $$ Ref: |
It's something we could support but I don't think we want to enforce that. |
I agree that it's painful to migrate to the Gymnasium API. There are still many RL frameworks using the old Gym API. But I think the implementation correctness is the top priority for an RL framework. It is not worth shipping "wrong" implementations of value-based algorithms (e.g., Q-learning) or policy-based algorithms (e.g., Actor-Critic, PPO). All RL algorithms using TD learning need to consider the
A step counter is not doing the same thing as a |
I have the following suggestions:
|
This one is almost ready
I will do a transform that allocates a
I updated the readme by putting a gif of the env API, putting the env feature on top and referring to the tutorial and doc in there. Suggestions are welcome!
We should do that, but as of now it is a private key that is handled by the classes on a per-usage basis. @matteobettini may have an opinion on where and what should be told about it? |
Can you check #962 ? |
Describe the bug
After auto-resetting an environment with
_reset_if_necessary
, the initial obs is ignored. The actual obs seen by the policy at the next step is always a zero TensorDict.To Reproduce
Here we use a dummy env where the obs is just the time stamp (starting from 1).
Running the above code gives
Expected behavior
Reason and Possible fixes
This occurs because in
SyncDataCollector
:The initial obs of the new episode gets discarded by
step_mdp
because it is not inself._tensordict["next"]
. What the policy will see is the zeros set byself._tensordict.masked_fill_(done_or_terminated, 0)
.The most straightforward fix is to change the above to:
So that the initial obs get carried to the next step by
step_mdp
.However, this would break some tests, e.g.,
test_collector.py::test_traj_len_consistency
, because now we have("next", "done")
in keys after the first reset which causes key inconsistency when doingtorch.cat
.I recall that the earlier versions of torchrl require
step_mdp(env.reset())
. Here I wonder does the coexistence of "done" and ("next", "done") make sense. I personally think having both is more rigorous: "done" indicates whether this step is an initial step, and ("next", "done") indicates whether the episode is terminated after this step (IIUC currently we have only "done" for the latter). In this way inside an RNN policy module, we can decide whether we need to reset some of its hidden states.Checklist
The text was updated successfully, but these errors were encountered: