-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New step API with terminated, truncated bools instead of done #115
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/farama/gym-docs/5N7cQLDARwYMQwdTWp7rokRgzT9B |
It is possible for `terminated=True` and `truncated=True` to occur at the same time when termination and truncation occur at the same step. | ||
|
||
#### Motivation | ||
This is done to remove the ambiguity in the `done` signal. `done=True` does not distinguish between the environment terminating and the episode truncating. This problem was avoided previously by setting `info['TimeLimit.truncated']` in case of a timelimit through the TimeLimit wrapper. However, this forces the truncation to happen only through a wrapper, and does not accomodate environments which truncate in the core environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"However, this forces the truncation to happen only through a wrapper"
This is not true, any env can set info['TimeLimit.truncated'] = True
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh of course, my bad!
#### Motivation | ||
This is done to remove the ambiguity in the `done` signal. `done=True` does not distinguish between the environment terminating and the episode truncating. This problem was avoided previously by setting `info['TimeLimit.truncated']` in case of a timelimit through the TimeLimit wrapper. However, this forces the truncation to happen only through a wrapper, and does not accomodate environments which truncate in the core environment. | ||
|
||
The main reason we need to distinguish between these two signals is the implications in implementing the Temporal Difference update, where the reward along with the value of the next state(s) are used to update the value of the current state. At terminal states, there is no contribution from the value of the next state(s) and we only consider the value as a function of the reward at that state. However, this is not true when we forcibly truncate the episode. Time-limits are often set during training to diversify experience, but the goal of the agent is not to maximize reward over this time period. The `done` signal is often used to control whether the value of the next state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference is more high-level than TD update: it is whether we treat the task as a finite or infinite horizon one.
See https://github.com/DLR-RM/stable-baselines3/blob/master/tests/test_gae.py#L136 for concrete example and test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment! I thought about it a bit. Here's my understanding -
For infinite horizon tasks, the situation is simple - we always have to set practical time limits, so we need the terminated / truncated distinction here, where basically terminated is always False (and without the distinction, bootstrapping next state value is skipped leading to incorrect value estimates, this is also the test you linked).
For finite horizon tasks, it's a little bit tricky - this can be further split into two scenarios,
- The agent is aware of the timelimit (included in agent's observation) - here we need the distinction, suppose if the task timelimit is 1000 steps, and the agent is aware of this, and the user sets an additional episode time limit of 50 steps while training, and let's it run to the end during eval, this then needs a distinction, since the agent should not optimize for 50 step truncation. (So, at 1000 steps, we don't bootstrap, at 50 steps we bootstrap)
- The agent is not aware of the timelimit - here the distinction doesn't matter, since the agent is unaware of either timelimit, and we can always safely bootstrap.
So it's not just about infinite vs finite horizon, but also about whether the agent is time-aware 🤔 since these lead to different situations. Please correct me if I'm wrong.
For the TD update, my point was to emphasize the reward + gamma*v(next_state)
(when terminated=False, truncated=True) versus just reward
(terminated=True). My specific example I think restricts to value based approaches, but a similar bootstrapping happens in almost all RL (In SB3 on-policy algos for eg. when terminated=True, the final reward is bootstrapped with the next value). Maybe I can update the example with this bootstrapping, instead of mentioning it as a TD error update?
Apologies for the random long comment, I hope this explanation is accurate. I will update the docs to be more clear and general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where basically terminated is always False
Terminated could be true (for instance in case of a failure), but I would prefer to talk about termination reasons (see my comments in openai/gym#2510 (comment)), as in practice, an episode is terminated if you reach the timeout.
bootstrapping next state value is skipped leading to incorrect value estimates,
yes, unless you pass the remaining time as input (as described in https://arxiv.org/abs/1712.00378, what you call "time-aware" below).
suppose if the task timelimit is 1000 steps, and the agent is aware of this, and the user sets an additional episode time limit
I'm not sure to get the example, in the sense that it does not make sense to me... do you have a concrete example where you want to set two time limits?
In your example, it feels that your new problem (50 steps limit) is treated as an infinite horizon problem when you bootstrap (because the agent never sees the real time limit). As in the infinite-horizon, the true horizon is dependent on the discount factor.
The only case where time-aware and infinite horizon is not equivalent for me is when you want your agent to do something special depending on the time, for instance a jump in the last steps.
The agent is not aware of the timelimit - here the distinction doesn't matter, since the agent is unaware of either timelimit, and we can always safely bootstrap.
In that case, you are breaking Markov assumption...
For the TD update,
My point was more to give a high-level reason of why we bootstrap or not, and dive into the details later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terminated could be true (for instance in case of a failure)
Ohh oops yes, I meant terminated cannot be True due to a time limit here (given this PR's meaning and usage of terminated)
I'm not sure to get the example, in the sense that it does not make sense to me... do you have a concrete example where you want to set two time limits?
I was imagining a scenario where the agent had an age that was very long (say 10k steps) which it's aware of, but for the purpose of diversifying we wish to restart the episode from when the agent is at t>>0, and let it learn for 50 steps at at time. But it occurs to me that the age would no longer be considered as a timelimit (since it's no longer 10k steps from the start of the episode), rather a regular termination condition. So, my point is invalid 🤔 I think I also need to think more about this and what 'timelimit' means. I'll get back on the issue thread.
In that case, you are breaking Markov assumption...
Right, so if the agent has to optimize for a time-limit, it has to be time-aware to satisfy the markov assumption
My point was more to give a high-level reason of why we bootstrap or not, and dive into the details later.
Oh okay, that makes sense
Can you do a corresponding PR in gym-examples once everything is ready? |
Please resolve conflicts |
@trigaten so there are some things that need to be resolved before this PR, I might need to edit again depending on other reviews. I'll update soon. |
@arjun-kg Would you be able to fix the merge conflicts. |
Refer to openai/gym#2752 for more details