New step API with terminated, truncated bools instead of done #115

arjun-kg · 2022-04-14T18:33:43Z

Refer to openai/gym#2752 for more details

vercel · 2022-04-14T18:33:47Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/farama/gym-docs/5N7cQLDARwYMQwdTWp7rokRgzT9B
✅ Preview: https://gym-docs-git-fork-arjun-kg-donetermtruncv3-farama.vercel.app

araffin · 2022-04-17T21:43:00Z

docs/source/content/api.md

+It is possible for `terminated=True` and `truncated=True` to occur at the same time when termination and truncation occur at the same step. 
+
+#### Motivation 
+This is done to remove the ambiguity in the `done` signal. `done=True` does not distinguish between the environment terminating and the episode truncating. This problem was avoided previously by setting `info['TimeLimit.truncated']` in case of a timelimit through the TimeLimit wrapper. However, this forces the truncation to happen only through a wrapper, and does not accomodate environments which truncate in the core environment. 


"However, this forces the truncation to happen only through a wrapper"

This is not true, any env can set info['TimeLimit.truncated'] = True...

Oh of course, my bad!

araffin · 2022-04-17T21:45:53Z

docs/source/content/api.md

+#### Motivation 
+This is done to remove the ambiguity in the `done` signal. `done=True` does not distinguish between the environment terminating and the episode truncating. This problem was avoided previously by setting `info['TimeLimit.truncated']` in case of a timelimit through the TimeLimit wrapper. However, this forces the truncation to happen only through a wrapper, and does not accomodate environments which truncate in the core environment. 
+
+The main reason we need to distinguish between these two signals is the implications in implementing the Temporal Difference update, where the reward along with the value of the next state(s) are used to update the value of the current state. At terminal states, there is no contribution from the value of the next state(s) and we only consider the value as a function of the reward at that state.  However, this is not true when we forcibly truncate the episode. Time-limits are often set during training to diversify experience, but the goal of the agent is not to maximize reward over this time period. The `done` signal is often used to control whether the value of the next state 


The main difference is more high-level than TD update: it is whether we treat the task as a finite or infinite horizon one.
See https://github.com/DLR-RM/stable-baselines3/blob/master/tests/test_gae.py#L136 for concrete example and test.

Thank you for your comment! I thought about it a bit. Here's my understanding -

For infinite horizon tasks, the situation is simple - we always have to set practical time limits, so we need the terminated / truncated distinction here, where basically terminated is always False (and without the distinction, bootstrapping next state value is skipped leading to incorrect value estimates, this is also the test you linked).

For finite horizon tasks, it's a little bit tricky - this can be further split into two scenarios,

The agent is aware of the timelimit (included in agent's observation) - here we need the distinction, suppose if the task timelimit is 1000 steps, and the agent is aware of this, and the user sets an additional episode time limit of 50 steps while training, and let's it run to the end during eval, this then needs a distinction, since the agent should not optimize for 50 step truncation. (So, at 1000 steps, we don't bootstrap, at 50 steps we bootstrap)

The agent is not aware of the timelimit - here the distinction doesn't matter, since the agent is unaware of either timelimit, and we can always safely bootstrap.

So it's not just about infinite vs finite horizon, but also about whether the agent is time-aware 🤔 since these lead to different situations. Please correct me if I'm wrong.

For the TD update, my point was to emphasize the reward + gamma*v(next_state) (when terminated=False, truncated=True) versus just reward (terminated=True). My specific example I think restricts to value based approaches, but a similar bootstrapping happens in almost all RL (In SB3 on-policy algos for eg. when terminated=True, the final reward is bootstrapped with the next value). Maybe I can update the example with this bootstrapping, instead of mentioning it as a TD error update?

Apologies for the random long comment, I hope this explanation is accurate. I will update the docs to be more clear and general.

where basically terminated is always False

Terminated could be true (for instance in case of a failure), but I would prefer to talk about termination reasons (see my comments in openai/gym#2510 (comment)), as in practice, an episode is terminated if you reach the timeout.

bootstrapping next state value is skipped leading to incorrect value estimates,

yes, unless you pass the remaining time as input (as described in https://arxiv.org/abs/1712.00378, what you call "time-aware" below).

suppose if the task timelimit is 1000 steps, and the agent is aware of this, and the user sets an additional episode time limit

I'm not sure to get the example, in the sense that it does not make sense to me... do you have a concrete example where you want to set two time limits?
In your example, it feels that your new problem (50 steps limit) is treated as an infinite horizon problem when you bootstrap (because the agent never sees the real time limit). As in the infinite-horizon, the true horizon is dependent on the discount factor.

The only case where time-aware and infinite horizon is not equivalent for me is when you want your agent to do something special depending on the time, for instance a jump in the last steps.

The agent is not aware of the timelimit - here the distinction doesn't matter, since the agent is unaware of either timelimit, and we can always safely bootstrap.

In that case, you are breaking Markov assumption...

For the TD update,

My point was more to give a high-level reason of why we bootstrap or not, and dive into the details later.

Terminated could be true (for instance in case of a failure)

Ohh oops yes, I meant terminated cannot be True due to a time limit here (given this PR's meaning and usage of terminated)

I'm not sure to get the example, in the sense that it does not make sense to me... do you have a concrete example where you want to set two time limits?

I was imagining a scenario where the agent had an age that was very long (say 10k steps) which it's aware of, but for the purpose of diversifying we wish to restart the episode from when the agent is at t>>0, and let it learn for 50 steps at at time. But it occurs to me that the age would no longer be considered as a timelimit (since it's no longer 10k steps from the start of the episode), rather a regular termination condition. So, my point is invalid 🤔 I think I also need to think more about this and what 'timelimit' means. I'll get back on the issue thread.

In that case, you are breaking Markov assumption...

Right, so if the agent has to optimize for a time-limit, it has to be time-aware to satisfy the markov assumption

My point was more to give a high-level reason of why we bootstrap or not, and dive into the details later.

Oh okay, that makes sense

Markus28 · 2022-05-06T08:59:35Z

Can you do a corresponding PR in gym-examples once everything is ready?

trigaten · 2022-05-07T15:28:35Z

Please resolve conflicts

arjun-kg · 2022-05-07T15:50:38Z

@trigaten so there are some things that need to be resolved before this PR, I might need to edit again depending on other reviews. I'll update soon.

pseudo-rnd-thoughts · 2022-09-13T08:37:17Z

@arjun-kg Would you be able to fix the merge conflicts.
From quickly looking through the PR, the backwrad compatibility section needs to be updated, step compatibility wrapper,

new step api

e497e4c

vercel bot deployed to Preview April 14, 2022 18:33 View deployment

arjun-kg mentioned this pull request Apr 14, 2022

New Step API with terminated, truncated bools instead of done openai/gym#2752

Merged

9 tasks

update environment creation

94a9478

vercel bot deployed to Preview April 14, 2022 19:36 View deployment

araffin reviewed Apr 17, 2022

View reviewed changes

trigaten marked this pull request as draft May 7, 2022 18:32

arjun-kg closed this Sep 14, 2022

araffin mentioned this pull request Nov 2, 2022

Re-add timelimit truncation information Farama-Foundation/Gymnasium#101

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New step API with terminated, truncated bools instead of done #115

New step API with terminated, truncated bools instead of done #115

arjun-kg commented Apr 14, 2022 •

edited

Loading

vercel bot commented Apr 14, 2022 •

edited

Loading

araffin Apr 17, 2022

arjun-kg Apr 18, 2022

araffin Apr 17, 2022

arjun-kg Apr 21, 2022

araffin Apr 21, 2022

arjun-kg Apr 21, 2022

Markus28 commented May 6, 2022

trigaten commented May 7, 2022

arjun-kg commented May 7, 2022

pseudo-rnd-thoughts commented Sep 13, 2022

New step API with terminated, truncated bools instead of done #115

New step API with terminated, truncated bools instead of done #115

Conversation

arjun-kg commented Apr 14, 2022 • edited Loading

vercel bot commented Apr 14, 2022 • edited Loading

araffin Apr 17, 2022

Choose a reason for hiding this comment

arjun-kg Apr 18, 2022

Choose a reason for hiding this comment

araffin Apr 17, 2022

Choose a reason for hiding this comment

arjun-kg Apr 21, 2022

Choose a reason for hiding this comment

araffin Apr 21, 2022

Choose a reason for hiding this comment

arjun-kg Apr 21, 2022

Choose a reason for hiding this comment

Markus28 commented May 6, 2022

trigaten commented May 7, 2022

arjun-kg commented May 7, 2022

pseudo-rnd-thoughts commented Sep 13, 2022

arjun-kg commented Apr 14, 2022 •

edited

Loading

vercel bot commented Apr 14, 2022 •

edited

Loading