Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New step API with terminated, truncated bools instead of done #115

Closed
wants to merge 2 commits into from

Conversation

arjun-kg
Copy link
Contributor

@arjun-kg arjun-kg commented Apr 14, 2022

Refer to openai/gym#2752 for more details

@vercel
Copy link

vercel bot commented Apr 14, 2022

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/farama/gym-docs/5N7cQLDARwYMQwdTWp7rokRgzT9B
✅ Preview: https://gym-docs-git-fork-arjun-kg-donetermtruncv3-farama.vercel.app

It is possible for `terminated=True` and `truncated=True` to occur at the same time when termination and truncation occur at the same step.

#### Motivation
This is done to remove the ambiguity in the `done` signal. `done=True` does not distinguish between the environment terminating and the episode truncating. This problem was avoided previously by setting `info['TimeLimit.truncated']` in case of a timelimit through the TimeLimit wrapper. However, this forces the truncation to happen only through a wrapper, and does not accomodate environments which truncate in the core environment.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"However, this forces the truncation to happen only through a wrapper"

This is not true, any env can set info['TimeLimit.truncated'] = True...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh of course, my bad!

#### Motivation
This is done to remove the ambiguity in the `done` signal. `done=True` does not distinguish between the environment terminating and the episode truncating. This problem was avoided previously by setting `info['TimeLimit.truncated']` in case of a timelimit through the TimeLimit wrapper. However, this forces the truncation to happen only through a wrapper, and does not accomodate environments which truncate in the core environment.

The main reason we need to distinguish between these two signals is the implications in implementing the Temporal Difference update, where the reward along with the value of the next state(s) are used to update the value of the current state. At terminal states, there is no contribution from the value of the next state(s) and we only consider the value as a function of the reward at that state. However, this is not true when we forcibly truncate the episode. Time-limits are often set during training to diversify experience, but the goal of the agent is not to maximize reward over this time period. The `done` signal is often used to control whether the value of the next state
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is more high-level than TD update: it is whether we treat the task as a finite or infinite horizon one.
See https://github.com/DLR-RM/stable-baselines3/blob/master/tests/test_gae.py#L136 for concrete example and test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment! I thought about it a bit. Here's my understanding -

For infinite horizon tasks, the situation is simple - we always have to set practical time limits, so we need the terminated / truncated distinction here, where basically terminated is always False (and without the distinction, bootstrapping next state value is skipped leading to incorrect value estimates, this is also the test you linked).

For finite horizon tasks, it's a little bit tricky - this can be further split into two scenarios,

  1. The agent is aware of the timelimit (included in agent's observation) - here we need the distinction, suppose if the task timelimit is 1000 steps, and the agent is aware of this, and the user sets an additional episode time limit of 50 steps while training, and let's it run to the end during eval, this then needs a distinction, since the agent should not optimize for 50 step truncation. (So, at 1000 steps, we don't bootstrap, at 50 steps we bootstrap)
  2. The agent is not aware of the timelimit - here the distinction doesn't matter, since the agent is unaware of either timelimit, and we can always safely bootstrap.

So it's not just about infinite vs finite horizon, but also about whether the agent is time-aware 🤔 since these lead to different situations. Please correct me if I'm wrong.

For the TD update, my point was to emphasize the reward + gamma*v(next_state) (when terminated=False, truncated=True) versus just reward (terminated=True). My specific example I think restricts to value based approaches, but a similar bootstrapping happens in almost all RL (In SB3 on-policy algos for eg. when terminated=True, the final reward is bootstrapped with the next value). Maybe I can update the example with this bootstrapping, instead of mentioning it as a TD error update?

Apologies for the random long comment, I hope this explanation is accurate. I will update the docs to be more clear and general.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where basically terminated is always False

Terminated could be true (for instance in case of a failure), but I would prefer to talk about termination reasons (see my comments in openai/gym#2510 (comment)), as in practice, an episode is terminated if you reach the timeout.

bootstrapping next state value is skipped leading to incorrect value estimates,

yes, unless you pass the remaining time as input (as described in https://arxiv.org/abs/1712.00378, what you call "time-aware" below).

suppose if the task timelimit is 1000 steps, and the agent is aware of this, and the user sets an additional episode time limit

I'm not sure to get the example, in the sense that it does not make sense to me... do you have a concrete example where you want to set two time limits?
In your example, it feels that your new problem (50 steps limit) is treated as an infinite horizon problem when you bootstrap (because the agent never sees the real time limit). As in the infinite-horizon, the true horizon is dependent on the discount factor.

The only case where time-aware and infinite horizon is not equivalent for me is when you want your agent to do something special depending on the time, for instance a jump in the last steps.

The agent is not aware of the timelimit - here the distinction doesn't matter, since the agent is unaware of either timelimit, and we can always safely bootstrap.

In that case, you are breaking Markov assumption...

For the TD update,

My point was more to give a high-level reason of why we bootstrap or not, and dive into the details later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terminated could be true (for instance in case of a failure)

Ohh oops yes, I meant terminated cannot be True due to a time limit here (given this PR's meaning and usage of terminated)

I'm not sure to get the example, in the sense that it does not make sense to me... do you have a concrete example where you want to set two time limits?

I was imagining a scenario where the agent had an age that was very long (say 10k steps) which it's aware of, but for the purpose of diversifying we wish to restart the episode from when the agent is at t>>0, and let it learn for 50 steps at at time. But it occurs to me that the age would no longer be considered as a timelimit (since it's no longer 10k steps from the start of the episode), rather a regular termination condition. So, my point is invalid 🤔 I think I also need to think more about this and what 'timelimit' means. I'll get back on the issue thread.

In that case, you are breaking Markov assumption...

Right, so if the agent has to optimize for a time-limit, it has to be time-aware to satisfy the markov assumption

My point was more to give a high-level reason of why we bootstrap or not, and dive into the details later.

Oh okay, that makes sense

@Markus28
Copy link
Contributor

Markus28 commented May 6, 2022

Can you do a corresponding PR in gym-examples once everything is ready?

@trigaten
Copy link
Collaborator

trigaten commented May 7, 2022

Please resolve conflicts

@arjun-kg
Copy link
Contributor Author

arjun-kg commented May 7, 2022

@trigaten so there are some things that need to be resolved before this PR, I might need to edit again depending on other reviews. I'll update soon.

@trigaten trigaten marked this pull request as draft May 7, 2022 18:32
@pseudo-rnd-thoughts
Copy link
Member

@arjun-kg Would you be able to fix the merge conflicts.
From quickly looking through the PR, the backwrad compatibility section needs to be updated, step compatibility wrapper,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants