Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Documentation for Handling Time Limits #128

Closed
wants to merge 2 commits into from

Conversation

arjun-kg
Copy link
Contributor

@arjun-kg arjun-kg commented May 6, 2022

A common complaint is the lack of documentation in handling time-limits with current gym API. This PR tries to address that. I've added this as a separate section since I didn't feel it fit anywhere else and it might be important enough to have its own section (the closest existing section would be under API, but even that is a long shot). I think in the future we can have tutorials, which are not external links and this can go in there.

There is some RL theory behind this, so it needs some careful reviewing, especially with the terminology which has had conflicting meanings.

@vercel
Copy link

vercel bot commented May 6, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
gym-docs ✅ Ready (Inspect) Visit Preview May 31, 2022 at 7:11PM (UTC)

@pseudo-rnd-thoughts
Copy link
Member

@arjun-kg I would like to clear up some of the text but overall I don't have any major issues with the sections. I will message on discord about it

@araffin
Copy link

araffin commented May 9, 2022

tagging myself for a review (if I have time this week, otherwise next week)

@pseudo-rnd-thoughts
Copy link
Member

@arjun-kg Do you have any more changes to this PR?
@araffin I forgot to poke you earlier, would you be able to look at this when you have time

@arjun-kg
Copy link
Contributor Author

@arjun-kg Do you have any more changes to this PR?

Just the math rendering and maybe adding an image explanation. I'll look at it again after the core text is reviewed since the content could potentially change

@araffin
Copy link

araffin commented May 28, 2022

@arjun-kg Do you have any more changes to this PR? @araffin I forgot to poke you earlier, would you be able to look at this when you have time

thanks for pinging me, i had unfortunately no time with ICRA tutorial preparation (ICRA was this week), and now I'm sick :/
I hope to be able to do a review by the end of this week.

@@ -0,0 +1,49 @@
# Handling Time Limits
In using Gym environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received from `env.step` indicates whether an episode has ended. However, this signal does not distinguish between the two different reasons why an episode can end - `termination` and `truncation`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably rephrase the intro paragraph as "Terminations due to timelimits must be handled with care, as it is common to oversee that particular case which violates RL core assumption (Markov property) and may hinder performance" (so start with the subject to make it clearer, not with "gym with RL code")

I wouldn't say there is "two different reasons", but more "it doesn't tell if the termination was due to timeout/timelimit or not".

# Handling Time Limits
In using Gym environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received from `env.step` indicates whether an episode has ended. However, this signal does not distinguish between the two different reasons why an episode can end - `termination` and `truncation`.

### Termination
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not a big fan of opposing "termination" vs "truncation/timelimit", because a timeout is a termination in practice (even though you treat it differently in infinite horizon problem).
I would more treat as "truncation/termination due to timelimit" vs "other terminations"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the terminology, unless I'm mistaken there is no clear literature on definitions of termination and truncation. But, I've seen states referred to as terminal states specifically mean 'special states' after which the episode ends, in the sense that this is built into the MDP. Sutton and Barto for eg. says "Note that the value of the terminal state, if any, is
always zero." and throughout the book uses terminal states in this context. This is not true if we call the last state after a timeout in an infinite-horizon problem also as a terminal state.

The term termination is an extrapolation from this, saying that, "If and only if an environment reaches a terminal state, it is said to have terminated".

And for all other cases, we say it is truncated. And we want to make this distinction specifically since it also makes a big difference for bootstrapping.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think they used "absorbing terminal state" too, need to double check



### Truncation
Truncation refers to the episode ending after an externally defined time-limit. This time-limit is not part of the environment definition, and its sole purpose is practicality for the user collecting rollouts of the episode.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, truncation is wider than timelimit, for instance, for a robot, truncation would also be going out of the tracking boundaries (e.g. quadruped locomotion).

But the treatment should be the same, in that example, you need to stop the episode (because you need to track your robot for computing reward/safety) but not punish the robot for that.

"is not part of the env definition" sounds weird because you actually usually define it there (in the register() method).

"sole purpose is practicality for the user collecting rollouts of the episode." -> this is quite vague...
And there are different reasons for timeouts:

  • more diverse data collection (with different starting states)
  • avoid infinite loop (for instance when using episodic RL or population based method)
  • data at the beginning of an episode may be different than data after (for instance first lap for a racing car vs subsequent laps)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or a robot, truncation would also be going out of the tracking boundaries (e.g. quadruped locomotion).

I had never considered this, I will definitely add it

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, you don't think about it until you apply RL directly on a real robot ;)


An infinite-horizon environment is an obvious example where this is needed. We cannot wait forever for the episode to complete, so we set a practical time-limit after which we forcibly halt the episode. The last state in this case is not a terminal state since it has a non-zero transition probability of moving to another state as per the environment definition. This is also different from time-limits in finite horizon environments as the agent in this case has no idea about this time-limit.

### Importance in learning code
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where to put it, but to understand the issue for value estimation, i usually give the example of a cyclic task (for example locomotion) where we try to estimate V(s_0) (initial state), V(s_T) (terminal state, here due to timeout only) but where s_0 = s_T.

with naive implementation:

  • V(s_T) = r_T (terminal reward)
  • V(s_0) = discounted_sum_rewards

but s_0 = s_T, so we got a problem (which is where it hinder performance in addition to breaking assumption)

```
In classical RL, the new `Q` estimate is a weighted average of previous `Q` estimate and `Q_target` while in Deep Q-Learning, the error between `Q_target` and previous `Q` estimate is minimized.

However, at the terminal state, bootstrapping is not done,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explain what bootstrapping is?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by doing so, the solution to the problem should become obvious
("value function tell you how much you should expect/get (discounted sum of reward) if you follow current policy starting from that state" and "even the episode stops here, by bootstrapping (adding potential future return to current reward) the agent looks in the future as if it will continue the episode..."

Currently, gym supplies truncation information through the TimeLimit wrapper which adds `TimeLimit.truncated` key to `info` which is returned by `env.step`. The correct way to handle terminations and truncations now would be,

```python
terminated = done and 'TimeLimit.truncated' not in info
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is not correct.
If you look at TimeLimit wrapper code, you should do:

# should_bootstrap can be renamed to `next_transition_valid` or equivalent
should_bootstrap = not done or info.get("TimeLimit.truncated", False)
rew + gamma*should_bootstrap*vf_next_state

btw, TimeLimit need to be patched in oder to allow env to define TimeLimit.truncated themselves

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh you're right

Copy link

@araffin araffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR =), definitely missing piece of Gym doc.
I guess it is a good start for discussions on how to present it.

@arjun-kg
Copy link
Contributor Author

Thanks for the PR =), definitely missing piece of Gym doc. I guess it is a good start for discussions on how to present it.

Thanks for your review! I will also try to get other people to look at it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants