-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Documentation for Handling Time Limits #128
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@arjun-kg I would like to clear up some of the text but overall I don't have any major issues with the sections. I will message on discord about it |
tagging myself for a review (if I have time this week, otherwise next week) |
Just the math rendering and maybe adding an image explanation. I'll look at it again after the core text is reviewed since the content could potentially change |
thanks for pinging me, i had unfortunately no time with ICRA tutorial preparation (ICRA was this week), and now I'm sick :/ |
@@ -0,0 +1,49 @@ | |||
# Handling Time Limits | |||
In using Gym environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received from `env.step` indicates whether an episode has ended. However, this signal does not distinguish between the two different reasons why an episode can end - `termination` and `truncation`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably rephrase the intro paragraph as "Terminations due to timelimits must be handled with care, as it is common to oversee that particular case which violates RL core assumption (Markov property) and may hinder performance" (so start with the subject to make it clearer, not with "gym with RL code")
I wouldn't say there is "two different reasons", but more "it doesn't tell if the termination was due to timeout/timelimit or not".
# Handling Time Limits | ||
In using Gym environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received from `env.step` indicates whether an episode has ended. However, this signal does not distinguish between the two different reasons why an episode can end - `termination` and `truncation`. | ||
|
||
### Termination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not a big fan of opposing "termination" vs "truncation/timelimit", because a timeout is a termination in practice (even though you treat it differently in infinite horizon problem).
I would more treat as "truncation/termination due to timelimit" vs "other terminations"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the terminology, unless I'm mistaken there is no clear literature on definitions of termination and truncation. But, I've seen states referred to as terminal states
specifically mean 'special states' after which the episode ends, in the sense that this is built into the MDP. Sutton and Barto for eg. says "Note that the value of the terminal state, if any, is
always zero." and throughout the book uses terminal states in this context. This is not true if we call the last state after a timeout in an infinite-horizon problem also as a terminal state.
The term termination
is an extrapolation from this, saying that, "If and only if an environment reaches a terminal state, it is said to have terminated
".
And for all other cases, we say it is truncated
. And we want to make this distinction specifically since it also makes a big difference for bootstrapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think they used "absorbing terminal state" too, need to double check
|
||
|
||
### Truncation | ||
Truncation refers to the episode ending after an externally defined time-limit. This time-limit is not part of the environment definition, and its sole purpose is practicality for the user collecting rollouts of the episode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, truncation is wider than timelimit, for instance, for a robot, truncation would also be going out of the tracking boundaries (e.g. quadruped locomotion).
But the treatment should be the same, in that example, you need to stop the episode (because you need to track your robot for computing reward/safety) but not punish the robot for that.
"is not part of the env definition" sounds weird because you actually usually define it there (in the register()
method).
"sole purpose is practicality for the user collecting rollouts of the episode." -> this is quite vague...
And there are different reasons for timeouts:
- more diverse data collection (with different starting states)
- avoid infinite loop (for instance when using episodic RL or population based method)
- data at the beginning of an episode may be different than data after (for instance first lap for a racing car vs subsequent laps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or a robot, truncation would also be going out of the tracking boundaries (e.g. quadruped locomotion).
I had never considered this, I will definitely add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, you don't think about it until you apply RL directly on a real robot ;)
|
||
An infinite-horizon environment is an obvious example where this is needed. We cannot wait forever for the episode to complete, so we set a practical time-limit after which we forcibly halt the episode. The last state in this case is not a terminal state since it has a non-zero transition probability of moving to another state as per the environment definition. This is also different from time-limits in finite horizon environments as the agent in this case has no idea about this time-limit. | ||
|
||
### Importance in learning code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where to put it, but to understand the issue for value estimation, i usually give the example of a cyclic task (for example locomotion) where we try to estimate V(s_0) (initial state), V(s_T) (terminal state, here due to timeout only) but where s_0 = s_T.
with naive implementation:
- V(s_T) = r_T (terminal reward)
- V(s_0) = discounted_sum_rewards
but s_0 = s_T, so we got a problem (which is where it hinder performance in addition to breaking assumption)
``` | ||
In classical RL, the new `Q` estimate is a weighted average of previous `Q` estimate and `Q_target` while in Deep Q-Learning, the error between `Q_target` and previous `Q` estimate is minimized. | ||
|
||
However, at the terminal state, bootstrapping is not done, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe explain what bootstrapping is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by doing so, the solution to the problem should become obvious
("value function tell you how much you should expect/get (discounted sum of reward) if you follow current policy starting from that state" and "even the episode stops here, by bootstrapping (adding potential future return to current reward) the agent looks in the future as if it will continue the episode..."
Currently, gym supplies truncation information through the TimeLimit wrapper which adds `TimeLimit.truncated` key to `info` which is returned by `env.step`. The correct way to handle terminations and truncations now would be, | ||
|
||
```python | ||
terminated = done and 'TimeLimit.truncated' not in info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is not correct.
If you look at TimeLimit
wrapper code, you should do:
# should_bootstrap can be renamed to `next_transition_valid` or equivalent
should_bootstrap = not done or info.get("TimeLimit.truncated", False)
rew + gamma*should_bootstrap*vf_next_state
btw, TimeLimit
need to be patched in oder to allow env to define TimeLimit.truncated
themselves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh you're right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR =), definitely missing piece of Gym doc.
I guess it is a good start for discussions on how to present it.
Thanks for your review! I will also try to get other people to look at it |
A common complaint is the lack of documentation in handling time-limits with current gym API. This PR tries to address that. I've added this as a separate section since I didn't feel it fit anywhere else and it might be important enough to have its own section (the closest existing section would be under API, but even that is a long shot). I think in the future we can have tutorials, which are not external links and this can go in there.
There is some RL theory behind this, so it needs some careful reviewing, especially with the terminology which has had conflicting meanings.