-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Documentation for Handling Time Limits #128
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Handling Time Limits | ||
In using Gym environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received from `env.step` indicates whether an episode has ended. However, this signal does not distinguish between the two different reasons why an episode can end - `termination` and `truncation`. | ||
|
||
### Termination | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm still not a big fan of opposing "termination" vs "truncation/timelimit", because a timeout is a termination in practice (even though you treat it differently in infinite horizon problem). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. About the terminology, unless I'm mistaken there is no clear literature on definitions of termination and truncation. But, I've seen states referred to as The term And for all other cases, we say it is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think they used "absorbing terminal state" too, need to double check |
||
Termination refers to the episode ending after reaching a terminal state that is defined as part of the environment definition. Examples are - task success, task failure, robot falling down etc. Notably this also includes episode ending in finite-horizon environments due to a time-limit inherent to the environment. Note that to preserve Markov property, a representation of the remaining time must be present in the agent's observation in finite-horizon environments. [(Reference)](https://arxiv.org/abs/1712.00378) | ||
|
||
|
||
### Truncation | ||
Truncation refers to the episode ending after an externally defined time-limit. This time-limit is not part of the environment definition, and its sole purpose is practicality for the user collecting rollouts of the episode. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, truncation is wider than timelimit, for instance, for a robot, truncation would also be going out of the tracking boundaries (e.g. quadruped locomotion). But the treatment should be the same, in that example, you need to stop the episode (because you need to track your robot for computing reward/safety) but not punish the robot for that. "is not part of the env definition" sounds weird because you actually usually define it there (in the "sole purpose is practicality for the user collecting rollouts of the episode." -> this is quite vague...
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I had never considered this, I will definitely add it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well, you don't think about it until you apply RL directly on a real robot ;) |
||
|
||
An infinite-horizon environment is an obvious example where this is needed. We cannot wait forever for the episode to complete, so we set a practical time-limit after which we forcibly halt the episode. The last state in this case is not a terminal state since it has a non-zero transition probability of moving to another state as per the environment definition. This is also different from time-limits in finite horizon environments as the agent in this case has no idea about this time-limit. | ||
|
||
### Importance in learning code | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure where to put it, but to understand the issue for value estimation, i usually give the example of a cyclic task (for example locomotion) where we try to estimate V(s_0) (initial state), V(s_T) (terminal state, here due to timeout only) but where s_0 = s_T. with naive implementation:
but s_0 = s_T, so we got a problem (which is where it hinder performance in addition to breaking assumption) |
||
|
||
Bootstrapping (using one or more estimated values of a variable to update estimates of the same variable) is a key aspect of Reinforcement Learning. A common example of bootstrapping in RL is updating the estimate of the Q-value function, | ||
|
||
```math | ||
Q_{target}(o_t, a_t) = r_t + \gamma . \max_a(Q(o_{t+1}, a_{t+1})) | ||
``` | ||
In classical RL, the new `Q` estimate is a weighted average of previous `Q` estimate and `Q_target` while in Deep Q-Learning, the error between `Q_target` and previous `Q` estimate is minimized. | ||
|
||
However, at the terminal state, bootstrapping is not done, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe explain what bootstrapping is? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. by doing so, the solution to the problem should become obvious |
||
|
||
```math | ||
Q_{target}(o_t, a_t) = r_t | ||
``` | ||
|
||
This is where the distinction between termination and truncation becomes important. When an episode ends due to termination we don't bootstrap, when it ends due to truncation, we bootstrap. | ||
|
||
While using gym environments, the `done` signal is frequently used to determine whether to bootstrap or not. However this is incorrect since it does not differentiate between termination and truncation. | ||
|
||
A simple example for value functions is shown below. This is an illustrative example and not part of any specific algorithm. | ||
|
||
```python | ||
# INCORRECT | ||
vf_target = rew + gamma * (1-done)* vf_next_state | ||
``` | ||
|
||
This is incorrect in the case of episode ending due to a truncation, where bootstrapping needs to happen but it doesn't. | ||
|
||
### Solution | ||
|
||
Currently, gym supplies truncation information through the TimeLimit wrapper which adds `TimeLimit.truncated` key to `info` which is returned by `env.step`. The correct way to handle terminations and truncations now would be, | ||
|
||
```python | ||
terminated = done and not info.get('TimeLimit.truncated', False) | ||
|
||
vf_target = rew + gamma*(1-terminated)*vf_next_state | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably rephrase the intro paragraph as "Terminations due to timelimits must be handled with care, as it is common to oversee that particular case which violates RL core assumption (Markov property) and may hinder performance" (so start with the subject to make it clearer, not with "gym with RL code")
I wouldn't say there is "two different reasons", but more "it doesn't tell if the termination was due to timeout/timelimit or not".