Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Documentation for Handling Time Limits #128

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions docs/source/content/handling_time_limits.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Handling Time Limits
In using Gym environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received from `env.step` indicates whether an episode has ended. However, this signal does not distinguish between the two different reasons why an episode can end - `termination` and `truncation`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably rephrase the intro paragraph as "Terminations due to timelimits must be handled with care, as it is common to oversee that particular case which violates RL core assumption (Markov property) and may hinder performance" (so start with the subject to make it clearer, not with "gym with RL code")

I wouldn't say there is "two different reasons", but more "it doesn't tell if the termination was due to timeout/timelimit or not".


### Termination
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not a big fan of opposing "termination" vs "truncation/timelimit", because a timeout is a termination in practice (even though you treat it differently in infinite horizon problem).
I would more treat as "truncation/termination due to timelimit" vs "other terminations"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the terminology, unless I'm mistaken there is no clear literature on definitions of termination and truncation. But, I've seen states referred to as terminal states specifically mean 'special states' after which the episode ends, in the sense that this is built into the MDP. Sutton and Barto for eg. says "Note that the value of the terminal state, if any, is
always zero." and throughout the book uses terminal states in this context. This is not true if we call the last state after a timeout in an infinite-horizon problem also as a terminal state.

The term termination is an extrapolation from this, saying that, "If and only if an environment reaches a terminal state, it is said to have terminated".

And for all other cases, we say it is truncated. And we want to make this distinction specifically since it also makes a big difference for bootstrapping.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think they used "absorbing terminal state" too, need to double check

Termination refers to the episode ending after reaching a terminal state that is defined as part of the environment definition. Examples are - task success, task failure, robot falling down etc. Notably this also includes episode ending in finite-horizon environments due to a time-limit inherent to the environment. Note that to preserve Markov property, a representation of the remaining time must be present in the agent's observation in finite-horizon environments. [(Reference)](https://arxiv.org/abs/1712.00378)


### Truncation
Truncation refers to the episode ending after an externally defined time-limit. This time-limit is not part of the environment definition, and its sole purpose is practicality for the user collecting rollouts of the episode.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, truncation is wider than timelimit, for instance, for a robot, truncation would also be going out of the tracking boundaries (e.g. quadruped locomotion).

But the treatment should be the same, in that example, you need to stop the episode (because you need to track your robot for computing reward/safety) but not punish the robot for that.

"is not part of the env definition" sounds weird because you actually usually define it there (in the register() method).

"sole purpose is practicality for the user collecting rollouts of the episode." -> this is quite vague...
And there are different reasons for timeouts:

  • more diverse data collection (with different starting states)
  • avoid infinite loop (for instance when using episodic RL or population based method)
  • data at the beginning of an episode may be different than data after (for instance first lap for a racing car vs subsequent laps)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or a robot, truncation would also be going out of the tracking boundaries (e.g. quadruped locomotion).

I had never considered this, I will definitely add it

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, you don't think about it until you apply RL directly on a real robot ;)


An infinite-horizon environment is an obvious example where this is needed. We cannot wait forever for the episode to complete, so we set a practical time-limit after which we forcibly halt the episode. The last state in this case is not a terminal state since it has a non-zero transition probability of moving to another state as per the environment definition. This is also different from time-limits in finite horizon environments as the agent in this case has no idea about this time-limit.

### Importance in learning code
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where to put it, but to understand the issue for value estimation, i usually give the example of a cyclic task (for example locomotion) where we try to estimate V(s_0) (initial state), V(s_T) (terminal state, here due to timeout only) but where s_0 = s_T.

with naive implementation:

  • V(s_T) = r_T (terminal reward)
  • V(s_0) = discounted_sum_rewards

but s_0 = s_T, so we got a problem (which is where it hinder performance in addition to breaking assumption)


Bootstrapping (using one or more estimated values of a variable to update estimates of the same variable) is a key aspect of Reinforcement Learning. A common example of bootstrapping in RL is updating the estimate of the Q-value function,

```math
Q_{target}(o_t, a_t) = r_t + \gamma . \max_a(Q(o_{t+1}, a_{t+1}))
```
In classical RL, the new `Q` estimate is a weighted average of previous `Q` estimate and `Q_target` while in Deep Q-Learning, the error between `Q_target` and previous `Q` estimate is minimized.

However, at the terminal state, bootstrapping is not done,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explain what bootstrapping is?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by doing so, the solution to the problem should become obvious
("value function tell you how much you should expect/get (discounted sum of reward) if you follow current policy starting from that state" and "even the episode stops here, by bootstrapping (adding potential future return to current reward) the agent looks in the future as if it will continue the episode..."


```math
Q_{target}(o_t, a_t) = r_t
```

This is where the distinction between termination and truncation becomes important. When an episode ends due to termination we don't bootstrap, when it ends due to truncation, we bootstrap.

While using gym environments, the `done` signal is frequently used to determine whether to bootstrap or not. However this is incorrect since it does not differentiate between termination and truncation.

A simple example for value functions is shown below. This is an illustrative example and not part of any specific algorithm.

```python
# INCORRECT
vf_target = rew + gamma * (1-done)* vf_next_state
```

This is incorrect in the case of episode ending due to a truncation, where bootstrapping needs to happen but it doesn't.

### Solution

Currently, gym supplies truncation information through the TimeLimit wrapper which adds `TimeLimit.truncated` key to `info` which is returned by `env.step`. The correct way to handle terminations and truncations now would be,

```python
terminated = done and not info.get('TimeLimit.truncated', False)

vf_target = rew + gamma*(1-terminated)*vf_next_state
```
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ content/spaces
content/vector_api
content/tutorials
content/wrappers
content/handling_time_limits
Github <https://github.com/openai/gym>
```

Expand Down