From ec007b03a34da435e55710379c6397535c55a251 Mon Sep 17 00:00:00 2001
From: arjun_kg <kg.arjun@gmail.com>
Date: Fri, 6 May 2022 20:06:46 +0530
Subject: [PATCH 1/2] handling time limits

---
 docs/source/content/handling_time_limits.md | 49 +++++++++++++++++++++
 docs/source/index.md                        |  1 +
 2 files changed, 50 insertions(+)
 create mode 100644 docs/source/content/handling_time_limits.md

diff --git a/docs/source/content/handling_time_limits.md b/docs/source/content/handling_time_limits.md
new file mode 100644
index 00000000..22921830
--- /dev/null
+++ b/docs/source/content/handling_time_limits.md
@@ -0,0 +1,49 @@
+# Handling Time Limits
+In using Gym environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received from `env.step` indicates whether an episode has ended. However, this signal does not distinguish between the two different reasons why an episode can end - `termination` and `truncation`. 
+
+### Termination
+Termination refers to the episode ending after reaching a terminal state that is defined as part of the environment definition. Examples are - task success, task failure, robot falling down etc. Notably this also includes episode ending in finite-horizon environments due to a time-limit inherent to the environment. Note that to preserve Markov property, a representation of the remaining time must be present in the agent's observation in finite-horizon environments. [(Reference)](https://arxiv.org/abs/1712.00378)
+
+
+### Truncation
+Truncation refers to the episode ending after an externally defined time-limit. This time-limit is not part of the environment definition, and its sole purpose is practicality for the user collecting rollouts of the episode. 
+
+An infinite-horizon environment is an obvious example where this is needed. We cannot wait forever for the episode to complete, so we set a practical time-limit after which we forcibly halt the episode. The last state in this case is not a terminal state since it has a non-zero transition probability of moving to another state as per the environment definition. This is also different from time-limits in finite horizon environments as the agent in this case has no idea about this time-limit. 
+
+### Importance in learning code
+
+Bootstrapping (using one or more estimated values of a variable to update estimates of the same variable) is a key aspect of Reinforcement Learning. A common example of bootstrapping in RL is updating the estimate of the Q-value function, 
+
+```math
+Q_{target}(o_t, a_t) = r_t + \gamma . \max_a(Q(o_{t+1}, a_{t+1}))
+```
+In classical RL, the new `Q` estimate is a weighted average of previous `Q` estimate and `Q_target` while in Deep Q-Learning, the error between `Q_target` and previous `Q` estimate is minimized.
+
+However, at the terminal state, bootstrapping is not done,
+
+```math
+Q_{target}(o_t, a_t) = r_t
+```
+
+This is where the distinction between termination and truncation becomes important. When an episode ends due to termination we don't bootstrap, when it ends due to truncation, we bootstrap.
+
+While using gym environments, the `done` signal is frequently used to determine whether to bootstrap or not. However this is incorrect since it does not differentiate between termination and truncation.
+
+A simple example for value functions is shown below. This is an illustrative example and not part of any specific algorithm.
+
+```python
+# INCORRECT
+vf_target = rew + gamma * (1-done)* vf_next_state
+```
+
+This is incorrect in the case of episode ending due to a truncation, where bootstrapping needs to happen but it doesn't. 
+
+### Solution
+
+Currently, gym supplies truncation information through the TimeLimit wrapper which adds `TimeLimit.truncated` key to `info` which is returned by `env.step`. The correct way to handle terminations and truncations now would be, 
+
+```python
+terminated = done and 'TimeLimit.truncated' not in info
+
+vf_target = rew + gamma*(1-terminated)*vf_next_state
+```
diff --git a/docs/source/index.md b/docs/source/index.md
index 174a23be..56600a09 100644
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -38,6 +38,7 @@ content/spaces
 content/vector_api
 content/tutorials
 content/wrappers
+content/handling_time_limits
 Github <https://github.com/openai/gym>
 ```
 

From e04723e681ec95750c2c8799cbc68851aa81bd19 Mon Sep 17 00:00:00 2001
From: Arjun KG <kg.arjun@gmail.com>
Date: Wed, 1 Jun 2022 00:40:28 +0530
Subject: [PATCH 2/2] correct mistake with terminated assignment

---
 docs/source/content/handling_time_limits.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/content/handling_time_limits.md b/docs/source/content/handling_time_limits.md
index 22921830..76dc2d54 100644
--- a/docs/source/content/handling_time_limits.md
+++ b/docs/source/content/handling_time_limits.md
@@ -43,7 +43,7 @@ This is incorrect in the case of episode ending due to a truncation, where boots
 Currently, gym supplies truncation information through the TimeLimit wrapper which adds `TimeLimit.truncated` key to `info` which is returned by `env.step`. The correct way to handle terminations and truncations now would be, 
 
 ```python
-terminated = done and 'TimeLimit.truncated' not in info
+terminated = done and not info.get('TimeLimit.truncated', False)
 
 vf_target = rew + gamma*(1-terminated)*vf_next_state
 ```