New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Improving Documentation for Handling Time Limits #128

Closed

arjun-kg wants to merge 2 commits into Farama-Foundation:main from arjun-kg:time_limits

Contributor

arjun-kg commented May 6, 2022

A common complaint is the lack of documentation in handling time-limits with current gym API. This PR tries to address that. I've added this as a separate section since I didn't feel it fit anywhere else and it might be important enough to have its own section (the closest existing section would be under API, but even that is a long shot). I think in the future we can have tutorials, which are not external links and this can go in there.

There is some RL theory behind this, so it needs some careful reviewing, especially with the terminology which has had conflicting meanings.


          handling time limits

ec007b0

vercel bot commented May 6, 2022 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
gym-docs	✅ Ready (Inspect)	Visit Preview	May 31, 2022 at 7:11PM (UTC)

vercel bot deployed to Preview

May 6, 2022 14:52

View deployment

trigaten marked this pull request as draft

May 7, 2022 15:26

Member

pseudo-rnd-thoughts commented May 9, 2022

@arjun-kg I would like to clear up some of the text but overall I don't have any major issues with the sections. I will message on discord about it

pseudo-rnd-thoughts mentioned this pull request

New Step API with terminated, truncated bools instead of done openai/gym#2752

Merged

9 tasks

araffin commented May 9, 2022

tagging myself for a review (if I have time this week, otherwise next week)

Member

pseudo-rnd-thoughts commented May 27, 2022

@arjun-kg Do you have any more changes to this PR?
@araffin I forgot to poke you earlier, would you be able to look at this when you have time

Contributor Author

arjun-kg commented May 28, 2022

@arjun-kg Do you have any more changes to this PR?

Just the math rendering and maybe adding an image explanation. I'll look at it again after the core text is reviewed since the content could potentially change

araffin commented May 28, 2022

@arjun-kg Do you have any more changes to this PR? @araffin I forgot to poke you earlier, would you be able to look at this when you have time

thanks for pinging me, i had unfortunately no time with ICRA tutorial preparation (ICRA was this week), and now I'm sick :/
I hope to be able to do a review by the end of this week.

araffin reviewed

View reviewed changes

docs/source/content/handling_time_limits.md

		@@ -0,0 +1,49 @@
		# Handling Time Limits
		In using Gym environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received from `env.step` indicates whether an episode has ended. However, this signal does not distinguish between the two different reasons why an episode can end - `termination` and `truncation`.

araffin May 31, 2022

I would probably rephrase the intro paragraph as "Terminations due to timelimits must be handled with care, as it is common to oversee that particular case which violates RL core assumption (Markov property) and may hinder performance" (so start with the subject to make it clearer, not with "gym with RL code")

I wouldn't say there is "two different reasons", but more "it doesn't tell if the termination was due to timeout/timelimit or not".

araffin reviewed

View reviewed changes

docs/source/content/handling_time_limits.md

+              # Handling Time Limits
+              In using Gym environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received from `env.step` indicates whether an episode has ended. However, this signal does not distinguish between the two different reasons why an episode can end - `termination` and `truncation`.
+              ### Termination

araffin May 31, 2022

I'm still not a big fan of opposing "termination" vs "truncation/timelimit", because a timeout is a termination in practice (even though you treat it differently in infinite horizon problem).
I would more treat as "truncation/termination due to timelimit" vs "other terminations"

Contributor Author

arjun-kg May 31, 2022

About the terminology, unless I'm mistaken there is no clear literature on definitions of termination and truncation. But, I've seen states referred to as terminal states specifically mean 'special states' after which the episode ends, in the sense that this is built into the MDP. Sutton and Barto for eg. says "Note that the value of the terminal state, if any, is
always zero." and throughout the book uses terminal states in this context. This is not true if we call the last state after a timeout in an infinite-horizon problem also as a terminal state.

The term termination is an extrapolation from this, saying that, "If and only if an environment reaches a terminal state, it is said to have terminated".

And for all other cases, we say it is truncated. And we want to make this distinction specifically since it also makes a big difference for bootstrapping.

araffin May 31, 2022

i think they used "absorbing terminal state" too, need to double check

araffin reviewed

View reviewed changes

docs/source/content/handling_time_limits.md



		### Truncation
		Truncation refers to the episode ending after an externally defined time-limit. This time-limit is not part of the environment definition, and its sole purpose is practicality for the user collecting rollouts of the episode.

araffin May 31, 2022

actually, truncation is wider than timelimit, for instance, for a robot, truncation would also be going out of the tracking boundaries (e.g. quadruped locomotion).

But the treatment should be the same, in that example, you need to stop the episode (because you need to track your robot for computing reward/safety) but not punish the robot for that.

"is not part of the env definition" sounds weird because you actually usually define it there (in the register() method).

"sole purpose is practicality for the user collecting rollouts of the episode." -> this is quite vague...
And there are different reasons for timeouts:

more diverse data collection (with different starting states)
avoid infinite loop (for instance when using episodic RL or population based method)
data at the beginning of an episode may be different than data after (for instance first lap for a racing car vs subsequent laps)

Contributor Author

arjun-kg May 31, 2022

or a robot, truncation would also be going out of the tracking boundaries (e.g. quadruped locomotion).

I had never considered this, I will definitely add it

araffin May 31, 2022

well, you don't think about it until you apply RL directly on a real robot ;)

araffin reviewed

View reviewed changes

docs/source/content/handling_time_limits.md


		An infinite-horizon environment is an obvious example where this is needed. We cannot wait forever for the episode to complete, so we set a practical time-limit after which we forcibly halt the episode. The last state in this case is not a terminal state since it has a non-zero transition probability of moving to another state as per the environment definition. This is also different from time-limits in finite horizon environments as the agent in this case has no idea about this time-limit.

		### Importance in learning code

araffin May 31, 2022

I'm not sure where to put it, but to understand the issue for value estimation, i usually give the example of a cyclic task (for example locomotion) where we try to estimate V(s_0) (initial state), V(s_T) (terminal state, here due to timeout only) but where s_0 = s_T.

with naive implementation:

V(s_T) = r_T (terminal reward)
V(s_0) = discounted_sum_rewards

but s_0 = s_T, so we got a problem (which is where it hinder performance in addition to breaking assumption)

araffin reviewed

View reviewed changes

docs/source/content/handling_time_limits.md

+              ```
+              In classical RL, the new `Q` estimate is a weighted average of previous `Q` estimate and `Q_target` while in Deep Q-Learning, the error between `Q_target` and previous `Q` estimate is minimized.
+              However, at the terminal state, bootstrapping is not done,

araffin May 31, 2022

maybe explain what bootstrapping is?

araffin May 31, 2022

by doing so, the solution to the problem should become obvious
("value function tell you how much you should expect/get (discounted sum of reward) if you follow current policy starting from that state" and "even the episode stops here, by bootstrapping (adding potential future return to current reward) the agent looks in the future as if it will continue the episode..."

araffin reviewed

View reviewed changes

docs/source/content/handling_time_limits.md Outdated

+              Currently, gym supplies truncation information through the TimeLimit wrapper which adds `TimeLimit.truncated` key to `info` which is returned by `env.step`. The correct way to handle terminations and truncations now would be,
+              ```python
+              terminated = done and 'TimeLimit.truncated' not in info

araffin May 31, 2022

actually this is not correct.
If you look at TimeLimit wrapper code, you should do:

# should_bootstrap can be renamed to `next_transition_valid` or equivalent
should_bootstrap = not done or info.get("TimeLimit.truncated", False)
rew + gamma*should_bootstrap*vf_next_state

btw, TimeLimit need to be patched in oder to allow env to define TimeLimit.truncated themselves

Contributor Author

arjun-kg May 31, 2022

Ooh you're right

araffin reviewed

View reviewed changes

araffin left a comment

Thanks for the PR =), definitely missing piece of Gym doc.
I guess it is a good start for discussions on how to present it.

Contributor Author

arjun-kg commented May 31, 2022

Thanks for the PR =), definitely missing piece of Gym doc. I guess it is a good start for discussions on how to present it.

Thanks for your review! I will also try to get other people to look at it


          correct mistake with terminated assignment

e04723e

vercel bot deployed to Preview

May 31, 2022 19:11

View deployment

arjun-kg closed this

araffin mentioned this pull request

Re-add timelimit truncation information Farama-Foundation/Gymnasium#101

Closed

10 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet