Filtering out artificial teminal states #863

mhtb32 · 2020-05-18T09:32:56Z

In many gym environments, like MountainCarContinuous, there is an epsiode step limit. This leads to episode termination before actually achieving the end of trajectory(which in this case is reaching uphill).

Saving these experiences to buffer without changing artificial terminals to False, for example, in here, leads to an error in computing TD errors. I think the agent's prediction about the future rewards while it has not reached the real end of the trajectory yet, should be taken into account.

This is why some implementations like OpenAI SpinningUp change that terminal states before saving the experience, like this:

"""From OpanAI SpinningUp source code"""

# Ignore the "done" signal if it comes from hitting the time
# horizon (that is when it's an artificial terminal signal
# that isn't based on the agent's state)
d = False if ep_len==max_ep_len else d

# Store experience to replay buffer
replay_buffer.store(o, a, r, o2, d)

araffin · 2020-05-18T09:49:03Z

Hello,

thanks for pointing out that problem.
So you have different way of dealing with the problem. One easy way is to add a time feature, as it done in the zoo:

Actually, the right way would be to check for TimeLimit.truncated in the info:
https://github.com/openai/gym/blob/master/gym/wrappers/time_limit.py#L19
it is a recent gym feature.

mhtb32 · 2020-05-18T10:59:54Z

So if I get it right, this is the right way to filter out artificial terminal flags:

done = False if info['TimeLimit.truncated'] else done

Am I right?

And do you have any plan to add this to stable-baselines or stable-baselines3?

araffin · 2020-05-18T11:50:57Z

Am I right?

Looks good ;)

And do you have any plan to add this to stable-baselines or stable-baselines3?

not for now, as the time feature is sufficient and avoid including additional complexity in the code (it gets a little more complex when using multiple environments).

araffin · 2020-09-30T10:19:15Z

I created a branch on SB3 but it in fact a bit more tricky than expected (notably because VecEnv resets automatically): https://github.com/DLR-RM/stable-baselines3/compare/feat/remove-timelimit

For A2C/PPO or any n-step methods, we would need to keep track of two types of terminations signal...

Gregwar · 2022-03-22T21:37:52Z

@araffin what is the status of this ?
In off_policy_algorithm.py I see mention of remove_time_limit_termination, but it looks like dead code to me since it is not used

araffin · 2022-03-23T09:52:14Z

Answered here DLR-RM/stable-baselines3#829

araffin added the enhancement New feature or request label May 18, 2020

mhtb32 closed this as completed May 18, 2020

Miffyli mentioned this issue Sep 15, 2020

[question] Enabling agents to keep bootstraping in the last step per episode #1004

Closed

araffin mentioned this issue Jan 7, 2021

[Bug] Infinite horizon tasks are handled like episodic tasks DLR-RM/stable-baselines3#284

Closed

Gregwar mentioned this issue Mar 22, 2022

[Question] Handling max steps / use of remove_time_limit_termination DLR-RM/stable-baselines3#829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering out artificial teminal states #863

Filtering out artificial teminal states #863

mhtb32 commented May 18, 2020

araffin commented May 18, 2020

mhtb32 commented May 18, 2020

araffin commented May 18, 2020

araffin commented Sep 30, 2020

Gregwar commented Mar 22, 2022

araffin commented Mar 23, 2022

Filtering out artificial teminal states #863

Filtering out artificial teminal states #863

Comments

mhtb32 commented May 18, 2020

araffin commented May 18, 2020

mhtb32 commented May 18, 2020

araffin commented May 18, 2020

araffin commented Sep 30, 2020

Gregwar commented Mar 22, 2022

araffin commented Mar 23, 2022