Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering out artificial teminal states #863

Closed
mhtb32 opened this issue May 18, 2020 · 6 comments
Closed

Filtering out artificial teminal states #863

mhtb32 opened this issue May 18, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@mhtb32
Copy link

mhtb32 commented May 18, 2020

In many gym environments, like MountainCarContinuous, there is an epsiode step limit. This leads to episode termination before actually achieving the end of trajectory(which in this case is reaching uphill).

Saving these experiences to buffer without changing artificial terminals to False, for example, in here, leads to an error in computing TD errors. I think the agent's prediction about the future rewards while it has not reached the real end of the trajectory yet, should be taken into account.

This is why some implementations like OpenAI SpinningUp change that terminal states before saving the experience, like this:

"""From OpanAI SpinningUp source code"""

# Ignore the "done" signal if it comes from hitting the time
# horizon (that is when it's an artificial terminal signal
# that isn't based on the agent's state)
d = False if ep_len==max_ep_len else d

# Store experience to replay buffer
replay_buffer.store(o, a, r, o2, d)
@araffin araffin added the enhancement New feature or request label May 18, 2020
@araffin
Copy link
Collaborator

araffin commented May 18, 2020

Hello,

thanks for pointing out that problem.
So you have different way of dealing with the problem. One easy way is to add a time feature, as it done in the zoo:

Actually, the right way would be to check for TimeLimit.truncated in the info:
https://github.com/openai/gym/blob/master/gym/wrappers/time_limit.py#L19
it is a recent gym feature.

@mhtb32
Copy link
Author

mhtb32 commented May 18, 2020

So if I get it right, this is the right way to filter out artificial terminal flags:

done = False if info['TimeLimit.truncated'] else done

Am I right?

And do you have any plan to add this to stable-baselines or stable-baselines3?

@araffin
Copy link
Collaborator

araffin commented May 18, 2020

Am I right?

Looks good ;)

And do you have any plan to add this to stable-baselines or stable-baselines3?

not for now, as the time feature is sufficient and avoid including additional complexity in the code (it gets a little more complex when using multiple environments).

@araffin
Copy link
Collaborator

araffin commented Sep 30, 2020

I created a branch on SB3 but it in fact a bit more tricky than expected (notably because VecEnv resets automatically): https://github.com/DLR-RM/stable-baselines3/compare/feat/remove-timelimit

For A2C/PPO or any n-step methods, we would need to keep track of two types of terminations signal...

@Gregwar
Copy link

Gregwar commented Mar 22, 2022

@araffin what is the status of this ?
In off_policy_algorithm.py I see mention of remove_time_limit_termination, but it looks like dead code to me since it is not used

@araffin
Copy link
Collaborator

araffin commented Mar 23, 2022

Answered here DLR-RM/stable-baselines3#829

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants