Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Does PPO handle timeout and bootstrap correctly? #651

Closed
2 tasks done
zhihanyang2022 opened this issue Nov 4, 2021 · 3 comments
Closed
2 tasks done

[Question] Does PPO handle timeout and bootstrap correctly? #651

zhihanyang2022 opened this issue Nov 4, 2021 · 3 comments
Labels
duplicate This issue or pull request already exists question Further information is requested

Comments

@zhihanyang2022
Copy link

zhihanyang2022 commented Nov 4, 2021

Question

SB3's PPO does not seem to distinguish between done and timeout, and only relies on done flags when computing GAE return:

def compute_returns_and_advantage(self, last_values: th.Tensor, dones: np.ndarray) -> None:

For example, when GAE lambda is set to 1, the comment says that R - V(s) would be computed, where R is the discounted reward with bootstrap. What if bootstrap is not appropriate for certain envs (done = 1 means done literally)?

Also, as the documentation for VecEnv points out, the real next observation is only available in terminal_observation key of the info dictionary. However, I don't see the real next observation being used in computing the bootstrap.

Checklist

  • I have read the documentation (required)
  • I have checked that there is no similar issue in the repo (required)
@zhihanyang2022 zhihanyang2022 added the question Further information is requested label Nov 4, 2021
@zhihanyang2022 zhihanyang2022 changed the title [Question] Does PPO handle timeout correctly? [Question] Does PPO handle timeout and bootstrap correctly? Nov 4, 2021
@araffin araffin added the duplicate This issue or pull request already exists label Nov 4, 2021
@araffin
Copy link
Member

araffin commented Nov 4, 2021

Hello,
Duplicate of #633

It currently does not (but you can use a TimeFeatureWrapper as in the RL Zoo) and we would welcome a PR that implements proper handling of timeouts ;)

EDIT: but timeouts are handled properly for off-policy algorithms

@zhihanyang2022
Copy link
Author

zhihanyang2022 commented Nov 4, 2021

we would welcome a PR that implements proper handling of timeouts ;)

I'm happy to work on this, as it's related to my current project as well.

Before I proceed, I want to clear up a few things:

  • (I asked this earlier) As the documentation for VecEnv points out, the real next observation is only available in terminal_observation key of the info dictionary. However, I don't see the real observation being used for bootstrapping in the current code for PPO, which is a bit weird to me.
  • How should one interpret the naming convention _last_episode_starts, and why isn't the naming done used instead? I've read Add test for GAE + rename RolloutBuffer.dones for clarification #375 but I'm still not sure.

@araffin
Copy link
Member

araffin commented Nov 4, 2021

However, I don't see the real observation being used for bootstrapping in the current code for PPO, which is a bit weird to me.

it is not used because we don't bootstrap when done=True currently.

nd why isn't the naming done used instead?

compared to off-policy algorithms, last_episode_start is shifted by one (hence the renaming), and we initialize _last_episode_starts to true:

https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/base_class.py#L430

PS: I will close this to have all the discussion in #633

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants