-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Does PPO handle timeout and bootstrap correctly? #651
Comments
Hello, It currently does not (but you can use a EDIT: but timeouts are handled properly for off-policy algorithms |
I'm happy to work on this, as it's related to my current project as well. Before I proceed, I want to clear up a few things:
|
it is not used because we don't bootstrap when
compared to off-policy algorithms, PS: I will close this to have all the discussion in #633 |
Question
SB3's PPO does not seem to distinguish between done and timeout, and only relies on done flags when computing GAE return:
stable-baselines3/stable_baselines3/common/buffers.py
Line 349 in 2bb4500
For example, when GAE lambda is set to 1, the comment says that
R - V(s)
would be computed, where R is the discounted reward with bootstrap. What if bootstrap is not appropriate for certain envs (done = 1
means done literally)?Also, as the documentation for VecEnv points out, the real next observation is only available in
terminal_observation
key of theinfo
dictionary. However, I don't see the real next observation being used in computing the bootstrap.Checklist
The text was updated successfully, but these errors were encountered: