-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm #633
Comments
after thinking about it, the fix is maybe quite easy to implement: replace |
Linking as a related paper on this same topic. There are some experiments we could try to replicate (e.g., in Walker2D the results seem to change quite a bit): https://arxiv.org/abs/1712.00378
Hmm staring at the lines below I am not sure if this would work out due to GAE: if stable-baselines3/stable_baselines3/common/buffers.py Lines 380 to 381 in 2bb4500
One alternative could be to also store terminal observations as part of the trajectory with dummy actions, in which case we do not have to resort to new trickery code. This could also be potentially used to clean the naming hassle with dones (e.g., get rid of the |
you are refering to L380 and 381? |
Yup, and I understood your idea. Turns out I was wrong ^^'. I only noticed it now that I tried to type it out. 1) Current setup when termination is encountered in next stepdelta = self.rewards[step] + 2) Ideal setup where timeouts are handled correctly (next state is timeout termination)delta = self.rewards[step] + self.gamma * next_values 3) Changing timeout reward to
|
so the conclusion is that my proposed hack is valid :p? |
Yup, at least in this part of the code ^^. I would still rethink the whole process through carefully, as "hacks" like this often break something (and sadly it is hard to test). |
🚀 Feature
Same as #284 but for on-policy algorithms.
The current workaround is to use a
TimeFeatureWrapper
(cf. zoo).### Checklist
The text was updated successfully, but these errors were encountered: