-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] Enabling agents to keep bootstraping in the last step per episode #1004
Comments
Thanks, setting |
Ah sorry, I was referring to the final comment from arrafin in that issue:
There is no support for |
For a longer answer regarding the time feature, you can read araffin/rl-baselines-zoo#79 |
I see. My bad that I directly jumped to #863 without noticing "to provide episode time in observations". Yes, that is a simple and necessary trick especially when the timeout is intrinsic in the system, as mentioned in one of the two cases in the paper Time Limits in Reinforcement Learning. However, in the second case of the paper where time limit is set just to facilitate learning (e.g. more diverse trajectories), bootstrapping in the last step is mandated. My case is more of the second one. But it is simpler than those mixed with both env |
The ease of implementing this yourself depends on the algorithm you want to use, as some do not store |
Thanks. I am mainly using AC/ACER, according to your hint, should I modify this method
by removing
|
I have little experience with ACER but that seems to be to the right direction. Sorry, I can not be of further assistance with this topic, as your guess will probably be more correct than mine :) |
It is ok. I have little experience with PPO, so I am trying ACER. Thank you, Miffyli, your comments and quotes are very helpful. I will test it by myself. |
Closing as now done in SB3 |
I am using stable-baselines 2.10.1 to train AC/ACER agents in a custom environment with time limits [0, T] per episode. In the last update per episode, the value function is normally updated by
which treats state S^T as an absorbing state where no value will be incurred thereafter. In the code,
(1. - done)
is used.However, for my limited-time cases, the update is expected to be like this
Since the training terminates not because the terminal state is reached, but because the time is out and V(S^T) has its value, therefore the training is expected to keep bootstraping in this last step.
I skimmed through the source code, and neither found this functionality nor figure out where to rewrite. Was wondering how to enable this?
The text was updated successfully, but these errors were encountered: