-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Infinite horizon tasks are handled like episodic tasks #284
Comments
To summarize so that I understood things right: You have non-episodic task (never truly "done"), but you use There should not be a problem with this while using SAC, as long as you always feed in A more sophisticated solution would indeed be a nice enhancement though, as errors like these are easy to miss. I will mark it as an enhancement for some later versions of stable-baselines. |
Yes, your description is accurate. Would you like me to open a PR for this? I'm happy to follow any guidelines you point me to. |
I would not start doing a PR yet as proper implementation (if any) should be discussed first, and I do not see an easy way to incorporate this to contrib repository either. Worse yet, this will get more complicated once support for n-step updates is included. For now I suggest you modify your own fork of SB3 to support this in the way that suits your experiments. |
Okay. Feel free to ping me if you require support with this feature later on. |
TimeFeature is one solution and equivalent in performance to specific handling of timeout. Personally, this is the recommended solution (and you can use the test mode at test time too). Note: the timeout handling is indeed important, see appendix of https://arxiv.org/abs/2005.05719 Related issueslinking to all relevant issues:
Experimental branch
"I created a branch on SB3 but it in fact a bit more tricky than expected (notably because VecEnv resets automatically): "
you don't need |
I've implemented correct handling of timeouts in #351 Note: for on-policy algorithms, I would still recommend to use timefeature wrapper (as the change will be even more tricky I think, due to lambda-return). |
Hi,
I wonder how to correctly use SAC with infinite horizon environments. I saw @araffin s answer to hill-a/stable-baselines#776 where he points out that algorithms are step-based. Our environments could always return
done = False
, but we would have to reset the environment manually then. As a consequence, we would add transitions to the replay buffer going from the last state to the initial state, which is bad.Is the only solution to include a time-feature? That means messing with the
observation_space
size and handling dict spaces correctly + explaining what this "time-feature" is in papers. Let me know if I've missed a thread treating this issue already 😄Greetings!
🐛 Bug / Background
My understanding is that SAC skips the target if
s'
is a terminal state:In infinite horizon tasks, we wrap our env with
gym.wrappers.TimeLimit
, which setsdone = True
when the maximum episode length is reached. This stops the episode in SAC and the transition is saved in the replay buffer for learning.However, according to "Time Limits in Reinforcement Learning" (https://arxiv.org/abs/1712.00378), we should not see that last state as a "terminal" state, since the termination has nothing to do with the MDP. If we ignore this, we are doing "state aliasing" and violating the Markov Property.
The text was updated successfully, but these errors were encountered: