[Bug] Infinite horizon tasks are handled like episodic tasks #284

tomasruizt · 2021-01-06T19:04:07Z

Hi,
I wonder how to correctly use SAC with infinite horizon environments. I saw @araffin s answer to hill-a/stable-baselines#776 where he points out that algorithms are step-based. Our environments could always return done = False, but we would have to reset the environment manually then. As a consequence, we would add transitions to the replay buffer going from the last state to the initial state, which is bad.

Is the only solution to include a time-feature? That means messing with the observation_space size and handling dict spaces correctly + explaining what this "time-feature" is in papers. Let me know if I've missed a thread treating this issue already 😄
Greetings!

🐛 Bug / Background

My understanding is that SAC skips the target if s' is a terminal state:

q_backup = replay_data.rewards + (1 - replay_data.dones) * self.gamma * target_q

In infinite horizon tasks, we wrap our env with gym.wrappers.TimeLimit, which sets done = True when the maximum episode length is reached. This stops the episode in SAC and the transition is saved in the replay buffer for learning.

However, according to "Time Limits in Reinforcement Learning" (https://arxiv.org/abs/1712.00378), we should not see that last state as a "terminal" state, since the termination has nothing to do with the MDP. If we ignore this, we are doing "state aliasing" and violating the Markov Property.

The text was updated successfully, but these errors were encountered:

Miffyli · 2021-01-07T12:30:12Z

To summarize so that I understood things right: You have non-episodic task (never truly "done"), but you use TimeLimit to reset game every now and then, and to train correctly you can not apply terminal boundaries during training (does not reflect true agent setup).

There should not be a problem with this while using SAC, as long as you always feed in done=False. The biggest problem then is that final timestep does not reflect environment behaviour (it was reset under the hood). The easiest fix for this is not to include it in the training data. You can add a check here to check if not infos.get(TimeLimit.truncated, False): buffer.add(...). Such flag is added to info dictionary when episode is truncated by timelimit.

A more sophisticated solution would indeed be a nice enhancement though, as errors like these are easy to miss. I will mark it as an enhancement for some later versions of stable-baselines.

tomasruizt · 2021-01-07T12:37:37Z

Yes, your description is accurate. Would you like me to open a PR for this? I'm happy to follow any guidelines you point me to.
Best, Tomas

Miffyli · 2021-01-07T13:15:52Z

I would not start doing a PR yet as proper implementation (if any) should be discussed first, and I do not see an easy way to incorporate this to contrib repository either. Worse yet, this will get more complicated once support for n-step updates is included.

For now I suggest you modify your own fork of SB3 to support this in the way that suits your experiments.

tomasruizt · 2021-01-07T13:23:41Z

Okay. Feel free to ping me if you require support with this feature later on.
For my work, I'll use a time-feature and set gamma = 1.

araffin · 2021-01-07T14:51:26Z

Is the only solution to include a time-feature? That means messing with the observation_space size and handling dict spaces correctly + explaining what this "time-feature" is in papers

TimeFeature is one solution and equivalent in performance to specific handling of timeout.
We have an implementation in SB3-Contrib that already handles dict and it is used for all PyBullet env in the zoo:
https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/sac.yml#L142

Personally, this is the recommended solution (and you can use the test mode at test time too).

Note: the timeout handling is indeed important, see appendix of https://arxiv.org/abs/2005.05719

Related issues

linking to all relevant issues:

Experimental branch

You can add a check here to check if not infos.get(TimeLimit.truncated, False): buffer.add(...). Such flag is added to info dictionary when episode is truncated by timelimit.

"I created a branch on SB3 but it in fact a bit more tricky than expected (notably because VecEnv resets automatically): "
As mentioned, I already created an experimental branch here:
https://github.com/DLR-RM/stable-baselines3/compare/feat/remove-timelimit

For my work, I'll use a time-feature and set gamma = 1.

you don't need gamma=1, this is independent from the infinite horizon problem.

araffin · 2021-04-05T10:55:05Z

I've implemented correct handling of timeouts in #351

Note: for on-policy algorithms, I would still recommend to use timefeature wrapper (as the change will be even more tricky I think, due to lambda-return).

tomasruizt added the bug Something isn't working label Jan 6, 2021

Miffyli added enhancement New feature or request and removed bug Something isn't working labels Jan 7, 2021

Miffyli added this to the v1.1 milestone Jan 7, 2021

araffin changed the title ~~[Bug] Infinize horizon tasks are handled like episodic tasks~~ [Bug] Infinite horizon tasks are handled like episodic tasks Jan 7, 2021

araffin mentioned this issue Mar 24, 2021

Refactor HER #351

Merged

18 tasks

araffin mentioned this issue May 3, 2021

Dictionary Observations #243

Merged

18 tasks

araffin closed this as completed in #243 May 11, 2021

araffin mentioned this issue Oct 28, 2021

[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm #633

Closed

1 task

araffin mentioned this issue Jan 12, 2022

Multiprocessing support for HerReplayBuffer #704

Merged

14 tasks

araffin mentioned this issue Mar 10, 2022

[Question] Is the length of trajectory (episode) controlled by the done in step() function? #814

Closed

2 tasks

araffin mentioned this issue Mar 23, 2022

[Question] Handling max steps / use of remove_time_limit_termination #829

Closed

YangyangFu mentioned this issue Apr 1, 2023

timelimit for reinforcement learning tasks YangyangFu/DRLax#1

Open

araffin mentioned this issue Apr 3, 2023

Add Gymnasium support #1327

Merged

20 tasks

bheijden mentioned this issue Mar 30, 2024

PPO Implementation Ignores Time Limits luchris429/purejaxrl#20

Open

sdpkjc mentioned this issue Aug 2, 2024

Possible inconsistencies with the PPO implementation vwxyzjn/cleanrl#477

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Infinite horizon tasks are handled like episodic tasks #284

[Bug] Infinite horizon tasks are handled like episodic tasks #284

tomasruizt commented Jan 6, 2021

Miffyli commented Jan 7, 2021 •

edited

Loading

tomasruizt commented Jan 7, 2021

Miffyli commented Jan 7, 2021

tomasruizt commented Jan 7, 2021

araffin commented Jan 7, 2021

araffin commented Apr 5, 2021

[Bug] Infinite horizon tasks are handled like episodic tasks #284

[Bug] Infinite horizon tasks are handled like episodic tasks #284

Comments

tomasruizt commented Jan 6, 2021

🐛 Bug / Background

Miffyli commented Jan 7, 2021 • edited Loading

tomasruizt commented Jan 7, 2021

Miffyli commented Jan 7, 2021

tomasruizt commented Jan 7, 2021

araffin commented Jan 7, 2021

Related issues

Experimental branch

araffin commented Apr 5, 2021

Miffyli commented Jan 7, 2021 •

edited

Loading