[question] Enabling agents to keep bootstraping in the last step per episode #1004

guoyangqin · 2020-09-15T08:32:27Z

I am using stable-baselines 2.10.1 to train AC/ACER agents in a custom environment with time limits [0, T] per episode. In the last update per episode, the value function is normally updated by

V(S^{T-1}) = r + 0

which treats state S^T as an absorbing state where no value will be incurred thereafter. In the code, (1. - done) is used.

def discount_with_dones(rewards, dones, gamma):
    discounted = []
    ret = 0  # Return: discounted reward
    for reward, done in zip(rewards[::-1], dones[::-1]):
        ret = reward + gamma * ret * (1. - done)  # fixed off by one bug
        discounted.append(ret)
    return discounted[::-1]

However, for my limited-time cases, the update is expected to be like this

V(S^{T-1}) = r + gamma*V(S^T)

Since the training terminates not because the terminal state is reached, but because the time is out and V(S^T) has its value, therefore the training is expected to keep bootstraping in this last step.

I skimmed through the source code, and neither found this functionality nor figure out where to rewrite. Was wondering how to enable this?

The text was updated successfully, but these errors were encountered:

Miffyli · 2020-09-15T10:57:56Z

Related to #863

There is no functionality to support this per se (indicating the episode ended on timeout is not standardized in Gym, although some environments provide this in the info dict). An easy solution for this problem is to provide episode time in observations as suggested in #863.

guoyangqin · 2020-09-15T11:31:19Z

Related to #863

There is no functionality to support this per se (indicating the episode ended on timeout is not standardized in Gym, although some environments provide this in the info dict). An easy solution for this problem is to provide episode time in observations as suggested in #863.

Thanks, setting info['TimeLimit.truncated']=True is an elegant solution that won't complicate the code. Was wondering if you can show me the source code about how stable-baselines processes the input info['TimeLimit.truncated'] (didn't get it by searching)?

Miffyli · 2020-09-15T11:34:30Z

Ah sorry, I was referring to the final comment from arrafin in that issue:

as the time feature is sufficient and avoid including additional complexity in the code (it gets a little more complex when using multiple environments)

There is no support for TimeLimit.truncated in stable-baselines but this would be a good feature for stable-baselines3, given how common it is.

araffin · 2020-09-15T11:58:15Z

For a longer answer regarding the time feature, you can read araffin/rl-baselines-zoo#79

guoyangqin · 2020-09-15T12:05:23Z

I see. My bad that I directly jumped to #863 without noticing "to provide episode time in observations". Yes, that is a simple and necessary trick especially when the timeout is intrinsic in the system, as mentioned in one of the two cases in the paper Time Limits in Reinforcement Learning. However, in the second case of the paper where time limit is set just to facilitate learning (e.g. more diverse trajectories), bootstrapping in the last step is mandated.

My case is more of the second one. But it is simpler than those mixed with both env done and time limit, it just has time limit as its mere termination signal. So I just need a overwritten method by simply dropping * (1. -done). Is there any possibility for the user to overwrite some specific methods in stable-baselines to realize it? Such as

Miffyli · 2020-09-15T12:15:56Z

The ease of implementing this yourself depends on the algorithm you want to use, as some do not store info dicts or their information. However for PPO2 you can modify code around this point to gather the infos you want and update the return/discount values and dones accordingly.

guoyangqin · 2020-09-15T13:11:47Z

Thanks. I am mainly using AC/ACER, according to your hint, should I modify this method

stable-baselines/stable_baselines/acer/acer_simple.py

Lines 50 to 52 in 3d115a3

def q_retrace(rewards, dones, q_i, values, rho_i, n_envs, n_steps, gamma):

"""

Calculates the target Q-retrace

by removing (1.0 - done_seq[i]) from

stable-baselines/stable_baselines/acer/acer_simple.py

Line 74 in 3d115a3

qret = reward_seq[i] + gamma * qret * (1.0 - done_seq[i])

Miffyli · 2020-09-15T13:23:45Z

I have little experience with ACER but that seems to be to the right direction. Sorry, I can not be of further assistance with this topic, as your guess will probably be more correct than mine :)

guoyangqin · 2020-09-15T13:29:04Z

It is ok. I have little experience with PPO, so I am trying ACER. Thank you, Miffyli, your comments and quotes are very helpful. I will test it by myself.

araffin · 2022-05-01T13:14:00Z

Closing as now done in SB3

Miffyli added the question Further information is requested label Sep 15, 2020

araffin mentioned this issue Jan 7, 2021

[Bug] Infinite horizon tasks are handled like episodic tasks DLR-RM/stable-baselines3#284

Closed

araffin closed this as completed May 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] Enabling agents to keep bootstraping in the last step per episode #1004

[question] Enabling agents to keep bootstraping in the last step per episode #1004

guoyangqin commented Sep 15, 2020 •

edited

Loading

Miffyli commented Sep 15, 2020

guoyangqin commented Sep 15, 2020

Miffyli commented Sep 15, 2020 •

edited

Loading

araffin commented Sep 15, 2020

guoyangqin commented Sep 15, 2020

Miffyli commented Sep 15, 2020

guoyangqin commented Sep 15, 2020

Miffyli commented Sep 15, 2020

guoyangqin commented Sep 15, 2020

araffin commented May 1, 2022

[question] Enabling agents to keep bootstraping in the last step per episode #1004

[question] Enabling agents to keep bootstraping in the last step per episode #1004

Comments

guoyangqin commented Sep 15, 2020 • edited Loading

Miffyli commented Sep 15, 2020

guoyangqin commented Sep 15, 2020

Miffyli commented Sep 15, 2020 • edited Loading

araffin commented Sep 15, 2020

guoyangqin commented Sep 15, 2020

Miffyli commented Sep 15, 2020

guoyangqin commented Sep 15, 2020

Miffyli commented Sep 15, 2020

guoyangqin commented Sep 15, 2020

araffin commented May 1, 2022

guoyangqin commented Sep 15, 2020 •

edited

Loading

Miffyli commented Sep 15, 2020 •

edited

Loading