[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm #633

araffin · 2021-10-28T11:50:22Z

🚀 Feature

Same as #284 but for on-policy algorithms.
The current workaround is to use a TimeFeatureWrapper (cf. zoo).

### Checklist

I have checked that there is no similar issue in the repo (required)

The text was updated successfully, but these errors were encountered:

araffin · 2021-11-05T11:18:53Z

@zhihanyang2022

after thinking about it, the fix is maybe quite easy to implement: replace reward by reward + gamma * next_value when there is a timeout.

Miffyli · 2021-11-05T12:13:56Z

Linking as a related paper on this same topic. There are some experiments we could try to replicate (e.g., in Walker2D the results seem to change quite a bit): https://arxiv.org/abs/1712.00378

after thinking about it, the fix is maybe quite easy to implement: replace reward by reward + gamma * next_value when there is a timeout.

Hmm staring at the lines below I am not sure if this would work out due to GAE: if reward = reward + gamma * next_value, then the bootstrapping happens in the first line but not on the second line. To me it sounds like we need a proper if-else clause somewhere (but then we need to pass the terminal observation along...).

stable-baselines3/stable_baselines3/common/buffers.py

Lines 380 to 381 in 2bb4500

    
           delta = self.rewards[step] + self.gamma * next_values * next_non_terminal - self.values[step] 
        
           last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam

One alternative could be to also store terminal observations as part of the trajectory with dummy actions, in which case we do not have to resort to new trickery code. This could also be potentially used to clean the naming hassle with dones (e.g., get rid of the episode_starts), and simplify code.

araffin · 2021-11-05T12:31:03Z

then the bootstrapping happens in the first line but not on the second line

you are refering to L380 and 381?
why not? (and I meant to do reward = reward + gamma * next_value when filling the buffer, as in https://github.com/leggedrobotics/rsl_rl/blob/master/rsl_rl/algorithms/ppo.py#L108 )

Miffyli · 2021-11-05T12:50:06Z

you are refering to L380 and 381?

Yup, and I understood your idea. Turns out I was wrong ^^'. I only noticed it now that I tried to type it out.

1) Current setup when termination is encountered in next step

delta = self.rewards[step] + ~~self.gamma * next_values * next_non_terminal~~ - self.values[step]
last_gae_lam = delta + ~~self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam~~

2) Ideal setup where timeouts are handled correctly (next state is timeout termination)

delta = self.rewards[step] + self.gamma * next_values * next_non_terminal - self.values[step] <--- bootstrap despite step is done
last_gae_lam = delta + ~~self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam~~ <--- avoid leaking from next episode

3) Changing timeout reward to `reward + next_value * gamma`

delta = (reward + next_value * self.gamma) + ~~self.gamma * next_values * next_non_terminal~~ - self.values[step] <--- same as in above example
last_gae_lam = delta + ~~self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam~~ <--- avoid leaking from next episode

araffin · 2021-11-05T13:12:37Z

Yup, and I understood your idea. Turns out I was wrong ^^'. I only noticed it now that I tried to type it out.

so the conclusion is that my proposed hack is valid :p?

Miffyli · 2021-11-05T13:15:06Z

so the conclusion is that my proposed hack is valid :p?

Yup, at least in this part of the code ^^. I would still rethink the whole process through carefully, as "hacks" like this often break something (and sadly it is hard to test).

araffin added enhancement New feature or request help wanted Help from contributors is welcomed labels Oct 28, 2021

araffin mentioned this issue Nov 4, 2021

[Question] Does PPO handle timeout and bootstrap correctly? #651

Closed

2 tasks

araffin mentioned this issue Nov 10, 2021

Add timeout handling for on-policy algorithms #658

Merged

14 tasks

araffin closed this as completed in #658 Nov 16, 2021

dtch1997 mentioned this issue Nov 26, 2021

PPO does not correctly calculate reward on timeout mcx-lab/rl-baselines3-zoo#68

Open

araffin mentioned this issue Feb 24, 2022

[Question] Terminal reward/penalty shaping in variable horizon environments while using PPO #777

Closed

araffin mentioned this issue Mar 10, 2022

[Question] Is the length of trajectory (episode) controlled by the done in step() function? #814

Closed

2 tasks

araffin mentioned this issue Mar 20, 2022

[question] [Proposal] Maximum Iterations Per Episode hill-a/stable-baselines#633

Closed

Howuhh mentioned this issue Jun 12, 2022

PPO timeout proper handling vwxyzjn/cleanrl#198

Open

Miffyli mentioned this issue Jul 7, 2022

[Question] The update of rewards in the on_policy_algorithm.py #953

Closed

2 tasks

shuishida mentioned this issue Mar 2, 2023

[Feature Request] Fixing TimeLimit Handling for On-Policy algorithm #1355

Closed

1 task

araffin mentioned this issue Apr 3, 2023

Add Gymnasium support #1327

Merged

20 tasks

KaleabTessera mentioned this issue Nov 2, 2023

[Bug Report] Possible bug with bootstrapping when environment is truncated in CleanRL mutli-agent Atari example Farama-Foundation/PettingZoo#1126

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm #633

[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm #633

araffin commented Oct 28, 2021

araffin commented Nov 5, 2021 •

edited

Loading

Miffyli commented Nov 5, 2021 •

edited

Loading

araffin commented Nov 5, 2021

Miffyli commented Nov 5, 2021 •

edited

Loading

araffin commented Nov 5, 2021

Miffyli commented Nov 5, 2021

[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm #633

[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm #633

Comments

araffin commented Oct 28, 2021

🚀 Feature

araffin commented Nov 5, 2021 • edited Loading

Miffyli commented Nov 5, 2021 • edited Loading

araffin commented Nov 5, 2021

Miffyli commented Nov 5, 2021 • edited Loading

1) Current setup when termination is encountered in next step

2) Ideal setup where timeouts are handled correctly (next state is timeout termination)

3) Changing timeout reward to reward + next_value * gamma

araffin commented Nov 5, 2021

Miffyli commented Nov 5, 2021

araffin commented Nov 5, 2021 •

edited

Loading

Miffyli commented Nov 5, 2021 •

edited

Loading

Miffyli commented Nov 5, 2021 •

edited

Loading

3) Changing timeout reward to `reward + next_value * gamma`