PPO timeout proper handling #198

Howuhh · 2022-06-10T21:42:14Z

Hi! I'm a bit puzzled as to how a timeout could be handled correctly in your implementation of PPO (well, this is relevant for all variants). I am especially surprised by envpool, because seems like they do not return the last real state at all (like gym vec envs do in info).

Were there any ideas on how to do it right? I'd like to make a PR with that fix, but I haven't figured out how yet. Buffer in PPO is a bit more confusing than in off-policy algos where there is a clear (s, s', d) thing..

The text was updated successfully, but these errors were encountered:

vwxyzjn · 2022-06-10T23:40:42Z

Proper timeout handling is definitely an interesting issue I want to look into further. I think the best resource is DLR-RM/stable-baselines3#658. Happy to take a look into this together :)

If you are submitting a fix PR, I highly suggest doing it just on ppo.py to show a proof-of-concept before applying the changes to other PPO variants.

Howuhh · 2022-06-11T00:42:59Z

Thanks for the link! I'll take a look and see what I can come up with.

Howuhh · 2022-06-11T23:19:23Z

After some thinking and sketching on a piece of paper, it seems to me that it could be solved this way (just proposal for now):

# define buffer's here: states, actions, rewards, dones, values, logprobs
# Initial obs and info
state, prev_info = torch.Tensor(envs.reset()).to(device), {}

for step in range(0, args.num_steps):
    with torch.no_grad():
        action, logprob, _, value = agent.get_action_and_value(state)
    next_state, reward, done, info = envs.step(action.cpu().numpy())

    # full transition on step T
    states[step] = state
    actions[step] = action
    rewards[step] = torch.tensor(reward).to(device).view(-1)
    dones[step] = done
    # also PPO stuff for step T
    logprobs[step] = logprob
    values[step] = value.flatten()

    # Here we should check for timeout on previous step
    if step > 1 and torch.any(dones[step - 1]):
        # if on prev step was timeout, then we should
        # 1. set dones[step - 1][env_id] to False, as it was not real done
        # 2. set values[step][env_id] to V(prev_info['terminal_observation']) (as it is real next_state for previous step)
        for env_id in range(args.num_envs):
            timeout = "TimeLimit.truncated" in prev_info[env_id] and prev_info[env_id]["TimeLimit.truncated"]
            if timeout:
                terminal_state = torch.tensor(prev_info[env_id]["terminal_observation"], device=device)
                # Set done to false as it was timeout
                dones[step - 1, env_id] = 0 
                # Set value to V of terminal_state (not state as it is first state after reset on previous step)
                values[step, env_id] = agent.get_value(terminal_state).flatten() 

    state, prev_info = torch.Tensor(next_state).to(device), info

This small rearrangement, of course, would require separately handling the last state after num_steps, since we need it to bootstrap the last transition, but it could be done in similar way. GAE computation then would be unchanged at all!

The main point here is that we never use value[step] to compute anything for state[step], only for state[step - 1]:

r[step - 1] + gamma * not_done[step - 1] * value[step]

So it seems to me that we can safely swap it inplace here. Does that sound reasonable to you?

vwxyzjn · 2022-06-12T03:42:45Z

Thanks for the sketch. I think it's a little tricky though so I wrote down what the variables look like as below (link)

I think the correct implementation is consistent with https://github.com/DLR-RM/stable-baselines3/pull/658/files#diff-384b5f21f2bed58d1d6e64da04a42fee52f353fcec38bf410338524336657bd8R205

            # Handle timeout by bootstraping with value function
            # see GitHub issue #633
            for idx, done_ in enumerate(dones):
                if (
                    done_
                    and infos[idx].get("terminal_observation") is not None
                    and infos[idx].get("TimeLimit.truncated", False)
                ):
                    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
                    with th.no_grad():
                        terminal_value = self.policy.predict_values(terminal_obs)[0]
                    rewards[idx] += self.gamma * terminal_value

Is this equivalent to your implementation? I didn't quite get the reason behind step - 1 in the sketch.

Next week will be pretty busy but will try to respond.

CC @araffin

Howuhh · 2022-06-12T10:42:56Z

After thinking a bit more, it seems that my implementation is not correct because it will not correctly account for the done flag in GAE (last_gae_lam in example 2 from @Miffyli in DLR-RM/stable-baselines3#633, while delta computation will be ok) so your version is more correct.

I don't like the fact that we're implicitly changing the rewards, though, since in other PPO variants they could be used for something else :(

I will still try to explain the thinking behind step - 1. Suppose we made 3 steps:
$s_0, r_0, d_0, v(s_0), s_1, r_1, d_1, v(s_1), s_2, r_2, d_2, v(s_2)$ and $d_1$ is done and timeout.

We have two variants of value computation for step 1:

$r_1$, as $d_1$ is True - false
$r_1 + \gamma v(s_2)$ - correct, but $s_2$ here will be first state after reset from next episode, so we should correct it to.

Because we will add $v(s_2)$ to the buffer on step 2, we should check if $d_1$ (where step - 1 comes from) is timeout and if it is then change:

$d_1$ -> False
$v(s_2)$ -> $v(s_{2 \ terminal})$

Then value estimation will be ok:
$r_1 + \gamma v(s_{2 \ terminal})$

Actually you do the same as done in our code will be added to the buffer in the next iteration of the loop (so it is step - 1). To be honest it confuses me every time since it is not intuitive, that we add to the buffer something (done to be specific) from the last step (and why?). In my example (and code in my projects) I add done on the same step.

cleanrl/cleanrl/ppo_continuous_action.py

Line 203 in ee262da

dones[step] = next_done

For now I think your proposal is better, I will try to make PR & some runs on gym Mujoco.

vwxyzjn · 2022-06-12T13:41:14Z

I don't like the fact that we're implicitly changing the rewards, though, since in other PPO variants they could be used for something else :(

That's a good point. One way we could address is to create another storage variable terminal_values with the same shapes as values, and we can utilize this terminal_values during the return and advantage calculation.

Also a quick note: the new gym API for truncation is coming soon openai/gym#2752. Ideally we should be prototyping using the new API.

I will still try to explain the thinking behind ...

One issue with the example you provided is storage - $v(s_2)$ is also useful and needs to be stored somewhere.

Actually you do the same as done in our code will be added to the buffer in the next iteration of the loop (so it is step - 1). To be honest it confuses me every time since it is not intuitive,

next_obs and next_done serve a very intricate role. Please see the "vectorized architecture" section in https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ for a detailed explanation.

vwxyzjn · 2022-06-12T15:05:16Z

Here is a more complete walkthrough

FelipeMartins96 · 2023-03-22T22:44:17Z

Found this issue as I was wondering if the isaacgym ppo was handling timeouts; it is not, right? I will check the code more thoroughly tomorrow and try to find a solution for my use case.

Additionally, I remember having issues on isaacgymenvs on how they handle the timeouts and resets; I think my env diverges a bit from theirs, so I'm not sure if it is handling their case but not mine

vwxyzjn · 2023-03-23T00:59:23Z

Yeah the isaacgym ppo variant does not deal with truncation properly… I sort of lost a bit of interest in it because properly handling it did not seem to result in significant performance difference. See sail-sg/envpool#194 (comment)

FelipeMartins96 · 2023-03-24T15:24:55Z

@vwxyzjn did you notice any significant overhead by handling truncations when benchmarking?

vwxyzjn · 2023-03-24T15:30:16Z

Not really significant overhead.

gmarkkula · 2024-04-10T11:57:08Z

Just chiming in here to agree with this comment in a related thread: I have experienced that proper timeout handling can make quite a difference. Thought I'd share my experience and solution here in case it's useful to others, and to encourage getting it implemented in CleanRL. (In relation to the discussions above I am not sure about the best way to do this though.)

Anyway: I have a vectorised environment that I was getting good results with from SB3, but to speed things up I wanted to move both environment and RL to GPU, building from CleanRL ppo_continuous_action_isaacgym.py. However, the CleanRL PPO did not replicate the good SB3 PPO results, After much head-scratching, I discovered that disabling the timeout handling in SB3 (by no longer reporting truncations and not passing terminal observations) got SB3 stuck in the same suboptimum as CleanRL, and after adding bootstrapping at truncation to the CleanRL implementation I now get very similar good results with CleanRL to what I was getting with SB3 out of the box.

For reference, in my environment the agent can incur a lot of effort penalties early on while it's trying to learn how to get positive rewards, so early on rewards will be mostly negative. My understanding of what was happening here is that without truncation bootstrapping, just giving up and doing nothing becomes a very strong suboptimum, because every now and then the agent still gets a random (from its Markov perspective) treat of reaching the episode truncation, where without bootstrapping the reached state seems to have about zero value, which is much better than the negative values incurred when trying to learn. And the value function learning becomes noisy and slow, because there is this weird zero value that happens unpredictably every now and then. (Apols if this is obvious to others - it took me a while to suss it out.)

I based my bootstrap implementation on the SB3 one, as referenced in Costa's comment above, changing this line:

            next_obs, rewards[step], next_done, info = envs.step(action)

to:

            next_obs, rewards[step], terminations, truncations, info = envs.step(action)
            next_done = (terminations | truncations).long()
            with torch.no_grad():
                terminal_values = agent.critic(info['final_observation']).squeeze()
            rewards[step] = torch.where(
                truncations, rewards[step] + args.gamma * terminal_values, rewards[step])

This approach directly modifies the rewards, as discussed in the thread above, but that doesn't matter to me in this case.

Note that I am using my own environment here, not an IsaacGym environment, so the info['final_observation'] is created by me in my environment class. Based on my (possibly imperfect?) interpretation of gymnasium.vector.VecEnv I am passing this as the observation tensor (same shape as next_obs), but with the information from before terminated/truncated environments have been reset.

This was referenced Jun 20, 2022

Roadmap for CleanRL #115

Closed

PPO improvements #206

Closed

ppo with timeout handling #204

Closed

vwxyzjn mentioned this issue Jun 29, 2022

Average PPO implementation #212

Closed

19 tasks

This was referenced Nov 5, 2022

Handle truncation properly with PPO #311

Closed

Update to support Gymnasium #277

Closed

KaleabTessera mentioned this issue Nov 2, 2023

Update ppo_pettingzoo_ma_atari.py #408

Open

18 tasks

luchris429 mentioned this issue Apr 5, 2024

PPO Implementation Ignores Time Limits luchris429/purejaxrl#20

Open

gmarkkula mentioned this issue Apr 17, 2024

Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448

Open

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO timeout proper handling #198

PPO timeout proper handling #198

Howuhh commented Jun 10, 2022

vwxyzjn commented Jun 10, 2022

Howuhh commented Jun 11, 2022

Howuhh commented Jun 11, 2022 •

edited

Loading

vwxyzjn commented Jun 12, 2022 •

edited

Loading

Howuhh commented Jun 12, 2022 •

edited

Loading

vwxyzjn commented Jun 12, 2022

vwxyzjn commented Jun 12, 2022

FelipeMartins96 commented Mar 22, 2023

vwxyzjn commented Mar 23, 2023

FelipeMartins96 commented Mar 24, 2023

vwxyzjn commented Mar 24, 2023

gmarkkula commented Apr 10, 2024 •

edited

Loading

PPO timeout proper handling #198

PPO timeout proper handling #198

Comments

Howuhh commented Jun 10, 2022

vwxyzjn commented Jun 10, 2022

Howuhh commented Jun 11, 2022

Howuhh commented Jun 11, 2022 • edited Loading

vwxyzjn commented Jun 12, 2022 • edited Loading

Howuhh commented Jun 12, 2022 • edited Loading

vwxyzjn commented Jun 12, 2022

vwxyzjn commented Jun 12, 2022

FelipeMartins96 commented Mar 22, 2023

vwxyzjn commented Mar 23, 2023

FelipeMartins96 commented Mar 24, 2023

vwxyzjn commented Mar 24, 2023

gmarkkula commented Apr 10, 2024 • edited Loading

Howuhh commented Jun 11, 2022 •

edited

Loading

vwxyzjn commented Jun 12, 2022 •

edited

Loading

Howuhh commented Jun 12, 2022 •

edited

Loading

gmarkkula commented Apr 10, 2024 •

edited

Loading