Fix reward bug in the CleanRL example #53

vwxyzjn · 2022-01-13T19:24:16Z

No description provided.

examples/cleanrl_examples/ppo_atari_envpool.py

vwxyzjn · 2022-01-13T19:56:26Z

Feel free to merge :) Thanks.

rfali · 2023-08-05T22:45:07Z

@vwxyzjn I am wondering why this change is not part of envpool_ppo_atari.py in the CleanRL repo? Is this bug specific to 0.26+ code (since the example in this repo is able to handle both legacy_gym and 0.26+, but the example in cleanRL repo can only handle legacy_gym i.e. before 0.26)

rfali · 2023-08-06T03:05:46Z

examples/cleanrl_examples/ppo_atari_envpool.py

    self.returned_episode_returns = np.zeros(self.num_envs, dtype=np.float32)
    self.returned_episode_lengths = np.zeros(self.num_envs, dtype=np.int32)
    return observations

  def step(self, action):
    observations, rewards, dones, infos = super(RecordEpisodeStatistics,
                                                self).step(action)
-    self.episode_returns += rewards
+    self.episode_returns += infos["reward"]


@vwxyzjn was this the reward bug you are alluding to? Since there was no description in the PR, I am not sure if this is the bug or if the additional code about self.has_lives is the bug?

vwxyzjn · 2023-08-06T03:12:08Z

@rfali this change was part of https://github.com/vwxyzjn/cleanrl/blob/f62ad1f942890db7b752f4ad79b65be53917d3c2/cleanrl/ppo_atari_envpool.py#L99.

Ehh maybe we are talking about different things but basically you could have lives==0 but the episode has not terminated yet. This is the reason I used the info["termintaed"]

rfali · 2023-08-06T03:37:18Z

examples/cleanrl_examples/ppo_atari_envpool.py

-    self.episode_returns *= (1 - dones)
-    self.episode_lengths *= (1 - dones)
+    all_lives_exhausted = infos["lives"] == 0
+    if self.has_lives:


@vwxyzjn does this change have anything to do with gym versioning (or termination)? Is this code

all_lives_exhausted = infos["lives"] == 0 if self.has_lives: self.episode_returns *= 1 - all_lives_exhausted self.episode_lengths *= 1 - all_lives_exhausted else: self.episode_returns *= 1 - dones self.episode_lengths *= 1 - dones

doing the same thing as

self.episode_returns *= 1 - infos["terminated"] self.episode_lengths *= 1 - infos["terminated"]

as in cleanrl repo's example?

Why the RecordEpisodeStatistics wrapper is different in
https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool.py
and
https://github.com/sail-sg/envpool/blob/main/examples/cleanrl_examples/ppo_atari_envpool.py
is not clear to me.

No you should be doing

self.episode_returns *= 1 - infos["terminated"] self.episode_lengths *= 1 - infos["terminated"]

rfali · 2023-08-06T04:53:31Z

@vwxyzjn @Trinkle23897 I am also curious to know why a RecordEpisodeStatistics wrapper for envpool has

self.episode_returns += infos["reward"]

whereas gym, gymnasium's RecordEpisodeStatistics and SB3's VecMonitor use

self.episode_returns += reward

in their step() to calculate rewards, especially since this PR mentions this is a reward bug.

rfali · 2023-08-06T06:54:48Z

examples/cleanrl_examples/ppo_atari_envpool.py

@@ -375,8 +389,12 @@ def get_action_and_value(self, x, action=None):
      ).to(device)

      for idx, d in enumerate(done):
-        if d:
+        if d and info["lives"][idx] == 0:


@vwxyzjn I also don't understand why we are checking that all lives have been exhausted info["lives"][idx] == 0: to calculate episode rewards although the episode may not be over (lives > 0), especially since the EpisodicLife is applied already. I have also looked over all the cleanrl examples, and this condition is only being checked for cleanrl examples that are for envpool (e.g. the dqn, c51 atari examples don't check for lives, although they use the standard gym RecordEpisodeStatistics wrapper)

vwxyzjn · 2023-08-06T14:27:56Z

@vwxyzjn @Trinkle23897 I am also curious to know why a RecordEpisodeStatistics wrapper for envpool has

self.episode_returns += infos["reward"]

whereas gym, gymnasium's RecordEpisodeStatistics and SB3's VecMonitor use

self.episode_returns += reward

in their step() to calculate rewards, especially since this PR mentions this is a reward bug.

It doesn’t look right. rewards is the clipped reward.

rfali · 2024-03-11T15:24:51Z

@vwxyzjn @Trinkle23897 I am also curious to know why a RecordEpisodeStatistics wrapper for envpool has
self.episode_returns += infos["reward"]
whereas gym, gymnasium's RecordEpisodeStatistics and SB3's VecMonitor use
self.episode_returns += reward
in their step() to calculate rewards, especially since this PR mentions this is a reward bug.

It doesn’t look right. rewards is the clipped reward.

hey @vwxyzjn , can you please specify which one is correct for envpool usage (with Atari and for other envs like Procgen). the first one is used in envpool (and we used it and saw drastically lower scores in ALE) whereas the second one is the original wrapper from gym (see here) and for gymnasium (see here)

self.episode_returns += infos["reward"]
self.episode_returns += reward

When you said

It doesn’t look right. rewards is the clipped reward.

I thought you were saying that using reward is not right (as it is clipped reward) and hence I also used infos["reward"] when using envpool, but I think option 1 is maybe incorrect.

What do envpool authors say about this? @Trinkle23897 @vwxyzjn

rfali · 2024-03-12T14:53:40Z

@vwxyzjn @Trinkle23897 I am also curious to know why a RecordEpisodeStatistics wrapper for envpool has
self.episode_returns += infos["reward"]
whereas gym, gymnasium's RecordEpisodeStatistics and SB3's VecMonitor use
self.episode_returns += reward
in their step() to calculate rewards, especially since this PR mentions this is a reward bug.

It doesn’t look right. rewards is the clipped reward.

hey @vwxyzjn , can you please specify which one is correct for envpool usage (with Atari and for other envs like Procgen). the first one is used in envpool (and we used it and saw drastically lower scores in ALE) whereas the second one is the original wrapper from gym (see here) and for gymnasium (see here)

self.episode_returns += infos["reward"]

self.episode_returns += reward

When you said

It doesn’t look right. rewards is the clipped reward.

I thought you were saying that using reward is not right (as it is clipped reward) and hence I also used infos["reward"] when using envpool, but I think option 1 is maybe incorrect.

What do envpool authors say about this? @Trinkle23897 @vwxyzjn

On further testing, I found out that self.episode_returns += reward actually lowers the reward values further when used with envpool, so this must be the clipped reward as @vwxyzjn said. However, I am not sure why is this different than the standard gym or gymnasium way of handling rewards which is self.episode_returns += reward. I have both setups, where in gym/gymanasium, the flow is env creation, episode stats wrapper, and then clipped reward. When I use envpool, I do envpool.make (pass reward_clip as true) and then apply the EpisodeStats wrapper (this is I believe what I saw in the cleanrl_ppo_envpool example).

vwxyzjn and others added 2 commits January 13, 2022 14:23

Fix reward bug in the CleanRL example

e6ec644

fix ci

071459e

Trinkle23897 reviewed Jan 13, 2022

View reviewed changes

examples/cleanrl_examples/ppo_atari_envpool.py Outdated Show resolved Hide resolved

Trinkle23897 added 3 commits January 13, 2022 14:52

fix typo

b989b21

remove

5d10392

revert

81b7f26

Trinkle23897 previously approved these changes Jan 13, 2022

View reviewed changes

quick fix

45b960d

vwxyzjn dismissed Trinkle23897’s stale review via 45b960d January 13, 2022 19:55

Trinkle23897 approved these changes Jan 13, 2022

View reviewed changes

Trinkle23897 merged commit 50e8689 into sail-sg:master Jan 13, 2022

Trinkle23897 mentioned this pull request Feb 17, 2022

[Feature Request] ACME Integration #60

Closed

5 tasks

rfali reviewed Aug 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reward bug in the CleanRL example #53

Fix reward bug in the CleanRL example #53

vwxyzjn commented Jan 13, 2022

vwxyzjn commented Jan 13, 2022

rfali commented Aug 5, 2023 •

edited

Loading

rfali Aug 6, 2023 •

edited

Loading

vwxyzjn commented Aug 6, 2023

rfali Aug 6, 2023 •

edited

Loading

vwxyzjn Aug 6, 2023

rfali commented Aug 6, 2023 •

edited

Loading

rfali Aug 6, 2023 •

edited

Loading

vwxyzjn commented Aug 6, 2023

rfali commented Mar 11, 2024 •

edited

Loading

rfali commented Mar 12, 2024

Fix reward bug in the CleanRL example #53

Fix reward bug in the CleanRL example #53

Conversation

vwxyzjn commented Jan 13, 2022

vwxyzjn commented Jan 13, 2022

rfali commented Aug 5, 2023 • edited Loading

rfali Aug 6, 2023 • edited Loading

Choose a reason for hiding this comment

vwxyzjn commented Aug 6, 2023

rfali Aug 6, 2023 • edited Loading

Choose a reason for hiding this comment

vwxyzjn Aug 6, 2023

Choose a reason for hiding this comment

rfali commented Aug 6, 2023 • edited Loading

rfali Aug 6, 2023 • edited Loading

Choose a reason for hiding this comment

vwxyzjn commented Aug 6, 2023

rfali commented Mar 11, 2024 • edited Loading

rfali commented Mar 12, 2024

rfali commented Aug 5, 2023 •

edited

Loading

rfali Aug 6, 2023 •

edited

Loading

rfali Aug 6, 2023 •

edited

Loading

rfali commented Aug 6, 2023 •

edited

Loading

rfali Aug 6, 2023 •

edited

Loading

rfali commented Mar 11, 2024 •

edited

Loading