Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reward bug in the CleanRL example #53

Merged
merged 6 commits into from
Jan 13, 2022
Merged

Fix reward bug in the CleanRL example #53

merged 6 commits into from
Jan 13, 2022

Conversation

vwxyzjn
Copy link
Collaborator

@vwxyzjn vwxyzjn commented Jan 13, 2022

No description provided.

Trinkle23897
Trinkle23897 previously approved these changes Jan 13, 2022
@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Jan 13, 2022

Feel free to merge :) Thanks.

@Trinkle23897 Trinkle23897 merged commit 50e8689 into sail-sg:master Jan 13, 2022
@rfali
Copy link

rfali commented Aug 5, 2023

@vwxyzjn I am wondering why this change is not part of envpool_ppo_atari.py in the CleanRL repo? Is this bug specific to 0.26+ code (since the example in this repo is able to handle both legacy_gym and 0.26+, but the example in cleanRL repo can only handle legacy_gym i.e. before 0.26)

self.returned_episode_returns = np.zeros(self.num_envs, dtype=np.float32)
self.returned_episode_lengths = np.zeros(self.num_envs, dtype=np.int32)
return observations

def step(self, action):
observations, rewards, dones, infos = super(RecordEpisodeStatistics,
self).step(action)
self.episode_returns += rewards
self.episode_returns += infos["reward"]
Copy link

@rfali rfali Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vwxyzjn was this the reward bug you are alluding to? Since there was no description in the PR, I am not sure if this is the bug or if the additional code about self.has_lives is the bug?

@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Aug 6, 2023

@rfali this change was part of https://github.com/vwxyzjn/cleanrl/blob/f62ad1f942890db7b752f4ad79b65be53917d3c2/cleanrl/ppo_atari_envpool.py#L99.

Ehh maybe we are talking about different things but basically you could have lives==0 but the episode has not terminated yet. This is the reason I used the info["termintaed"]

self.episode_returns *= (1 - dones)
self.episode_lengths *= (1 - dones)
all_lives_exhausted = infos["lives"] == 0
if self.has_lives:
Copy link

@rfali rfali Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vwxyzjn does this change have anything to do with gym versioning (or termination)? Is this code

  all_lives_exhausted = infos["lives"] == 0
  if self.has_lives:
    self.episode_returns *= 1 - all_lives_exhausted
    self.episode_lengths *= 1 - all_lives_exhausted
  else:
    self.episode_returns *= 1 - dones
    self.episode_lengths *= 1 - dones

doing the same thing as

  self.episode_returns *= 1 - infos["terminated"]
  self.episode_lengths *= 1 - infos["terminated"]

as in cleanrl repo's example?

Why the RecordEpisodeStatistics wrapper is different in
https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool.py
and
https://github.com/sail-sg/envpool/blob/main/examples/cleanrl_examples/ppo_atari_envpool.py
is not clear to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you should be doing

  self.episode_returns *= 1 - infos["terminated"]
  self.episode_lengths *= 1 - infos["terminated"]

@rfali
Copy link

rfali commented Aug 6, 2023

@vwxyzjn @Trinkle23897 I am also curious to know why a RecordEpisodeStatistics wrapper for envpool has

self.episode_returns += infos["reward"]

whereas gym, gymnasium's RecordEpisodeStatistics and SB3's VecMonitor use

self.episode_returns += reward

in their step() to calculate rewards, especially since this PR mentions this is a reward bug.

@@ -375,8 +389,12 @@ def get_action_and_value(self, x, action=None):
).to(device)

for idx, d in enumerate(done):
if d:
if d and info["lives"][idx] == 0:
Copy link

@rfali rfali Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vwxyzjn I also don't understand why we are checking that all lives have been exhausted info["lives"][idx] == 0: to calculate episode rewards although the episode may not be over (lives > 0), especially since the EpisodicLife is applied already. I have also looked over all the cleanrl examples, and this condition is only being checked for cleanrl examples that are for envpool (e.g. the dqn, c51 atari examples don't check for lives, although they use the standard gym RecordEpisodeStatistics wrapper)

@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Aug 6, 2023

@vwxyzjn @Trinkle23897 I am also curious to know why a RecordEpisodeStatistics wrapper for envpool has

self.episode_returns += infos["reward"]

whereas gym, gymnasium's RecordEpisodeStatistics and SB3's VecMonitor use

self.episode_returns += reward

in their step() to calculate rewards, especially since this PR mentions this is a reward bug.

It doesn’t look right. rewards is the clipped reward.

@rfali
Copy link

rfali commented Mar 11, 2024

@vwxyzjn @Trinkle23897 I am also curious to know why a RecordEpisodeStatistics wrapper for envpool has
self.episode_returns += infos["reward"]
whereas gym, gymnasium's RecordEpisodeStatistics and SB3's VecMonitor use
self.episode_returns += reward
in their step() to calculate rewards, especially since this PR mentions this is a reward bug.

It doesn’t look right. rewards is the clipped reward.

hey @vwxyzjn , can you please specify which one is correct for envpool usage (with Atari and for other envs like Procgen). the first one is used in envpool (and we used it and saw drastically lower scores in ALE) whereas the second one is the original wrapper from gym (see here) and for gymnasium (see here)

  1. self.episode_returns += infos["reward"]
  2. self.episode_returns += reward

When you said

It doesn’t look right. rewards is the clipped reward.

I thought you were saying that using reward is not right (as it is clipped reward) and hence I also used infos["reward"] when using envpool, but I think option 1 is maybe incorrect.

What do envpool authors say about this? @Trinkle23897 @vwxyzjn

@rfali
Copy link

rfali commented Mar 12, 2024

@vwxyzjn @Trinkle23897 I am also curious to know why a RecordEpisodeStatistics wrapper for envpool has
self.episode_returns += infos["reward"]
whereas gym, gymnasium's RecordEpisodeStatistics and SB3's VecMonitor use
self.episode_returns += reward
in their step() to calculate rewards, especially since this PR mentions this is a reward bug.

It doesn’t look right. rewards is the clipped reward.

hey @vwxyzjn , can you please specify which one is correct for envpool usage (with Atari and for other envs like Procgen). the first one is used in envpool (and we used it and saw drastically lower scores in ALE) whereas the second one is the original wrapper from gym (see here) and for gymnasium (see here)

  1. self.episode_returns += infos["reward"]
  2. self.episode_returns += reward

When you said

It doesn’t look right. rewards is the clipped reward.

I thought you were saying that using reward is not right (as it is clipped reward) and hence I also used infos["reward"] when using envpool, but I think option 1 is maybe incorrect.

What do envpool authors say about this? @Trinkle23897 @vwxyzjn

On further testing, I found out that self.episode_returns += reward actually lowers the reward values further when used with envpool, so this must be the clipped reward as @vwxyzjn said. However, I am not sure why is this different than the standard gym or gymnasium way of handling rewards which is self.episode_returns += reward. I have both setups, where in gym/gymanasium, the flow is env creation, episode stats wrapper, and then clipped reward. When I use envpool, I do envpool.make (pass reward_clip as true) and then apply the EpisodeStats wrapper (this is I believe what I saw in the cleanrl_ppo_envpool example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants