-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reward bug in the CleanRL example #53
Conversation
Feel free to merge :) Thanks. |
@vwxyzjn I am wondering why this change is not part of envpool_ppo_atari.py in the CleanRL repo? Is this bug specific to 0.26+ code (since the example in this repo is able to handle both legacy_gym and 0.26+, but the example in cleanRL repo can only handle legacy_gym i.e. before 0.26) |
self.returned_episode_returns = np.zeros(self.num_envs, dtype=np.float32) | ||
self.returned_episode_lengths = np.zeros(self.num_envs, dtype=np.int32) | ||
return observations | ||
|
||
def step(self, action): | ||
observations, rewards, dones, infos = super(RecordEpisodeStatistics, | ||
self).step(action) | ||
self.episode_returns += rewards | ||
self.episode_returns += infos["reward"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vwxyzjn was this the reward bug you are alluding to? Since there was no description in the PR, I am not sure if this is the bug or if the additional code about self.has_lives
is the bug?
@rfali this change was part of https://github.com/vwxyzjn/cleanrl/blob/f62ad1f942890db7b752f4ad79b65be53917d3c2/cleanrl/ppo_atari_envpool.py#L99. Ehh maybe we are talking about different things but basically you could have lives==0 but the episode has not terminated yet. This is the reason I used the |
self.episode_returns *= (1 - dones) | ||
self.episode_lengths *= (1 - dones) | ||
all_lives_exhausted = infos["lives"] == 0 | ||
if self.has_lives: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vwxyzjn does this change have anything to do with gym versioning (or termination)? Is this code
all_lives_exhausted = infos["lives"] == 0
if self.has_lives:
self.episode_returns *= 1 - all_lives_exhausted
self.episode_lengths *= 1 - all_lives_exhausted
else:
self.episode_returns *= 1 - dones
self.episode_lengths *= 1 - dones
doing the same thing as
self.episode_returns *= 1 - infos["terminated"]
self.episode_lengths *= 1 - infos["terminated"]
as in cleanrl repo's example?
Why the RecordEpisodeStatistics wrapper is different in
https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool.py
and
https://github.com/sail-sg/envpool/blob/main/examples/cleanrl_examples/ppo_atari_envpool.py
is not clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No you should be doing
self.episode_returns *= 1 - infos["terminated"]
self.episode_lengths *= 1 - infos["terminated"]
@vwxyzjn @Trinkle23897 I am also curious to know why a RecordEpisodeStatistics wrapper for envpool has
whereas gym, gymnasium's RecordEpisodeStatistics and SB3's VecMonitor use
in their |
@@ -375,8 +389,12 @@ def get_action_and_value(self, x, action=None): | |||
).to(device) | |||
|
|||
for idx, d in enumerate(done): | |||
if d: | |||
if d and info["lives"][idx] == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vwxyzjn I also don't understand why we are checking that all lives have been exhausted info["lives"][idx] == 0:
to calculate episode rewards although the episode may not be over (lives > 0), especially since the EpisodicLife
is applied already. I have also looked over all the cleanrl examples, and this condition is only being checked for cleanrl examples that are for envpool (e.g. the dqn, c51 atari examples don't check for lives, although they use the standard gym RecordEpisodeStatistics wrapper)
It doesn’t look right. rewards is the clipped reward. |
hey @vwxyzjn , can you please specify which one is correct for envpool usage (with Atari and for other envs like Procgen). the first one is used in envpool (and we used it and saw drastically lower scores in ALE) whereas the second one is the original wrapper from gym (see here) and for gymnasium (see here)
When you said
I thought you were saying that using What do envpool authors say about this? @Trinkle23897 @vwxyzjn |
On further testing, I found out that |
No description provided.