Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Atari environments do not default to 108K frames (27K steps) per episode #1234

Open
hexonfox opened this issue Aug 1, 2024 · 0 comments · May be fixed by #1235
Open

BUG: Atari environments do not default to 108K frames (27K steps) per episode #1234

hexonfox opened this issue Aug 1, 2024 · 0 comments · May be fixed by #1235

Comments

@hexonfox
Copy link

hexonfox commented Aug 1, 2024

After testing the PPO algorithm across 56 Atari environments, I noticed a discrepancy with some environments. In particular, the mean rewards attained differed from mean rewards attained by the PPO implementations from Stable Baselines3 and CleanRL in nine environments. The table below shows the nine environments, where five trials was conducted for each (implementation, environment) permutation, and an environment-wise one-way ANOVA was subsequently conducted to determine the effect of implementation source on mean reward. With respect to Baselines (not the 108 variant), it is observed that the implementation means are significantly different.

Screenshot 2024-08-02 at 4 09 53 AM

In the figure below, the training curves are aggregated from five trials to indicate the minimum, maximum, and mean within the shaded regions. The y-axis represents the mean reward while the x-axis represents the number of frames (total of 40 million frames). The curves for Baselines, Stable Baselines3, and CleanRL are in purple, orange, and red respectively (the blue and green curves can be ignored). It can be observed that Baselines' curves are significantly different than the curves from CleanRL and Stable Baselines3, aligning with the table above.

Screenshot 2024-08-02 at 4 35 44 AM

After manually debugging the code, I managed to locate the inconsistency. The environment was not conforming to the ALE specification of 108K frames per episode for the v4 variant---the default variant used by this repository and most DRL Libraries (e.g., CleanRL and Stable Baselines3). After setting max_episode_steps in the make_atari function to be 27K (108K frames), the implementations were now consistent in three out of the nine environments, as seen in the table above and the figure below.

Screenshot 2024-08-02 at 4 51 37 AM

I will create a pull request which sets the default number of frames per episode to be 108K (27K steps), with minimal changes to the original codebase so that it does not affect other components. However, I believe that there might still be other inconsistencies since there are six environments that still significantly differ between the implementations. Any suggestions on the possible causes of these inconsistencies would be much appreciated. In case the pull-request is not accepted, I have also included the fix below for those wanting to train Atari environments :)

# one line change in baselines/baselines/common/atari_wrappers.py
def make_atari(env_id, max_episode_steps=27000):

Run Command To Replicate:

python -m baselines.run --alg=ppo2 --env=AtlantisNoFrameskip-v4 --seed 0 --num_timesteps 10e6 --network cnn --num_env 8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant