You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can’t seem to replicate the original PPO algorithm's performance when using RLlib's PPO implementation. The hyperparameters used are listed below. It follows the hyperparameters discussed in an ICLR Blog in aims to replicate the results from the original PPO paper (without LSTM).
Hyperparameters
# Environment
Max Frames Per Episode = 108000
Frameskip = 4
Max Of Last 2 Frames = True
Max Steps Per Episode = 27000
Framestack = 4
Observation Type = Grayscale
Frame Size = 84 x 84
Max No Operation Actions = 30
Repeat Action Probability = 0.0
Terminal On Life Loss = True
Fire Action on Reset = True
Reward Clip = {-1, 0, 1}
Full Action Space = False
# Algorithm
Neural Network Feature Extractor = Nature CNN
Neural Network Policy Head = Linear Layer with n_actions output features
Neural Network Value Head = Linear Layer with 1 output feature
Shared Feature Extractor = True
Orthogonal Initialization = True
Scale Images to [0, 1] = True
Optimizer = Adam with 1e-5 Epsilon
Learning Rate = 2.5e-4
Decay Learning Rate = True
Number of Environments = 8
Number of Steps = 128
Batch Size = 256
Number of Minibatches = 4
Number of Epochs = 4
Gamma = 0.99
GAE Lambda = 0.95
Clip Range = 0.1
VF Clip Range = 0.1
Normalize Advantage = True
Entropy Coefficient = 0.01
VF Coefficient = 0.5
Max Gradient Normalization = 0.5
Use Target KL = False
Total Timesteps = 10000000
Log Interval = 1
Evaluation Episodes = 100
Deterministic Evaluation = False
Seed = Random
Number of Trials = 5
I have tried these same hyperparameters with the Baselines, Stable Baselines3, and CleanRL implementations of the PPO algorithm and they all achieved the expected results. For example, on the Alien and Pong environments, the agents were able to achieve more than 1000 and 20 mean rewards respectively. However, the RLlib agent fails to train at all. The trained RLlib agent achieves approx. 240 and -20 mean rewards for the Alien and Pong environments respectively, as seen in the training curves below. Am I missing something in my RLlib configuration (see reproduction script 1) or is there a bug (or intentional discrepancy) in RLlib's PPO implementation?
RLlib PPO Alien Mean Reward Training Curve RLlib PPO Pong Mean Reward Training Curve
Other Issues Found
On a side note, I also found 2 other issues during my experiments.
Firstly, setting preprocessor_pref="deepmind" in .env_runners does not seem to work at all and thus, I had to manually configure the environment via a custom function make_atari and attach the wrappers there. To test this issue, use the following code in reproduction script 2. The modifications made with respect to reproduction script 1 is that now I am not using a custom function make_atari with the required wrappers and instead using preprocessor_pref="deepmind" to demonstrate the issue.
Secondly, while evaluating, specifying the number of episodes in .evaluation seems to have no effect at all. The evaluation function seems to return evaluation over a different number of episodes than what was initially specified in .evaluation. This is why there is a while loop towards the end of the codes given in both reproduction scripts, in order to accumulate the correct number of episodes. You can easily observe this in the info.json file automatically generated after running the scripts and looking at the initial_eval_episodes field.
I have also posted all of the aforementioned issues encountered at discuss.ray.io here. Any help/feedback on this would be greatly appreciated. Thanks in advance!
Versions / Dependencies
Python version: 3.11
Ray version: 2.20.0
OS: Ubuntu LTS
Reproduction script
To run the python file for both scripts: python file_name.py --env Alien --gpu 0 --trials 1
The text was updated successfully, but these errors were encountered:
hexonfox
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 31, 2024
@sven1977 Hi, I was wondering if you could take a look at this? I've tried running PPO via the tuned examples given and it seems to work fine. It however, fails to work when I customize it to fit certain hyperparameters from the original PPO paper, which makes me suspect that there is either something wrong with my configuration or one of RLlib's functions used during the customization process. It might be the latter because I've already found 2 potential bugs, one with preprocessor_pref and another with .evaluation, also discussed in the issue above. I've posted this on discuss.ray.io here about 3 weeks ago and have yet to receive a response on it.
Hey there,
I am not a ray maintainer, but I noticed an issue with the PPO implementation in ray 2.37.0 (as of yesterday - the latest release): batches are not shuffled when training, which seems to be fixed in 2.38.0 by this PR.
I believe this might cause a noticeable impact on training, so you could try running your experiment again with 2.38.0.
@smanolloff Thank you for your suggestion. The version of Ray I am using is 2.34.0, and I have merged the modifications from this #47458 , which has significantly improved the results. However, it is still slightly weaker than the training results of the old version under my multi-card training process.
What happened + What you expected to happen
I can’t seem to replicate the original PPO algorithm's performance when using RLlib's PPO implementation. The hyperparameters used are listed below. It follows the hyperparameters discussed in an ICLR Blog in aims to replicate the results from the original PPO paper (without LSTM).
Hyperparameters
I have tried these same hyperparameters with the Baselines, Stable Baselines3, and CleanRL implementations of the PPO algorithm and they all achieved the expected results. For example, on the Alien and Pong environments, the agents were able to achieve more than 1000 and 20 mean rewards respectively. However, the RLlib agent fails to train at all. The trained RLlib agent achieves approx. 240 and -20 mean rewards for the Alien and Pong environments respectively, as seen in the training curves below. Am I missing something in my RLlib configuration (see reproduction script 1) or is there a bug (or intentional discrepancy) in RLlib's PPO implementation?
RLlib PPO Alien Mean Reward Training Curve
RLlib PPO Pong Mean Reward Training Curve
Other Issues Found
On a side note, I also found 2 other issues during my experiments.
Firstly, setting
preprocessor_pref="deepmind"
in.env_runners
does not seem to work at all and thus, I had to manually configure the environment via a custom functionmake_atari
and attach the wrappers there. To test this issue, use the following code in reproduction script 2. The modifications made with respect to reproduction script 1 is that now I am not using a custom functionmake_atari
with the required wrappers and instead usingpreprocessor_pref="deepmind"
to demonstrate the issue.Secondly, while evaluating, specifying the number of episodes in
.evaluation
seems to have no effect at all. The evaluation function seems to return evaluation over a different number of episodes than what was initially specified in.evaluation
. This is why there is awhile
loop towards the end of the codes given in both reproduction scripts, in order to accumulate the correct number of episodes. You can easily observe this in theinfo.json
file automatically generated after running the scripts and looking at theinitial_eval_episodes
field.I have also posted all of the aforementioned issues encountered at discuss.ray.io here. Any help/feedback on this would be greatly appreciated. Thanks in advance!
Versions / Dependencies
Python version: 3.11
Ray version: 2.20.0
OS: Ubuntu LTS
Reproduction script
To run the python file for both scripts:
python file_name.py --env Alien --gpu 0 --trials 1
Reproduction script 1
Reproduction script 2
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: