-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Give more time to impala tests #31910
[RLlib] Give more time to impala tests #31910
Conversation
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
[0, 0.0005], | ||
[20000000, 0.000000000001], | ||
] | ||
lr: 0.0005 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that 20000000, twntymillion in words, is never reached so to not confuse this for 2M, we should just leaved it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sounds good.
can you not leave the y scale out of the screen shot? |
@gjoliver The veriance does indeed not manifest in the max(mean_episode_reward). |
ok, but I still have the question of why then do we need 4 hrs for this test? |
@gjoliver Sorry, my mistake. I dug out the original tensorboard that you see in the picture: https://tensorboard.dev/experiment/5vi5EGOdSJWdfwoc1A2qNw/#scalars&tagFilter=cur_lr&_smoothingWeight=0 |
So the run in the picture is a actually 40 minutes long. |
In the file changes, I have only moved this from 30 to 40 minutes. |
oh ok, man, this sounds a lot better. sorry I misread the numbers 😓 |
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Artur Niederfahrenhorst artur@anyscale.com
Why are these changes needed?
IMPALA release tests have been flakey. They are capable of learning >> 200 mean reward on breakout, but struggle to do so within 1800s. This PR moves that bar to 2400s.
The following screenshot depicts two representative runs of how the mean reward behaves over time (1 on horizontal scale is 1h).
Note that TF has much more variance.
I could not find out why TF has much more variance but noted that the major difference in metrics is with the value loss.
I also tuned LR and num workers to find out of TF being faster is simply an off-policy thing. It's not.
Related TB https://tensorboard.dev/experiment/5vi5EGOdSJWdfwoc1A2qNw/#scalars&tagFilter=cur_lr&_smoothingWeight=0