-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. #2401
Conversation
…raph mode. Signed-off-by: aaron276h <aaron@determined.ai>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Signed-off-by: aaron276h <aaron@determined.ai>
Signed-off-by: aaron276h <aaron@determined.ai>
This comment has been minimized.
This comment has been minimized.
Signed-off-by: aaron276h <aaron@determined.ai>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Awesome to see feature parity among the different optimizers.
Hi, @aaron276h |
@Richie-yan thanks for flagging this issue. Could you take a look at #2415, that should address the issue you are running into. |
Hi, @aaron276h @tgaddair
According to my statistics, the gradient of the Roberta large model occupies about 1.17GB of GPU memory. |
Good catch @Richie-yan. I'm not sure it's possible to avoid this, as the gradients need to be stored into a separate variable in order to perform the local aggregation. Do you have some thoughts on how this additional copy can be avoided? |
@tgaddair |
@Richie-yan that's really interesting, this should give the correct performance and if you are observing that this provides better memory performance we should definitely make this change. Seems to be potentially related to this old thread but it's not clear why we see 5x memory usage rather than 2x from using |
@aaron276h |
Checklist before submitting
Description
This PR is a follow up #2346. This PR adds support for backward_passes_per_step > 1 for TF legacy optimizers (tf.train.Optimizer) executing in graph(non-eager) mode. This is one of the features that we have built into Determined AI's fork of Horovod that we would like to upstream.
Review process to land