Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. #2401

Merged
merged 4 commits into from
Oct 30, 2020

Conversation

aaron276h
Copy link
Contributor

@aaron276h aaron276h commented Oct 27, 2020

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This PR is a follow up #2346. This PR adds support for backward_passes_per_step > 1 for TF legacy optimizers (tf.train.Optimizer) executing in graph(non-eager) mode. This is one of the features that we have built into Determined AI's fork of Horovod that we would like to upstream.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

…raph mode.

Signed-off-by: aaron276h <aaron@determined.ai>
@aaron276h aaron276h changed the title Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. [WIP] Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. Oct 27, 2020
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Signed-off-by: aaron276h <aaron@determined.ai>
Signed-off-by: aaron276h <aaron@determined.ai>
@github-actions

This comment has been minimized.

@aaron276h aaron276h changed the title [WIP] Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. Oct 28, 2020
Signed-off-by: aaron276h <aaron@determined.ai>
@github-actions

This comment has been minimized.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Awesome to see feature parity among the different optimizers.

@tgaddair tgaddair merged commit d93887d into horovod:master Oct 30, 2020
@github-actions
Copy link

Unit Test Results

   522 files   -   19     522 suites   - 19   4h 27m 49s ⏱️ -33s
   509 tests +    1     481 ✔️  -     1       27 💤 +  1  1 ❌ +1 
9 924 runs   - 362  7 882 ✔️  - 305  2 041 💤  - 58  1 ❌ +1 

results for commit d93887d ± comparison against base commit bb4e4cf

@Richie-yan
Copy link
Contributor

Hi, @aaron276h
When I ran tensorflow_mnist_estimator.py using gradient accumulation, I set the backward_passes_per_step to 3 and encountered the following problems:
屏幕快照 2020-11-02 上午11 55 29
When the broadcast is shown in the above picture, the counter names of the two ranks are inconsistent, which leads to hang. I think it is related to the following code:
屏幕快照 2020-11-02 下午12 06 48
Does this code need to distinguish rank on the counter variable?

@aaron276h
Copy link
Contributor Author

@Richie-yan thanks for flagging this issue. Could you take a look at #2415, that should address the issue you are running into.

@Richie-yan
Copy link
Contributor

Richie-yan commented Nov 16, 2020

Hi, @aaron276h @tgaddair
I tried to run TF's Roberta large model using gradient accumulation, and found a problem:
Set batch_size to 10:

  • Do not use gradient accumulation, TF's bfc memory usage is about 23.3GB;

  • Using gradient accumulation, TF's bfc memory usage is about 29.2GB.

According to my statistics, the gradient of the Roberta large model occupies about 1.17GB of GPU memory.
However, using gradient accumulation will take up an additional 5.9GB of GPU memory.
When using gradient accumulation, the use of GPU memory does not meet expectations. What may be the cause?
I suspect it may be caused by using tf.zeros_initializer() when creating a variable
屏幕快照 2020-11-16 下午4 32 53

@tgaddair
Copy link
Collaborator

Good catch @Richie-yan. I'm not sure it's possible to avoid this, as the gradients need to be stored into a separate variable in order to perform the local aggregation. Do you have some thoughts on how this additional copy can be avoided?

@Richie-yan
Copy link
Contributor

@tgaddair
thank you for your reply
I haven't thought of other methods that can replace this kind of copy.
However, I tried to modify the variable initialization method, and it seems that the use of video memory has returned to normal, from the original additional 5.9GB to about 1.3GB. My method is as follows:
屏幕快照 2020-11-17 上午10 22 48

@aaron276h
Copy link
Contributor Author

@Richie-yan that's really interesting, this should give the correct performance and if you are observing that this provides better memory performance we should definitely make this change.

Seems to be potentially related to this old thread but it's not clear why we see 5x memory usage rather than 2x from using tf.zeroes_initializer().

@Richie-yan
Copy link
Contributor

@aaron276h
Yes, changing to the above method, I can further increase the batch_size of the model without OOM, thereby improving the throughput of the model.
I think we can further investigate why 5x memory usage is occupied using tf.zeroes_initializer().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants