Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. #2401

aaron276h · 2020-10-27T17:24:24Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This PR is a follow up #2346. This PR adds support for backward_passes_per_step > 1 for TF legacy optimizers (tf.train.Optimizer) executing in graph(non-eager) mode. This is one of the features that we have built into Determined AI's fork of Horovod that we would like to upstream.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

…raph mode. Signed-off-by: aaron276h <aaron@determined.ai>

Signed-off-by: aaron276h <aaron@determined.ai>

tgaddair

Nice work! Awesome to see feature parity among the different optimizers.

github-actions · 2020-10-30T03:12:18Z

Unit Test Results

  522 files -   19   522 suites - 19 4h 27m 49s ⏱️ -33s
  509 tests +    1   481 ✔️ -     1     27 💤 +  1 1 ❌ +1
9 924 runs - 362 7 882 ✔️ - 305 2 041 💤 - 58 1 ❌ +1

results for commit d93887d ± comparison against base commit bb4e4cf

Richie-yan · 2020-11-02T04:11:21Z

Hi, @aaron276h
When I ran tensorflow_mnist_estimator.py using gradient accumulation, I set the backward_passes_per_step to 3 and encountered the following problems:

When the broadcast is shown in the above picture, the counter names of the two ranks are inconsistent, which leads to hang. I think it is related to the following code:

Does this code need to distinguish rank on the counter variable?

aaron276h · 2020-11-02T14:32:21Z

@Richie-yan thanks for flagging this issue. Could you take a look at #2415, that should address the issue you are running into.

Richie-yan · 2020-11-16T03:14:37Z

Hi, @aaron276h @tgaddair
I tried to run TF's Roberta large model using gradient accumulation, and found a problem:
Set batch_size to 10：

Do not use gradient accumulation, TF's bfc memory usage is about 23.3GB;
Using gradient accumulation, TF's bfc memory usage is about 29.2GB.

According to my statistics, the gradient of the Roberta large model occupies about 1.17GB of GPU memory.
However, using gradient accumulation will take up an additional 5.9GB of GPU memory.
When using gradient accumulation, the use of GPU memory does not meet expectations. What may be the cause?
I suspect it may be caused by using tf.zeros_initializer() when creating a variable

tgaddair · 2020-11-16T15:15:28Z

Good catch @Richie-yan. I'm not sure it's possible to avoid this, as the gradients need to be stored into a separate variable in order to perform the local aggregation. Do you have some thoughts on how this additional copy can be avoided?

Richie-yan · 2020-11-17T02:25:02Z

@tgaddair
thank you for your reply
I haven't thought of other methods that can replace this kind of copy.
However, I tried to modify the variable initialization method, and it seems that the use of video memory has returned to normal, from the original additional 5.9GB to about 1.3GB. My method is as follows:

aaron276h · 2020-11-17T02:37:32Z

@Richie-yan that's really interesting, this should give the correct performance and if you are observing that this provides better memory performance we should definitely make this change.

Seems to be potentially related to this old thread but it's not clear why we see 5x memory usage rather than 2x from using tf.zeroes_initializer().

Richie-yan · 2020-11-17T03:09:21Z

@aaron276h
Yes, changing to the above method, I can further increase the batch_size of the model without OOM, thereby improving the throughput of the model.
I think we can further investigate why 5x memory usage is occupied using tf.zeroes_initializer().

Add support for backward_passes_per_step > 1 for LegacyOptimizer in g…

5212a18

…raph mode. Signed-off-by: aaron276h <aaron@determined.ai>

aaron276h changed the title ~~Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode.~~ [WIP] Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. Oct 27, 2020

This comment has been minimized.

Sign in to view

aaron276h force-pushed the DET-4282 branch from a8ff19d to 2741d91 Compare October 27, 2020 18:54

This comment has been minimized.

Sign in to view

aaron276h added 2 commits October 28, 2020 10:11

Add tests for local gradient aggregation for LegacyOptimizer.

516a6f3

Signed-off-by: aaron276h <aaron@determined.ai>

Update CHANGELOG.

12d98b3

Signed-off-by: aaron276h <aaron@determined.ai>

aaron276h force-pushed the DET-4282 branch from 2741d91 to 12d98b3 Compare October 28, 2020 14:11

This comment has been minimized.

Sign in to view

aaron276h changed the title ~~[WIP] Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode.~~ Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. Oct 28, 2020

Update TF Estimator example.

64056e6

Signed-off-by: aaron276h <aaron@determined.ai>

This comment has been minimized.

Sign in to view

tgaddair approved these changes Oct 30, 2020

View reviewed changes

tgaddair merged commit d93887d into horovod:master Oct 30, 2020

aaron276h mentioned this pull request Nov 2, 2020

Make counter for local gradient aggregation a local variable. #2415

Merged

aaron276h mentioned this pull request Nov 17, 2020

Reduce memory usage for TF1 gradient aggregation. #2455

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. #2401

Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. #2401

aaron276h commented Oct 27, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

tgaddair left a comment

github-actions bot commented Oct 30, 2020

Richie-yan commented Nov 2, 2020

aaron276h commented Nov 2, 2020

Richie-yan commented Nov 16, 2020 •

edited

Loading

tgaddair commented Nov 16, 2020

Richie-yan commented Nov 17, 2020

aaron276h commented Nov 17, 2020

Richie-yan commented Nov 17, 2020

Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. #2401

Add support for backward_passes_per_step > 1 for LegacyOptimizers (TF) in Graph Mode. #2401

Conversation

aaron276h commented Oct 27, 2020 • edited Loading

Checklist before submitting

Description

Review process to land

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

tgaddair left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 30, 2020

Unit Test Results

Richie-yan commented Nov 2, 2020

aaron276h commented Nov 2, 2020

Richie-yan commented Nov 16, 2020 • edited Loading

tgaddair commented Nov 16, 2020

Richie-yan commented Nov 17, 2020

aaron276h commented Nov 17, 2020

Richie-yan commented Nov 17, 2020

aaron276h commented Oct 27, 2020 •

edited

Loading

Richie-yan commented Nov 16, 2020 •

edited

Loading