Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes gradient update timing in TF AggregationHelperEager #3496

Merged

Conversation

plliao
Copy link
Contributor

@plliao plliao commented Mar 25, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes gradient update timing in TF LocalGradientAggregationHelperEager.

Why?
The gradient updating timing in LocalGradientAggregationHelperEager was set one step before the locally aggregated gradients are allreduced. Therefore, the applied gradient are the locally aggregated gradient instead of the allreduced gradient. This PR fixes the issue.

The PR also improves the gradient aggregation test case to cover tf.IndexedSlices gradient and validates the updated values. Before the change, the grad and values are always zero which does not test anything.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: Pei-Lun Liao <pliao@linkedin.com>
@plliao plliao force-pushed the fix_apply_gradient_timing_in_TF_AggregationHelper branch from ecda0ce to c5f57d2 Compare March 25, 2022 23:54
@github-actions
Copy link

Unit Test Results

     446 files   -    375     446 suites   - 375   7h 5m 13s ⏱️ - 2h 29m 5s
     766 tests +       1     615 ✔️  -    107     149 💤 +   106  2 +2 
10 330 runs   - 8 506  7 134 ✔️  - 6 419  3 194 💤  - 2 089  2 +2 

For more details on these failures, see this check.

Results for commit c5f57d2. ± Comparison against base commit 4b8cc49.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation_0
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation_1
This pull request skips 106 tests.
test.parallel.test_adasum_pytorch.TorchAdasumTests ‑ test_orthogonal
test.parallel.test_adasum_pytorch.TorchAdasumTests ‑ test_parallel
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_allgather_object
test.parallel.test_mxnet2.MX2Tests ‑ test_broadcast_object
test.parallel.test_mxnet2.MX2Tests ‑ test_compression_fp16
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
…

@github-actions
Copy link

Unit Test Results (with flaky tests)

     501 files   -    404     501 suites   - 404   8h 17m 34s ⏱️ - 1h 38m 26s
     766 tests +       1     612 ✔️  -    110     149 💤 +   106  5 +5 
11 544 runs   - 9 488  8 005 ✔️  - 6 880  3 530 💤  - 2 617  9 +9 

For more details on these failures, see this check.

Results for commit c5f57d2. ± Comparison against base commit 4b8cc49.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation_0
test.parallel.test_tensorflow2_keras.Tf2KerasTests ‑ test_gradient_aggregation_1
This pull request skips 106 tests.
test.parallel.test_adasum_pytorch.TorchAdasumTests ‑ test_orthogonal
test.parallel.test_adasum_pytorch.TorchAdasumTests ‑ test_parallel
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_allgather_object
test.parallel.test_mxnet2.MX2Tests ‑ test_broadcast_object
test.parallel.test_mxnet2.MX2Tests ‑ test_compression_fp16
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
…

@Tixxx
Copy link
Collaborator

Tixxx commented Mar 30, 2022

I'm not entirely sure if we should remove it. From the comments in "non_aggregation_step":

            # In TF 2.4+ `_aggregate_gradients()` is called from inside of `apply_gradients()`.
               # We account for this by calling `_aggregate_gradients()` for steps where we do
               # not call `apply_gradients()`.

My understanding is that they return 'true' 1 step early to let apply_gradients() do the final step which will then call _allreduce again which returns reduced tensors.
@tgaddair @aaron276h

@plliao
Copy link
Contributor Author

plliao commented Mar 30, 2022

I'm not entirely sure if we should remove it. From the comments in "non_aggregation_step":

            # In TF 2.4+ `_aggregate_gradients()` is called from inside of `apply_gradients()`.
               # We account for this by calling `_aggregate_gradients()` for steps where we do
               # not call `apply_gradients()`.

My understanding is that they return 'true' 1 step early to let apply_gradients() do the final step which will then call _allreduce again which returns reduced tensors. @tgaddair @aaron276h

I think behavior was changed in this PR. After TF 2.4, the gradients are allreduced in _compute_gradients instead of _aggregate_gradients. Hence, the applied gradients is not reduced and averaged.
#2647

@Tixxx
Copy link
Collaborator

Tixxx commented Mar 30, 2022

I'm not entirely sure if we should remove it. From the comments in "non_aggregation_step":

            # In TF 2.4+ `_aggregate_gradients()` is called from inside of `apply_gradients()`.
               # We account for this by calling `_aggregate_gradients()` for steps where we do
               # not call `apply_gradients()`.

My understanding is that they return 'true' 1 step early to let apply_gradients() do the final step which will then call _allreduce again which returns reduced tensors. @tgaddair @aaron276h

I think behavior was changed in this PR. After TF 2.4, the gradients are allreduced in _compute_gradients instead of _aggregate_gradients. Hence, the applied gradients is not reduced and averaged. #2647

I see, thanks for the explanation.

@Tixxx Tixxx merged commit 9e87451 into horovod:master Mar 31, 2022
@shaowei-su
Copy link

Quick follow up on this issue: @Tixxx could you help push this issue fixes (with the other PR: #3499) into production releases? the latest release 0.24.3 does not seem to come with this fix. Thanks!

@Tixxx Tixxx added this to the v0.25.0 milestone May 2, 2022
@Tixxx
Copy link
Collaborator

Tixxx commented May 2, 2022

Quick follow up on this issue: @Tixxx could you help push this issue fixes (with the other PR: #3499) into production releases? the latest release 0.24.3 does not seem to come with this fix. Thanks!

Done, added the two PRs to 0.25.0 release milestone. You can check them out here

@shaowei-su
Copy link

Thank you @Tixxx for your quick responses and help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants