-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes gradient update timing in TF AggregationHelperEager #3496
Fixes gradient update timing in TF AggregationHelperEager #3496
Conversation
Signed-off-by: Pei-Lun Liao <pliao@linkedin.com>
ecda0ce
to
c5f57d2
Compare
Unit Test Results 446 files - 375 446 suites - 375 7h 5m 13s ⏱️ - 2h 29m 5s For more details on these failures, see this check. Results for commit c5f57d2. ± Comparison against base commit 4b8cc49. This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.
This pull request skips 106 tests.
|
Unit Test Results (with flaky tests) 501 files - 404 501 suites - 404 8h 17m 34s ⏱️ - 1h 38m 26s For more details on these failures, see this check. Results for commit c5f57d2. ± Comparison against base commit 4b8cc49. This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.
This pull request skips 106 tests.
|
I'm not entirely sure if we should remove it. From the comments in "non_aggregation_step":
My understanding is that they return 'true' 1 step early to let apply_gradients() do the final step which will then call _allreduce again which returns reduced tensors. |
I think behavior was changed in this PR. After TF 2.4, the gradients are allreduced in |
I see, thanks for the explanation. |
Thank you @Tixxx for your quick responses and help! |
Checklist before submitting
Description
Fixes gradient update timing in TF LocalGradientAggregationHelperEager.
Why?
The gradient updating timing in LocalGradientAggregationHelperEager was set one step before the locally aggregated gradients are allreduced. Therefore, the applied gradient are the locally aggregated gradient instead of the allreduced gradient. This PR fixes the issue.
The PR also improves the gradient aggregation test case to cover tf.IndexedSlices gradient and validates the updated values. Before the change, the grad and values are always zero which does not test anything.
Review process to land