[zero-1] ignore overlap/contiguous_gradients flags by jeffra · Pull Request #1246 · deepspeedai/DeepSpeed

jeffra · 2021-07-23T20:33:39Z

Overlap and contiguous grads are meaningless in stage 1 and should be ignored.
Found typo'd mpi variable that would cause a crash (I guess it's not a frequently used code path).

samyam

LGTM and makes sense to do this. Since none of the backward hooks are triggered by stage 1, there is no overlapping or contiguous gradients.

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

xiaopqr · 2023-01-06T03:39:57Z

LGTM and makes sense to do this. Since none of the backward hooks are triggered by stage 1, there is no overlapping or contiguous gradients.

What should we do if we want to realize the overlapping of communication and computing in Zero1?

…4887) The `overlap_comm` and `contiguous_gradients` options have been ignored in ZeRO stage 1 since #1246. Back in that time, ZeRO 1 and 2 are separately implemented (see https://github.com/microsoft/DeepSpeed/tree/6ae756c03f12674f17aef90622e7664a8af9d2af/deepspeed/runtime/zero). ZeRO 1 does not have gradient hooks registered to overlap backward and gradient all-reduce, so it's fine to ignore `overlap_comm` and `contiguous_gradients`. However, in the current implementation, ZeRO 1 and 2 share almost the same implementation (`stage_1_and_2.py`). Features like `overlap_comm` and `contiguous_gradients` can also be enabled for ZeRO 1 (Please correct me if I made a mistake). With this PR, turning on `overlap_comm` and `contiguous_gradients` for ZeRO 1 on the [SFT task](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning) produces exactly the same training curve as the latest master. ![image](https://github.com/microsoft/DeepSpeed/assets/39846316/bda3be7b-c236-4e08-b687-b3cd01f5cc73) I also see a ~1.05x e2e speedup by overlapping backward and gradient all-reduce. I can confirm by the trace that backward and all-reduce do overlap, and the separate gradients are indeed copied to a flat buffer. These options are also effective for ZeRO 1. ![image](https://github.com/microsoft/DeepSpeed/assets/39846316/5f876296-e1b4-404b-8b33-03cee8e5e6b2) ![image](https://github.com/microsoft/DeepSpeed/assets/39846316/9654f6be-5c7a-401a-b0bc-413ecd3f4e6b) Related issue: #2295 Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

…eepspeedai#4887) The `overlap_comm` and `contiguous_gradients` options have been ignored in ZeRO stage 1 since deepspeedai#1246. Back in that time, ZeRO 1 and 2 are separately implemented (see https://github.com/microsoft/DeepSpeed/tree/6ae756c03f12674f17aef90622e7664a8af9d2af/deepspeed/runtime/zero). ZeRO 1 does not have gradient hooks registered to overlap backward and gradient all-reduce, so it's fine to ignore `overlap_comm` and `contiguous_gradients`. However, in the current implementation, ZeRO 1 and 2 share almost the same implementation (`stage_1_and_2.py`). Features like `overlap_comm` and `contiguous_gradients` can also be enabled for ZeRO 1 (Please correct me if I made a mistake). With this PR, turning on `overlap_comm` and `contiguous_gradients` for ZeRO 1 on the [SFT task](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning) produces exactly the same training curve as the latest master. ![image](https://github.com/microsoft/DeepSpeed/assets/39846316/bda3be7b-c236-4e08-b687-b3cd01f5cc73) I also see a ~1.05x e2e speedup by overlapping backward and gradient all-reduce. I can confirm by the trace that backward and all-reduce do overlap, and the separate gradients are indeed copied to a flat buffer. These options are also effective for ZeRO 1. ![image](https://github.com/microsoft/DeepSpeed/assets/39846316/5f876296-e1b4-404b-8b33-03cee8e5e6b2) ![image](https://github.com/microsoft/DeepSpeed/assets/39846316/9654f6be-5c7a-401a-b0bc-413ecd3f4e6b) Related issue: deepspeedai#2295 Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

ignore overlap/contiguous_gradients if using zero 1

a0ef845

jeffra requested review from RezaYazdaniAminabadi, ShadenSmith, awan-10, cli99, conglongli, eltonzheng, minjiaz, niumanar, samyam and tjruwase as code owners July 23, 2021 20:33

tjruwase approved these changes Jul 23, 2021

View reviewed changes

tjruwase added 2 commits July 24, 2021 14:05

Merge branch 'master' into jeffra/z1-defaults

4afc1c7

Merge branch 'master' into jeffra/z1-defaults

2350001

samyam approved these changes Jul 26, 2021

View reviewed changes

jeffra merged commit 6ae756c into master Jul 27, 2021

jeffra deleted the jeffra/z1-defaults branch July 27, 2021 00:13

jeffra added a commit that referenced this pull request Jul 29, 2021

ignore overlap/contiguous_gradients if using zero 1 (#1246)

e5ecdf5

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

li-plus mentioned this pull request Dec 30, 2023

Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 #4887

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[zero-1] ignore overlap/contiguous_gradients flags#1246

[zero-1] ignore overlap/contiguous_gradients flags#1246
jeffra merged 3 commits intomasterfrom
jeffra/z1-defaults

jeffra commented Jul 23, 2021 •

edited

Loading

Uh oh!

samyam left a comment

Uh oh!

xiaopqr commented Jan 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

jeffra commented Jul 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samyam left a comment

Choose a reason for hiding this comment

Uh oh!

xiaopqr commented Jan 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeffra commented Jul 23, 2021 •

edited

Loading