[Bug] grad_weight can't be NoneType when running with DeepSpeed on Zero3. by ys950902 · Pull Request #428 · deepspeedai/Megatron-DeepSpeed

ys950902 · 2024-07-22T04:13:00Z

When running the Megatron-DeepSpeed with DeepSpeed on zero3, the grad_weight is set to None by default, that will cause the error issue below:
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1253
AttributeError: 'NoneType' object has no attribute 'numel'
This pr is to fix this error issue, the weight.grad equals to grad_weight.

ys950902 · 2024-07-22T05:06:34Z

@tjruwase could you please take a look on this pr, it will block the users to use the zero3 stage.

ys950902 · 2024-09-03T05:06:58Z

Any concern for this pr?

tjruwase · 2024-09-03T13:42:42Z

@ys950902, apologies for the delay on this PR. I am confused about this issue since Megatron-DeepSpeed and ZeRO3 have always been compatible. Can you share some repro details to help my understanding? Thanks!

ys950902 · 2024-09-19T03:38:12Z

@ys950902, apologies for the delay on this PR. I am confused about this issue since Megatron-DeepSpeed and ZeRO3 have always been compatible. Can you share some repro details to help my understanding? Thanks!

Hi @tjruwase, sorry for late response, for zero-bubble added in latest Megatron-DeepSpeed, the backward computation will be divided in two parts(weight gradient and output), and the weight gradient will not compute at once and will set to be None, when in running zero2/3 will divided the gradient may cause some unexpected error, this pr is to add the flag when doing pipeline-parallelism go the current path, when doing zero2/3 go the former path.

ys950902 · 2024-09-20T03:44:00Z

Hi @tjruwase, if you still have some concern for this pr, please let me know.

tjruwase · 2024-09-20T13:05:04Z

Hi @tjruwase, sorry for late response, for zero-bubble added in latest Megatron-DeepSpeed, the backward computation will be divided in two parts(weight gradient and output),

Thanks for the explanation. In that case, shouldn't this behavior depend on whether zero-bubble is enabled? In other words, check for args.enable_zbh1_pipeline. Can you please clarify?

ys950902 · 2024-09-20T15:39:25Z

Hi @tjruwase, sorry for late response, for zero-bubble added in latest Megatron-DeepSpeed, the backward computation will be divided in two parts(weight gradient and output),

Thanks for the explanation. In that case, shouldn't this behavior depend on whether zero-bubble is enabled? In other words, check for args.enable_zbh1_pipeline. Can you please clarify?

Thanks for your reply, yes, maybe check for args.enable_zbh1_pipeline is more reasonable, because for pipelien-parallelism, only supported for ZERO1/ZERO0 on Deepspeed, it won't divided the gradient, so is also okay set to NONE on weight gradient calculation. I will modify the check to args.enable_zbh1_pipeline to avoid the confuse.

…l divided the gradient

ys950902 · 2024-09-20T16:41:49Z

Hi @tjruwase, I noticed the bug issue #442 in zero-bubble, so I did a little modified to better understanding for customers, and this pr is the solution for this bug, Only enable args.enable_zbh1_pipeline go this path calculate the weight gradient when schedule is pop, when is not enable go the former path to calculate the weight gradient. I think this is more clear. And if approved, could you please also merge this pr, then we can upgrade the Megatron-DeepSpeed to latest for next release.

…l divided the gradient (#428) Signed-off-by: Logan Adams <loadams@microsoft.com>

…l divided the gradient (deepspeedai#428) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

…nabled (#479) * pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add fused_rms_norm support on XPU device (#431) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [LLaMa] Adding support converting checkpoint from mds to hf (#432) * add support converting checkpoint from hf to mds * Fix PP issue * update Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add device check when import ipex (#436) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix TFLOPs calculation (#371) * fix TFLOPs calculation when GQA used, we observe right TFLOPs after this fix. when GQA is not used, huge difference in TFLOPs is solved with selective recompute . some other minor difference will also be observed as logits macs also added. * add copyrights Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix nan issue when running megatron-deepspeed (#434) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * enable empty cache on XPU device (#438) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [wandb] disable wandb more gracefully (#422) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [Bug] Fix crash when logging optimizer state to tb (#417) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * remove unnecessary files Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * set the warmup length to be FPDT chunk size if enabled Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Enable Sequence Parallelism (#429) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * grad_wei can't be NoneType when running with DeepSpeed, for zero3 will divided the gradient (#428) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix init issue for rms_norm in squence_parallel (#448) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * enable profiler for specific ranks (#451) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix init issue for silently ignoring the deepspeed config (#452) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix moe tflops (#445) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [tool]GQA convert support (#454) * [tools]GQA convert support * fix readme Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Fix import error in `deepspeed_to_megatron.py` (#455) Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`. Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Update references to new GitHub org (deepspeedai) (#462) Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add sequence_parallel in layernorm init to enable 3D parallelism can run successfully with DeepSpeed (#468) Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix bug when FPDT is disabled but with original Ulysses Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> Signed-off-by: jinghan yao yjhmitweb@gmail.com Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> --------- Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: jinghan yao yjhmitweb@gmail.com Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu> Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn> Co-authored-by: billishyahao <yahao.he@gmail.com> Co-authored-by: Polisetty V R K Jyothendra Varma <jvarma@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu> Co-authored-by: ranzhejiang <zhejiang.ran@intel.com> Co-authored-by: Xinyu Lian <lian7@illinois.edu> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: hotsuyuki <hotsuyuki.kawanishi@gmail.com> Co-authored-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

ys950902 requested review from GuanhuaWang, arashb, awan-10, duli2012, eltonzheng and tjruwase as code owners July 22, 2024 04:13

ys950902 force-pushed the sy/bug_fix branch from 59b511a to ab5f823 Compare August 9, 2024 14:41

tjruwase removed request for GuanhuaWang, arashb, awan-10, duli2012 and eltonzheng September 3, 2024 13:41

ys950902 force-pushed the sy/bug_fix branch from ab5f823 to 05779cf Compare September 20, 2024 15:48

tjruwase approved these changes Sep 20, 2024

View reviewed changes

grad_wei can't be NoneType when running with DeepSpeed, for zero3 wil…

3bf9a0b

…l divided the gradient

ys950902 force-pushed the sy/bug_fix branch from 05779cf to 3bf9a0b Compare September 20, 2024 16:32

ys950902 closed this Sep 20, 2024

ys950902 reopened this Sep 20, 2024

ys950902 requested a review from tjruwase September 20, 2024 16:42

tjruwase approved these changes Sep 20, 2024

View reviewed changes

tjruwase merged commit 598c092 into deepspeedai:main Sep 20, 2024

ys950902 deleted the sy/bug_fix branch September 23, 2024 05:40

loadams pushed a commit that referenced this pull request Feb 7, 2025

grad_wei can't be NoneType when running with DeepSpeed, for zero3 wil…

d6ccdae

…l divided the gradient (#428) Signed-off-by: Logan Adams <loadams@microsoft.com>

YJHMITWEB pushed a commit to YJHMITWEB/Megatron-DeepSpeed that referenced this pull request Aug 9, 2025

grad_wei can't be NoneType when running with DeepSpeed, for zero3 wil…

e17739f

…l divided the gradient (deepspeedai#428) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

YJHMITWEB pushed a commit to YJHMITWEB/Megatron-DeepSpeed that referenced this pull request Aug 9, 2025

grad_wei can't be NoneType when running with DeepSpeed, for zero3 wil…

707d45f

…l divided the gradient (deepspeedai#428) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[Bug] grad_weight can't be NoneType when running with DeepSpeed on Zero3.#428

[Bug] grad_weight can't be NoneType when running with DeepSpeed on Zero3.#428
tjruwase merged 1 commit intodeepspeedai:mainfrom
ys950902:sy/bug_fix

ys950902 commented Jul 22, 2024

Uh oh!

ys950902 commented Jul 22, 2024

Uh oh!

ys950902 commented Sep 3, 2024

Uh oh!

tjruwase commented Sep 3, 2024

Uh oh!

ys950902 commented Sep 19, 2024 •

edited

Loading

Uh oh!

ys950902 commented Sep 20, 2024

Uh oh!

tjruwase commented Sep 20, 2024

Uh oh!

ys950902 commented Sep 20, 2024

Uh oh!

ys950902 commented Sep 20, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ys950902 commented Jul 22, 2024

Uh oh!

ys950902 commented Jul 22, 2024

Uh oh!

ys950902 commented Sep 3, 2024

Uh oh!

tjruwase commented Sep 3, 2024

Uh oh!

ys950902 commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ys950902 commented Sep 20, 2024

Uh oh!

tjruwase commented Sep 20, 2024

Uh oh!

ys950902 commented Sep 20, 2024

Uh oh!

ys950902 commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ys950902 commented Sep 19, 2024 •

edited

Loading

ys950902 commented Sep 20, 2024 •

edited

Loading