fix sequence parallel(Ulysses) grad scale for zero0 by inkcherry · Pull Request #5555 · deepspeedai/DeepSpeed

inkcherry · 2024-05-21T07:47:06Z

use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:

grad_norm	step1	step2	step3	step4	step5	step100
zero1	15.825	16.646	15.853	16.159	17.333	15.555
zero0	3.956	4.161	3.963	4.040	4.333	3.889
zero0(this patch)	15.825	16.646	15.853	16.159	17.333	15.554

samadejacobs · 2024-05-24T17:02:48Z

deepspeed/runtime/engine.py

        # to maintain the gradients value unaffected by ep_size setting,
        # utilize dp_world_size for allreduce average
-        dp_world_size = dist.get_world_size(groups._get_data_parallel_group())
+        dp_world_size = dist.get_world_size(groups._get_data_parallel_group()) / float(self.sequence_parallel_size)


@inkcherry, can you help me understand why scale by sp_size? get_data_parallel_group != get_sequence_data_parallel_group, you should have correct value already, no?

Thanks for the review! @samadejacobs. Yes, this should be the correct value. We should only need to modify the dp_world_size in the above instance.

inkcherry · 2024-06-03T03:48:05Z

Hi, @samadejacobs I have removed the modifications you mentioned in that line. Could you please help review the other parts again? Thanks!

use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size. tiny model with sp=4 grad norm test: grad_norm | step1 | step2 | step3 | step4 |step5 | step100 -- | -- | -- | -- | -- | --| -- zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555 zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889 zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554

fix ds-sp grad scale for zero0

cb15ffa

inkcherry requested review from mrwyattii and tjruwase as code owners May 21, 2024 07:47

tjruwase requested review from samadejacobs and tohtana and removed request for mrwyattii and tjruwase May 21, 2024 15:14

samadejacobs reviewed May 24, 2024

View reviewed changes

keep the correct dp_size

60e0dbc

tohtana approved these changes Jun 5, 2024

View reviewed changes

samadejacobs added this pull request to the merge queue Jun 5, 2024

Merged via the queue into deepspeedai:master with commit 6b6d641 Jun 5, 2024

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #6556

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix sequence parallel(Ulysses) grad scale for zero0#5555

fix sequence parallel(Ulysses) grad scale for zero0#5555
samadejacobs merged 2 commits intodeepspeedai:masterfrom
inkcherry:fix_Ulysses_grad

inkcherry commented May 21, 2024 •

edited

Loading

Uh oh!

samadejacobs May 24, 2024

Uh oh!

inkcherry May 28, 2024

Uh oh!

inkcherry commented Jun 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

inkcherry commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samadejacobs May 24, 2024

Choose a reason for hiding this comment

Uh oh!

inkcherry May 28, 2024

Choose a reason for hiding this comment

Uh oh!

inkcherry commented Jun 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

inkcherry commented May 21, 2024 •

edited

Loading