Skip to content

fix sequence parallel(Ulysses) grad scale for zero0#5555

Merged
samadejacobs merged 2 commits intodeepspeedai:masterfrom
inkcherry:fix_Ulysses_grad
Jun 5, 2024
Merged

fix sequence parallel(Ulysses) grad scale for zero0#5555
samadejacobs merged 2 commits intodeepspeedai:masterfrom
inkcherry:fix_Ulysses_grad

Conversation

@inkcherry
Copy link
Contributor

@inkcherry inkcherry commented May 21, 2024

use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:

grad_norm step1 step2 step3 step4 step5 step100
zero1 15.825 16.646 15.853 16.159 17.333 15.555
zero0 3.956 4.161 3.963 4.040 4.333 3.889
zero0(this patch) 15.825 16.646 15.853 16.159 17.333 15.554

@tjruwase tjruwase requested review from samadejacobs and tohtana and removed request for mrwyattii and tjruwase May 21, 2024 15:14
# to maintain the gradients value unaffected by ep_size setting,
# utilize dp_world_size for allreduce average
dp_world_size = dist.get_world_size(groups._get_data_parallel_group())
dp_world_size = dist.get_world_size(groups._get_data_parallel_group()) / float(self.sequence_parallel_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inkcherry, can you help me understand why scale by sp_size? get_data_parallel_group != get_sequence_data_parallel_group, you should have correct value already, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! @samadejacobs. Yes, this should be the correct value. We should only need to modify the dp_world_size in the above instance.

@inkcherry
Copy link
Contributor Author

Hi, @samadejacobs I have removed the modifications you mentioned in that line. Could you please help review the other parts again? Thanks!

@samadejacobs samadejacobs added this pull request to the merge queue Jun 5, 2024
Merged via the queue into deepspeedai:master with commit 6b6d641 Jun 5, 2024
sfc-gh-reyazda pushed a commit to Snowflake-Labs/DeepSpeed that referenced this pull request Jun 10, 2024
use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:
grad_norm | step1 | step2 | step3 | step4 |step5 | step100
-- | -- | -- | -- | -- | --| --
zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555
zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889
zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants