[Random] Random state management #38

comaniac · 2023-02-02T18:31:21Z

Description

Add random state management to deal with the requirements of using the same or different random seeds in a TP group. The implementation is based on the one in Megatron-LM.
Add an activation checkpointing with the consideration of random states.
Add an op DropoutWithTensorParallel that can be replaced by users when writing a schedule.
Add unit tests.
Disable randomly plugin in pytest. This plugin makes our random seed setup in the test fixture useless.

Notes:

We now offer an API set_random_seed for users to call in the training script. Users have to manually call it and specify the rank of 3D parallelism.
All changes in this PR will have no impact if set_random_seed is not called in advance.
Fidelity testing shows the updated GPT schedule with 3D parallelism could align the loss to ZeRO-3 (with and without activation checkpointing), but flash attention has to be disabled.
I'll update the flash attention to the latest one and see if the problem will gone.

Checklist

PR's title starts with a category (e.g. [Bugfix], [Model], [Tutorial], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

cc @szhengac @chhzh123

chhzh123

LGTM. Thanks @comaniac.

tests/test_shard_sync_op.py

conftest.py

szhengac · 2023-02-03T22:27:01Z

examples/gpt/megatron_hf.py

+        # Note 1: We assume no DP and PP in this script.
+        # Note 2: This overrides Megatron random seed management, so we only use
+        #         this script for benchmarking.
+        slapo.set_random_seed(2013, None, None, sch.rank)


if i understand correctly. all the DP ranks also use the same seed so the loss wouldn't be right, but we only use this script for benchmarking

szhengac · 2023-02-03T22:46:16Z

LGTM

chhzh123 approved these changes Feb 2, 2023

View reviewed changes

tests/test_shard_sync_op.py Show resolved Hide resolved

conftest.py Show resolved Hide resolved

chhzh123 and others added 5 commits February 2, 2023 22:00

Add random

9c8c95d

Add random seed initialization

6982bc9

Setup rng manager and update GPT schedule

bc849cd

Add activtion checkpointing

34cf0a1

lint

fe75e52

comaniac force-pushed the dropout branch from d435be3 to fe75e52 Compare February 2, 2023 22:11

comaniac added 5 commits February 2, 2023 22:26

lint

a21288d

fix checkpoint fallback

19e94b3

fix

6e5c86c

detach

14de737

lint

dd797a7

szhengac reviewed Feb 3, 2023

View reviewed changes

szhengac approved these changes Feb 3, 2023

View reviewed changes

szhengac merged commit d93764c into awslabs:main Feb 3, 2023

comaniac linked an issue Feb 3, 2023 that may be closed by this pull request

[Feature] Random seed management for dropout layers in distributed environment #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Random] Random state management #38

[Random] Random state management #38

comaniac commented Feb 2, 2023

chhzh123 left a comment

szhengac Feb 3, 2023

szhengac commented Feb 3, 2023

[Random] Random state management #38

[Random] Random state management #38

Conversation

comaniac commented Feb 2, 2023

Description

Checklist

chhzh123 left a comment

Choose a reason for hiding this comment

szhengac Feb 3, 2023

Choose a reason for hiding this comment

szhengac commented Feb 3, 2023