Universal ckp fixes #4588

mosheisland · 2023-10-31T13:50:46Z

No description provided.

tjruwase · 2023-10-31T16:45:48Z

Modify ds_to_universal to remove word embeddings padding. Then, when loading from universal ckp, pad with zeros. Signed-off-by: Moshe Island <misland@habana.ai>

deepspeed/runtime/engine.py

deepspeed/runtime/bf16_optimizer.py

deepspeed/runtime/engine.py

When loading from universal checkpoint, the optimizer step is not restored. This is due to base_optimizer is not being saved in universal checkpoint. For now, set base optimizer step after it is initialized. While at it, remove unused step_count. Signed-off-by: Moshe Island <misland@habana.ai>

When loading from universal checkpoint with a different model parameter configuration, the loaded tensor parallel RNG tracker states are incorrect. In this case, we reconfigure the tensor parallel RNG tracker states with new seed values (each tp rank with a unique seed). We add an offset=iteration to the base seed. This is to ensure that when we load multiple times from universal checkpoint, we will use a different random sequence at each run. Signed-off-by: Moshe Island <misland@habana.ai>

Verify that all model patterns are matched at least once. Signed-off-by: Moshe Island <misland@habana.ai>

mosheisland · 2023-11-09T07:43:09Z

@tjruwase, I have an additional commit that adds support for universal checkpoint for llama model. Should I add it to this PR? or should I wait for this PR to be merged and create a new PR? Note that it also requires changes in Megatron-DeepSpeed PR deepspeedai/Megatron-DeepSpeed#276.

Also, I suspect that the unit test failures are not due to changes in this PR, but rather some instable unit tests. I know that other PRs (not mine) are failing on same unit tests.

tjruwase · 2023-11-09T11:26:20Z

@mosheisland, so does that mean this PR is good to go? I was not sure because of the unresolved comments. But yes, it is okay for the llama changes to come in a different PR. No, need to delay this one any longer.

mosheisland · 2023-11-09T14:55:54Z

@tjruwase, my understanding is that there are no open issues. can you please double check?

Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

mosheisland requested review from jeffra and tjruwase as code owners October 31, 2023 13:50

mosheisland mentioned this pull request Oct 31, 2023

Universal ckp fixes deepspeedai/Megatron-DeepSpeed#276

Merged

universal-ckp: fix handling of word embedding

90cd14a

Modify ds_to_universal to remove word embeddings padding. Then, when loading from universal ckp, pad with zeros. Signed-off-by: Moshe Island <misland@habana.ai>

mosheisland force-pushed the universal_ckp_fixes branch from f4152a2 to 2d71d20 Compare October 31, 2023 17:08

stas00 reviewed Oct 31, 2023

View reviewed changes

deepspeed/runtime/engine.py Outdated Show resolved Hide resolved

stas00 reviewed Oct 31, 2023

View reviewed changes

deepspeed/runtime/bf16_optimizer.py Show resolved Hide resolved

tjruwase reviewed Oct 31, 2023

View reviewed changes

deepspeed/runtime/engine.py Outdated Show resolved Hide resolved

RUAN-ZX mentioned this pull request Nov 1, 2023

[NPU] Add NPU support for unit test #4569

Merged

mosheisland force-pushed the universal_ckp_fixes branch from 0bac303 to a661958 Compare November 1, 2023 14:43

Moshe Island added 3 commits November 1, 2023 17:40

universal-ckp: enforce pattern matching in ds_to_universal

1a227aa

Verify that all model patterns are matched at least once. Signed-off-by: Moshe Island <misland@habana.ai>

mosheisland force-pushed the universal_ckp_fixes branch from 8d154d5 to 1a227aa Compare November 1, 2023 15:41

tjruwase and others added 3 commits November 1, 2023 13:07

Merge branch 'master' into universal_ckp_fixes

7283ecf

Merge branch 'master' into universal_ckp_fixes

1e7ca68

Merge branch 'master' into universal_ckp_fixes

dff1816

tjruwase approved these changes Nov 9, 2023

View reviewed changes

tjruwase added this pull request to the merge queue Nov 9, 2023

Merged via the queue into deepspeedai:master with commit 8ad187d Nov 9, 2023

tjruwase mentioned this pull request Mar 12, 2024

Fix loading a universal checkpoint #5263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Universal ckp fixes #4588

Universal ckp fixes #4588

Uh oh!

mosheisland commented Oct 31, 2023

Uh oh!

tjruwase commented Oct 31, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mosheisland commented Nov 9, 2023

Uh oh!

tjruwase commented Nov 9, 2023

Uh oh!

mosheisland commented Nov 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Universal ckp fixes #4588

Universal ckp fixes #4588

Uh oh!

Conversation

mosheisland commented Oct 31, 2023

Uh oh!

tjruwase commented Oct 31, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mosheisland commented Nov 9, 2023

Uh oh!

tjruwase commented Nov 9, 2023

Uh oh!

mosheisland commented Nov 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants