-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Universal ckp fixes #4588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Universal ckp fixes #4588
Conversation
|
@stas00, FYI |
Modify ds_to_universal to remove word embeddings padding. Then, when loading from universal ckp, pad with zeros. Signed-off-by: Moshe Island <misland@habana.ai>
f4152a2 to
2d71d20
Compare
0bac303 to
a661958
Compare
When loading from universal checkpoint, the optimizer step is not restored. This is due to base_optimizer is not being saved in universal checkpoint. For now, set base optimizer step after it is initialized. While at it, remove unused step_count. Signed-off-by: Moshe Island <misland@habana.ai>
When loading from universal checkpoint with a different model parameter configuration, the loaded tensor parallel RNG tracker states are incorrect. In this case, we reconfigure the tensor parallel RNG tracker states with new seed values (each tp rank with a unique seed). We add an offset=iteration to the base seed. This is to ensure that when we load multiple times from universal checkpoint, we will use a different random sequence at each run. Signed-off-by: Moshe Island <misland@habana.ai>
Verify that all model patterns are matched at least once. Signed-off-by: Moshe Island <misland@habana.ai>
8d154d5 to
1a227aa
Compare
|
@tjruwase, I have an additional commit that adds support for universal checkpoint for llama model. Should I add it to this PR? or should I wait for this PR to be merged and create a new PR? Note that it also requires changes in Megatron-DeepSpeed PR deepspeedai/Megatron-DeepSpeed#276. Also, I suspect that the unit test failures are not due to changes in this PR, but rather some instable unit tests. I know that other PRs (not mine) are failing on same unit tests. |
|
@mosheisland, so does that mean this PR is good to go? I was not sure because of the unresolved comments. But yes, it is okay for the llama changes to come in a different PR. No, need to delay this one any longer. |
|
@tjruwase, my understanding is that there are no open issues. can you please double check? |
Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
No description provided.