Skip to content

Conversation

@mosheisland
Copy link
Contributor

No description provided.

@tjruwase
Copy link
Contributor

@stas00, FYI

Modify ds_to_universal to remove word embeddings padding.
Then, when loading from universal ckp, pad with zeros.

Signed-off-by: Moshe Island <misland@habana.ai>
Moshe Island added 3 commits November 1, 2023 17:40
When loading from universal checkpoint, the optimizer step is not restored.
This is due to base_optimizer is not being saved in universal checkpoint.
For now, set base optimizer step after it is initialized.
While at it, remove unused step_count.

Signed-off-by: Moshe Island <misland@habana.ai>
When loading from universal checkpoint with a different model parameter
configuration, the loaded tensor parallel RNG tracker states are incorrect.
In this case, we reconfigure the tensor parallel RNG tracker states with new
seed values (each tp rank with a unique seed).
We add an offset=iteration to the base seed. This is to ensure that when we
load multiple times from universal checkpoint, we will use a different random
sequence at each run.

Signed-off-by: Moshe Island <misland@habana.ai>
Verify that all model patterns are matched at least once.

Signed-off-by: Moshe Island <misland@habana.ai>
@mosheisland
Copy link
Contributor Author

@tjruwase, I have an additional commit that adds support for universal checkpoint for llama model. Should I add it to this PR? or should I wait for this PR to be merged and create a new PR? Note that it also requires changes in Megatron-DeepSpeed PR deepspeedai/Megatron-DeepSpeed#276.

Also, I suspect that the unit test failures are not due to changes in this PR, but rather some instable unit tests. I know that other PRs (not mine) are failing on same unit tests.

@tjruwase
Copy link
Contributor

tjruwase commented Nov 9, 2023

@mosheisland, so does that mean this PR is good to go? I was not sure because of the unresolved comments. But yes, it is okay for the llama changes to come in a different PR. No, need to delay this one any longer.

@mosheisland
Copy link
Contributor Author

@tjruwase, my understanding is that there are no open issues. can you please double check?

@tjruwase tjruwase added this pull request to the merge queue Nov 9, 2023
Merged via the queue into deepspeedai:master with commit 8ad187d Nov 9, 2023
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants