Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducible checkpoint #11582
Reproducible checkpoint #11582
Changes from 4 commits
3ee5f67
ee424d7
4c2220e
d915bbc
1d5a16c
10f0b9a
108989a
5cad1f4
75009c8
4323979
d1da404
8374498
1995ff2
2b0c7d9
4ec3c99
4b8f370
08eb713
f7849d1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is another possible way (including the suggested earlier
has_torch_generator
which doesn't currently exist)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering whether you checked that different processes in dist env will have the same cpu RNG state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another question - what about python's main RNG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In distributed training, the torch generator should have the same state on all processes since it's set with the same seed initially (then all processes execute the same code). Unless someone goes out of their way to only execute a random instruction on one process.
The code has been tested one GPU, 2 GPUs with DataParallel and 2 GPUs distributed and the same result are obtained with a full training and resuming from the last checkpoint.
As for python main RNG and numpy main RNG I have no way of extracting the seed (I can set it but not get the current state easily) so any code that wants to be 100% reproducible when resuming from a checkpoint needs to use torch RNGs only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't these work?
Do you mean to say that they should stick to
torch
's random APIs? and convert to other formats from there?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh TIL! I didn't think it was possible to extract and set those. Will add that to the PR.