-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ModelCheckpoint race condition in file existence check #5155
Merged
Merged
Changes from 25 commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
1af3218
fix
awaelchli 0f2a986
remove repro script
awaelchli 6ffb64b
type annotation
awaelchli 9b7512a
added changelog
awaelchli b545a0f
fix import problem
awaelchli 622f929
Merge branch 'master' into bugfix/ddp-ckpt
awaelchli 5d1b297
fix NoneType error
awaelchli 2c1a07d
Merge branch 'master' into bugfix/ddp-ckpt
awaelchli e5d12cd
fix test
c13345d
Merge branch 'master' into bugfix/ddp-ckpt
f8529a4
debugging
f7b528a
debug
324c7c7
debug
cbb8b81
Merge branch 'master' into bugfix/ddp-ckpt
b717dcc
debug
awaelchli 988167b
skip horovod apex test to see if all others pass
awaelchli 87357eb
remove debug message
awaelchli 16fbebf
Merge branch 'master' into bugfix/ddp-ckpt
awaelchli f2cad67
Merge branch 'master' into bugfix/ddp-ckpt
awaelchli fb7533c
no-op broadcast for single core tpu
awaelchli 34e76b9
spelling
awaelchli 7ced05b
OMG add a barrier
awaelchli 437b5a2
skip
awaelchli 920df05
Merge branch 'master' into bugfix/ddp-ckpt
awaelchli 68b0d34
Update CHANGELOG.md
Borda 3393868
add back skip in test
awaelchli 3398164
Merge branch 'master' into bugfix/ddp-ckpt
awaelchli 7f3dd80
add changelog
awaelchli 81f7c14
Merge branch 'master' into bugfix/ddp-ckpt
Borda 3fccc5b
Merge branch 'master' into bugfix/ddp-ckpt
SkafteNicki File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tgaddair, sorry to ping you out of nowhere but I am stuck here with this broadcast causing a horovod test to hang (test_horovod_apex).
Do you see something obviously wrong about this broadcasting I'm trying to do here?
(print statements and systemexit above were just for debugging attempts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without having taken a close look at the code, I'm guessing that because we're in
model_checkpoint.py
, that this code is only being executed on rank 0? Is that possible?There should be some messages printed out by Horovod when such a stall occurs after about 30s or so. Are they being printed? Can you share them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint, I found out that I can append the -s option in pytest and get the messages. And indeed as you said, there is the message about the stall
Does the above message mean that rank 0 missed the broadcast?
The model checkpoint code should be exectued on all ranks, the only difference should be that it is only allowed to write to disk on rank 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like rank 1 is entering the checkpoint logic while rank 0 is still training the model. So it seems there is some non-deterministic behavior causing rank 1 to write a checkpoint. For example, there could be something like this going on:
That's hypothetical, but if the above logic existed, and rank 1 satisfied the condition but rank 0 didn't, it could lead to the situation above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @awaelchli, any update there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what @tgaddair explains, but I can't find where in Lightning the source of the problem occurs. There is one test that fails, and the only difference between that test and the other horovod tests is that apex is turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgaddair does
horovod.broadcast_object
not block? It looks like adding a barrier solves the problem. The failing apex test now passes locally, but the CI drone is still in trouble