-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ModelCheckpoint race condition in file existence check #5155
Changes from 20 commits
1af3218
0f2a986
6ffb64b
9b7512a
b545a0f
622f929
5d1b297
2c1a07d
e5d12cd
c13345d
f8529a4
f7b528a
324c7c7
cbb8b81
b717dcc
988167b
87357eb
16fbebf
f2cad67
fb7533c
34e76b9
7ced05b
437b5a2
920df05
68b0d34
3393868
3398164
7f3dd80
81f7c14
3fccc5b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -125,6 +125,7 @@ def test_horovod_multi_gpu(tmpdir): | |
_run_horovod(trainer_options, on_gpu=True) | ||
|
||
|
||
@pytest.mark.skip(reason="Horovod has a problem with broadcast when using apex?") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So does it or does it not? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wasn't able to investigate yet, the only difference between this test and the one above is apex. It's as if apex is affected by my broadcast operations in model checkpoint, or the other way around |
||
@pytest.mark.skipif(platform.system() == "Windows", reason="Horovod is not supported on Windows") | ||
@pytest.mark.skipif(not HOROVOD_NCCL_AVAILABLE, reason="test requires Horovod with NCCL support") | ||
@pytest.mark.skipif(torch.cuda.device_count() < 2, reason="test requires multi-GPU machine") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tgaddair, sorry to ping you out of nowhere but I am stuck here with this broadcast causing a horovod test to hang (test_horovod_apex).
Do you see something obviously wrong about this broadcasting I'm trying to do here?
(print statements and systemexit above were just for debugging attempts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without having taken a close look at the code, I'm guessing that because we're in
model_checkpoint.py
, that this code is only being executed on rank 0? Is that possible?There should be some messages printed out by Horovod when such a stall occurs after about 30s or so. Are they being printed? Can you share them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint, I found out that I can append the -s option in pytest and get the messages. And indeed as you said, there is the message about the stall
Does the above message mean that rank 0 missed the broadcast?
The model checkpoint code should be exectued on all ranks, the only difference should be that it is only allowed to write to disk on rank 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like rank 1 is entering the checkpoint logic while rank 0 is still training the model. So it seems there is some non-deterministic behavior causing rank 1 to write a checkpoint. For example, there could be something like this going on:
That's hypothetical, but if the above logic existed, and rank 1 satisfied the condition but rank 0 didn't, it could lead to the situation above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @awaelchli, any update there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what @tgaddair explains, but I can't find where in Lightning the source of the problem occurs. There is one test that fails, and the only difference between that test and the other horovod tests is that apex is turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgaddair does
horovod.broadcast_object
not block? It looks like adding a barrier solves the problem. The failing apex test now passes locally, but the CI drone is still in trouble