-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ModelCheckpoint race condition in file existence check #5155
Conversation
print(trainer.global_rank, "torch.distributed.is_initialized:", torch.distributed.is_initialized()) | ||
raise SystemExit() | ||
if trainer.accelerator_backend is not None: | ||
exists = trainer.accelerator_backend.broadcast(exists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tgaddair, sorry to ping you out of nowhere but I am stuck here with this broadcast causing a horovod test to hang (test_horovod_apex).
Do you see something obviously wrong about this broadcasting I'm trying to do here?
(print statements and systemexit above were just for debugging attempts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without having taken a close look at the code, I'm guessing that because we're in model_checkpoint.py
, that this code is only being executed on rank 0? Is that possible?
There should be some messages printed out by Horovod when such a stall occurs after about 30s or so. Are they being printed? Can you share them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(lightning) [aw18f408@vnode03 pytorch-lightning]$ py.test -v tests/models/test_horovod.py::test_horovod_apex -s
================================================================================================ test session starts =================================================================================================
platform linux -- Python 3.8.3, pytest-6.0.1, py-1.9.0, pluggy-0.13.1 -- /home/aw18f408/.conda/envs/lightning/bin/python
cachedir: .pytest_cache
rootdir: /home/aw18f408/repositories/pytorch-lightning, configfile: setup.cfg
plugins: hydra-core-1.0.4, flake8-1.0.6, cov-2.10.0
collected 1 item
tests/models/test_horovod.py::test_horovod_apex [0]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
[1]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
[0]<stderr>: assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'
[1]<stderr>: assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'
[0]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/graphql/type/typemap.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
[0]<stderr>: from collections import OrderedDict, Sequence, defaultdict
[1]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/graphql/type/typemap.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
[1]<stderr>: from collections import OrderedDict, Sequence, defaultdict
[1]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: DeprecationWarning: mode='auto' is deprecated in v1.1 and will be removed in v1.3. Default value for mode with be 'min' in v1.3.
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: DeprecationWarning: mode='auto' is deprecated in v1.1 and will be removed in v1.3. Default value for mode with be 'min' in v1.3.
[0]<stderr>: warnings.warn(*args, **kwargs)
[1]<stderr>: warnings.warn(*args, **kwargs)
[0]<stderr>:GPU available: True, used: True
[0]<stderr>:TPU available: None, using: 0 TPU cores
[1]<stderr>:GPU available: True, used: True
[1]<stderr>:TPU available: None, using: 0 TPU cores
[0]<stderr>:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[1]<stderr>:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[0]<stderr>:Using APEX 16bit precision.
[1]<stderr>:Using APEX 16bit precision.
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/connectors/precision_connector.py:69: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
[1]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/connectors/precision_connector.py:69: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
[0]<stderr>: log.warn("LightningOptimizer doesn't support Apex")
[1]<stderr>: log.warn("LightningOptimizer doesn't support Apex")
[0]<stderr>:LightningOptimizer doesn't support Apex
[1]<stderr>:LightningOptimizer doesn't support Apex
[0]<stdout>:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
[0]<stdout>:
[0]<stdout>:Defaults for this optimization level are:
[0]<stdout>:enabled : True
[1]<stdout>:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
[0]<stdout>:opt_level : O2
[1]<stdout>:
[0]<stdout>:cast_model_type : torch.float16
[0]<stdout>:patch_torch_functions : False
[0]<stdout>:keep_batchnorm_fp32 : True
[1]<stdout>:Defaults for this optimization level are:
[0]<stdout>:master_weights : True
[1]<stdout>:enabled : True
[0]<stdout>:loss_scale : dynamic
[1]<stdout>:opt_level : O2
[0]<stdout>:Processing user overrides (additional kwargs that are not None)...
[1]<stdout>:cast_model_type : torch.float16
[0]<stdout>:After processing overrides, optimization options are:
[1]<stdout>:patch_torch_functions : False
[0]<stdout>:enabled : True
[1]<stdout>:keep_batchnorm_fp32 : True
[0]<stdout>:opt_level : O2
[1]<stdout>:master_weights : True
[0]<stdout>:cast_model_type : torch.float16
[1]<stdout>:loss_scale : dynamic
[0]<stdout>:patch_torch_functions : False
[1]<stdout>:Processing user overrides (additional kwargs that are not None)...
[0]<stdout>:keep_batchnorm_fp32 : True
[1]<stdout>:After processing overrides, optimization options are:
[0]<stdout>:master_weights : True
[1]<stdout>:enabled : True
[0]<stdout>:loss_scale : dynamic
[1]<stdout>:opt_level : O2
[1]<stdout>:cast_model_type : torch.float16
[1]<stdout>:patch_torch_functions : False
[1]<stdout>:keep_batchnorm_fp32 : True
[1]<stdout>:master_weights : True
[1]<stdout>:loss_scale : dynamic
[0]<stderr>:
[0]<stderr>: | Name | Type | Params | In sizes | Out sizes
[0]<stderr>:------------------------------------------------------------------
[0]<stderr>:0 | c_d1 | Linear | 785 K | [5, 784] | [5, 1000]
[0]<stderr>:1 | c_d1_bn | BatchNorm1d | 2.0 K | [5, 1000] | [5, 1000]
[0]<stderr>:2 | c_d1_drop | Dropout | 0 | [5, 1000] | [5, 1000]
[0]<stderr>:3 | c_d2 | Linear | 10.0 K | [5, 1000] | [5, 10]
[0]<stderr>:------------------------------------------------------------------
[0]<stderr>:797 K Trainable params
[0]<stderr>:0 Non-trainable params
[0]<stderr>:797 K Total params
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
[0]<stderr>: warnings.warn(*args, **kwargs)
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The validation_epoch_end should not return anything as of 9.1. To log, use self.log(...) or self.write(...) directly in the LightningModule
[0]<stderr>: warnings.warn(*args, **kwargs)
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The {progress_bar:dict keyword} was deprecated in 0.9.1 and will be removed in 1.0.0
[0]<stderr>:Please use self.log(...) inside the lightningModule instead.
[0]<stderr>:
[0]<stderr>:# log on a step or aggregate epoch metric to the logger and/or progress bar
[0]<stderr>:# (inside LightningModule)
[0]<stderr>:self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
[0]<stderr>: warnings.warn(*args, **kwargs)
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The {log:dict keyword} was deprecated in 0.9.1 and will be removed in 1.0.0
[0]<stderr>:Please use self.log(...) inside the lightningModule instead.
[0]<stderr>:
[0]<stderr>:# log on a step or aggregate epoch metric to the logger and/or progress bar
[0]<stderr>:# (inside LightningModule)
[0]<stderr>:self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
[0]<stderr>: warnings.warn(*args, **kwargs)
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
[0]<stderr>: warnings.warn(*args, **kwargs)
[1]<stdout>:Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
[0]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/horovod/torch/optimizer.py:252: UserWarning: optimizer.step() called without optimizer.skip_synchronize() context after optimizer.synchronize(). This can cause training slowdown. You may want to consider using optimizer.skip_synchronize() context if you use optimizer.synchronize() in your code.
[0]<stderr>: warnings.warn("optimizer.step() called without "
[1]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/horovod/torch/optimizer.py:252: UserWarning: optimizer.step() called without optimizer.skip_synchronize() context after optimizer.synchronize(). This can cause training slowdown. You may want to consider using optimizer.skip_synchronize() context if you use optimizer.synchronize() in your code.
[1]<stderr>: warnings.warn("optimizer.step() called without "
[1]<stdout>:1 <pytorch_lightning.accelerators.horovod_accelerator.HorovodAccelerator object at 0x7f9efa5c2490>
[1]<stdout>:1 exists: False /tmp/pytest-of-aw18f408/pytest-0/test_horovod_apex0/epoch=0-step=1.ckpt
[0]<stderr>:[2020-12-29 16:49:57.723209: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:[2020-12-29 16:50:57.724380: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:[2020-12-29 16:51:57.726092: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
[0]<stderr>:[2020-12-29 16:52:57.726326: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
[0]<stderr>:[2020-12-29 16:53:57.726988: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
[0]<stderr>:[2020-12-29 16:54:57.727621: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
[0]<stderr>:[2020-12-29 16:55:57.728667: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
Thanks for the hint, I found out that I can append the -s option in pytest and get the messages. And indeed as you said, there is the message about the stall
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
Does the above message mean that rank 0 missed the broadcast?
The model checkpoint code should be exectued on all ranks, the only difference should be that it is only allowed to write to disk on rank 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like rank 1 is entering the checkpoint logic while rank 0 is still training the model. So it seems there is some non-deterministic behavior causing rank 1 to write a checkpoint. For example, there could be something like this going on:
if some_local_metric < threshold:
write_checkpoint()
continue_training_loop()
That's hypothetical, but if the above logic existed, and rank 1 satisfied the condition but rank 0 didn't, it could lead to the situation above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @awaelchli, any update there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what @tgaddair explains, but I can't find where in Lightning the source of the problem occurs. There is one test that fails, and the only difference between that test and the other horovod tests is that apex is turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgaddair does horovod.broadcast_object
not block? It looks like adding a barrier solves the problem. The failing apex test now passes locally, but the CI drone is still in trouble
How this p0 go? 🐰 |
@Borda I have a strange tpu test and apex test failing, which I don't know how it is related to my changes. I need to do further testing |
@awaelchli The tests are failing for single-core TPU tests. It's happening because broadcast is being called when using a single core. Adding these lines to the broadcast method in tpu_accelerator should fix it. if self.trainer.tpu_id is not None:
return obj |
@@ -125,6 +125,7 @@ def test_horovod_multi_gpu(tmpdir): | |||
_run_horovod(trainer_options, on_gpu=True) | |||
|
|||
|
|||
@pytest.mark.skip(reason="Horovod has a problem with broadcast when using apex?") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does it or does it not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to investigate yet, the only difference between this test and the one above is apex. It's as if apex is affected by my broadcast operations in model checkpoint, or the other way around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
What does this PR do?
Fixes #5144
We need to be careful, whenever we have a code like this:
The check should only be performed on rank 0 and the result should be broadcast to the other ranks.
It can happen that rank 0 determines the file does not exist, then saves the file and later rank 1 determines the file already exists, and therefore the state of the ModelCheckpoint diverges between the ranks.
The result is that rank 0 will save
epoch=X.ckpt
as best model path, and rank 1 will saveepoch=X-v0.ckpt
as model path, but the latter does not actually exist physically on disk.cc: @rohitgr7 @carmocca