Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ModelCheckpoint race condition in file existence check #5155

Merged
merged 30 commits into from
Jan 27, 2021

Conversation

awaelchli
Copy link
Contributor

What does this PR do?

Fixes #5144

We need to be careful, whenever we have a code like this:

if exists(file):
    change(self.state)

The check should only be performed on rank 0 and the result should be broadcast to the other ranks.
It can happen that rank 0 determines the file does not exist, then saves the file and later rank 1 determines the file already exists, and therefore the state of the ModelCheckpoint diverges between the ranks.
The result is that rank 0 will save epoch=X.ckpt as best model path, and rank 1 will save epoch=X-v0.ckpt as model path, but the latter does not actually exist physically on disk.

cc: @rohitgr7 @carmocca

@awaelchli awaelchli added bug Something isn't working priority: 0 High priority task distributed Generic distributed-related topic checkpointing Related to checkpointing labels Dec 16, 2020
@awaelchli awaelchli added this to the 1.1.x milestone Dec 16, 2020
@awaelchli awaelchli changed the title Fix ModelCheckpoint race condition in file existance check Fix ModelCheckpoint race condition in file existence check Dec 16, 2020
print(trainer.global_rank, "torch.distributed.is_initialized:", torch.distributed.is_initialized())
raise SystemExit()
if trainer.accelerator_backend is not None:
exists = trainer.accelerator_backend.broadcast(exists)
Copy link
Contributor Author

@awaelchli awaelchli Dec 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tgaddair, sorry to ping you out of nowhere but I am stuck here with this broadcast causing a horovod test to hang (test_horovod_apex).
Do you see something obviously wrong about this broadcasting I'm trying to do here?

(print statements and systemexit above were just for debugging attempts)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without having taken a close look at the code, I'm guessing that because we're in model_checkpoint.py, that this code is only being executed on rank 0? Is that possible?

There should be some messages printed out by Horovod when such a stall occurs after about 30s or so. Are they being printed? Can you share them here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(lightning) [aw18f408@vnode03 pytorch-lightning]$ py.test -v tests/models/test_horovod.py::test_horovod_apex -s
================================================================================================ test session starts =================================================================================================
platform linux -- Python 3.8.3, pytest-6.0.1, py-1.9.0, pluggy-0.13.1 -- /home/aw18f408/.conda/envs/lightning/bin/python
cachedir: .pytest_cache
rootdir: /home/aw18f408/repositories/pytorch-lightning, configfile: setup.cfg
plugins: hydra-core-1.0.4, flake8-1.0.6, cov-2.10.0
collected 1 item

tests/models/test_horovod.py::test_horovod_apex [0]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
[1]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
[0]<stderr>:  assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'
[1]<stderr>:  assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'
[0]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/graphql/type/typemap.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
[0]<stderr>:  from collections import OrderedDict, Sequence, defaultdict
[1]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/graphql/type/typemap.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
[1]<stderr>:  from collections import OrderedDict, Sequence, defaultdict
[1]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: DeprecationWarning: mode='auto' is deprecated in v1.1 and will be removed in v1.3. Default value for mode with be 'min' in v1.3.
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: DeprecationWarning: mode='auto' is deprecated in v1.1 and will be removed in v1.3. Default value for mode with be 'min' in v1.3.
[0]<stderr>:  warnings.warn(*args, **kwargs)
[1]<stderr>:  warnings.warn(*args, **kwargs)
[0]<stderr>:GPU available: True, used: True
[0]<stderr>:TPU available: None, using: 0 TPU cores
[1]<stderr>:GPU available: True, used: True
[1]<stderr>:TPU available: None, using: 0 TPU cores
[0]<stderr>:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[1]<stderr>:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[0]<stderr>:Using APEX 16bit precision.
[1]<stderr>:Using APEX 16bit precision.
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/connectors/precision_connector.py:69: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
[1]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/trainer/connectors/precision_connector.py:69: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
[0]<stderr>:  log.warn("LightningOptimizer doesn't support Apex")
[1]<stderr>:  log.warn("LightningOptimizer doesn't support Apex")
[0]<stderr>:LightningOptimizer doesn't support Apex
[1]<stderr>:LightningOptimizer doesn't support Apex
[0]<stdout>:Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.
[0]<stdout>:
[0]<stdout>:Defaults for this optimization level are:
[0]<stdout>:enabled                : True
[1]<stdout>:Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.
[0]<stdout>:opt_level              : O2
[1]<stdout>:
[0]<stdout>:cast_model_type        : torch.float16
[0]<stdout>:patch_torch_functions  : False
[0]<stdout>:keep_batchnorm_fp32    : True
[1]<stdout>:Defaults for this optimization level are:
[0]<stdout>:master_weights         : True
[1]<stdout>:enabled                : True
[0]<stdout>:loss_scale             : dynamic
[1]<stdout>:opt_level              : O2
[0]<stdout>:Processing user overrides (additional kwargs that are not None)...
[1]<stdout>:cast_model_type        : torch.float16
[0]<stdout>:After processing overrides, optimization options are:
[1]<stdout>:patch_torch_functions  : False
[0]<stdout>:enabled                : True
[1]<stdout>:keep_batchnorm_fp32    : True
[0]<stdout>:opt_level              : O2
[1]<stdout>:master_weights         : True
[0]<stdout>:cast_model_type        : torch.float16
[1]<stdout>:loss_scale             : dynamic
[0]<stdout>:patch_torch_functions  : False
[1]<stdout>:Processing user overrides (additional kwargs that are not None)...
[0]<stdout>:keep_batchnorm_fp32    : True
[1]<stdout>:After processing overrides, optimization options are:
[0]<stdout>:master_weights         : True
[1]<stdout>:enabled                : True
[0]<stdout>:loss_scale             : dynamic
[1]<stdout>:opt_level              : O2
[1]<stdout>:cast_model_type        : torch.float16
[1]<stdout>:patch_torch_functions  : False
[1]<stdout>:keep_batchnorm_fp32    : True
[1]<stdout>:master_weights         : True
[1]<stdout>:loss_scale             : dynamic
[0]<stderr>:
[0]<stderr>:  | Name      | Type        | Params | In sizes  | Out sizes
[0]<stderr>:------------------------------------------------------------------
[0]<stderr>:0 | c_d1      | Linear      | 785 K  | [5, 784]  | [5, 1000]
[0]<stderr>:1 | c_d1_bn   | BatchNorm1d | 2.0 K  | [5, 1000] | [5, 1000]
[0]<stderr>:2 | c_d1_drop | Dropout     | 0      | [5, 1000] | [5, 1000]
[0]<stderr>:3 | c_d2      | Linear      | 10.0 K | [5, 1000] | [5, 10]
[0]<stderr>:------------------------------------------------------------------
[0]<stderr>:797 K     Trainable params
[0]<stderr>:0         Non-trainable params
[0]<stderr>:797 K     Total params
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
[0]<stderr>:  warnings.warn(*args, **kwargs)
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The validation_epoch_end should not return anything as of 9.1. To log, use self.log(...) or self.write(...) directly in the LightningModule
[0]<stderr>:  warnings.warn(*args, **kwargs)
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The {progress_bar:dict keyword} was deprecated in 0.9.1 and will be removed in 1.0.0
[0]<stderr>:Please use self.log(...) inside the lightningModule instead.
[0]<stderr>:
[0]<stderr>:# log on a step or aggregate epoch metric to the logger and/or progress bar
[0]<stderr>:# (inside LightningModule)
[0]<stderr>:self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
[0]<stderr>:  warnings.warn(*args, **kwargs)
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The {log:dict keyword} was deprecated in 0.9.1 and will be removed in 1.0.0
[0]<stderr>:Please use self.log(...) inside the lightningModule instead.
[0]<stderr>:
[0]<stderr>:# log on a step or aggregate epoch metric to the logger and/or progress bar
[0]<stderr>:# (inside LightningModule)
[0]<stderr>:self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
[0]<stderr>:  warnings.warn(*args, **kwargs)
[0]<stderr>:/home/aw18f408/repositories/pytorch-lightning/pytorch_lightning/utilities/distributed.py:49: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
[0]<stderr>:  warnings.warn(*args, **kwargs)
[1]<stdout>:Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
[0]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/horovod/torch/optimizer.py:252: UserWarning: optimizer.step() called without optimizer.skip_synchronize() context after optimizer.synchronize(). This can cause training slowdown. You may want to consider using optimizer.skip_synchronize() context if you use optimizer.synchronize() in your code.
[0]<stderr>:  warnings.warn("optimizer.step() called without "
[1]<stderr>:/home/aw18f408/.conda/envs/lightning/lib/python3.8/site-packages/horovod/torch/optimizer.py:252: UserWarning: optimizer.step() called without optimizer.skip_synchronize() context after optimizer.synchronize(). This can cause training slowdown. You may want to consider using optimizer.skip_synchronize() context if you use optimizer.synchronize() in your code.
[1]<stderr>:  warnings.warn("optimizer.step() called without "
[1]<stdout>:1 <pytorch_lightning.accelerators.horovod_accelerator.HorovodAccelerator object at 0x7f9efa5c2490>
[1]<stdout>:1 exists: False /tmp/pytest-of-aw18f408/pytest-0/test_horovod_apex0/epoch=0-step=1.ckpt
[0]<stderr>:[2020-12-29 16:49:57.723209: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:[2020-12-29 16:50:57.724380: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:[2020-12-29 16:51:57.726092: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
[0]<stderr>:[2020-12-29 16:52:57.726326: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
[0]<stderr>:[2020-12-29 16:53:57.726988: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
[0]<stderr>:[2020-12-29 16:54:57.727621: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]
[0]<stderr>:[2020-12-29 16:55:57.728667: W /tmp/pip-install-uk4q7bn5/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]

Thanks for the hint, I found out that I can append the -s option in pytest and get the messages. And indeed as you said, there is the message about the stall

[0]<stderr>:Missing ranks:
[0]<stderr>:0: [broadcast.bool.sz]
[0]<stderr>:1: [allreduce.c_d1.bias, allreduce.c_d1.weight, allreduce.c_d1_bn.bias, allreduce.c_d1_bn.weight, allreduce.c_d2.bias, allreduce.c_d2.weight ...]

Does the above message mean that rank 0 missed the broadcast?
The model checkpoint code should be exectued on all ranks, the only difference should be that it is only allowed to write to disk on rank 0.

Copy link
Contributor

@tgaddair tgaddair Dec 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like rank 1 is entering the checkpoint logic while rank 0 is still training the model. So it seems there is some non-deterministic behavior causing rank 1 to write a checkpoint. For example, there could be something like this going on:

if some_local_metric < threshold:
    write_checkpoint()
continue_training_loop()

That's hypothetical, but if the above logic existed, and rank 1 satisfied the condition but rank 0 didn't, it could lead to the situation above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @awaelchli, any update there ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what @tgaddair explains, but I can't find where in Lightning the source of the problem occurs. There is one test that fails, and the only difference between that test and the other horovod tests is that apex is turned on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair does horovod.broadcast_object not block? It looks like adding a barrier solves the problem. The failing apex test now passes locally, but the CI drone is still in trouble

@awaelchli awaelchli changed the title Fix ModelCheckpoint race condition in file existence check Fix ModelCheckpoint race condition in file existence check [skip ci] Dec 29, 2020
@awaelchli awaelchli changed the title Fix ModelCheckpoint race condition in file existence check [skip ci] Fix ModelCheckpoint race condition in file existence check Dec 29, 2020
@Borda
Copy link
Member

Borda commented Dec 31, 2020

How this p0 go? 🐰

@awaelchli
Copy link
Contributor Author

@Borda I have a strange tpu test and apex test failing, which I don't know how it is related to my changes. I need to do further testing

@lezwon
Copy link
Contributor

lezwon commented Jan 8, 2021

@awaelchli The tests are failing for single-core TPU tests. It's happening because broadcast is being called when using a single core. Adding these lines to the broadcast method in tpu_accelerator should fix it.

if self.trainer.tpu_id is not None:
            return obj

@@ -125,6 +125,7 @@ def test_horovod_multi_gpu(tmpdir):
_run_horovod(trainer_options, on_gpu=True)


@pytest.mark.skip(reason="Horovod has a problem with broadcast when using apex?")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does it or does it not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to investigate yet, the only difference between this test and the one above is apex. It's as if apex is affected by my broadcast operations in model checkpoint, or the other way around

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
@mergify mergify bot removed the has conflicts label Jan 26, 2021
@Borda Borda added the ready PRs ready to be merged label Jan 26, 2021
Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot removed the has conflicts label Jan 27, 2021
@awaelchli awaelchli enabled auto-merge (squash) January 27, 2021 12:54
@awaelchli awaelchli merged commit 507f3f7 into master Jan 27, 2021
@awaelchli awaelchli deleted the bugfix/ddp-ckpt branch January 27, 2021 15:27
tchaton pushed a commit that referenced this pull request Feb 5, 2021
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Borda added a commit that referenced this pull request Feb 5, 2021
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing distributed Generic distributed-related topic priority: 0 High priority task ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trainer test cannot load from checkpoint when training on multiple GPUs
8 participants