Skip to content

Commit

Permalink
Add check for uninitialized _sync_dir in DDP Plugin to avoid errors d…
Browse files Browse the repository at this point in the history
…uring error handling (#9267)
  • Loading branch information
four4fish authored and awaelchli committed Sep 7, 2021
1 parent bb498b3 commit dca4a41
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 3 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed bug where data-loading functions where not getting the correct running stage passed ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))


- Fixed error handling in DDP process reconciliation when `_sync_dir` was not initialized ([#9267](https://github.com/PyTorchLightning/pytorch-lightning/pull/9267))


## [1.4.5] - 2021-08-31

- Fixed reduction using `self.log(sync_dict=True, reduce_fx={mean,max})` ([#9142](https://github.com/PyTorchLightning/pytorch-lightning/pull/9142))
Expand Down
10 changes: 7 additions & 3 deletions pytorch_lightning/plugins/training_type/ddp.py
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,9 @@ def init_ddp_connection(self, global_rank: Optional[int] = None, world_size: Opt
)

def pre_dispatch(self):
# share ddp pids to all processes
self._share_information_to_prevent_deadlock()

# move the model to the correct device
self.model_to_device()

Expand All @@ -338,9 +341,6 @@ def pre_dispatch(self):

self.configure_ddp()

# share ddp pids to all processes
self._share_information_to_prevent_deadlock()

def post_dispatch(self) -> None:
self.cluster_environment.teardown()

Expand Down Expand Up @@ -436,6 +436,10 @@ def reconciliate_processes(self, trace: str):

sync_dir = self._sync_dir

if not sync_dir:
rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
return

# The cluster may be configured to periodically purge the `/tmp`
# directory, in which case `sync_dir` may not exist anymore at this
# point. Idempotently create it to ensure its existence.
Expand Down

0 comments on commit dca4a41

Please sign in to comment.