introduce default cluster environment for lightning-specific ddp (#5915)

* handle distributed_sampler_kwargs * move emptying cache to accelertor * fix a few tests * restoring the result from subprocess * fix queue.get() order for results * add missing "block_backward_sync" context manager * add missing "block_backward_sync" context manager * fix sync_batchnorm * fix supported gpu-ids for tuple * fix clip gradients and inf recursion * accelerator selection: added cluster_environment plugin * fix torchelastic test * fix reduce early stopping decision for DDP * fix tests: callbacks, conversion to lightning optimizer * fix lightning optimizer does not pickle * fix setting benchmark and deterministic option * fix slurm amp test * fix prepare_data test and determine node_rank * fix retrieving last path when testing * remove obsolete plugin argument * fix test: test_trainer_config * fix torchscript tests * fix trainer.model access * move properties * fix test_transfer_batch_hook * fix auto_select_gpus * fix omegaconf test * fix test that needs to simulate slurm ddp * add horovod plugin * fix test with named arguments * clean up whitespace * fix datamodules test * remove old accelerators * fix naming * move old plugins * move to plugins * create precision subpackage * create training_type subpackage * fix all new import errors * fix wrong arguments order passed to test * fix LR finder * Added sharded training type and amp plugin * Move clip grad to precision plugin * Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically * Fix import issue, attempting to fix tests * Fix initial test * Reflect hook logic from master, should wrap model after move to device * Optional state consolidation, since master has optimizers not wrapped * change attribute for instance test * reset optimizers optimizers are not used in main process, so state would be wrong. * legacy * imports in accel * legacy2 * trainer imports * fix import errors after rebase * move hook to new setup location * provide unwrapping logic * fix trainer callback system * added ddp2 implementation * fix imports .legacy * move plugins * restore legacy * drop test.py from root * add tpu accelerator and plugins * fixes * fix lightning optimizer merge * reset bugreportmodel * unwrapping * step routing forward * model access * unwrap * opt * integrate distrib_type * sync changes * sync * fixes * add forgotten generators * add missing logic * update * import * missed imports * import fixes * isort * mv f * changelog * format * move helper to parallel plugin * d * add world size * clean up * duplicate * activate ddp_sharded and tpu * set nvidia flags * remove unused colab var * use_tpu <-> on_tpu attrs * make some ddp_cpu and clusterplugin tests pass * Ref/accelerator connector (#5742) * final cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * connector cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * trainer cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * accelerator cleanup + missing logic in accelerator connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add missing changes to callbacks Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * reflect accelerator changes to lightning module Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * clean cluster envs Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * cleanup plugins Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add broadcasting Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * yapf * remove plugin connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * plugins * manual optimization * update optimizer routing * add rank to torchelastic * fix memory mixed precision * setstate on trainer for pickling in ddp spawn * add predict method * add back commented accelerator code * adapt test for sync_batch_norm to new plugin * fix deprecated tests * fix ddp cpu choice when no num_processes are given * yapf format * skip a memory test that cannot pass anymore * fix pickle error in spawn plugin * x * avoid * x * fix cyclic import in docs build * add support for sharded * update typing * add sharded and sharded_spawn to distributed types * make unwrap model default * refactor LightningShardedDataParallel similar to LightningDistributedDataParallel * update sharded spawn to reflect changes * update sharded to reflect changes * Merge 1.1.5 changes * fix merge * fix merge * yapf isort * fix merge * yapf isort * fix indentation in test * copy over reinit scheduler implementation from dev1.2 * fix apex tracking calls with dev_debugger * reduce diff to dev1.2, clean up * fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu * sort plugin tests legacy/new * fix error handling for amp on cpu * fix merge fix merge fix merge * [Feat] Resolve manual_backward (#5837) * resolve manual_backward * resolve flake8 * update * resolve for ddp_spawn * resolve flake8 * resolve flake8 * resolve flake8 Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * fix tests/accelerator tests on cpu * [BugFix] Resolve manual optimization (#5852) * resolve manual_optimization * update * update Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856) * resovle a bug * Accelerator refactor sharded rpc (#5854) * rpc branch * merge * update handling of rpc * make devices etc. Optional in RPC * set devices etc. later if necessary * remove devices from sequential * make devices optional in rpc * fix import * uncomment everything * fix cluster selection Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * resolve bug * fix assert in rpc test * resolve a test * fix docs compilation * accelerator refactor - fix for sharded parity test (#5866) * fix memory issue with ddp_spawn * x x x x x x x x x * x * Remove DDP2 as this does not apply * Add missing pre optimizer hook to ensure lambda closure is called * fix apex docstring * [accelerator][BugFix] Resolve some test for 1 gpu (#5863) * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * update * resolve flake8 * update * update * update * update * update * all_gather * update * make plugins work, add misconfig for RPC * update * update * remove breaking test * resolve some tests * resolve flake8 * revert to ddp_spawn Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de> * yapf isort * resolve flake8 * fix apex doctests * fix apex doctests 2 * resolve docs * update drone * clean env * update * update * update * update * merge * Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881) * Fix RPC related tests, clean out old API, update for new accelerator API * Move tests out of legacy folder, update paths and names * Update test_remove_1-4.py * Expose properties for tpu cores/gpus/num_gpus * Add root GPU property * Move properties to properties.py * move tests that were previously in drone * Fix root GPU property (#5908) * Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator * Add missing tests back * fix best model path transfer when no checkpoint callback available * Fix setup hook order [wip] (#5858) * Call trainer setup hook before accelerator setup * Add test case * add new test * typo * fix callback order in test Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * rename ddp sequential -> rpc sequential for special test * revert * fix stupid merge problem * abstract the cluster plugins * default plugin * integrate default environment * fix property * adapt tests * adjust test * fix world size access * base cluster env * revert rebase errors * revert rebase errors * missing import * revert unrelated change * remove unused cluster local rank * remove unrelated changes * fix unrelated changes * fix pep8 * remove unused var * reset permissions * ypaf * test default environment * test torchelastic environment * world size as int * tests for slurm environment * changelog * test comments * remove unintended change * keep master port fixed after it is generated * test random master port * yapf * add missing default environment * move helper function * rename default environment * rename * rename * yapf * Update pytorch_lightning/plugins/environments/lightning_environment.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update CHANGELOG.md Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * spawn -> create Co-authored-by: justusschock <justus.schock@posteo.de> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Lightning-AI · Mar 5, 2021 · ec8d46e · ec8d46e
1 parent 248a8e8
commit ec8d46e
Show file tree

Hide file tree

Showing 19 changed files with 287 additions and 91 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,6 +15,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](https://github.com/PyTorchLightning/pytorch-lightning/pull/6072))
 
 
+- Added `LightningEnvironment` for Lightning-specific DDP ([#5915](https://github.com/PyTorchLightning/pytorch-lightning/pull/5915))
+
+
 - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](https://github.com/PyTorchLightning/pytorch-lightning/pull/6274))
 
 

diff --git a/pytorch_lightning/plugins/environments/__init__.py b/pytorch_lightning/plugins/environments/__init__.py
@@ -12,5 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment # noqa: F401
+from pytorch_lightning.plugins.environments.lightning_environment import LightningEnvironment # noqa: F401
 from pytorch_lightning.plugins.environments.slurm_environment import SLURMEnvironment # noqa: F401
 from pytorch_lightning.plugins.environments.torchelastic_environment import TorchElasticEnvironment # noqa: F401
diff --git a/pytorch_lightning/plugins/environments/cluster_environment.py b/pytorch_lightning/plugins/environments/cluster_environment.py
@@ -11,24 +11,33 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+from abc import ABC, abstractmethod
+from typing import Optional
 
 
-class ClusterEnvironment:
+class ClusterEnvironment(ABC):
+ """ Specification of a cluster environment. """
 
- def __init__(self):
- self._world_size = None
+ @abstractmethod
+ def creates_children(self) -> bool:
+ """ Whether the environment creates the subprocesses or not. """
 
- def master_address(self):
- pass
+ @abstractmethod
+ def master_address(self) -> str:
+ """ The master address through which all processes connect and communicate. """
 
- def master_port(self):
- pass
+ @abstractmethod
+ def master_port(self) -> int:
+ """ An open and configured port in the master node through which all processes communicate. """
 
- def world_size(self) -> int:
- return self._world_size
+ @abstractmethod
+ def world_size(self) -> Optional[int]:
+ """ The number of processes across all devices and nodes. """
 
+ @abstractmethod
  def local_rank(self) -> int:
- pass
+ """ The rank (index) of the currently running process inside of the current node. """
 
+ @abstractmethod
  def node_rank(self) -> int:
- pass
+ """ The rank (index) of the node on which the current process runs. """
diff --git a/pytorch_lightning/plugins/environments/lightning_environment.py b/pytorch_lightning/plugins/environments/lightning_environment.py
@@ -0,0 +1,71 @@
+# Copyright The PyTorch Lightning team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import socket
+from typing import Optional
+
+from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment
+
+
+class LightningEnvironment(ClusterEnvironment):
+ """
+ The default environment used by Lightning for a single node or free cluster (not managed).
+
+ The master process must be launched by the user and Lightning will spawn new
+ worker processes for distributed training, either in a single node or across multiple nodes.
+
+ If the master address and port are not provided, the default environment will choose them
+ automatically. It is recommended to use this default environment for single-node distributed
+ training as it provides the most convenient way to launch the training script.
+ """
+
+ def __init__(self):
+ super().__init__()
+ self._master_port = None
+
+ def creates_children(self) -> bool:
+ return False
+
+ def master_address(self) -> str:
+ return os.environ.get("MASTER_ADDR", "127.0.0.1")
+
+ def master_port(self) -> int:
+ if self._master_port is None:
+ self._master_port = os.environ.get("MASTER_PORT", find_free_network_port())
+ return int(self._master_port)
+
+ def world_size(self) -> Optional[int]:
+ return None
+
+ def local_rank(self) -> int:
+ return int(os.environ.get("LOCAL_RANK", 0))
+
+ def node_rank(self) -> int:
+ group_rank = os.environ.get("GROUP_RANK", 0)
+ return int(os.environ.get("NODE_RANK", group_rank))
+
+
+def find_free_network_port() -> int:
+ """
+ Finds a free port on localhost.
+ It is useful in single-node training when we don't want to connect to a real master node but
+ have to set the `MASTER_PORT` environment variable.
+ """
+ s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+ s.bind(("", 0))
+ s.listen(1)
+ port = s.getsockname()[1]
+ s.close()
+ return port
diff --git a/pytorch_lightning/plugins/environments/slurm_environment.py b/pytorch_lightning/plugins/environments/slurm_environment.py
@@ -26,7 +26,10 @@ class SLURMEnvironment(ClusterEnvironment):
  def __init__(self):
  super().__init__()
 
- def master_address(self):
+ def creates_children(self) -> bool:
+ return True
+
+ def master_address(self) -> str:
  # figure out the root node addr
  slurm_nodelist = os.environ.get("SLURM_NODELIST")
  if slurm_nodelist:
@@ -39,7 +42,7 @@ def master_address(self):
  log.debug(f"MASTER_ADDR: {os.environ['MASTER_ADDR']}")
  return root_node
 
- def master_port(self):
+ def master_port(self) -> int:
  # -----------------------
  # SLURM JOB = PORT number
  # -----------------------
@@ -64,18 +67,18 @@ def master_port(self):
 
  log.debug(f"MASTER_PORT: {os.environ['MASTER_PORT']}")
 
- return default_port
+ return int(default_port)
 
  def world_size(self):
- return self._world_size
+ return None
 
- def local_rank(self):
+ def local_rank(self) -> int:
  return int(os.environ['SLURM_LOCALID'])
 
- def node_rank(self):
+ def node_rank(self) -> int:
  return int(os.environ['SLURM_NODEID'])
 
- def resolve_root_node_address(self, root_node):
+ def resolve_root_node_address(self, root_node: str) -> str:
  if '[' in root_node:
  name, numbers = root_node.split('[', maxsplit=1)
  number = numbers.split(',', maxsplit=1)[0]

diff --git a/pytorch_lightning/plugins/environments/torchelastic_environment.py b/pytorch_lightning/plugins/environments/torchelastic_environment.py
@@ -14,6 +14,7 @@
 
 import logging
 import os
+from typing import Optional
 
 from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment
 from pytorch_lightning.utilities import rank_zero_warn
@@ -26,27 +27,31 @@ class TorchElasticEnvironment(ClusterEnvironment):
  def __init__(self):
  super().__init__()
 
- def master_address(self):
+ def creates_children(self) -> bool:
+ return True
+
+ def master_address(self) -> str:
  if "MASTER_ADDR" not in os.environ:
  rank_zero_warn("MASTER_ADDR environment variable is not defined. Set as localhost")
  os.environ["MASTER_ADDR"] = "127.0.0.1"
  log.debug(f"MASTER_ADDR: {os.environ['MASTER_ADDR']}")
  master_address = os.environ.get('MASTER_ADDR')
  return master_address
 
- def master_port(self):
+ def master_port(self) -> int:
  if "MASTER_PORT" not in os.environ:
  rank_zero_warn("MASTER_PORT environment variable is not defined. Set as 12910")
  os.environ["MASTER_PORT"] = "12910"
  log.debug(f"MASTER_PORT: {os.environ['MASTER_PORT']}")
 
- port = os.environ.get('MASTER_PORT')
+ port = int(os.environ.get('MASTER_PORT'))
  return port
 
- def world_size(self):
- return os.environ.get('WORLD_SIZE')
+ def world_size(self) -> Optional[int]:
+ world_size = os.environ.get('WORLD_SIZE')
+ return int(world_size) if world_size is not None else world_size
 
- def local_rank(self):
+ def local_rank(self) -> int:
  return int(os.environ['LOCAL_RANK'])
 
  def node_rank(self) -> int:

diff --git a/pytorch_lightning/plugins/training_type/ddp.py b/pytorch_lightning/plugins/training_type/ddp.py
@@ -30,20 +30,14 @@
 from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment
 from pytorch_lightning.plugins.training_type.parallel import ParallelPlugin
 from pytorch_lightning.utilities import _HYDRA_AVAILABLE, _TORCH_GREATER_EQUAL_1_7, rank_zero_warn
-from pytorch_lightning.utilities.distributed import (
- find_free_network_port,
- rank_zero_only,
- ReduceOp,
- sync_ddp_if_available,
-)
+from pytorch_lightning.utilities.distributed import rank_zero_only, ReduceOp, sync_ddp_if_available
 from pytorch_lightning.utilities.exceptions import MisconfigurationException
 from pytorch_lightning.utilities.seed import seed_everything
 
 if _HYDRA_AVAILABLE:
  from hydra.core.hydra_config import HydraConfig
  from hydra.utils import get_original_cwd, to_absolute_path
 
-
 log = logging.getLogger(__name__)
 
 
@@ -90,8 +84,7 @@ def setup(self, model):
  self._model = model
 
  # start the other scripts
- # TODO: refactor and let generic cluster env hold the information about who spawns the processes
- if os.environ.get("PL_IN_DDP_SUBPROCESS", "0") != "1":
+ if not self.cluster_environment.creates_children() and os.environ.get("PL_IN_DDP_SUBPROCESS", "0") != "1":
  self._call_children_scripts()
 
  # set the task idx
@@ -105,15 +98,12 @@ def _call_children_scripts(self):
  self._has_spawned_children = True
 
  # DDP Environment variables
- os.environ["MASTER_ADDR"] = os.environ.get("MASTER_ADDR", "127.0.0.1")
- os.environ["MASTER_PORT"] = os.environ.get("MASTER_PORT", str(find_free_network_port()))
+ os.environ["MASTER_ADDR"] = self.cluster_environment.master_address()
+ os.environ["MASTER_PORT"] = str(self.cluster_environment.master_port())
 
  # allow the user to pass the node rank
- node_rank = "0"
- node_rank = os.environ.get("NODE_RANK", node_rank)
- node_rank = os.environ.get("GROUP_RANK", node_rank)
- os.environ["NODE_RANK"] = node_rank
- os.environ["LOCAL_RANK"] = "0"
+ os.environ["NODE_RANK"] = str(self.cluster_environment.node_rank())
+ os.environ["LOCAL_RANK"] = str(self.cluster_environment.local_rank())
 
  # when user is using hydra find the absolute path
  path_lib = os.path.abspath if not _HYDRA_AVAILABLE else to_absolute_path
@@ -209,7 +199,6 @@ def determine_ddp_device_ids(self):
  return [self.root_device.index]
 
  def init_ddp_connection(self, global_rank: int, world_size: int) -> None:
- # TODO: From where to get cluster environment?
  os.environ["MASTER_ADDR"] = str(self.cluster_environment.master_address())
  os.environ["MASTER_PORT"] = str(self.cluster_environment.master_port())
  os.environ["WORLD_SIZE"] = str(self.cluster_environment.world_size())

diff --git a/pytorch_lightning/plugins/training_type/ddp_spawn.py b/pytorch_lightning/plugins/training_type/ddp_spawn.py
@@ -30,13 +30,7 @@
 from pytorch_lightning.utilities import _TORCH_GREATER_EQUAL_1_7
 from pytorch_lightning.utilities.cloud_io import atomic_save
 from pytorch_lightning.utilities.cloud_io import load as pl_load
-from pytorch_lightning.utilities.distributed import (
- find_free_network_port,
- rank_zero_only,
- rank_zero_warn,
- ReduceOp,
- sync_ddp_if_available,
-)
+from pytorch_lightning.utilities.distributed import rank_zero_only, rank_zero_warn, ReduceOp, sync_ddp_if_available
 from pytorch_lightning.utilities.seed import seed_everything
 
 log = logging.getLogger(__name__)
@@ -84,7 +78,7 @@ def distributed_sampler_kwargs(self):
  def setup(self, model):
  self._model = model
 
- os.environ["MASTER_PORT"] = os.environ.get("MASTER_PORT", str(find_free_network_port()))
+ os.environ["MASTER_PORT"] = str(self.cluster_environment.master_port())
 
  # pass in a state q
  smp = mp.get_context("spawn")
@@ -93,7 +87,7 @@ def setup(self, model):
  def set_world_ranks(self, process_idx):
  self.local_rank = process_idx
  self.node_rank = self.cluster_environment.node_rank()
- self.task_idx = self.cluster_local_rank
+ self.task_idx = self.cluster_environment.local_rank()
  self.global_rank = self.node_rank * self.num_processes + self.local_rank
  self.world_size = self.num_nodes * self.num_processes
 

diff --git a/pytorch_lightning/plugins/training_type/parallel.py b/pytorch_lightning/plugins/training_type/parallel.py
@@ -40,13 +40,6 @@ def __init__(
  self.local_rank = 0
  self.cluster_environment = cluster_environment
 
- @property
- def cluster_local_rank(self):
- try:
- return self.cluster_environment.local_rank()
- except KeyError:
- return 0
-
  @property
  @abstractmethod
  def root_device(self):

diff --git a/pytorch_lightning/trainer/connectors/accelerator_connector.py b/pytorch_lightning/trainer/connectors/accelerator_connector.py
@@ -42,7 +42,12 @@
  TPUSpawnPlugin,
  TrainingTypePlugin,
 )
-from pytorch_lightning.plugins.environments import ClusterEnvironment, SLURMEnvironment, TorchElasticEnvironment
+from pytorch_lightning.plugins.environments import (
+ ClusterEnvironment,
+ LightningEnvironment,
+ SLURMEnvironment,
+ TorchElasticEnvironment,
+)
 from pytorch_lightning.tuner.auto_gpu_select import pick_multiple_gpus
 from pytorch_lightning.utilities import (
  _APEX_AVAILABLE,
@@ -451,17 +456,10 @@ def select_cluster_environment(self) -> ClusterEnvironment:
  return self._cluster_environment
  if self.is_slurm_managing_tasks:
  env = SLURMEnvironment()
- # TODO: decouple DDP from SLURM
- # refactor and let generic cluster env hold the information about who spawns the processes
- os.environ["PL_IN_DDP_SUBPROCESS"] = "1"
  elif self.is_using_torchelastic:
  env = TorchElasticEnvironment()
- # TODO: decouple DDP from TE
- # refactor and let generic cluster env hold the information about who spawns the processes
- os.environ["PL_IN_DDP_SUBPROCESS"] = "1"
  else:
- # TODO: maybe introduce a DefaultEnvironment?
- env = TorchElasticEnvironment()
+ env = LightningEnvironment()
  return env
 
  def set_distributed_mode(self, distributed_backend: Optional[str] = None):

diff --git a/pytorch_lightning/utilities/distributed.py b/pytorch_lightning/utilities/distributed.py
@@ -64,21 +64,6 @@ def _debug(*args, **kwargs):
 rank_zero_warn = rank_zero_only(_warn)
 
 
-def find_free_network_port() -> int:
- """
- Finds a free port on localhost.
- It is useful in single-node training when we don't want to connect to a real master node but
- have to set the `MASTER_PORT` environment variable.
- """
- import socket
- s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
- s.bind(("", 0))
- s.listen(1)
- port = s.getsockname()[1]
- s.close()
- return port
-
-
 def gather_all_tensors(result: Union[torch.Tensor], group: Optional[Any] = None):
  """
  Function to gather all tensors from several ddp processes onto a list that

diff --git a/setup.cfg b/setup.cfg
@@ -48,7 +48,6 @@ omit =
  pytorch_lightning/accelerators/ddp2_*.py
  pytorch_lightning/accelerators/dp_*.py
  pytorch_lightning/accelerators/tpu_*.py
- pytorch_lightning/cluster_environments/*.py
  pytorch_lightning/utilities/xla_device_utils.py
  pytorch_lightning/utilities/distributed.py
  pytorch_lightning/tuner/auto_gpu_select.py