[dask] make random port search more resilient to random collisions (fixes #4057) #4133

jameslamb · 2021-03-29T05:48:07Z

Changes in this PR

if using training with "search for random ports" strategy (documented in https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#using-specific-ports), lightgbm.dask. will now check for the situation where it found duplicate IP-port pairs. If such duplicates are found, lightgbm.dask will re-run the random search up to 10 times.

Background

#4057 documents a class of error that can fail with distributed training using lightgbm.dask. If you do not provide machines or local_listen_port at training time, LightGBM will randomly search for ports to use on each worker.

To limit the overhead introduced by this random search, a function _find_random_open_port() is run once at the same time on every worker, using distributed.Client.run() (added in #3823). If you have multiple Dask worker processes on the same physical host (i.e. if you're using distirbuted.LocalCluster or nrprocs > 1), there is a small but non-zero probability that this setup will choose the same random port for multiple workers on that host. When that happens, LightGBM training will fail with an error like

Exception: LightGBMError('Socket send error, code: 104',)

Based on #4112 (comment), I think my original assumptions about how likely such conflicts were (#4057 (comment)) was not correct.

How this improves `lightgbm`

Removes a source of instability that could cause distributed training with Dask to fail.

Improves the stability of LightGBM's tests, which should reduce maintainer effort needed in re-running failed builds.

jameslamb · 2021-03-29T05:48:42Z

Linking #4132 (comment), the finding that inspired this pull request.

StrikerRUS · 2021-03-29T20:38:40Z

python-package/lightgbm/dask.py

@@ -371,6 +383,18 @@ def _train(
                _find_random_open_port,
                workers=list(worker_addresses)
            )
+            # handle the case where _find_random_open_port() produces duplicates
+            retries_left = 10
+            while _worker_map_has_duplicates(worker_address_to_port) and retries_left > 0:


I'm afraid that multiple re-runs of the same function still have high probability to produce same values, especially on the same physical machine.

If a is omitted or None, the current system time is used. If randomness sources are provided by the operating system, they are used instead of the system time (see the os.urandom() function for details on availability).
https://docs.python.org/3/library/random.html#random.seed

I believe that more reliable way to handle the case of same ports will be to resolve it manually by simply incrementing ports while conflicts are not resolved.

ok, I'll change this to something like that. I was worried about making this a lot more complicated, but maybe it's unavoidable.

I'm sorry for this, I'll try to come up with a solution as well. FWIW the "randomness" isn't related to python's random.seed, since we're asking the OS for an open port and it decides which one to give us. I believe the collisions happen when a worker has completed the function and freed the port and another one is just asking for it and the OS just returns the same one (kinda troll). I'll see if we can put the port in wait for a bit or something like that.

alright, I tried a different approach in 05303c8. I think this will be more reliable. Instead of re-running _find_random_port() all at once for all workers again, the process will be like:

run _find_random_port() for every worker

if any duplicate IP-port pairs were found, run _find_random_port() again only for those workers that need to be changed to eliminate duplicates

I think that pattern (only re-running for the workers that need to be changed to resolve duplicates) should give us confidence that duplicates will be resolved, because the system time will change each time that function is run.

I think this approach, will still relies on _find_random_port(), is preferable to just incrementing and checking if the new port is open, because it should (on average) find a new open port more quickly than the incrementing approach. Consider the case where, for example, LightGBM tries to put multiple workers in a LocalCluster on port 8887 (1 less than the default port for Jupyter). Jupyter uses such an approach of "increment by one until you find an open port", so if someone has multiple Jupyter sessions running it's possible that they might have ports 8888, 8889, 8890, and 8891 all occupied (for example), which would mean LightGBM would need 5 attempts to find a new open port (if 8892 is open).

I think the existence of systems like this is why Dask also searches randomly if the default port it prefers for its scheduler is occupied when you run dask-scheduler (instead of incrementing). You can see that in the logs for LightGBM's Dask tests that use dsitributed.LocalCluster, for example from this build:

/opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 37211 instead warnings.warn( /opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 35045 instead warnings.warn( /opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 35051 instead warnings.warn( /opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 38031 instead /opt/conda/envs/test-env/lib/python3.9/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 41941 instead

@jmoralez I don't think you need to do any new work. Your review on the change I just pushed is welcome if you see anything wrong with it, but I'm fairly confident it will get the job done.

tests/python_package_test/test_dask.py

jmoralez · 2021-03-30T02:14:09Z

tests/python_package_test/test_dask.py

+    assert retry_msg not in capsys.readouterr().out
+
+    # should handle worker maps with duplicates
+    map_without_duplicates = {


map_with_duplicates

fixed in 75cfbbb

jmoralez · 2021-03-30T02:14:31Z

tests/python_package_test/test_dask.py

+    }
+    patched_map = lgb.dask._possibly_fix_worker_map_duplicates(
+        client=client,
+        worker_map=map_without_duplicates


map_with_duplicates

fixed in 75cfbbb

tests/python_package_test/test_dask.py

python-package/lightgbm/dask.py

StrikerRUS

Thank you!

StrikerRUS · 2021-03-31T12:02:11Z

@jmoralez Are you OK to merge this?

jmoralez · 2021-03-31T14:21:33Z

@StrikerRUS Yes

github-actions · 2023-08-23T22:51:28Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

[dask] make random port search more resilient to random collisions

4782338

jameslamb added the fix label Mar 29, 2021

jameslamb requested a review from StrikerRUS March 29, 2021 05:48

linting

b1cc4eb

jameslamb mentioned this pull request Mar 29, 2021

[dask] run one training task on each worker #4132

Merged

StrikerRUS reviewed Mar 29, 2021

View reviewed changes

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved

jameslamb added 2 commits March 29, 2021 18:33

Merge branch 'master' into fix/dask-duplicate-ports

671619c

more reliable ports check

05303c8

jmoralez reviewed Mar 30, 2021

View reviewed changes

address review comments

75cfbbb

StrikerRUS reviewed Mar 30, 2021

View reviewed changes

python-package/lightgbm/dask.py Show resolved Hide resolved

jameslamb added 2 commits March 30, 2021 17:33

Merge branch 'master' into fix/dask-duplicate-ports

4160f1c

add error message

bd9a51a

StrikerRUS approved these changes Mar 31, 2021

View reviewed changes

jameslamb merged commit 1ce4b22 into master Mar 31, 2021

jameslamb deleted the fix/dask-duplicate-ports branch March 31, 2021 14:25

jameslamb mentioned this pull request Apr 1, 2021

[dask] Random failures in Dask tests during teardown #3829

Closed

StrikerRUS mentioned this pull request May 26, 2021

[tests][cli] distributed training #4254

Merged

jameslamb mentioned this pull request Jul 9, 2021

[dask] flaky unit test test_find_random_open_port #4458

Closed

jmoralez mentioned this pull request Aug 1, 2021

[dask] find all needed ports in each host at once (fixes #4458) #4498

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] make random port search more resilient to random collisions (fixes #4057) #4133

[dask] make random port search more resilient to random collisions (fixes #4057) #4133

jameslamb commented Mar 29, 2021

jameslamb commented Mar 29, 2021

StrikerRUS Mar 29, 2021

jameslamb Mar 29, 2021

jmoralez Mar 30, 2021

jameslamb Mar 30, 2021

jameslamb Mar 30, 2021

jmoralez Mar 30, 2021

jameslamb Mar 30, 2021

jmoralez Mar 30, 2021

jameslamb Mar 30, 2021

StrikerRUS left a comment

StrikerRUS commented Mar 31, 2021

jmoralez commented Mar 31, 2021

github-actions bot commented Aug 23, 2023

[dask] make random port search more resilient to random collisions (fixes #4057) #4133

[dask] make random port search more resilient to random collisions (fixes #4057) #4133

Conversation

jameslamb commented Mar 29, 2021

Changes in this PR

Background

How this improves lightgbm

jameslamb commented Mar 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS commented Mar 31, 2021

jmoralez commented Mar 31, 2021

github-actions bot commented Aug 23, 2023

How this improves `lightgbm`