-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] make random port search more resilient to random collisions (fixes #4057) #4133
Merged
Merged
Changes from 2 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
4782338
[dask] make random port search more resilient to random collisions
jameslamb b1cc4eb
linting
jameslamb 671619c
Merge branch 'master' into fix/dask-duplicate-ports
jameslamb 05303c8
more reliable ports check
jameslamb 75cfbbb
address review comments
jameslamb 4160f1c
Merge branch 'master' into fix/dask-duplicate-ports
jameslamb bd9a51a
add error message
jameslamb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that multiple re-runs of the same function still have high probability to produce same values, especially on the same physical machine.
I believe that more reliable way to handle the case of same ports will be to resolve it manually by simply incrementing ports while conflicts are not resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll change this to something like that. I was worried about making this a lot more complicated, but maybe it's unavoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for this, I'll try to come up with a solution as well. FWIW the "randomness" isn't related to python's
random.seed
, since we're asking the OS for an open port and it decides which one to give us. I believe the collisions happen when a worker has completed the function and freed the port and another one is just asking for it and the OS just returns the same one (kinda troll). I'll see if we can put the port in wait for a bit or something like that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, I tried a different approach in 05303c8. I think this will be more reliable. Instead of re-running
_find_random_port()
all at once for all workers again, the process will be like:_find_random_port()
for every worker_find_random_port()
again only for those workers that need to be changed to eliminate duplicatesI think that pattern (only re-running for the workers that need to be changed to resolve duplicates) should give us confidence that duplicates will be resolved, because the system time will change each time that function is run.
I think this approach, will still relies on
_find_random_port()
, is preferable to just incrementing and checking if the new port is open, because it should (on average) find a new open port more quickly than the incrementing approach. Consider the case where, for example, LightGBM tries to put multiple workers in aLocalCluster
on port8887
(1 less than the default port for Jupyter). Jupyter uses such an approach of "increment by one until you find an open port", so if someone has multiple Jupyter sessions running it's possible that they might have ports 8888, 8889, 8890, and 8891 all occupied (for example), which would mean LightGBM would need 5 attempts to find a new open port (if 8892 is open).I think the existence of systems like this is why Dask also searches randomly if the default port it prefers for its scheduler is occupied when you run
dask-scheduler
(instead of incrementing). You can see that in the logs for LightGBM's Dask tests that usedsitributed.LocalCluster
, for example from this build:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmoralez I don't think you need to do any new work. Your review on the change I just pushed is welcome if you see anything wrong with it, but I'm fairly confident it will get the job done.