-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] [python-package] DaskLGBMRegressor training error: 'binding port 12402 failed' #3753
Comments
@ffineis I've assigned this to myself because I'm actively working on it right now. If you have ideas of stuff to try, leave a comment here. @StrikerRUS I know you're somewhat of a LightGBM historian 😂 . If you remember similar issues with other LightGBM parallel training, please link them here. |
🤣
Unfortunately I don't... As a side note, I'm not sure but I guess |
ha oh yeah, guess I got lucky with the commit I chose. Fixed. |
This morning, I tested the theory that LightGBM just cleaning up the network correctly. I tried to test this by changing I added this to the beginning of each test iteration local_listen_port += n_workers And then ran 10 times in a row again. n_workers = 3
local_listen_port = 12400
...
client.restart()
local_listen_port += n_workers
print(f"local_listen_port: {local_listen_port}")
num_rows = 1e6
num_features = 1e2
num_partitions = 10
rows_per_chunk = num_rows / num_partitions
data = da.random.random((num_rows, num_features), (rows_per_chunk, num_features))
labels = da.random.random((num_rows, 1), (rows_per_chunk, 1))
data = data.persist()
labels = labels.persist()
_ = wait(data)
_ = wait(labels)
dask_reg = DaskLGBMRegressor(
silent=False,
max_depth=5,
random_state=708,
objective="regression_l2",
learning_rate=0.1,
tree_learner="data",
n_estimators=10,
min_child_samples=1,
n_jobs=-1,
local_listen_port=local_listen_port
)
dask_reg.fit(client=client, X=data, y=labels)
|
NOTE I'm going to keep posting debugging findings here, but please feel free to unsubscribe from this issue. I'll open a PR or Summary I tried that MPI version of LightGBM and found that this issue doesn't occur, but only because how I tested and what I foundI tried building the MPI version of LigthGBM (https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html?highlight=mpi#build-mpi-version). I added this to the Dockerfile RUN apt install -y openmpi-bin libopenmpi-dev and changed the install pip install dist/lightgbm*.tar.gz --install-option=--mpi I found that with the MPI version of LightGBM, this issue does not happen! 10 straight successful runs. It's confusing though, I see information in the logs that says that only one worker is participating:
But I can see in the Dask diagnostic dashboard that all three workers in my cluster are participating in training. I'm worried that maybe for the MPI version, the three worker processes are training three independent models, and then When I run When training succeeds in the socket-based version of |
@jameslamb I think this is all expected, I don't see anything wrong with lightgbm - ports could be used for a variety of reasons. The problem is with the dask-lightgbm code, since it doesn't try different ports. In mmlspark, we have special retry logic that tries a specific port and, if it doesn't work, tries the next port. |
@jameslamb this is written in scala but it might be useful: note how we try different ports starting from a user-specified default listen port up to the number of tasks on the machine - we don't assume that some port range will just always work. Once we have the ports we are interested in we send them back to the driver which aggregates them and then sends them back to all of the workers, which then call network init and start the parallel training. |
Nice! Then it just sounds like the
If/when the fix for finding available ports is getting developed - it'd be worth keeping in mind this related issue I've been coming across when worker memory reaches a critical mass and dask restarts the worker. Perhaps this issue could be tackled with port trying in the same go? When dask restarts a worker, the new worker has a different port number, so the call to This issue was called out in dask-lightgbm just prior to 3515. Just mentioning this because it seems related to the issue at hand here. |
thanks for noting that! and to @imatiach-msft for the ideas. I think that one issue I'm seeing is that I tried running training with my example above, and in a shell I ran the following netstat -an | grep 124 During training on a
Once training ends, that same command returns this
I ran that every few seconds and it returned that result for about 2 minutes, then returned nothing. Once I saw that that command didn't return anything, re-running training succeeded. So I think there are two issues:
I think the "look for available ports first" solution is going to fix this. Hopefully I'll have a PR up shortly 😀 More detailsI'm not that familiar with the low-level details of TCP, but this blog post seems relevant to what's happening: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html.
|
@jameslamb I agree with your findings - it would be great to find a cleaner way to close ports if possible. However, the current implementation is wrong anyway because it assumes all ports in those ranges are available. In practice, any application could be running and using some of those ports already - and the only way to find out if they are already used is actually to just try and bind to them. So even if the ports are closed in a cleaner way, it would still be better to find open ports first before assuming they are already open. |
totally agree, thanks for pointing me in the right direction!! I almost have something working right now, will open a PR soon. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
How you are using LightGBM?
LightGBM component: Python package
Environment info
Operating System: Ubuntu 18.04
C++ compiler version: 9.3.0
CMake version: 3.16.3
Python version: 3.8.5
LightGBM version or commit hash:
master
(78d31d9)Error message and / or logs
Training with
DaskLGBMRegressor
often fails with an error like this:The error doesn't ALWAYS happen, and training sometimes succeeds. It also doesn't always reference port 12402. I've found that the first time I call
DaskLGBMRegressor.fit()
, I don't see this error. After that, subsequent tries often result in the error.Here's the result of 10 calls of
.fit()
, withclient.restart()
run after each one.full traceback:
Steps to reproduce
This is based on https://github.com/jameslamb/talks/tree/main/recent-developments-in-lightgbm, but copying in the steps in case that repo changes in the future.
Dockerfile
with the following contentsdocker build --no-cache -t dask-lgbm-testing:1 -f Dockerfile .
docker run \ -v $$(pwd):/home/jovyan/testing \ -p 8888:8888 \ -p 8787:8787 \ --name dask-lgbm-test \ dask-lgbm-testing:1
That should succeed, and if you click the printed link, you should see the Dask diagnostic dashboard.
client.restart()
, which clears the memory on all worker processes and removes any work from the scheduler.I expect that you'll see a similar pattern as the one noted above. Training will sometimes succeed, but often fail with an error like "cannot bind port XXXX".
Other information
I've noticed that often when this happens, it seems like maybe some of the worker processes were killed and restarted. I don't see messages about that in the logs, but the memory utilization for the workers is really uneven.
I've observed this behavior on
FargateCluster
s fromdask-cloudprovider
and on thedask-kubernetes
clusters from Saturn Cloud. So I don't think this issue is specific to the Dask docker image I used in the example below, or to the use ofLocalCluster
.I've also observed this behavior using dask-lightgbm built from current
master
, with LightGBM 3.0.0.Given all of that, my best guess is that there is some race condition where workers join the LightGBM cluster in a nondeterministic order, or maybe where two workers claim the same rank.
The text was updated successfully, but these errors were encountered: