-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] [python] lightgbm.dask hangs indefinitely after an error #4771
Comments
Thanks for the example @jameslamb, I couldn't reproduce the hang on the workers test, didn't realize it was the consecutive runs that failed. I just ran this locally and looked at the worker logs and see this:
I'd say that's strange to say the least. |
Sure! I only had the idea of running the two consecutive tests after noticing @gjoseph92's writeup in dask/distributed#5480. |
Oh I just realized that test_errors raises that exception, I see now it's not strange at all seeing that in the worker logs haha. |
Adding a |
Yeah! Since we have evidence that this is a bug in I'd support a PR that adds that |
@jameslamb do you have any logs from the cluster you could share? Specifically, I'm interested in evidence of dask/distributed#5482, or some other error like it—if you see But if you're reproducibly showing a deadlock with 2021.10.0 but not 2021.09.0, that's very helpful information. |
Let me try running the repro from this issue's description with more verbose logging! To be honest, I just jumped to conclusions when I saw the combination of these factors:
|
@gjoseph92 the following reproduces what we're seeing in our tests: from dask.datasets import timeseries
from dask.distributed import Client, LocalCluster
def raise_exc(part):
raise Exception
if __name__ == '__main__':
cluster = LocalCluster(n_workers=2, threads_per_worker=2)
with Client(cluster) as client:
ts = timeseries()
ts = ts.map_partitions(raise_exc, meta={})
try:
ts.compute()
except:
pass
with Client(cluster) as client:
X = timeseries().compute()
print(X.shape) |
@jmoralez thanks for the excellent reproducer! Turns out this was a separate issue dask/distributed#5497. Your code is very helpful! |
Description
Shortly after
dask=2021.10.0
anddistributed=2021.10.0
were uploaded to the Anaconda default channels (used in LightGBM's CI), LightGBM's CI job started failing with timeouts (#4769).Logs suggested that one of the unit tests on
lightgbm.dask
was hanging indefinitely. Specifically, the test aftertest_errors
.Reproducible example
Run inside a container with
conda
available.Using
dask
anddistributed
v2021.10.0, after a run which produces an error out on the cluster, the next test gets stuck indefinitely.Downgrading to
dask
/distributed
2021.9.0 fixes the issue.Environment info
See commands in reproducible example.
Additional Comments
I suspect that
lightgbm
's tests are being affected by the bug documented here: dask/distributed#5480.I think this because timeout errors happen immediately after a test that produces an error, and as of #4159, all of LightGBM's Dask unit tests share the same cluster.
The text was updated successfully, but these errors were encountered: