You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fromdask.datasetsimporttimeseriesfromdask.distributedimportClient, LocalClusterdefraise_exc(part):
raiseRuntimeErrorif__name__=='__main__':
cluster=LocalCluster(n_workers=2, threads_per_worker=2)
withClient(cluster) asclient:
ts=timeseries()
ts=ts.map_partitions(raise_exc, meta={})
print("starting first compute on cluster")
try:
ts.compute()
except:
passprint("first compute done")
print("starting second compute on cluster")
X=client.submit(lambdax: x+1, 1).result()
print("second compute done")
print(X)
second compute done never prints on distributed main, 2021.10.0. It succeeds on 2021.09.1. (Note this was all with dask/dask on main dask/dask@c2278fe).
Originally reported in microsoft/LightGBM#4771 (comment) (thank you @jmoralez and @jameslamb!). I've slightly simplified the example (the multiple clients weren't necessary to reproduce, nor recomputing timeseries—it seems that any subsequent computation deadlocks). I haven't been able to simplify timeseries().map_partitions(raise_exc); just submitting Futures I can't replicate it.
Anything else we need to know?:
I poked briefly at the deadlocked cluster from another client. This is not #5480; the workers' BatchedSends have nothing in their buffers. Additionally, the scheduler and worker both actually agree that the workers have more tasks remaining to do (client.processing() lines up with client.run(lambda dask_worker: dask_worker.ready)).
This seems to be purely a Worker state machine deadlock. The problem is that there are keys left in Worker._executing:
Minimal Complete Verifiable Example:
second compute done
never prints on distributed main, 2021.10.0. It succeeds on 2021.09.1. (Note this was all with dask/dask on main dask/dask@c2278fe).Originally reported in microsoft/LightGBM#4771 (comment) (thank you @jmoralez and @jameslamb!). I've slightly simplified the example (the multiple clients weren't necessary to reproduce, nor recomputing
timeseries
—it seems that any subsequent computation deadlocks). I haven't been able to simplifytimeseries().map_partitions(raise_exc)
; just submitting Futures I can't replicate it.Anything else we need to know?:
I poked briefly at the deadlocked cluster from another client. This is not #5480; the workers' BatchedSends have nothing in their buffers. Additionally, the scheduler and worker both actually agree that the workers have more tasks remaining to do (
client.processing()
lines up withclient.run(lambda dask_worker: dask_worker.ready)
).This seems to be purely a Worker state machine deadlock. The problem is that there are keys left in
Worker._executing
:But threads aren't actually running:
And those tasks are already transitioned to the
error
state, yet not removed from_executing
for some reason:cc @fjetter @jrbourbeau
Environment:
The text was updated successfully, but these errors were encountered: