-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workers in a LocalCluster appear to stall at indeterminate time through computation #3878
Comments
If you suspect #3761 have you tested with WorkStealing off ? |
Just a little update on this; I'm still seeing it occur with work stealing disabled. I have a suspicion that it may be related to the fact that I create a client/local cluster long before I actually trigger a compute - if I break up the workflow to create a new client/local cluster right before I call I am not sure how the workers being idle (mostly; apart from scanning input CSVs on |
Hi Krishan - You said this no longer occurs when you increase |
Hey @GFleishman, |
I am still trying to narrow down the source of my issue, but I am raising the issue now incase someone is able to jump in with some insight that helps me either solve it or find an MRE.
What happened:
I create a LocalCluster with between 32-90 workers (depending on what size server I am running on).
I create and submit a rather large task graph (~3million tasks) composed of entirely Dask Dataframe operations.
At indeterminate times through the computation, some of my workers abruptly drop
to 0-4% CPU usage. They still hold keys in memory, and still have tasks listed
in their processing queue, but they do not make any progress.
If I do not intervene, all other workers exhaust their processing queues (taking
new tasks from the scheduler & completing them) until
the only tasks remaining are those that depend on tasks currently in the queue
of the stalled worker(s). At this point, all workers in the cluster sit idle.
If I manually sigterm the offending worker(s), then the computation is able to finish successfully.
I originally thought I might be seeing an instance of #3761, but now I think my problem is different.
Often (but not always), the stalled workers have a status of
closing
. Moreoften (almost always, but not 100%), the workers emit a log line stating something like
Stopping worker at <Worker 127.0.0.1:34334>
, which tells me something is invoking theWorker.close()
method.I've noticed that in general for every worker that hits this state, there are
~3-4 other workers that close successfully.
After adjusting the connection timeouts
(
distributed.comm.timeouts.tcp/.connect
) to 90s each, I am yet to encounterthis problem again.
I have two problems:
related to them timing out
intervention.
I can live with
1
as long as2
isn't true, but ideally I'd understand why1
was occurring to fix the source of the problem.What you expected to happen:
The workers to close successfully, or not be closed at all.
Minimal Complete Verifiable Example:
I am yet to create a MRE for this, and am losing hope that I will succeed at doing so.
Anything else we need to know?:
I realise that this bug report might lack enough detail to work with - I am sharing it in the hope that someone may help point me in a direction to dig further. I will post updates as I uncover more.
Environment:
The text was updated successfully, but these errors were encountered: