Problems with handling of broken Workers #1874

avolution · 2018-03-30T14:52:18Z

Hey guys

I have the following problem:

My workers need a complex env setup which can be broken in some cases.

This brings Problems for the scheduler runs

For example:

def worker_function():
  # do something which will fail on at least one worker node
  return

client.submit(worker_function)

This task can not be done, described as follows

I have tried to raise the Reschedule() Exception on fails but the scheduler assign that failed task again and again to the same broken worker.
In result, all other tasks are succeed but the whole run hangs because of this one(or more) broken tasks which will be run repeated on the assigned broken worker(s) forever.
The retry Option for the submit function brings almost the same behaviour with the difference that the whole Task failed after run all retries.

How can I configure the Task to bring the scheduler to rotate the Task assignments

and/or how can I kill broken workers for example when they had too much failed runs.

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-03-30T15:23:34Z

I'm curious, how is the worker broken?

avolution · 2018-03-30T15:27:15Z

"I'm curious, how is the worker broken?"

Failed network connectivity for example or specially in my case the "database" of the worker is not polulated while setup (I ran the workers in a docker swarm which every worker run a prepare.sh script on setup by docker)

The reason why the worker env is broken has nothing to do with DASK

mrocklin · 2018-03-30T15:30:54Z

Having a worker that is connected to the network but incapable of running tasks sounds bad. If you know which workers will be "good" at startup you could use resources: http://distributed.readthedocs.io/en/latest/resources.html You could also specify which workers are capable of running a task explicitly with the workers= keyword: http://distributed.readthedocs.io/en/latest/locality.html#user-control After #1876 you should be able to kill workers manually from the client. If you look at the implementation of that PR you'll see that this is just a one-liner that you can do today off of an recent release.

…

On Fri, Mar 30, 2018 at 11:27 AM, avolution ***@***.***> wrote: "I'm curious, how is the worker broken?" Failed network connectivity for example or specially in my case the "database" of the worker is not polulated while setup (I ran the workers in a docker swarm which every worker run a prepare.sh script on setup by docker) The reason why the worker env is broken has nothing to do with DASK — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1874 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszB3W4CsGIXAcXz-AitpVFgaM5B-oks5tjk7TgaJpZM4TBxqr> .

avolution · 2018-03-30T15:52:01Z

#1876 helps a lot thanks.

mrocklin closed this as completed Apr 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with handling of broken Workers #1874

Problems with handling of broken Workers #1874

avolution commented Mar 30, 2018 •

edited

Loading

mrocklin commented Mar 30, 2018

avolution commented Mar 30, 2018

mrocklin commented Mar 30, 2018 via email

avolution commented Mar 30, 2018

Problems with handling of broken Workers #1874

Problems with handling of broken Workers #1874

Comments

avolution commented Mar 30, 2018 • edited Loading

mrocklin commented Mar 30, 2018

avolution commented Mar 30, 2018

mrocklin commented Mar 30, 2018 via email

avolution commented Mar 30, 2018

avolution commented Mar 30, 2018 •

edited

Loading