-
-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with handling of broken Workers #1874
Comments
I'm curious, how is the worker broken? |
"I'm curious, how is the worker broken?" Failed network connectivity for example or specially in my case the "database" of the worker is not polulated while setup (I ran the workers in a docker swarm which every worker run a prepare.sh script on setup by docker) The reason why the worker env is broken has nothing to do with DASK |
Having a worker that is connected to the network but incapable of running
tasks sounds bad. If you know which workers will be "good" at startup you
could use resources:
http://distributed.readthedocs.io/en/latest/resources.html
You could also specify which workers are capable of running a task
explicitly with the workers= keyword:
http://distributed.readthedocs.io/en/latest/locality.html#user-control
After #1876 you should be able to
kill workers manually from the client. If you look at the implementation
of that PR you'll see that this is just a one-liner that you can do today
off of an recent release.
…On Fri, Mar 30, 2018 at 11:27 AM, avolution ***@***.***> wrote:
"I'm curious, how is the worker broken?"
Failed network connectivity for example or specially in my case the
"database" of the worker is not polulated while setup (I ran the workers in
a docker swarm which every worker run a prepare.sh script on setup by
docker)
The reason why the worker env is broken has nothing to do with DASK
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1874 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszB3W4CsGIXAcXz-AitpVFgaM5B-oks5tjk7TgaJpZM4TBxqr>
.
|
#1876 helps a lot thanks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hey guys
I have the following problem:
My workers need a complex env setup which can be broken in some cases.
This brings Problems for the scheduler runs
For example:
This task can not be done, described as follows
I have tried to raise the Reschedule() Exception on fails but the scheduler assign that failed task again and again to the same broken worker.
In result, all other tasks are succeed but the whole run hangs because of this one(or more) broken tasks which will be run repeated on the assigned broken worker(s) forever.
The retry Option for the submit function brings almost the same behaviour with the difference that the whole Task failed after run all retries.
How can I configure the Task to bring the scheduler to rotate the Task assignments
and/or how can I kill broken workers for example when they had too much failed runs.
The text was updated successfully, but these errors were encountered: