Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with handling of broken Workers #1874

Closed
avolution opened this issue Mar 30, 2018 · 4 comments
Closed

Problems with handling of broken Workers #1874

avolution opened this issue Mar 30, 2018 · 4 comments

Comments

@avolution
Copy link

avolution commented Mar 30, 2018

Hey guys

I have the following problem:

My workers need a complex env setup which can be broken in some cases.

This brings Problems for the scheduler runs

For example:

def worker_function():
  # do something which will fail on at least one worker node
  return

client.submit(worker_function)

This task can not be done, described as follows

I have tried to raise the Reschedule() Exception on fails but the scheduler assign that failed task again and again to the same broken worker.
In result, all other tasks are succeed but the whole run hangs because of this one(or more) broken tasks which will be run repeated on the assigned broken worker(s) forever.
The retry Option for the submit function brings almost the same behaviour with the difference that the whole Task failed after run all retries.

How can I configure the Task to bring the scheduler to rotate the Task assignments

and/or how can I kill broken workers for example when they had too much failed runs.

@mrocklin
Copy link
Member

I'm curious, how is the worker broken?

@avolution
Copy link
Author

"I'm curious, how is the worker broken?"

Failed network connectivity for example or specially in my case the "database" of the worker is not polulated while setup (I ran the workers in a docker swarm which every worker run a prepare.sh script on setup by docker)

The reason why the worker env is broken has nothing to do with DASK

@mrocklin
Copy link
Member

mrocklin commented Mar 30, 2018 via email

@avolution
Copy link
Author

#1876 helps a lot thanks.

@mrocklin mrocklin closed this as completed Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants