-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task timeouts and retries #391
Comments
IPython doesn't allow retrying task on timeout (that might be a little tricky, depending on how well you support aborting/invalidating results that are pending), but it does support task reassignment based on special exceptions. In IPython's case, it's an UnmetDependency exception, indicating that the worker cannot run the task (meant for missing packages, resources like memory, GPUs, etc.). |
Just curious, has anything changed on this front? In particular I was looking for a solution to:
I often find that the last couple of tasks (out of many thousands) will hang for unknown reasons; the only solution I've come up with so far is to submit everything w/
Far from ideal but it at least breaks out of the deadlock where the job will never finish. |
@bnaul you might be interested in dask/dask#1183 |
Wow, this line is a life-saver. Executing this by hand also helped with my issue here: Is there meanwhile a solution for this problem? |
FWIW today there's a special |
This would be really useful especially if that could be a global to all tasks. My specific use case would be basically annotate any task in a graph with a timeout that would trigger a retry. I often find that 99% of my tasks finish quickly, but for some reason one hangs potentially indefinitely. I could use the exception mentioned by @jrbourbeau above, but it make it hard if I am calling a function defined by a package that triggers many tasks. One example is Xarray's'
Another use case that would be helpful is when I have a hypothesis that a dask distributed lock is stuck or something similar. There are many hypothesis I have that are unlikely to be true about why a very small subset of tasks are hanging at the very end. This is the only sweeping mechanism that could help trigger a retry. I have no clue what this would take, just giving my two cents at how this would be a nice-to-have. |
I have another scenario of same issue where raising anything from the task won't work: I have a cluster tuned for raw performance versus stability, and tasks are allowed to rarely fail with unexpected consequences (segfaults typically) and thus can sometimes hang without any way to recover. I don't see any "infinite" hangs - when the tasks stops responding it will eventually trigger a reschedule "automatically" but the delay is huge, probably due to some internal timeout that saves the day, so I can safely keep it running without supervision. The issue is of course degraded overall performance. The line I'm seeing in logs after a long delay:
So I need a way to trigger that internal timeout. Ideally, some routine should learn 99th percentile of single-job completion time, and force-reschedule. I will try to play with TCP keepalive and timeouts on scheduler node to see if it helps |
Sometimes we want to retry a task on another worker if it appears to be taking a long time. One approach would be to specify a timeout when submitting the task
Another would be for the function itself to raise a special exception
The latter is somewhat attractive because it removes some administrative burden from the scheduler.
Pinging @minrk for thoughts or previous experiences.
The text was updated successfully, but these errors were encountered: