-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask issue: job cancelled #69
Comments
Some further debugging shows that the worker died because the scheduler fails to hear from a worker for a certain period of time (300 sec). According to this source (dask/distributed#6324), this occurs because the distributed tasks hold the GIL for long periods of time. Normally pure python code is unlikely hold GIL for too long, but since our code involves Cython, and therefore this created the problem. I started seeing jobs repeatedly getting canceled after we upgraded the dask version. This also agrees to the observations from the link above, which states that this behavior is observed in dask.distributed==2022.5.0. You can see the PR of dask source code that caused our issue here: dask/distributed#6200 A quick fix for now is simply to downgrade dask version. A more elegant fix could be either 1. Figure out a way for Cython to release GIL, or 2. Increase default timeout interval. Currently it is 300 sec. Unfortunately there does not seem to be an easy fix. Both solutions requires some work around. Here are the error logs:
|
Regarding fix method 2, increasing default timeout interval, here's how to do it: This is basically reverting Dask's update here: dask/distributed@d94ab9a |
Diyu, I don't totally understand this, but can you make some change, perhaps to the timeout interval, that fixes the problem and then do a PR? |
Sure. Basically what happened is that the scheduler does not hear back from the computation node for a long time (300 sec), and the schedule thinks the node is dead, and cancels the node, while in fact the node is still actively performing computation. The fix is to set this timeout period to be infinity instead of 300 sec (default value of dask). This fix requires the user to manually change the timeout interval value inside the Dask config file (which usually stores in the home directory Will do a PR on this. |
Recently I've been running into the issue of nodes get canceled when performing multi-node computation with dask.
Here's the error log from a canceled node:
I can confirm that this is not due to small
death_timeout
duration, as I setdeath_timeout
to be 1200 sec, while the node cancelation happens rather early (~5mins after I got the nodes).Furthermore, I observed that a large chunk of the multi-node jobs gets canceled:
The text was updated successfully, but these errors were encountered: