-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect handling of MemoryError #2297
Comments
In principle replacing |
I don't see how you would do it. The MemoryError should be raised by the function that is being executed, but the KilledWorker is raised in the scheduler / nanny. |
Looking this over, can one set the value
to None, which will prevent the nanny from terminating the worker process in low memory situations (which in my case my tasks get lost, as the client is not notified that the task is no longer running, and the task is assigned to that particular worker, and not restarted, and if the worker was restarted, the worker would be in infinite loop of running out of memory and restart). Note this could be the case anyhow, as there is an opportunity for the worker process to run of memory, without the nanny knowing, as I seemed to have encountered these on occasion as well. |
I think this would require the nanny to communicate to the scheduler (AFAIK it currently doesn't, at all) that it just killed the worker, and this information must reach the scheduler before it has a chance to raise KilledWorker Use case 1Worker dies abruptly. There is a Nanny. Use case 2Worker dies abruptly. There is a Nanny. Use case 3Worker dies abruptly. There is a Nanny. Use case 4Worker dies abruptly. There is no Nanny. |
Following the discussion on dask/dask-jobqueue#169, I produced a sscce. Thank you @guillaumeeb and @mrocklin!
Description
The issue is that a MemoryError cannot be produced (except with an explicit raise) because the worker restarts before.
Motivations
It seems trivial, but there are situations when you want to catch those errors.
The bug originates from EpistasisLab/tpot#779 where an evolutionary algorithm tests scikit-learn pipelines. Some pipelines will produce a MemoryError (like PolynomialFeatures with a large number of columns)
Code
Note that the same works with the delayed interface.
Output
Also, one can notice that the warnings are strange, because distributed.nanny should know what killed the workers.
Discussion
Since I only started looking at the internals of Dask yesterday, my opinion is very subjective ; but I see two options.
The default behavior is to try executing the task 3 times. If a worker has a MemoryError because of a temporary problem on its node, then dask should retry the task.
I don't know if 1. is possible, but if it is, it offers the most flexibility as it would just be a variation on the default behavior that raises a MemoryError if the task failed 3 times because of a worker that restarted. This value could be customized for this particular reason of worker restart.
The 2. could be system dependent, and would probably need a mechanism to retry the tasks. This mechanism would be simple to handle from the application code, but I really think it should be provided by dask as well.
The text was updated successfully, but these errors were encountered: