-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use_dask and joblib don't handle Exceptions #779
Comments
I am not sure, but the timeout handling with the dask backend should be worth testing as well. |
Unfortunately, it doesn't work either with the joblib backend. I launched the following code:
After some time, I got the same errors:
and the cell failed without raising an error but early stopped. |
For use_dask, the problem probably comes from Lines 443 to 454 in 507b45d
If one solved the problem with use_dask, it is probable that it will solve the dask joblib backend issue as well. |
@louisabraham which version of dask you are using right now? I remember we had this kind of error handling issue before but I thought @TomAugspurger fixed it. Edit: I think that |
I am using dask 0.19.3 and tpot 0.9.5. |
It's quite likely timeouts sent by TPOT are not handled correctly right now on the dask backend. That should be possible with a little effort It'll be a bit before I can look closely at the other errors. |
I think it is about MemoryError, not timeouts. I reported the issue at dask/distributed#2297 |
I'm facing the same issue. Apparently an exception in an worker is leading to a KillerWorker exception being raised, which is leading to a mistimed "RuntimeError: A pipeline has not yet been optimized. Please call fit() first." being raised, and then the whole dask machinery is shut down. I'm attaching the relevant portion of the logs and the installed packages (I'm using anaconda inside docker). |
My issue is described at dask/dask-jobqueue#169
A huge thank to @guillaumeeb who identified the problem!
Context of the issue
Basically, an error is triggered by the dask backend if a computation encounters exceptions,
and after some attempts to relaunch the task, it makes the
fit
function fail.Process to reproduce the issue
Set
use_dask=True
with a dataset that will crash with some pipelines (for example PolynomialFeatures with 800 columns).Expected result
Without dask,
_wrapped_cross_val_score
catches the exceptions and returns a -inf score.The relevant code is there:
tpot/tpot/gp_deap.py
Lines 434 to 480 in 507b45d
Current result
The current result is strange (KeyError), but the error message on the master branch of Dask will indicate that a worker restarted (I think).
Possible fix
There should be a way to catch the memory error (or worker restart) to return a -inf score.
When I try with a LSFCluster backend, it causes an exception
When I try with a LocalCluster (
Client()
), the whole notebook crashes after some attempts.Immediate fix
I think using the joblib backend provided by Dask should be fine? I'll try tomorrow!It doesn't work, see below.
Screenshots
from the dask web interface
Logs
with one worker
traceback: https://pastebin.com/raw/QvtVXkBD
dask-worker.err: https://ptpb.pw/SwbX
Notice that the dask worker restarted 3 times, which appears to be a constant.
The text was updated successfully, but these errors were encountered: