-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
My issue is described at dask/dask-jobqueue#169
A huge thank to @guillaumeeb who identified the problem!
Context of the issue
Basically, an error is triggered by the dask backend if a computation encounters exceptions,
and after some attempts to relaunch the task, it makes the fit
function fail.
Process to reproduce the issue
Set use_dask=True
with a dataset that will crash with some pipelines (for example PolynomialFeatures with 800 columns).
Expected result
Without dask, _wrapped_cross_val_score
catches the exceptions and returns a -inf score.
The relevant code is there:
Lines 434 to 480 in 507b45d
if use_dask: | |
try: | |
import dask_ml.model_selection # noqa | |
import dask # noqa | |
from dask.delayed import Delayed | |
except ImportError: | |
msg = "'use_dask' requires the optional dask and dask-ml depedencies." | |
raise ImportError(msg) | |
dsk, keys, n_splits = dask_ml.model_selection._search.build_graph( | |
estimator=sklearn_pipeline, | |
cv=cv, | |
scorer=scorer, | |
candidate_params=[{}], | |
X=features, | |
y=target, | |
groups=groups, | |
fit_params=sample_weight_dict, | |
refit=False, | |
error_score=float('-inf'), | |
) | |
cv_results = Delayed(keys[0], dsk) | |
scores = [cv_results['split{}_test_score'.format(i)] | |
for i in range(n_splits)] | |
CV_score = dask.delayed(np.array)(scores)[:, 0] | |
return dask.delayed(np.nanmean)(CV_score) | |
else: | |
try: | |
with warnings.catch_warnings(): | |
warnings.simplefilter('ignore') | |
scores = [_fit_and_score(estimator=clone(sklearn_pipeline), | |
X=features, | |
y=target, | |
scorer=scorer, | |
train=train, | |
test=test, | |
verbose=0, | |
parameters=None, | |
fit_params=sample_weight_dict) | |
for train, test in cv_iter] | |
CV_score = np.array(scores)[:, 0] | |
return np.nanmean(CV_score) | |
except TimeoutException: | |
return "Timeout" | |
except Exception as e: | |
return -float('inf') |
Current result
The current result is strange (KeyError), but the error message on the master branch of Dask will indicate that a worker restarted (I think).
Possible fix
There should be a way to catch the memory error (or worker restart) to return a -inf score.
When I try with a LSFCluster backend, it causes an exception
When I try with a LocalCluster (Client()
), the whole notebook crashes after some attempts.
Immediate fix
I think using the joblib backend provided by Dask should be fine? I'll try tomorrow!
It doesn't work, see below.
Screenshots
Logs
with one worker
traceback: https://pastebin.com/raw/QvtVXkBD
dask-worker.err: https://ptpb.pw/SwbX
Notice that the dask worker restarted 3 times, which appears to be a constant.