Skip to content

use_dask and joblib don't handle Exceptions #779

@louisabraham

Description

@louisabraham

My issue is described at dask/dask-jobqueue#169

A huge thank to @guillaumeeb who identified the problem!

Context of the issue

Basically, an error is triggered by the dask backend if a computation encounters exceptions,
and after some attempts to relaunch the task, it makes the fit function fail.

Process to reproduce the issue

Set use_dask=True with a dataset that will crash with some pipelines (for example PolynomialFeatures with 800 columns).

Expected result

Without dask, _wrapped_cross_val_score catches the exceptions and returns a -inf score.
The relevant code is there:

tpot/tpot/gp_deap.py

Lines 434 to 480 in 507b45d

if use_dask:
try:
import dask_ml.model_selection # noqa
import dask # noqa
from dask.delayed import Delayed
except ImportError:
msg = "'use_dask' requires the optional dask and dask-ml depedencies."
raise ImportError(msg)
dsk, keys, n_splits = dask_ml.model_selection._search.build_graph(
estimator=sklearn_pipeline,
cv=cv,
scorer=scorer,
candidate_params=[{}],
X=features,
y=target,
groups=groups,
fit_params=sample_weight_dict,
refit=False,
error_score=float('-inf'),
)
cv_results = Delayed(keys[0], dsk)
scores = [cv_results['split{}_test_score'.format(i)]
for i in range(n_splits)]
CV_score = dask.delayed(np.array)(scores)[:, 0]
return dask.delayed(np.nanmean)(CV_score)
else:
try:
with warnings.catch_warnings():
warnings.simplefilter('ignore')
scores = [_fit_and_score(estimator=clone(sklearn_pipeline),
X=features,
y=target,
scorer=scorer,
train=train,
test=test,
verbose=0,
parameters=None,
fit_params=sample_weight_dict)
for train, test in cv_iter]
CV_score = np.array(scores)[:, 0]
return np.nanmean(CV_score)
except TimeoutException:
return "Timeout"
except Exception as e:
return -float('inf')

Current result

The current result is strange (KeyError), but the error message on the master branch of Dask will indicate that a worker restarted (I think).

Possible fix

There should be a way to catch the memory error (or worker restart) to return a -inf score.

When I try with a LSFCluster backend, it causes an exception

When I try with a LocalCluster (Client()), the whole notebook crashes after some attempts.

Immediate fix

I think using the joblib backend provided by Dask should be fine? I'll try tomorrow!

It doesn't work, see below.

Screenshots

from the dask web interface
43351582_285330748746423_8179954919642497024_n

43319853_1882248011842149_2283641915238776832_n

Logs

with one worker

traceback: https://pastebin.com/raw/QvtVXkBD
dask-worker.err: https://ptpb.pw/SwbX

Notice that the dask worker restarted 3 times, which appears to be a constant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions