TPOT freezing at 0% with n_jobs >4 on linux with large dataset #876

s-marton · 2019-06-03T12:17:49Z

Hello,

I am trying to get TPOT running since a while now but always encounter the same errro. I have a linux machine with 24 kernels. When I run TPOT on a large dataset (~6mio rows, ~20 features) it freezes at 0% and after about 10-20 minutes the CPU goes down to a few percent. I already tried setting the multiprocessing to forkserver without any changes. I also tried the dask implementation, but since the max_eval_time_mins does not seem to work there, it runs forever.

However, the problem does not occur when n_jobs != 1 but just if n_jobs > 4. I do not really know what else to try and I would appreciate any suggestions.

Thanks!

aml_tpot = TPOTRegressor(scoring = 'neg_mean_squared_error',                          
                                        generations=20, 
                                        population_size=50, 
                                        verbosity=3, 
                                        random_state = RANDOM_SEED, 
                                        n_jobs = 16,
                                        max_eval_time_mins = 20,
                                        cv = 3,
                        )

aml_tpot.fit(X_train.values, y_train.values.ravel())

weixuanfu · 2019-06-03T14:00:34Z

It seems that there is a kind of threading deadlock issue (maybe related to this old issue in joblib). Could you please try to update joblib (> 0.13.2) and scikit-learn (>=0.21) via conda or pip and reinstall TPOT development branch via the command below?

pip install --upgrade --no-deps --force-reinstall git+https://github.com/EpistasisLab/tpot.git@development

We recently noticed that the internal joblib module (based on a older version of joblib) in scikit-learn (<0.20) was deprecated (see #867) and may cause the issue here because it did not have some important updates about limiting the number of threads in joblib (>0.12, see joblib change log). LMK if this solution works or now.

s-marton · 2019-06-03T15:13:19Z

Thanks for the quick reply!
joblib 0.13.2 is the current version and I cant update joblib (> 0.13.2) or did I get something wrong here?

I installed the development branch and tried it again with joblib == 0.13.2 and scikit-learn >=0.21, but unfortunately it is still freezing.

s-marton · 2019-06-03T21:31:56Z

However, I just tried it again and now it is stuck at 5% (54/1050).

s-marton · 2019-06-06T19:06:41Z

Changing the value of DEFAULT_THREAD_BACKEND = 'threading' to e.g. 'loky' in parallel.py of joblib worked for me.

huaiyizhao · 2019-06-14T06:19:15Z

Not working. It runs only when n_jobs is set to 1

huaiyizhao · 2019-06-14T06:21:54Z

@Chowkah Could you please talk about how can you reach 5%. I still stuck at 0%

s-marton · 2019-06-14T09:38:59Z

I cant really nail it down to a point, I tried several different things, sometimes it was working, sometimes not. I changed the parallel backend directly in the parallel.py of joblib which sometimes helped. Additionally I changed my random seed to some other value and with the same setting it was working. So the problem might be related to a specific algorithm (maybe just with some specific parameter setting) that makes TPOT freeze. However, I was not able to identify which one it might be.

huaiyizhao · 2019-06-15T03:56:16Z

Thank you, maybe I should start with examples in official doc, make a few changes every time and see what will happen.

weixuanfu added the bug label Jun 3, 2019

weixuanfu mentioned this issue Jun 3, 2019

TPOTRegressor Thinks its Running HTOP disagrees #875

Open

perib mentioned this issue Sep 21, 2023

TPOT2 and the future of TPOT development -- From the Devs #1322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPOT freezing at 0% with n_jobs >4 on linux with large dataset #876

TPOT freezing at 0% with n_jobs >4 on linux with large dataset #876

s-marton commented Jun 3, 2019 •

edited

Loading

weixuanfu commented Jun 3, 2019

s-marton commented Jun 3, 2019

s-marton commented Jun 3, 2019

s-marton commented Jun 6, 2019

huaiyizhao commented Jun 14, 2019

huaiyizhao commented Jun 14, 2019

s-marton commented Jun 14, 2019

huaiyizhao commented Jun 15, 2019

TPOT freezing at 0% with n_jobs >4 on linux with large dataset #876

TPOT freezing at 0% with n_jobs >4 on linux with large dataset #876

Comments

s-marton commented Jun 3, 2019 • edited Loading

weixuanfu commented Jun 3, 2019

s-marton commented Jun 3, 2019

s-marton commented Jun 3, 2019

s-marton commented Jun 6, 2019

huaiyizhao commented Jun 14, 2019

huaiyizhao commented Jun 14, 2019

s-marton commented Jun 14, 2019

huaiyizhao commented Jun 15, 2019

s-marton commented Jun 3, 2019 •

edited

Loading