Use Dask= True or manual import method not working #764

GinoWoz1 · 2018-09-12T21:23:16Z

[provide general introduction to the issue and why it is relevant to this repository]

I cannot use multiple cores and therefore my jobs are running extremely slow.

Context of the issue

In 0.9.4 a fix was put in to use use_dask = True or to import manually. Both methods return the error

"
File "C:\Users\jstnjc\Anaconda3\lib\site-packages\tpot\base.py", line 684, in fit
self._update_top_pipeline()

File "C:\Users\jstnjc\Anaconda3\lib\site-packages\tpot\base.py", line 758, in _update_top_pipeline
raise RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')

RuntimeError: A pipeline has not yet been optimized. Please call fit() first."

Process to reproduce the issue

(ive tested this on 3 different computers including a cloud service)

Install Anaconda 3.6 for windows 64bit
pip Install missingno
pip install these .whl files manually (need to do so for fancyimpute)
-ecos-2.0.5-cp36-cp36m-win_amd64.whl
-cvxpy-1.0.8-cp36-cp36m-win_amd64.whl
pip install fancimpute
pip install rfpimp (used for my custom functions import file)
conda install py-xgboost
pip install tpot
pip install msgpack
pip install dask[delayed] dask-ml

[ordered list the process to finding and recreating the issue, example below]

With the above - execute the code below:

from sklearn.metrics import make_scorer
from tpot import TPOTRegressor
import warnings
import pandas as pd
import math
warnings.filterwarnings('ignore')


url = 'https://github.com/GinoWoz1/AdvancedHousePrices/raw/master/'

X_train = pd.read_csv(url + 'train_tpot_issue.csv')
y_train = pd.read_csv(url + 'y_train_tpot_issue.csv', header=None)

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            raise ValueError("Mean Squared Logarithmic Error cannot be used when "
                             "targets contain negative values.")
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

rmsle_loss = make_scorer(rmsle_loss,greater_is_better=False)

tpot = TPOTRegressor(verbosity=3, scoring = rmsle_loss, generations = 50,population_size=50,offspring_size= 50,max_eval_time_mins=10,warm_start=True, use_dask=True)
tpot.fit(X_train,y_train[0])

Expected result

Expect the process to run and to use all cores.

Current result

[describe what you currently experience from this process, and thereby explain the bug]

The text was updated successfully, but these errors were encountered:

weixuanfu · 2018-09-13T14:01:38Z

Hmm, I tested those codes under a fresh test conda environment and the error was not reproduced. But I used a easy way to install fancyimpute as the commands below. Could you please build a conda environment for a test?

conda create -n test_env python=3.6
activate test_env
pip install missingno
conda install -y -c anaconda ecos
conda install -y -c conda-forge lapack
conda install -y -c cvxgrp cvxpy
conda install -y -c cimcb fancyimpute
pip install rfpimp
conda install -y py-xgboost
pip install tpot msgpack dask[delayed] dask-ml

Another suggestion about the customized scorer in your codes. May it will be more stable if the function does not raise ValueError as the example below:

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            return float('inf')
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

GinoWoz1 · 2018-09-13T14:37:35Z

Thanks Weixuan. Quick question, how do I run the python script out of the conda environment? I am just used to opening up the script on my desktop and running it there.

GinoWoz1 · 2018-09-13T16:36:52Z

Nevermind on the python script question. I was able to setup on my laptop.

Any idea why this install process breaks the verbosity argument? everything else seems to be working fine, thanks a ton for your help.

Sincerely,
Justin

weixuanfu · 2018-09-13T16:58:53Z

You're welcome. Do you mean no confirmation during installation of packages via conda? If so, the -y in the command is for this purpose.

GinoWoz1 · 2018-09-13T17:00:19Z

The progress bar doesnt show up.

weixuanfu · 2018-09-13T17:18:32Z

Hmm, I think progress bar should be not easy to catch with tons of warning messages when dask=True but it did show up in my test (as stdout below).

We need refine this warning message action when dask=True.

 **self._backend_args)
D:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py:547: UserWarning: Multiprocessing-backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)
D:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py:547: UserWarning: Multiprocessing-backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)
Generation 1 - Current best internal CV score: -5.969518794583038e-15
Optimization Progress:   4%|█▉                                                | 101/2550 [01:08<51:22,  1.26s/pipeline]

GinoWoz1 · 2018-09-13T17:21:03Z

Thanks, no problem. I can live without it for now just as long as the periodic checkpoints are being saved. You can close this. thanks again!

GinoWoz1 · 2018-09-13T23:04:01Z

Hmm after the first generation, the same error came up in the virtual environment. Were you able to finish one generation and save a pipeline? I did exactly as you suggested with the virtual env.

weixuanfu · 2018-09-14T15:11:05Z

Hmm, did you also update rmsle_loss in your codes? Can you please provide a random_state to reproduce the issue?

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            return float('inf')
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

GinoWoz1 · 2018-09-17T14:17:14Z

Thanks I did. Sorry for the bother, it looks like a user error on my side with my virtual environment. Really hate to inconvenience you. I am going to do a overview of TPOT soon for some individuals in my area at a meetup so this will help greatly! I'll make sure to give a shout out to your and your team. Sincerely, Justin

…

On Fri, Sep 14, 2018 at 8:11 AM Weixuan Fu ***@***.***> wrote: Hmm, did you also update rmsle_loss in your codes? Can you provide a random_state to reproduce the issue? def rmsle_loss(y_true, y_pred): assert len(y_true) == len(y_pred) try: terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)] except: return float('inf') if not (y_true >= 0).all() and not (y_pred >= 0).all(): return float('inf') return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#764 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQuRcbdaaedW_zFed3k6VD2AwKi7dBigks5ua8cTgaJpZM4WmMOo> .

GinoWoz1 · 2018-11-06T15:10:54Z

Hey Weixuan,

With the same exact setup, I am now getting the error below. Any idea? I am unable to get TPOT to finish a single run.

conda create -n test_env python=3.6
activate test_env
pip install missingno
conda install -y -c anaconda ecos
conda install -y -c conda-forge lapack
conda install -y -c cvxgrp cvxpy
conda install -y -c cimcb fancyimpute
pip install rfpimp
conda install -y py-xgboost
pip install tpot msgpack dask[delayed] dask-ml

weixuanfu · 2018-11-06T15:32:49Z

Hmm it seems a xgboost API issue. I tried to reproduce this issue via the demo below but the error didn't show up. I think I recently updated xgboost to 0.80 via conda install -c anaconda py-xgboost, maybe updating xgboost will help.

from sklearn.metrics import make_scorer
from tpot import TPOTRegressor
import warnings
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import math
warnings.filterwarnings('ignore')
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25)
                                                    
def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            return float('inf')
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

tpot = TPOTRegressor(verbosity=3, scoring = rmsle_loss, generations = 50,population_size=50,offspring_size= 50,max_eval_time_mins=10,warm_start=True, use_dask=True)
tpot.fit(X_train,y_train)

GuillaumeLab · 2020-08-22T17:35:48Z

I got the same issue. I can't use conda environment. Whenever i use "use_dask=True", i get the following error :

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

tpot = TPOTRegressor(verbosity=3, scoring = rmsle_loss, generations = 50,population_size=50,offspring_size= 50,max_eval_time_mins=10,warm_start=True, use_dask=True)
tpot.fit(X_train,y_train)

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

I have tried on azure databricks cluster as well as on my local machine

weixuanfu · 2020-08-23T13:29:31Z

@GuillaumeLab which version of dask are installed in your environment?

GuillaumeLab · 2020-08-23T14:07:43Z

dask 2.24.0 .

Thanks for your answer.
I also get another error message :
Restarting distributed.nanny - WARNING - Worker exceeded 95% memory budget.

I checked this thread : "dask/distributed#2297"", and it does not really help solve the issue. Tpot is working fine on a single device, no memory issue. Why distributing it on several devices would cause a memory issue?

weixuanfu closed this as completed Sep 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Dask= True or manual import method not working #764

Use Dask= True or manual import method not working #764

GinoWoz1 commented Sep 12, 2018 •

edited

Loading

weixuanfu commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

weixuanfu commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

weixuanfu commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

weixuanfu commented Sep 14, 2018 •

edited

Loading

GinoWoz1 commented Sep 17, 2018 via email

GinoWoz1 commented Nov 6, 2018

weixuanfu commented Nov 6, 2018

GuillaumeLab commented Aug 22, 2020 •

edited

Loading

weixuanfu commented Aug 23, 2020

GuillaumeLab commented Aug 23, 2020 •

edited

Loading

Use Dask= True or manual import method not working #764

Use Dask= True or manual import method not working #764

Comments

GinoWoz1 commented Sep 12, 2018 • edited Loading

Context of the issue

Process to reproduce the issue

Expected result

Current result

weixuanfu commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

weixuanfu commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

weixuanfu commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

GinoWoz1 commented Sep 13, 2018

weixuanfu commented Sep 14, 2018 • edited Loading

GinoWoz1 commented Sep 17, 2018 via email

GinoWoz1 commented Nov 6, 2018

weixuanfu commented Nov 6, 2018

GuillaumeLab commented Aug 22, 2020 • edited Loading

weixuanfu commented Aug 23, 2020

GuillaumeLab commented Aug 23, 2020 • edited Loading

GinoWoz1 commented Sep 12, 2018 •

edited

Loading

weixuanfu commented Sep 14, 2018 •

edited

Loading

GuillaumeLab commented Aug 22, 2020 •

edited

Loading

GuillaumeLab commented Aug 23, 2020 •

edited

Loading