Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Dask= True or manual import method not working #764

Closed
GinoWoz1 opened this issue Sep 12, 2018 · 15 comments
Closed

Use Dask= True or manual import method not working #764

GinoWoz1 opened this issue Sep 12, 2018 · 15 comments

Comments

@GinoWoz1
Copy link

GinoWoz1 commented Sep 12, 2018

[provide general introduction to the issue and why it is relevant to this repository]

I cannot use multiple cores and therefore my jobs are running extremely slow.

Context of the issue

In 0.9.4 a fix was put in to use use_dask = True or to import manually. Both methods return the error

"
File "C:\Users\jstnjc\Anaconda3\lib\site-packages\tpot\base.py", line 684, in fit
self._update_top_pipeline()

File "C:\Users\jstnjc\Anaconda3\lib\site-packages\tpot\base.py", line 758, in _update_top_pipeline
raise RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')

RuntimeError: A pipeline has not yet been optimized. Please call fit() first."

Process to reproduce the issue

(ive tested this on 3 different computers including a cloud service)

Install Anaconda 3.6 for windows 64bit
pip Install missingno
pip install these .whl files manually (need to do so for fancyimpute)
-ecos-2.0.5-cp36-cp36m-win_amd64.whl
-cvxpy-1.0.8-cp36-cp36m-win_amd64.whl
pip install fancimpute
pip install rfpimp (used for my custom functions import file)
conda install py-xgboost
pip install tpot
pip install msgpack
pip install dask[delayed] dask-ml

[ordered list the process to finding and recreating the issue, example below]

With the above - execute the code below:

from sklearn.metrics import make_scorer
from tpot import TPOTRegressor
import warnings
import pandas as pd
import math
warnings.filterwarnings('ignore')


url = 'https://github.com/GinoWoz1/AdvancedHousePrices/raw/master/'

X_train = pd.read_csv(url + 'train_tpot_issue.csv')
y_train = pd.read_csv(url + 'y_train_tpot_issue.csv', header=None)

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            raise ValueError("Mean Squared Logarithmic Error cannot be used when "
                             "targets contain negative values.")
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

rmsle_loss = make_scorer(rmsle_loss,greater_is_better=False)

tpot = TPOTRegressor(verbosity=3, scoring = rmsle_loss, generations = 50,population_size=50,offspring_size= 50,max_eval_time_mins=10,warm_start=True, use_dask=True)
tpot.fit(X_train,y_train[0])

Expected result

Expect the process to run and to use all cores.

Current result

[describe what you currently experience from this process, and thereby explain the bug]
image

@weixuanfu
Copy link
Contributor

Hmm, I tested those codes under a fresh test conda environment and the error was not reproduced. But I used a easy way to install fancyimpute as the commands below. Could you please build a conda environment for a test?

conda create -n test_env python=3.6
activate test_env
pip install missingno
conda install -y -c anaconda ecos
conda install -y -c conda-forge lapack
conda install -y -c cvxgrp cvxpy
conda install -y -c cimcb fancyimpute
pip install rfpimp
conda install -y py-xgboost
pip install tpot msgpack dask[delayed] dask-ml

Another suggestion about the customized scorer in your codes. May it will be more stable if the function does not raise ValueError as the example below:

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            return float('inf')
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

@GinoWoz1
Copy link
Author

Thanks Weixuan. Quick question, how do I run the python script out of the conda environment? I am just used to opening up the script on my desktop and running it there.

@GinoWoz1
Copy link
Author

Nevermind on the python script question. I was able to setup on my laptop.

Any idea why this install process breaks the verbosity argument? everything else seems to be working fine, thanks a ton for your help.

Sincerely,
Justin

@weixuanfu
Copy link
Contributor

You're welcome. Do you mean no confirmation during installation of packages via conda? If so, the -y in the command is for this purpose.

@GinoWoz1
Copy link
Author

The progress bar doesnt show up.

@weixuanfu
Copy link
Contributor

Hmm, I think progress bar should be not easy to catch with tons of warning messages when dask=True but it did show up in my test (as stdout below).

We need refine this warning message action when dask=True.

 **self._backend_args)
D:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py:547: UserWarning: Multiprocessing-backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)
D:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py:547: UserWarning: Multiprocessing-backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)
Generation 1 - Current best internal CV score: -5.969518794583038e-15
Optimization Progress:   4%|█▉                                                | 101/2550 [01:08<51:22,  1.26s/pipeline]

@GinoWoz1
Copy link
Author

Thanks, no problem. I can live without it for now just as long as the periodic checkpoints are being saved. You can close this. thanks again!

@GinoWoz1
Copy link
Author

Hmm after the first generation, the same error came up in the virtual environment. Were you able to finish one generation and save a pipeline? I did exactly as you suggested with the virtual env.

image

@weixuanfu
Copy link
Contributor

weixuanfu commented Sep 14, 2018

Hmm, did you also update rmsle_loss in your codes? Can you please provide a random_state to reproduce the issue?

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            return float('inf')
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

@GinoWoz1
Copy link
Author

GinoWoz1 commented Sep 17, 2018 via email

@GinoWoz1
Copy link
Author

GinoWoz1 commented Nov 6, 2018

Hey Weixuan,

With the same exact setup, I am now getting the error below. Any idea? I am unable to get TPOT to finish a single run.

image

conda create -n test_env python=3.6
activate test_env
pip install missingno
conda install -y -c anaconda ecos
conda install -y -c conda-forge lapack
conda install -y -c cvxgrp cvxpy
conda install -y -c cimcb fancyimpute
pip install rfpimp
conda install -y py-xgboost
pip install tpot msgpack dask[delayed] dask-ml

@weixuanfu
Copy link
Contributor

Hmm it seems a xgboost API issue. I tried to reproduce this issue via the demo below but the error didn't show up. I think I recently updated xgboost to 0.80 via conda install -c anaconda py-xgboost, maybe updating xgboost will help.

from sklearn.metrics import make_scorer
from tpot import TPOTRegressor
import warnings
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import math
warnings.filterwarnings('ignore')
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25)
                                                    
def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            return float('inf')
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

tpot = TPOTRegressor(verbosity=3, scoring = rmsle_loss, generations = 50,population_size=50,offspring_size= 50,max_eval_time_mins=10,warm_start=True, use_dask=True)
tpot.fit(X_train,y_train)

@GuillaumeLab
Copy link

GuillaumeLab commented Aug 22, 2020

I got the same issue. I can't use conda environment. Whenever i use "use_dask=True", i get the following error :

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

tpot = TPOTRegressor(verbosity=3, scoring = rmsle_loss, generations = 50,population_size=50,offspring_size= 50,max_eval_time_mins=10,warm_start=True, use_dask=True)
tpot.fit(X_train,y_train)

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

I have tried on azure databricks cluster as well as on my local machine

@weixuanfu
Copy link
Contributor

@GuillaumeLab which version of dask are installed in your environment?

@GuillaumeLab
Copy link

GuillaumeLab commented Aug 23, 2020

dask 2.24.0 .

Thanks for your answer.
I also get another error message :
Restarting distributed.nanny - WARNING - Worker exceeded 95% memory budget.

I checked this thread : "dask/distributed#2297"", and it does not really help solve the issue. Tpot is working fine on a single device, no memory issue. Why distributing it on several devices would cause a memory issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants