Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPOTRegressor Thinks its Running HTOP disagrees #875

Open
Whamp opened this issue Jun 1, 2019 · 13 comments
Open

TPOTRegressor Thinks its Running HTOP disagrees #875

Whamp opened this issue Jun 1, 2019 · 13 comments
Labels

Comments

@Whamp
Copy link

Whamp commented Jun 1, 2019

Possible user error since I'm a new tpot user but I've tried to run what I expected to be a large job and I think the job has failed but the kernel is still busy and I haven't received any errors. I think it's failed because it went from occupying all CPU processes and nearly all the memory to none.

I've run the Boston training set with no problems so I think my configuration is good.

Here is my code for the job I think failed. I'm running in a Jupyter notebook, and it's been running for about 4.5 hours

I'm running tpot 0.10.1 installed via conda

model_pipe = tpot.TPOTRegressor(generations=100, 
                   population_size=100,                        
                   offspring_size=None, 
                   mutation_rate=0.9,                       
                   crossover_rate=0.1,
                   scoring='neg_mean_squared_error',
                   cv=list(splitter.split(X_trn)),   
                   subsample=1.0,
                   n_jobs=-2,                        
                   max_time_mins=24*60,              
                   max_eval_time_mins=20,           
                   random_state=42,                  
                   config_dict=None,
                   template="RandomTree",
                   warm_start=False,
                   memory='auto',                    
                   use_dask=True,                    
                   periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp'
                   early_stop=10,                    
                   verbosity=3,                      
                   disable_update_check=False)

model_pipe.fit(X_trn.drop(columns='Date'), y_trn.drop(columns='Date').values)
print(model_pipe.score(X_tst.drop(columns='Date'), y_tst.drop(columns='Date').values))

image

output from htop on ubuntu 18.04:
image

I wanted to highlight this as I searched and couldn't find a similar issue.

@weixuanfu
Copy link
Contributor

How large is your dataset? I doubt the dask backend maybe crashed somehow after using all the resources. I will double-check it next week.

@Whamp
Copy link
Author

Whamp commented Jun 2, 2019

So I don't think it's the size that's the problem, I think it's only 2gb. I think the issue may be the time series cross-validation that I'm using. I'm splitting the DataFrame by date with 1964 to 2012 in the training set and each month from 2012 to 2018 so roughly 72 monthly cross-validation splits which I assume means each model needs to be retrained 72 times which is sort of dumb but also necessary for my problem.

Here is my code for that:

import pandas as pd
from pandas.tseries.offsets import MonthEnd
import numpy as np

class TimeSeriesSplitMonthLag():
    def __init__(self, date_col='Date',init_trn_date=None, lag=12):
        """
        date_col: string name of column in dataframe containing dates
        init_trn_date: string in format mm/dd/yyy for the first train split
        lag = number of months to lag forward the test set to handle forward lagged targets
        
        Example:
        >>> splitter = TimeSeriesSplitDateLag(date_col= 'Date',init_trn_date= "05/31/2012",lag= 12)
        >>> for trn_idx, tst_idx in splitter.split(df):
        ...    print(f"trn index max: {trn_idx.max()}, month {df.loc[trn_idx,'Date'].max()}")
        ...    print(f"tst index min: {tst_idx.min()}, month {df.loc[tst_idx,'Date'].min()}")
        trn index max: 1344164, month 2018-02-28 
        tst index min: 1339,    month 2019-02-28 
        trn index max: 1344262, month 2018-03-31 
        tst index min: 1340,    month 2019-03-31 
        trn index max: 1344483, month 2018-04-30 
        tst index min: 1341,    month 2019-04-30 
        
        """
        self.init_trn_date = pd.to_datetime(init_trn_date)
        self.lag = lag
        self.Date = date_col

    def split(self, df):
        """
        df: dataframe in single index format
        """
        max_tst_date = pd.to_datetime(df[self.Date].unique()).max()
        max_trn_date = max_tst_date - MonthEnd(self.lag)
        trn_date_range = pd.date_range(start= self.init_trn_date, end= max_trn_date,freq= 'M')
        
        for date in trn_date_range:
            trn_idxs = df.loc[df[self.Date] <= date,:].index.values
            tst_idxs = df.loc[df[self.Date] == date + MonthEnd(self.lag),:].index.values
            yield (trn_idxs, tst_idxs)

splitter = TimeSeriesSplitMonthLag(date_col='Date', init_trn_date='05/31/2012', lag=12)

#passed as a paramter to tpot:
cv = list(splitter.split(X_trn))

I'm going to try running a small sample of the dataset with fewer cv monthly splits just to test.

@Whamp
Copy link
Author

Whamp commented Jun 2, 2019

Just to update, i tried running with only 12 monthly cross val splits with 1 generation and population of 2 and it had the same behavior. Never shows a completed pipeline and cpu usage goes from heavy to nothing. I tried running with dask on an off with the same behavior.

@Whamp
Copy link
Author

Whamp commented Jun 2, 2019

changing a few more parameters and setting n_jobs to 4 from -2 i finally got an error message:

model_pipe = tpot.TPOTRegressor(generations=1, 
                   population_size=2,                        
                   offspring_size=None, 
                   mutation_rate=0.9,                       
                   crossover_rate=0.1,
                   scoring='neg_mean_squared_error',
                   cv=list(splitter.split(X_trn)),   
                   subsample=1.0,
                   n_jobs=4,                       
                   max_time_mins=24*60,              
                   max_eval_time_mins=20,           
                   random_state=42,                 
                   config_dict=None,
                   template="RandomTree",
                   warm_start=False,
                   memory='auto',                   
                   use_dask=False,                    
                   periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp', 
                   early_stop=10,                    
                   verbosity=4,                      
                   disable_update_check=False)

model_pipe.fit(X_trn.drop(columns='Date'), y_trn.drop(columns='Date').values)
print(model_pipe.score(X_tst.drop(columns='Date'), y_tst.drop(columns='Date').values))

image
image

@weixuanfu
Copy link
Contributor

weixuanfu commented Jun 3, 2019

Relate to issue #876

It seems that there is a kind of threading deadlock issue (maybe related to this old issue in joblib). Could you please try to update joblib (> 0.13.2) and scikit-learn (>=0.21) via conda or pip and reinstall TPOT development branch via the command below?

pip install --upgrade --no-deps --force-reinstall git+https://github.com/EpistasisLab/tpot.git@development

We recently noticed that the internal joblib module (based on a older version of joblib) in scikit-learn (<0.20) was deprecated (see #867) and may cause the issue here because it did not have some important updates about limiting the number of threads in joblib (>0.12, see joblib change log). LMK if this solution works or now.

@weixuanfu weixuanfu added the bug label Jun 3, 2019
@Whamp
Copy link
Author

Whamp commented Jun 3, 2019

Unfortunately, I'm already running:
joblib = 0.13.2
scikit-learn = 0.21.1

TPOT = 0.10.1
just in case dask is involved:
dask = 1.2.2
dask-core = 1.2.2
dask-glm = 0.1.0
dask-ml = 0.13.0

I'll create a new conda env with the dev branch and revert back to you.

@weixuanfu
Copy link
Contributor

The TPOT (v0.10.1) only uses the joblib built in scikit-learn instead of the (joblib 0.13). The development branch of TPOT should use the joblib module due to a recently merge #867.

@Whamp
Copy link
Author

Whamp commented Jun 3, 2019

The TPOT (v0.10.1) only uses the joblib built in scikit-learn instead of the (joblib 0.13). The development branch of TPOT should use the joblib module due to a recently merge #867.

oh ok perfect, i'll let you know once i've tested on the dev branch. And thank you for all your help troubleshooting by the way.

@Whamp
Copy link
Author

Whamp commented Jun 3, 2019

ok, i re-ran in a new environment with the dev branch of tpot and the behavior is unchanged. Is there a sample dataset that is known to work for tpot regression with time series cv splits?

I guess I can always add a dummy column of dates to the Boston dataset just to get things running. Is there a base parameter configuration you would recommend for this?

@weixuanfu
Copy link
Contributor

weixuanfu commented Jun 4, 2019

Your configuration above looks fine to me. Also you may try to use config_dict="TPOT light" or/and n_jobs=4 for a quick test on your dataset/Boston datasets to checking whether this was a resource problem?

model_pipe = tpot.TPOTRegressor(generations=100, 
                   population_size=100,                        
                   offspring_size=None, 
                   mutation_rate=0.9,                       
                   crossover_rate=0.1,
                   scoring='neg_mean_squared_error',
                   cv=list(splitter.split(X_trn)),   
                   subsample=1.0,
                   n_jobs=4,                        
                   max_time_mins=24*60,              
                   max_eval_time_mins=20,           
                   random_state=42,                  
                   config_dict="TPOT light",
                   template="RandomTree",
                   warm_start=False,
                   memory='auto',                    
                   use_dask=True,                    
                   periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp'
                   early_stop=10,                    
                   verbosity=3,                      
                   disable_update_check=False)

@Whamp
Copy link
Author

Whamp commented Jun 10, 2019

So I've been able to test the dev branch a little more. I was able to get TPOT to return exported pipelines with the standard tpot config and these parameters. I think n_jobs =4 was an important change from n_jobs=-2 but it's tough to tell with the silent failures and no error message.

model_pipe = tpot.TPOTRegressor(generations=1, 
                                population_size=2,                        
                                offspring_size=None, 
                                mutation_rate=0.9,                       
                                crossover_rate=0.1,
                                scoring='neg_mean_squared_error', 
                                cv=trn_val_split_idxs,                                         
                                subsample=1.0,
                                n_jobs= 4,                                                     
                                max_time_mins=None,                                            
                                max_eval_time_mins=5,                                        
                                random_state=42,                                            
                                config_dict=None,
                                template="RandomTree",
                                warm_start=False,
                                memory='auto',                                                 
                                use_dask=True,                                                
                                periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp',   
                                early_stop=10,                                                 
                                verbosity=4,                                                  
                                disable_update_check=False)

but when I tried to run the tpot light config I got the same silent failure where tpot thought it was running but no jobs were running on the CPU.

@weixuanfu
Copy link
Contributor

Does the solution (#876 (comment)) from another issue work for you?

@Whamp
Copy link
Author

Whamp commented Jun 15, 2019

Does the solution (#876 (comment)) from another issue work for you?

That had no effect for me. It's never made it beyond 0% on the progress bar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants