-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPOTRegressor Thinks its Running HTOP disagrees #875
Comments
How large is your dataset? I doubt the dask backend maybe crashed somehow after using all the resources. I will double-check it next week. |
So I don't think it's the size that's the problem, I think it's only 2gb. I think the issue may be the time series cross-validation that I'm using. I'm splitting the DataFrame by date with 1964 to 2012 in the training set and each month from 2012 to 2018 so roughly 72 monthly cross-validation splits which I assume means each model needs to be retrained 72 times which is sort of dumb but also necessary for my problem. Here is my code for that: import pandas as pd
from pandas.tseries.offsets import MonthEnd
import numpy as np
class TimeSeriesSplitMonthLag():
def __init__(self, date_col='Date',init_trn_date=None, lag=12):
"""
date_col: string name of column in dataframe containing dates
init_trn_date: string in format mm/dd/yyy for the first train split
lag = number of months to lag forward the test set to handle forward lagged targets
Example:
>>> splitter = TimeSeriesSplitDateLag(date_col= 'Date',init_trn_date= "05/31/2012",lag= 12)
>>> for trn_idx, tst_idx in splitter.split(df):
... print(f"trn index max: {trn_idx.max()}, month {df.loc[trn_idx,'Date'].max()}")
... print(f"tst index min: {tst_idx.min()}, month {df.loc[tst_idx,'Date'].min()}")
trn index max: 1344164, month 2018-02-28
tst index min: 1339, month 2019-02-28
trn index max: 1344262, month 2018-03-31
tst index min: 1340, month 2019-03-31
trn index max: 1344483, month 2018-04-30
tst index min: 1341, month 2019-04-30
"""
self.init_trn_date = pd.to_datetime(init_trn_date)
self.lag = lag
self.Date = date_col
def split(self, df):
"""
df: dataframe in single index format
"""
max_tst_date = pd.to_datetime(df[self.Date].unique()).max()
max_trn_date = max_tst_date - MonthEnd(self.lag)
trn_date_range = pd.date_range(start= self.init_trn_date, end= max_trn_date,freq= 'M')
for date in trn_date_range:
trn_idxs = df.loc[df[self.Date] <= date,:].index.values
tst_idxs = df.loc[df[self.Date] == date + MonthEnd(self.lag),:].index.values
yield (trn_idxs, tst_idxs)
splitter = TimeSeriesSplitMonthLag(date_col='Date', init_trn_date='05/31/2012', lag=12)
#passed as a paramter to tpot:
cv = list(splitter.split(X_trn)) I'm going to try running a small sample of the dataset with fewer cv monthly splits just to test. |
Just to update, i tried running with only 12 monthly cross val splits with 1 generation and population of 2 and it had the same behavior. Never shows a completed pipeline and cpu usage goes from heavy to nothing. I tried running with dask on an off with the same behavior. |
changing a few more parameters and setting n_jobs to 4 from -2 i finally got an error message: model_pipe = tpot.TPOTRegressor(generations=1,
population_size=2,
offspring_size=None,
mutation_rate=0.9,
crossover_rate=0.1,
scoring='neg_mean_squared_error',
cv=list(splitter.split(X_trn)),
subsample=1.0,
n_jobs=4,
max_time_mins=24*60,
max_eval_time_mins=20,
random_state=42,
config_dict=None,
template="RandomTree",
warm_start=False,
memory='auto',
use_dask=False,
periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp',
early_stop=10,
verbosity=4,
disable_update_check=False)
model_pipe.fit(X_trn.drop(columns='Date'), y_trn.drop(columns='Date').values)
print(model_pipe.score(X_tst.drop(columns='Date'), y_tst.drop(columns='Date').values)) |
Relate to issue #876 It seems that there is a kind of threading deadlock issue (maybe related to this old issue in joblib). Could you please try to update joblib (> 0.13.2) and scikit-learn (>=0.21) via conda or pip and reinstall TPOT development branch via the command below? pip install --upgrade --no-deps --force-reinstall git+https://github.com/EpistasisLab/tpot.git@development We recently noticed that the internal joblib module (based on a older version of joblib) in scikit-learn (<0.20) was deprecated (see #867) and may cause the issue here because it did not have some important updates about limiting the number of threads in joblib (>0.12, see joblib change log). LMK if this solution works or now. |
Unfortunately, I'm already running: TPOT = 0.10.1 I'll create a new conda env with the dev branch and revert back to you. |
The TPOT (v0.10.1) only uses the joblib built in scikit-learn instead of the (joblib 0.13). The development branch of TPOT should use the joblib module due to a recently merge #867. |
oh ok perfect, i'll let you know once i've tested on the dev branch. And thank you for all your help troubleshooting by the way. |
ok, i re-ran in a new environment with the dev branch of tpot and the behavior is unchanged. Is there a sample dataset that is known to work for tpot regression with time series cv splits? I guess I can always add a dummy column of dates to the Boston dataset just to get things running. Is there a base parameter configuration you would recommend for this? |
Your configuration above looks fine to me. Also you may try to use config_dict="TPOT light" or/and n_jobs=4 for a quick test on your dataset/Boston datasets to checking whether this was a resource problem? model_pipe = tpot.TPOTRegressor(generations=100,
population_size=100,
offspring_size=None,
mutation_rate=0.9,
crossover_rate=0.1,
scoring='neg_mean_squared_error',
cv=list(splitter.split(X_trn)),
subsample=1.0,
n_jobs=4,
max_time_mins=24*60,
max_eval_time_mins=20,
random_state=42,
config_dict="TPOT light",
template="RandomTree",
warm_start=False,
memory='auto',
use_dask=True,
periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp'
early_stop=10,
verbosity=3,
disable_update_check=False) |
So I've been able to test the dev branch a little more. I was able to get TPOT to return exported pipelines with the standard tpot config and these parameters. I think n_jobs =4 was an important change from n_jobs=-2 but it's tough to tell with the silent failures and no error message. model_pipe = tpot.TPOTRegressor(generations=1,
population_size=2,
offspring_size=None,
mutation_rate=0.9,
crossover_rate=0.1,
scoring='neg_mean_squared_error',
cv=trn_val_split_idxs,
subsample=1.0,
n_jobs= 4,
max_time_mins=None,
max_eval_time_mins=5,
random_state=42,
config_dict=None,
template="RandomTree",
warm_start=False,
memory='auto',
use_dask=True,
periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp',
early_stop=10,
verbosity=4,
disable_update_check=False) but when I tried to run the tpot light config I got the same silent failure where tpot thought it was running but no jobs were running on the CPU. |
Does the solution (#876 (comment)) from another issue work for you? |
That had no effect for me. It's never made it beyond 0% on the progress bar. |
Possible user error since I'm a new tpot user but I've tried to run what I expected to be a large job and I think the job has failed but the kernel is still busy and I haven't received any errors. I think it's failed because it went from occupying all CPU processes and nearly all the memory to none.
I've run the Boston training set with no problems so I think my configuration is good.
Here is my code for the job I think failed. I'm running in a Jupyter notebook, and it's been running for about 4.5 hours
I'm running tpot 0.10.1 installed via conda
output from htop on ubuntu 18.04:
I wanted to highlight this as I searched and couldn't find a similar issue.
The text was updated successfully, but these errors were encountered: