Add feature selection operators to select categorical or continuous features #549

rhiever · 2017-08-16T20:44:24Z

Currently, TPOT operators act on the full dataset regardless of the feature types. For example, if TPOT decides to apply a OneHotEncoder on the data, then the OneHotEncoder will be applied to all features---even continuous features. This issue can be problematic for datasets with mixed feature types.

We should look into adding feature selection operators that select only continuous or only categorical features. These operators will allow TPOT to subset the features by feature type and (hopefully) further apply appropriate transformations to those features.

An example pipeline I can imagine with these operators is, e.g.,

input --> select continuous --> PCA ----------------
|-----> RandomForest
input --> select categorical --> OneHotEncoder ---

This should be a simple addition to the current version of TPOT, as all it requires is the addition of two new built-in feature selection operators. We could have the operators decide that a feature is categorical if it has < 10 levels, and it is continuous otherwise. We can make the threshold value a parameter for these operators, with a default of 10.

JuanuMusic · 2017-10-11T02:20:20Z

Another useful way is to detect categorical types on pandas dataframes. Of course, this should go along with a good documentation about the behaviour.

Also, there could be a parameter indicating column indices or names of categorical values features.

lstmemery · 2017-10-22T15:45:17Z

Hey there. I would like to take a crack at this feature.

weixuanfu · 2017-10-22T15:49:16Z

@harkdev please check the PR #560. I will rebase this PR on version 0.9.

JuanuMusic · 2017-10-22T16:11:29Z

@weixuanfu excellent!
I'll see If i can give it a go when it's merged, and test it!
I still have to figure out how to install packages from github on my anaconda

rafaelnovello · 2018-05-14T17:46:09Z

Hi guys!
I have tried the solution showed here with titanic dataset. I believe that with this PR and the right config TPOT should apply one hot encoding automatically on my data, but I have got the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-60-4ca640247df0> in <module>()
     14 tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, config_dict='TPOT sparse')
     15 
---> 16 tpot.fit(df.drop(['Name', 'Ticket', 'Survived', 'Cabin', 'PassengerId', 'Age', 'Fare'], axis=1).as_matrix(), df.Survived.as_matrix())

~/.virtualenvs/playground/lib/python3.6/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    540         """
    541 
--> 542         features, target = self._check_dataset(features, target)
    543 
    544         # Randomly collect a subsample of training samples for pipeline optimization process.

~/.virtualenvs/playground/lib/python3.6/site-packages/tpot/base.py in _check_dataset(self, features, target)
   1020                 )
   1021         else:
-> 1022             if np.any(np.isnan(features)):
   1023                 self._imputed = True
   1024                 features = self._impute_values(features)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I have tried the following code:

import numpy as np
import pandas as pd

from tpot import TPOTClassifier
from tpot.config import classifier_config_dict

url = "https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv"

df = pd.read_csv(url, sep=None, engine='python')

classifier_config_dict['tpot.builtins.CategoricalSelector'] = {
    'threshold': [10],
    'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
    'sparse': [False]
}

classifier_config_dict['tpot.builtins.ContinuousSelector'] = {
    'threshold': [10],
    'svd_solver': ['randomized'],
    'iterated_power': range(1, 11)
}

tpot = TPOTClassifier(
    generations=5,
    population_size=50,
    verbosity=2,
    config_dict=classifier_config_dict
)

tpot.fit(df.drop('Survived', axis=1), df.Survived)

I'm using Ubuntu 18.04, Python 3.6.5 and the following libs:

Package            Version  
------------------ ---------    
numpy              1.14.3   
pandas             0.22.0     
pip                10.0.1   
scikit-learn       0.19.1   
scipy              1.1.0       
TPOT               0.9.3      
xgboost            0.71

Sorry if this is not the right place to post this!

weixuanfu · 2018-05-14T22:23:23Z

Hmm, stderr showed config_dict='TPOT sparse', which is different with your examples.

I think the issue can be fixed if the last line of codes is tpot.fit(df.drop('Survived', axis=1).values, df.Survived).values. tpot.fit() just takes np.ndarray as inputs for now.

rafaelnovello · 2018-05-15T17:20:07Z

Sorry for the config_dict='TPOT sparse'. It was a mistake!

I have tried your sugestion but I got the same error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-109-a060d5622dbf> in <module>()
     18 )
     19 
---> 20 tpot.fit(df.drop('Survived', axis=1).values, df.Survived.values)

~/.virtualenvs/playground/lib/python3.6/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    540         """
    541 
--> 542         features, target = self._check_dataset(features, target)
    543 
    544         # Randomly collect a subsample of training samples for pipeline optimization process.

~/.virtualenvs/playground/lib/python3.6/site-packages/tpot/base.py in _check_dataset(self, features, target)
   1020                 )
   1021         else:
-> 1022             if np.any(np.isnan(features)):
   1023                 self._imputed = True
   1024                 features = self._impute_values(features)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I have used the classifier_config_dict defined in the first post at this time.

I believe the problem is the array type (dtype=object) but it was defined as object because it have a mixed data types (int, float, str)

df.drop('Survived', axis=1).values
array([[1, 3, 'Braund, Mr. Owen Harris', ..., 7.25, nan, 'S'],
       [2, 1, 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', ...,
        71.2833, 'C85', 'C'],
       [3, 3, 'Heikkinen, Miss. Laina', ..., 7.925, nan, 'S'],
       ...,
       [154, 3, 'van Billiard, Mr. Austin Blyler', ..., 14.5, nan, 'S'],
       [155, 3, 'Olsen, Mr. Ole Martin', ..., 7.3125, nan, 'S'],
       [156, 1, 'Williams, Mr. Charles Duane', ..., 61.3792, nan, 'C']],
      dtype=object)

weixuanfu · 2018-05-15T17:47:25Z

oh, I understood what happened. The input of tpot.fit() need to be a numeric array. You need convert these string types, like "Braund, Mr. Owen Harris", to numbers (like 1, 2 or 3) or just remove this feature if it is not important.

rafaelnovello · 2018-05-17T14:19:52Z

Nice! This is the point! Should not the TPOT address this issue? Like apply some LabelEncoder or something else? This functionality isn't about this?
Thanks for the help!

thedatadecoder · 2018-07-10T06:11:32Z

Has the issue in TPOT with categorical variables been resolved?

weixuanfu · 2018-07-10T12:13:44Z

Two new operators related to this issue were merged to dev branch.

thedatadecoder · 2018-07-10T13:03:43Z

Hey weixuanfu, Can you throw some light upon them please?

weixuanfu · 2018-07-10T13:10:20Z

@soorma7 please check the comments in PR #560

thedatadecoder · 2018-07-10T13:32:17Z

ok, can you tell me why it's stuck here?

weixuanfu · 2018-07-10T13:41:47Z

Hmm, not sure. how large is your dataset?

thedatadecoder · 2018-07-10T13:52:02Z

Its 200,000 rows x 33 columns.

weixuanfu · 2018-07-10T13:59:11Z

Hmm, it may take a while for each pipeline. You could try 'TPOT light' configuration instead or test it without jupyther. There is a unsolved issue that TPOT may stuck in jupyter notebook #645

thedatadecoder · 2018-07-10T14:03:07Z

Ok, I'll try both. Thanks.

almandsky · 2018-09-26T06:46:16Z

@weixuanfu is this issue still open? it depends on issue #756?

Are there any work-around? e.g. specify the categorical columns via categorical_features?

weixuanfu · 2018-09-26T13:26:09Z

Hmm I should close this issue since those two operators were added to the current version of TPOT (v0.9.5)

You can change the configuration of built-in OneHotEncoder (see default one here) to specify the categorical columns. Below is a demo for update configuration of OneHotEncoder in tpot:

from tpot.config import classifier_config_dict
# assume that col. 2-4 in input features are categorical_features
classifier_config_dict['tpot.builtins.OneHotEncoder']= {
        'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
        'sparse': [False],
        'threshold': [10]
        'categorical_features':  [[1,2,3]] 
}


tpot = TPOTClassifier(config_dict=classifier_config_dict)

Please see more details about customizing TPOT's operators and parameters

almandsky · 2018-09-26T20:21:41Z

Thanks @weixuanfu !

rhiever added enhancement need contributor labels Aug 16, 2017

weixuanfu self-assigned this Aug 21, 2017

weixuanfu mentioned this issue Sep 6, 2017

Add feature selection operators to select categorical or continuous features #560

Merged

weixuanfu mentioned this issue Sep 4, 2018

Categorical Encoders Support #756

Closed

weixuanfu closed this as completed Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature selection operators to select categorical or continuous features #549

Add feature selection operators to select categorical or continuous features #549

rhiever commented Aug 16, 2017

JuanuMusic commented Oct 11, 2017 •

edited

Loading

lstmemery commented Oct 22, 2017

weixuanfu commented Oct 22, 2017

JuanuMusic commented Oct 22, 2017

rafaelnovello commented May 14, 2018

weixuanfu commented May 14, 2018

rafaelnovello commented May 15, 2018

weixuanfu commented May 15, 2018 •

edited

Loading

rafaelnovello commented May 17, 2018

thedatadecoder commented Jul 10, 2018

weixuanfu commented Jul 10, 2018

thedatadecoder commented Jul 10, 2018

weixuanfu commented Jul 10, 2018 •

edited

Loading

thedatadecoder commented Jul 10, 2018

weixuanfu commented Jul 10, 2018

thedatadecoder commented Jul 10, 2018

weixuanfu commented Jul 10, 2018

thedatadecoder commented Jul 10, 2018

almandsky commented Sep 26, 2018

weixuanfu commented Sep 26, 2018

almandsky commented Sep 26, 2018

Add feature selection operators to select categorical or continuous features #549

Add feature selection operators to select categorical or continuous features #549

Comments

rhiever commented Aug 16, 2017

JuanuMusic commented Oct 11, 2017 • edited Loading

lstmemery commented Oct 22, 2017

weixuanfu commented Oct 22, 2017

JuanuMusic commented Oct 22, 2017

rafaelnovello commented May 14, 2018

weixuanfu commented May 14, 2018

rafaelnovello commented May 15, 2018

weixuanfu commented May 15, 2018 • edited Loading

rafaelnovello commented May 17, 2018

thedatadecoder commented Jul 10, 2018

weixuanfu commented Jul 10, 2018

thedatadecoder commented Jul 10, 2018

weixuanfu commented Jul 10, 2018 • edited Loading

thedatadecoder commented Jul 10, 2018

weixuanfu commented Jul 10, 2018

thedatadecoder commented Jul 10, 2018

weixuanfu commented Jul 10, 2018

thedatadecoder commented Jul 10, 2018

almandsky commented Sep 26, 2018

weixuanfu commented Sep 26, 2018

almandsky commented Sep 26, 2018

JuanuMusic commented Oct 11, 2017 •

edited

Loading

weixuanfu commented May 15, 2018 •

edited

Loading

weixuanfu commented Jul 10, 2018 •

edited

Loading