Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature selection operators to select categorical or continuous features #549

Closed
rhiever opened this issue Aug 16, 2017 · 21 comments
Closed

Comments

@rhiever
Copy link
Contributor

rhiever commented Aug 16, 2017

Currently, TPOT operators act on the full dataset regardless of the feature types. For example, if TPOT decides to apply a OneHotEncoder on the data, then the OneHotEncoder will be applied to all features---even continuous features. This issue can be problematic for datasets with mixed feature types.

We should look into adding feature selection operators that select only continuous or only categorical features. These operators will allow TPOT to subset the features by feature type and (hopefully) further apply appropriate transformations to those features.

An example pipeline I can imagine with these operators is, e.g.,

input --> select continuous --> PCA ----------------
                                                                                 |-----> RandomForest
input --> select categorical --> OneHotEncoder ---

This should be a simple addition to the current version of TPOT, as all it requires is the addition of two new built-in feature selection operators. We could have the operators decide that a feature is categorical if it has < 10 levels, and it is continuous otherwise. We can make the threshold value a parameter for these operators, with a default of 10.

@JuanuMusic
Copy link

JuanuMusic commented Oct 11, 2017

Another useful way is to detect categorical types on pandas dataframes. Of course, this should go along with a good documentation about the behaviour.

Also, there could be a parameter indicating column indices or names of categorical values features.

@lstmemery
Copy link

Hey there. I would like to take a crack at this feature.

@weixuanfu
Copy link
Contributor

@harkdev please check the PR #560. I will rebase this PR on version 0.9.

@JuanuMusic
Copy link

@weixuanfu excellent!
I'll see If i can give it a go when it's merged, and test it!
I still have to figure out how to install packages from github on my anaconda

@rafaelnovello
Copy link

Hi guys!
I have tried the solution showed here with titanic dataset. I believe that with this PR and the right config TPOT should apply one hot encoding automatically on my data, but I have got the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-60-4ca640247df0> in <module>()
     14 tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, config_dict='TPOT sparse')
     15 
---> 16 tpot.fit(df.drop(['Name', 'Ticket', 'Survived', 'Cabin', 'PassengerId', 'Age', 'Fare'], axis=1).as_matrix(), df.Survived.as_matrix())

~/.virtualenvs/playground/lib/python3.6/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    540         """
    541 
--> 542         features, target = self._check_dataset(features, target)
    543 
    544         # Randomly collect a subsample of training samples for pipeline optimization process.

~/.virtualenvs/playground/lib/python3.6/site-packages/tpot/base.py in _check_dataset(self, features, target)
   1020                 )
   1021         else:
-> 1022             if np.any(np.isnan(features)):
   1023                 self._imputed = True
   1024                 features = self._impute_values(features)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I have tried the following code:

import numpy as np
import pandas as pd

from tpot import TPOTClassifier
from tpot.config import classifier_config_dict

url = "https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv"

df = pd.read_csv(url, sep=None, engine='python')

classifier_config_dict['tpot.builtins.CategoricalSelector'] = {
    'threshold': [10],
    'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
    'sparse': [False]
}

classifier_config_dict['tpot.builtins.ContinuousSelector'] = {
    'threshold': [10],
    'svd_solver': ['randomized'],
    'iterated_power': range(1, 11)
}

tpot = TPOTClassifier(
    generations=5,
    population_size=50,
    verbosity=2,
    config_dict=classifier_config_dict
)

tpot.fit(df.drop('Survived', axis=1), df.Survived)

I'm using Ubuntu 18.04, Python 3.6.5 and the following libs:

Package            Version  
------------------ ---------    
numpy              1.14.3   
pandas             0.22.0     
pip                10.0.1   
scikit-learn       0.19.1   
scipy              1.1.0       
TPOT               0.9.3      
xgboost            0.71  

Sorry if this is not the right place to post this!

@weixuanfu
Copy link
Contributor

Hmm, stderr showed config_dict='TPOT sparse', which is different with your examples.

I think the issue can be fixed if the last line of codes is tpot.fit(df.drop('Survived', axis=1).values, df.Survived).values. tpot.fit() just takes np.ndarray as inputs for now.

@rafaelnovello
Copy link

Sorry for the config_dict='TPOT sparse'. It was a mistake!

I have tried your sugestion but I got the same error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-109-a060d5622dbf> in <module>()
     18 )
     19 
---> 20 tpot.fit(df.drop('Survived', axis=1).values, df.Survived.values)

~/.virtualenvs/playground/lib/python3.6/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    540         """
    541 
--> 542         features, target = self._check_dataset(features, target)
    543 
    544         # Randomly collect a subsample of training samples for pipeline optimization process.

~/.virtualenvs/playground/lib/python3.6/site-packages/tpot/base.py in _check_dataset(self, features, target)
   1020                 )
   1021         else:
-> 1022             if np.any(np.isnan(features)):
   1023                 self._imputed = True
   1024                 features = self._impute_values(features)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I have used the classifier_config_dict defined in the first post at this time.

I believe the problem is the array type (dtype=object) but it was defined as object because it have a mixed data types (int, float, str)

df.drop('Survived', axis=1).values
array([[1, 3, 'Braund, Mr. Owen Harris', ..., 7.25, nan, 'S'],
       [2, 1, 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', ...,
        71.2833, 'C85', 'C'],
       [3, 3, 'Heikkinen, Miss. Laina', ..., 7.925, nan, 'S'],
       ...,
       [154, 3, 'van Billiard, Mr. Austin Blyler', ..., 14.5, nan, 'S'],
       [155, 3, 'Olsen, Mr. Ole Martin', ..., 7.3125, nan, 'S'],
       [156, 1, 'Williams, Mr. Charles Duane', ..., 61.3792, nan, 'C']],
      dtype=object)

@weixuanfu
Copy link
Contributor

weixuanfu commented May 15, 2018

oh, I understood what happened. The input of tpot.fit() need to be a numeric array. You need convert these string types, like "Braund, Mr. Owen Harris", to numbers (like 1, 2 or 3) or just remove this feature if it is not important.

@rafaelnovello
Copy link

Nice! This is the point! Should not the TPOT address this issue? Like apply some LabelEncoder or something else? This functionality isn't about this?
Thanks for the help!

@thedatadecoder
Copy link

Has the issue in TPOT with categorical variables been resolved?

@weixuanfu
Copy link
Contributor

Two new operators related to this issue were merged to dev branch.

@thedatadecoder
Copy link

Hey weixuanfu, Can you throw some light upon them please?

@weixuanfu
Copy link
Contributor

weixuanfu commented Jul 10, 2018

@soorma7 please check the comments in PR #560

@thedatadecoder
Copy link


screenshot from 2018-07-10 19-01-07

ok, can you tell me why it's stuck here?

@weixuanfu
Copy link
Contributor

Hmm, not sure. how large is your dataset?

@thedatadecoder
Copy link

Its 200,000 rows x 33 columns.

@weixuanfu
Copy link
Contributor

Hmm, it may take a while for each pipeline. You could try 'TPOT light' configuration instead or test it without jupyther. There is a unsolved issue that TPOT may stuck in jupyter notebook #645

@thedatadecoder
Copy link

Ok, I'll try both. Thanks.

@almandsky
Copy link

@weixuanfu is this issue still open? it depends on issue #756?

Are there any work-around? e.g. specify the categorical columns via categorical_features?

@weixuanfu
Copy link
Contributor

Hmm I should close this issue since those two operators were added to the current version of TPOT (v0.9.5)

You can change the configuration of built-in OneHotEncoder (see default one here) to specify the categorical columns. Below is a demo for update configuration of OneHotEncoder in tpot:

from tpot.config import classifier_config_dict
# assume that col. 2-4 in input features are categorical_features
classifier_config_dict['tpot.builtins.OneHotEncoder']= {
        'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
        'sparse': [False],
        'threshold': [10]
        'categorical_features':  [[1,2,3]] 
}


tpot = TPOTClassifier(config_dict=classifier_config_dict)

Please see more details about customizing TPOT's operators and parameters

@almandsky
Copy link

Thanks @weixuanfu !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants