-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add feature selection operators to select categorical or continuous features #549
Comments
Another useful way is to detect categorical types on pandas dataframes. Of course, this should go along with a good documentation about the behaviour. Also, there could be a parameter indicating column indices or names of categorical values features. |
Hey there. I would like to take a crack at this feature. |
@weixuanfu excellent! |
Hi guys!
I have tried the following code:
I'm using Ubuntu 18.04, Python 3.6.5 and the following libs:
Sorry if this is not the right place to post this! |
Hmm, stderr showed I think the issue can be fixed if the last line of codes is |
Sorry for the I have tried your sugestion but I got the same error:
I have used the I believe the problem is the array type (
|
oh, I understood what happened. The input of |
Nice! This is the point! Should not the TPOT address this issue? Like apply some LabelEncoder or something else? This functionality isn't about this? |
Has the issue in TPOT with categorical variables been resolved? |
Two new operators related to this issue were merged to dev branch. |
Hey weixuanfu, Can you throw some light upon them please? |
@soorma7 please check the comments in PR #560 |
Hmm, not sure. how large is your dataset? |
Its 200,000 rows x 33 columns. |
Hmm, it may take a while for each pipeline. You could try 'TPOT light' configuration instead or test it without jupyther. There is a unsolved issue that TPOT may stuck in jupyter notebook #645 |
Ok, I'll try both. Thanks. |
@weixuanfu is this issue still open? it depends on issue #756? Are there any work-around? e.g. specify the categorical columns via |
Hmm I should close this issue since those two operators were added to the current version of TPOT (v0.9.5) You can change the configuration of built-in from tpot.config import classifier_config_dict
# assume that col. 2-4 in input features are categorical_features
classifier_config_dict['tpot.builtins.OneHotEncoder']= {
'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
'sparse': [False],
'threshold': [10]
'categorical_features': [[1,2,3]]
}
tpot = TPOTClassifier(config_dict=classifier_config_dict) Please see more details about customizing TPOT's operators and parameters |
Thanks @weixuanfu ! |
Currently, TPOT operators act on the full dataset regardless of the feature types. For example, if TPOT decides to apply a OneHotEncoder on the data, then the OneHotEncoder will be applied to all features---even continuous features. This issue can be problematic for datasets with mixed feature types.
We should look into adding feature selection operators that select only continuous or only categorical features. These operators will allow TPOT to subset the features by feature type and (hopefully) further apply appropriate transformations to those features.
An example pipeline I can imagine with these operators is, e.g.,
input --> select continuous --> PCA ----------------
|-----> RandomForest
input --> select categorical --> OneHotEncoder ---
This should be a simple addition to the current version of TPOT, as all it requires is the addition of two new built-in feature selection operators. We could have the operators decide that a feature is categorical if it has < 10 levels, and it is continuous otherwise. We can make the threshold value a parameter for these operators, with a default of 10.
The text was updated successfully, but these errors were encountered: