Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Are there any alternatives to One-hot encoding? #1268

Open
SaffronWolf opened this issue Oct 10, 2021 · 6 comments
Open

[Question] Are there any alternatives to One-hot encoding? #1268

SaffronWolf opened this issue Oct 10, 2021 · 6 comments
Labels

Comments

@SaffronWolf
Copy link

Are there any alternatives to One-hot encoding for categorical features? I maybe wrong but I think One-hot encoding is the only choice available for encoding categorical features and it doesn't make sense every time. Extending with custom component would also not work because there is no way to specify a component specifically for categorical features.

@eddiebergman
Copy link
Contributor

Hi @SaffronWolf,

Apologies for the delayed response. Currently we have ordinal encoding and one-hot encoding (here). What other kinds of encoding would you be considering for categorical data?

In theory it should be possible to add your own encoder. You would have to check the example or more reliable, the source code and create your own.

@SaffronWolf
Copy link
Author

I would like to try out encoders from category-encoders such as CountEncoding and CatBoostEncoding. These are supposed to applied to only categorical features but I could not find anything in the API that would allow to do that.

include={ 'data_preprocessor': ['CountEncoding'] },

This line will apply CountEncoding to all features whereas it should applied to categorical features only.

@eddiebergman
Copy link
Contributor

eddiebergman commented Oct 16, 2021

Unfortunately this doesn't seem possible at the moment, the relevant steps seem to be hard-coded in at the moment.

The only implemented data preprocessor is 'feature_type', a class that applies a set of actions to a column, depending on if it's numerical or categorical.

from autosklearn.pipeline.components.data_preprocessing import _preprocessors
print(_preprocessors)

# OrderedDict([('feature_type', autosklearn.pipeline.components.data_preprocessing.feature_type.FeatTypeSplit)])

We identify in FeatTypeSplit which columns are categorical or numerical and then apply either the CategoricalPreprocessingPipeline or the NumericalPreprocessingPipeline. The hard-coded steps are here.

My suggestion in the meantime if you just need a quick-and-dirty solution is to modify the soure code to include what you need in those lines mentioned.

The more flexible solution is to create your own two classes identical to FeatTypeSplit and CategoricalPreprocessingPipeline, overwriting some methods, where your own FeatTypeSplitCustom would use CategoricalPreprocessingPipelineCustom instead of the default one.

from autosklearn.pipeline.components.data_preprocessing.feature_type import FeatTypeSplit
from autosklearn.pipeline.components.date_preprocessing.feature_type_categorical import CategoricalPreprocessingPipeline

class CategoricalPreprocessingPipelineCustom(CategoricalPreprocessingPipeline):

    def _get_pipeline_steps(self, dataset_properties):
        ... # overwrite with what you want your pipeline to look like

class FeatTypeSplitCustom(FeatTypeSplit):

    def __init__(
        self,
        config = None,
        pipeline = None,
        dataset_properties = None,
        include = None,
        exclude = None,
        random_state = None,
        init_params = None,
        feat_type = None,
        force_sparse_output = False,
        column_transformer = None,
    ):
        super().__init__(
            config=config,
            pipeline=pipeline,
            dataset_properties=dataset_properties,
            include=include,
            exclude=exclude,
            random_state=random_state,
            init_params=init_params,
            feat_type=feat_type,
            force_sparse_output=force_sparse_output,
            column_transformer=column_transformer
        )
        # Set the categorical pipeline part to use your custom one
        self.categ_ppl = CategoricalPreprocessingPipelineCustom(
            config=None, steps=pipeline, dataset_properties=dataset_properties,
            include=include, exclude=exclude, random_state=random_state,
            init_params=init_params)
        self._transformers = [
            ("categorical_transformer", self.categ_ppl),
            ("numerical_transformer", self.numer_ppl),
        ]

from autosklearn.pipeline.components.data_preprocessing import add_preprocessor, _addons
add_preprocessor(FeatTypeSplitCustom)
print(_addons)

# OrderedDict([('custom_feature_type', FeatTypeSplitCustom)])

You could then use this with include={ 'data_preprocessor': ['custom_feature_type'] }.

Note, I havn't tested this code but it should be enough to get you going with what you need. You make have to implement a wrapper around any custom steps as we have done for One Hot Encoding, as an example.

@SaffronWolf
Copy link
Author

I will try this out. Thanks a lot for the detailed answer.

@eddiebergman
Copy link
Contributor

@SaffronWolf , no probem. Let me know if there's any issues with it. We'll address this in the future as it seems others have issues with data preprocessing.

I also updated the comment as I forgot to subclass the CategoricalPreprocessingPipeline

@SaffronWolf
Copy link
Author

Hi @eddiebergman, the above code throws this error
TypeError: add_component works only with a subclass of <class 'autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm'>. So I subclassed the AutoSklearnPreprocessingAlgorithm.

This leads to following error, ValueError: Property handles_dense must not be specified for algorithm FeatTypeSplitCustom.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants