[Question] Are there any alternatives to One-hot encoding? #1268

SaffronWolf · 2021-10-10T15:05:16Z

Are there any alternatives to One-hot encoding for categorical features? I maybe wrong but I think One-hot encoding is the only choice available for encoding categorical features and it doesn't make sense every time. Extending with custom component would also not work because there is no way to specify a component specifically for categorical features.

eddiebergman · 2021-10-16T12:55:03Z

Hi @SaffronWolf,

Apologies for the delayed response. Currently we have ordinal encoding and one-hot encoding (here). What other kinds of encoding would you be considering for categorical data?

In theory it should be possible to add your own encoder. You would have to check the example or more reliable, the source code and create your own.

SaffronWolf · 2021-10-16T14:23:08Z

I would like to try out encoders from category-encoders such as CountEncoding and CatBoostEncoding. These are supposed to applied to only categorical features but I could not find anything in the API that would allow to do that.

include={ 'data_preprocessor': ['CountEncoding'] },

This line will apply CountEncoding to all features whereas it should applied to categorical features only.

eddiebergman · 2021-10-16T15:22:31Z

Unfortunately this doesn't seem possible at the moment, the relevant steps seem to be hard-coded in at the moment.

The only implemented data preprocessor is 'feature_type', a class that applies a set of actions to a column, depending on if it's numerical or categorical.

from autosklearn.pipeline.components.data_preprocessing import _preprocessors
print(_preprocessors)

# OrderedDict([('feature_type', autosklearn.pipeline.components.data_preprocessing.feature_type.FeatTypeSplit)])

We identify in FeatTypeSplit which columns are categorical or numerical and then apply either the CategoricalPreprocessingPipeline or the NumericalPreprocessingPipeline. The hard-coded steps are here.

My suggestion in the meantime if you just need a quick-and-dirty solution is to modify the soure code to include what you need in those lines mentioned.

The more flexible solution is to create your own two classes identical to FeatTypeSplit and CategoricalPreprocessingPipeline, overwriting some methods, where your own FeatTypeSplitCustom would use CategoricalPreprocessingPipelineCustom instead of the default one.

from autosklearn.pipeline.components.data_preprocessing.feature_type import FeatTypeSplit
from autosklearn.pipeline.components.date_preprocessing.feature_type_categorical import CategoricalPreprocessingPipeline

class CategoricalPreprocessingPipelineCustom(CategoricalPreprocessingPipeline):

    def _get_pipeline_steps(self, dataset_properties):
        ... # overwrite with what you want your pipeline to look like

class FeatTypeSplitCustom(FeatTypeSplit):

    def __init__(
        self,
        config = None,
        pipeline = None,
        dataset_properties = None,
        include = None,
        exclude = None,
        random_state = None,
        init_params = None,
        feat_type = None,
        force_sparse_output = False,
        column_transformer = None,
    ):
        super().__init__(
            config=config,
            pipeline=pipeline,
            dataset_properties=dataset_properties,
            include=include,
            exclude=exclude,
            random_state=random_state,
            init_params=init_params,
            feat_type=feat_type,
            force_sparse_output=force_sparse_output,
            column_transformer=column_transformer
        )
        # Set the categorical pipeline part to use your custom one
        self.categ_ppl = CategoricalPreprocessingPipelineCustom(
            config=None, steps=pipeline, dataset_properties=dataset_properties,
            include=include, exclude=exclude, random_state=random_state,
            init_params=init_params)
        self._transformers = [
            ("categorical_transformer", self.categ_ppl),
            ("numerical_transformer", self.numer_ppl),
        ]

from autosklearn.pipeline.components.data_preprocessing import add_preprocessor, _addons
add_preprocessor(FeatTypeSplitCustom)
print(_addons)

# OrderedDict([('custom_feature_type', FeatTypeSplitCustom)])

You could then use this with include={ 'data_preprocessor': ['custom_feature_type'] }.

Note, I havn't tested this code but it should be enough to get you going with what you need. You make have to implement a wrapper around any custom steps as we have done for One Hot Encoding, as an example.

SaffronWolf · 2021-10-16T15:27:57Z

I will try this out. Thanks a lot for the detailed answer.

eddiebergman · 2021-10-16T15:51:42Z

@SaffronWolf , no probem. Let me know if there's any issues with it. We'll address this in the future as it seems others have issues with data preprocessing.

I also updated the comment as I forgot to subclass the CategoricalPreprocessingPipeline

SaffronWolf · 2021-10-16T18:19:10Z

Hi @eddiebergman, the above code throws this error
TypeError: add_component works only with a subclass of <class 'autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm'>. So I subclassed the AutoSklearnPreprocessingAlgorithm.

This leads to following error, ValueError: Property handles_dense must not be specified for algorithm FeatTypeSplitCustom.

eddiebergman added the Feedback-Required label Oct 16, 2021

eddiebergman added bug and removed Feedback-Required labels Jun 10, 2022

AmirAlavi mentioned this issue Nov 6, 2022

ThirdPartyComponents.add_component check for explicit base class results in redundant inheritance #1604

Open

eddiebergman mentioned this issue Jul 21, 2023

What's in store for Auto-Sklearn? -- From the Developers #1677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Are there any alternatives to One-hot encoding? #1268

[Question] Are there any alternatives to One-hot encoding? #1268

SaffronWolf commented Oct 10, 2021

eddiebergman commented Oct 16, 2021

SaffronWolf commented Oct 16, 2021

eddiebergman commented Oct 16, 2021 •

edited

Loading

SaffronWolf commented Oct 16, 2021

eddiebergman commented Oct 16, 2021

SaffronWolf commented Oct 16, 2021

[Question] Are there any alternatives to One-hot encoding? #1268

[Question] Are there any alternatives to One-hot encoding? #1268

Comments

SaffronWolf commented Oct 10, 2021

eddiebergman commented Oct 16, 2021

SaffronWolf commented Oct 16, 2021

eddiebergman commented Oct 16, 2021 • edited Loading

SaffronWolf commented Oct 16, 2021

eddiebergman commented Oct 16, 2021

SaffronWolf commented Oct 16, 2021

eddiebergman commented Oct 16, 2021 •

edited

Loading