-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Are there any alternatives to One-hot encoding? #1268
Comments
Hi @SaffronWolf, Apologies for the delayed response. Currently we have ordinal encoding and one-hot encoding (here). What other kinds of encoding would you be considering for categorical data? In theory it should be possible to add your own encoder. You would have to check the example or more reliable, the source code and create your own. |
I would like to try out encoders from category-encoders such as CountEncoding and CatBoostEncoding. These are supposed to applied to only categorical features but I could not find anything in the API that would allow to do that.
This line will apply CountEncoding to all features whereas it should applied to categorical features only. |
Unfortunately this doesn't seem possible at the moment, the relevant steps seem to be hard-coded in at the moment. The only implemented data preprocessor is from autosklearn.pipeline.components.data_preprocessing import _preprocessors
print(_preprocessors)
# OrderedDict([('feature_type', autosklearn.pipeline.components.data_preprocessing.feature_type.FeatTypeSplit)]) We identify in My suggestion in the meantime if you just need a quick-and-dirty solution is to modify the soure code to include what you need in those lines mentioned. The more flexible solution is to create your own two classes identical to from autosklearn.pipeline.components.data_preprocessing.feature_type import FeatTypeSplit
from autosklearn.pipeline.components.date_preprocessing.feature_type_categorical import CategoricalPreprocessingPipeline
class CategoricalPreprocessingPipelineCustom(CategoricalPreprocessingPipeline):
def _get_pipeline_steps(self, dataset_properties):
... # overwrite with what you want your pipeline to look like
class FeatTypeSplitCustom(FeatTypeSplit):
def __init__(
self,
config = None,
pipeline = None,
dataset_properties = None,
include = None,
exclude = None,
random_state = None,
init_params = None,
feat_type = None,
force_sparse_output = False,
column_transformer = None,
):
super().__init__(
config=config,
pipeline=pipeline,
dataset_properties=dataset_properties,
include=include,
exclude=exclude,
random_state=random_state,
init_params=init_params,
feat_type=feat_type,
force_sparse_output=force_sparse_output,
column_transformer=column_transformer
)
# Set the categorical pipeline part to use your custom one
self.categ_ppl = CategoricalPreprocessingPipelineCustom(
config=None, steps=pipeline, dataset_properties=dataset_properties,
include=include, exclude=exclude, random_state=random_state,
init_params=init_params)
self._transformers = [
("categorical_transformer", self.categ_ppl),
("numerical_transformer", self.numer_ppl),
]
from autosklearn.pipeline.components.data_preprocessing import add_preprocessor, _addons
add_preprocessor(FeatTypeSplitCustom)
print(_addons)
# OrderedDict([('custom_feature_type', FeatTypeSplitCustom)]) You could then use this with Note, I havn't tested this code but it should be enough to get you going with what you need. You make have to implement a wrapper around any custom steps as we have done for One Hot Encoding, as an example. |
I will try this out. Thanks a lot for the detailed answer. |
@SaffronWolf , no probem. Let me know if there's any issues with it. We'll address this in the future as it seems others have issues with data preprocessing. I also updated the comment as I forgot to subclass the |
Hi @eddiebergman, the above code throws this error This leads to following error, |
Are there any alternatives to One-hot encoding for categorical features? I maybe wrong but I think One-hot encoding is the only choice available for encoding categorical features and it doesn't make sense every time. Extending with custom component would also not work because there is no way to specify a component specifically for categorical features.
The text was updated successfully, but these errors were encountered: