[Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? #1548

LeSasse · 2022-07-26T14:03:43Z

H!

First of all, thanks for this nice tool for the community. It is very useful in finding good models quickly without too much effort.

Short Question Description

My question is this: I would like to make sure that the autosklearn model only evaluates pipelines which scale features using the StandardScaler from sklearn. It is not entirely clear to me how this can be done. I have tried different argument configurations using
"include" and "exclude", but all of my inputs seem to be invalid.

With some extra context to follow it up. This way the question is clear for both you and us without it being lost in the paragraph.
Some useful information to help us with your question:

How did this question come about?

I am just trying to make sure that autosklearn is always using the StandardScaler on the input features.

Would a small code snippet help?

Yes, very much.

What have you already looked at?

I have been through the documentation for API, examples, and the code on github. I am not sure, whether what I want is possible anymore. One particular thing I don't really understand about the API is the distinction between the "data_preprocessor" and the "feature_preprocessor".

Before you ask, please have a look at the

Documentation
- If it's related but not clear, please include it in your question with a link, we'll try to make it better!
Examples
- Likewise, an example can answer many questions! However we can't cover all question with examples but if you think your question would benefit from an example, let us know!
Issues
- We try to label all questions with the label Question, maybe someone has already asked. If the question is about a feature, try searching more of the issues. If you find something related but doesn't directly answer your question, please link to it with #(issue number)!

System Details (if relevant)

Which version of auto-sklearn are you using?
auto-sklearn==0.14.7
Are you running this on Linux / Mac / ... ?
Debian 10.11

The text was updated successfully, but these errors were encountered:

verakye · 2022-07-26T14:18:41Z

Thank you very much for this really nice library!

To add to the question: We are aware that the data_preprocessor offers options for StandardScaling (https://github.com/automl/auto-sklearn/blob/development/autosklearn/pipeline/components/data_preprocessing/rescaling/standardize.py). However, from the documentation/examples/the repo it doesn't become clear how one would pass this to "include" or "exclude" as different potential options like include={"data_preprocessor": ["standardize"]} or include={"data_preprocessor": ["rescaling"]} or include={"data_preprocessor": {"rescaling": ["standardize"]}} don't work.

We are not sure how to derive from the error message, exemplarily resulting from option 1 above The provided component 'standardize' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type'], what the correct application would be.

Thank you very much for your help!

eddiebergman · 2022-08-03T08:28:21Z

Hi @LeSasse & @verakye,

So the short answer is that if you do the pre-processing yourself, it will always be part of the dataset.

The long answer is that the data_preprocessing part of the pipeline really doesn't allow much customization at the moment and this is largely due to that fact that it's the only one that is done column wise and is conditional upon the columns included.

We do hope to fix this soon though, this has come up quite a few times.

With regards to data_preprocessing vs feature_preprocessing, this is purely a distinction upon the fact that data_preprocessing is done column wise while all components listed as feature_preprocessing are done on the entire X data.

Best,
Eddie

verakye · 2022-08-03T08:41:40Z

Hi @eddiebergman,

For data_preprocessing: 1) Does that mean that one would need to specify the columns and if so how would one do so (or is this the part you are referring to that it's currently not really doable?)? 2) What would be the correct syntax to specify the available "rescaling": "standardize"?
Regarding the feature_preprocessing: Are the features standardised by default? I assumed so based on some examples, but didn't find it in the documentation. If not, is there an option to standardise the features (I only found a standardisation option for the data_preprocessing)?

Thank your very much!

eddiebergman · 2022-08-03T09:40:16Z

Hi @verakye,

You can use the "feat_types" parameter to specify the column types of your data.
There's no way to specify that for "numerical" columns that only "standardize" should apply. Currently data preprocessing is fixed and very unflexible. To ensure that it's standardized, I would recommend just doing so yourself before passing it in to autosklearn.

If your overarching question is can you ensure "standardize" is applied to each feature through autosklearn, then no it can't currently be done by data_preprocessing or feature_preprocessing.

You can find here the components for data_preprocessing and feature_preprocessing. I don't believe any features outputted by feature_preprocessing will have been standardized and they are mostly just wrappers around whatever the default sklearn feature preprocessers do.

Best,
Eddie

mfeurer · 2022-08-03T17:16:49Z

As a hack, you can remove the other components by deleting all files besides __init__.py and standardize.py. However, this will most likely result in worse performance as meta-learning will no longer be able to suggest the best possible configurations. Moreover, as @eddiebergman explained, for categorical data, this will not apply.

But in any case, could you maybe describe your use case? I'm wondering what you are trying to achieve and whether there's a better solution for this.

LeSasse · 2022-08-04T07:24:48Z

Thanks a lot for the quick reply!

"If your overarching question is can you ensure "standardize" is applied to each feature through autosklearn, then no it can't currently be done by data_preprocessing or feature_preprocessing."

This pretty much answers it, I would say. We wanted to ensure that "standardize" is applied to all (numerical) features, while avoiding preprocessing before handing the data to auto-sklearn to avoid "data leakage" in its internal evaluations. It is likely standardisation won't have a huge "leakage" effect, but we thought it best to ensure standardisation in a cv-consistent way within auto-sklearn.

verakye · 2022-08-04T07:33:27Z

Thank you very much @eddiebergman and @mfeurer for clarification!
To add to @LeSasse's response: We are using resampling_strategy="cv", that's why we are concerned about data leakage if we did the preprocessing outside of AutoSklearn. AutoSklearn is used as the inner CV of a nested CV. We could totally do the standardisation in the outer CV (but this would technically be some form of data leakage), so we searched for an option to do the standardisation in the most inner level, which in our case would then be within AutoSklearn.

LeSasse changed the title ~~[Question] My Question?~~ [Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? Jul 26, 2022

eddiebergman added the question label Aug 3, 2022

eddiebergman added this to the v0.16 milestone Aug 3, 2022

verakye mentioned this issue Mar 23, 2023

[Question] Is this the correct way of extending AutoSklearn with a Standard Scaler feature preprocessor? #1651

Closed

eddiebergman mentioned this issue Jul 21, 2023

What's in store for Auto-Sklearn? -- From the Developers #1677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? #1548

[Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? #1548

LeSasse commented Jul 26, 2022

verakye commented Jul 26, 2022

eddiebergman commented Aug 3, 2022

verakye commented Aug 3, 2022

eddiebergman commented Aug 3, 2022

mfeurer commented Aug 3, 2022

LeSasse commented Aug 4, 2022 •

edited

Loading

verakye commented Aug 4, 2022

[Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? #1548

[Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? #1548

Comments

LeSasse commented Jul 26, 2022

Short Question Description

System Details (if relevant)

verakye commented Jul 26, 2022

eddiebergman commented Aug 3, 2022

verakye commented Aug 3, 2022

eddiebergman commented Aug 3, 2022

mfeurer commented Aug 3, 2022

LeSasse commented Aug 4, 2022 • edited Loading

verakye commented Aug 4, 2022

LeSasse commented Aug 4, 2022 •

edited

Loading