Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? #1548

Open
LeSasse opened this issue Jul 26, 2022 · 7 comments
Labels
Milestone

Comments

@LeSasse
Copy link

LeSasse commented Jul 26, 2022

H!

First of all, thanks for this nice tool for the community. It is very useful in finding good models quickly without too much effort.

Short Question Description

My question is this: I would like to make sure that the autosklearn model only evaluates pipelines which scale features using the StandardScaler from sklearn. It is not entirely clear to me how this can be done. I have tried different argument configurations using
"include" and "exclude", but all of my inputs seem to be invalid.

With some extra context to follow it up. This way the question is clear for both you and us without it being lost in the paragraph.
Some useful information to help us with your question:

  • How did this question come about?

I am just trying to make sure that autosklearn is always using the StandardScaler on the input features.

  • Would a small code snippet help?

Yes, very much.

  • What have you already looked at?

I have been through the documentation for API, examples, and the code on github. I am not sure, whether what I want is possible anymore. One particular thing I don't really understand about the API is the distinction between the "data_preprocessor" and the "feature_preprocessor".

Before you ask, please have a look at the

  • Documentation
    • If it's related but not clear, please include it in your question with a link, we'll try to make it better!
  • Examples
    • Likewise, an example can answer many questions! However we can't cover all question with examples but if you think your question would benefit from an example, let us know!
  • Issues
    • We try to label all questions with the label Question, maybe someone has already asked. If the question is about a feature, try searching more of the issues. If you find something related but doesn't directly answer your question, please link to it with #(issue number)!

System Details (if relevant)

  • Which version of auto-sklearn are you using?
    auto-sklearn==0.14.7

  • Are you running this on Linux / Mac / ... ?
    Debian 10.11

@LeSasse LeSasse changed the title [Question] My Question? [Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? Jul 26, 2022
@verakye
Copy link

verakye commented Jul 26, 2022

Thank you very much for this really nice library!

To add to the question: We are aware that the data_preprocessor offers options for StandardScaling (https://github.com/automl/auto-sklearn/blob/development/autosklearn/pipeline/components/data_preprocessing/rescaling/standardize.py). However, from the documentation/examples/the repo it doesn't become clear how one would pass this to "include" or "exclude" as different potential options like include={"data_preprocessor": ["standardize"]} or include={"data_preprocessor": ["rescaling"]} or include={"data_preprocessor": {"rescaling": ["standardize"]}} don't work.

We are not sure how to derive from the error message, exemplarily resulting from option 1 above The provided component 'standardize' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type'], what the correct application would be.

Thank you very much for your help!

@eddiebergman
Copy link
Contributor

Hi @LeSasse & @verakye,

So the short answer is that if you do the pre-processing yourself, it will always be part of the dataset.

The long answer is that the data_preprocessing part of the pipeline really doesn't allow much customization at the moment and this is largely due to that fact that it's the only one that is done column wise and is conditional upon the columns included.

We do hope to fix this soon though, this has come up quite a few times.

With regards to data_preprocessing vs feature_preprocessing, this is purely a distinction upon the fact that data_preprocessing is done column wise while all components listed as feature_preprocessing are done on the entire X data.

Best,
Eddie

@eddiebergman eddiebergman added this to the v0.16 milestone Aug 3, 2022
@verakye
Copy link

verakye commented Aug 3, 2022

Hi @eddiebergman,

For data_preprocessing: 1) Does that mean that one would need to specify the columns and if so how would one do so (or is this the part you are referring to that it's currently not really doable?)? 2) What would be the correct syntax to specify the available "rescaling": "standardize"?
Regarding the feature_preprocessing: Are the features standardised by default? I assumed so based on some examples, but didn't find it in the documentation. If not, is there an option to standardise the features (I only found a standardisation option for the data_preprocessing)?

Thank your very much!

@eddiebergman
Copy link
Contributor

Hi @verakye,

  1. You can use the "feat_types" parameter to specify the column types of your data.
  2. There's no way to specify that for "numerical" columns that only "standardize" should apply. Currently data preprocessing is fixed and very unflexible. To ensure that it's standardized, I would recommend just doing so yourself before passing it in to autosklearn.

If your overarching question is can you ensure "standardize" is applied to each feature through autosklearn, then no it can't currently be done by data_preprocessing or feature_preprocessing.

You can find here the components for data_preprocessing and feature_preprocessing. I don't believe any features outputted by feature_preprocessing will have been standardized and they are mostly just wrappers around whatever the default sklearn feature preprocessers do.

Best,
Eddie

@mfeurer
Copy link
Contributor

mfeurer commented Aug 3, 2022

As a hack, you can remove the other components by deleting all files besides __init__.py and standardize.py. However, this will most likely result in worse performance as meta-learning will no longer be able to suggest the best possible configurations. Moreover, as @eddiebergman explained, for categorical data, this will not apply.

But in any case, could you maybe describe your use case? I'm wondering what you are trying to achieve and whether there's a better solution for this.

@LeSasse
Copy link
Author

LeSasse commented Aug 4, 2022

Thanks a lot for the quick reply!

"If your overarching question is can you ensure "standardize" is applied to each feature through autosklearn, then no it can't currently be done by data_preprocessing or feature_preprocessing."

This pretty much answers it, I would say. We wanted to ensure that "standardize" is applied to all (numerical) features, while avoiding preprocessing before handing the data to auto-sklearn to avoid "data leakage" in its internal evaluations. It is likely standardisation won't have a huge "leakage" effect, but we thought it best to ensure standardisation in a cv-consistent way within auto-sklearn.

@verakye
Copy link

verakye commented Aug 4, 2022

Thank you very much @eddiebergman and @mfeurer for clarification!
To add to @LeSasse's response: We are using resampling_strategy="cv", that's why we are concerned about data leakage if we did the preprocessing outside of AutoSklearn. AutoSklearn is used as the inner CV of a nested CV. We could totally do the standardisation in the outer CV (but this would technically be some form of data leakage), so we searched for an option to do the standardisation in the most inner level, which in our case would then be within AutoSklearn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants