[Question] Is this the correct way of extending AutoSklearn with a Standard Scaler feature preprocessor? #1651

verakye · 2023-03-23T11:30:26Z

Short Question Description

Is my below code the correct approach to extend AutoSklearn with a wrapper class for the scikitlearn Standard Scaler as a feature preprocessor (particularly the settings in get_properties and in get_hyperparameter_search_space)?
Would that approach (in your opinion) solve the problem mentioned in [Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? #1548?
How can I best double check in the final output that the features were indeed standard-scaled within AutoSklearn, i.e. if the StandardScaler works as it is supposed to work (better than just comparing two runs with or without standard scaling)?

How did this question come about?:
As clarified in question [Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? #1548 currently it is not possible to standard scale the features in a cv-conistent manner within a CV in Autosklearn. Therefore, currently standard scaling would be needed to be done on the features before calling AutoSklearn, which would lead to data leakage.
What have you already looked at?
I checked the documentation here: https://automl.github.io/auto-sklearn/master/extending.html
I checked out the ConfigSpace documentation: https://automl.github.io/ConfigSpace/main/api/conditions.html
I had a look at the source code of other implementations of components.

Thank you very much!

The code I wrote with the attempt to write a wrapper class and register it to auto-sklearn (runs without errors)

typing import Optional
from pprint import pprint

from ConfigSpace.configuration_space import ConfigurationSpace
from ConfigSpace.hyperparameters import CategoricalHyperparameter

from autosklearn.askl_typing import FEAT_TYPE_TYPE
import autosklearn.pipeline.components.feature_preprocessing
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import DENSE, SIGNED_DATA, UNSIGNED_DATA, INPUT

class StandardScaler(AutoSklearnPreprocessingAlgorithm):
    def __init__(self, copy=True, with_mean=True, with_std=True, random_state=None):
        self.copy = copy
        self.with_mean = with_mean
        self.with_std = with_std
        self.random_state = random_state
        self.preprocessor_ = None

    def fit(self, X, y=None, sample_weight=None):
        from sklearn.preprocessing import StandardScaler

        self.preprocessor_ = sklearn.preprocessing.StandardScaler(
            copy = self.copy,
            with_mean = self.with_mean,
            with_std = self.with_std,
        )
        self.preprocessor_.fit(X, y, sample_weight)
        return self

    
    def transform(self, X):
        if self.preprocessor_ is None:
            raise NotImplementedError()
        return self.preprocessor_.transform(X)

    # method to query properties of the component
    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            "shortname": "StandardScaler",
            "name": "Standard Scaler Preprocessor",
            "handles_regression": True,
            "handles_classification": False,
            "handles_multiclass": False,
            "handles_multilabel": False,
            "handles_multioutput": False,
            "is_deterministic": True,
            "input": (DENSE, UNSIGNED_DATA, SIGNED_DATA),
            "output": (INPUT, DENSE, UNSIGNED_DATA, SIGNED_DATA),
        }

    # method to return the configuration space
    @staticmethod
    def get_hyperparameter_search_space(
        feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
    ):
        cs = ConfigurationSpace()
        copy = CategoricalHyperparameter(
            name="copy", choices=["True", "False"], default_value="True"
        )
        with_mean = CategoricalHyperparameter(
            name="with_mean", choices=["True", "False"], default_value="True"
        )
        with_std = CategoricalHyperparameter(
            name="with_std", choices=["True", "False"], default_value="True"
        )
        cs.add_hyperparameters([copy, with_mean, with_std])
        return cs

# Add StandardScaler component to auto-sklearn.
autosklearn.pipeline.components.feature_preprocessing.add_preprocessor(StandardScaler)

Additional example code to test-run the wrapper

import autosklearn.regression
import sklearn.metrics
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

# check the configuration space
cs = StandardScaler.get_hyperparameter_search_space()
print(cs)

# run AutoSklearn regression with CV and standard scaler
aml = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=60,
    include={"feature_preprocessor": ["StandardScaler"]},
    memory_limit = 6000,
    resampling_strategy="cv",
    resampling_strategy_arguments={"folds": 3},
    # speed up calculations
    initial_configurations_via_metalearning=0,
    smac_scenario_args={"runcount_limit": 5},
)
aml.fit(X_train, y_train)

# inspect leaderboard -> standard scaler ist listed
aml.leaderboard(detailed=True)

# check scores before re-fit
predictions_before_refit = aml.predict(X_test)
print("Test R2 score:", sklearn.metrics.r2_score(y_test, predictions_before_refit))

# check scores after refit
aml.refit(X_train.copy(), y_train.copy())
predictions_after_refit = aml.predict(X_test)
print("Test R2 score:", sklearn.metrics.r2_score(y_test, predictions_after_refit))

The text was updated successfully, but these errors were encountered:

eddiebergman · 2023-03-23T12:02:24Z

Hi @verakye,

get_properties looks correct at the very least. However the copy hyperparameter looks dubious, I'm not sure why you would want to search over this parameter. It's probably best to just leave it at True, i.e. don't modify any data in place.
The issue mentioned seems to suggest it should always be applied, I can't really verify this properly from here. One possible way to verify this is to use automl.leaderboard(ensemble_only=False, detailed=True). ensemble_only=False indicates that every run should be included while detailed=True indicates to show more columns, one of which is the featre_preprocessor.
I don't really have a good suggestion for you on this one. I imagine you could use the answer here to get the actual model out and pass data through it. However there are limitations to it. I'm sorry it's the best I can do.

Best,
Eddie

verakye · 2023-03-23T14:17:33Z

Hi @eddiebergman,

thanks a lot for the quick reply!

That sounds like a very logical suggestions! Will do so.
It actually should always be applied, but in the sense of a pipeline within Autosklearn, meaning, not as a replacement of the default chosen preprocessing for a certain algorithm but as a first step, but then also other (and differing) pre-processing steps could come. I checked the leaderboard as suggested by you and compared it with a vanilla "no specification" of preprocessing. Unfortunately adding the preprocessor, replaces ALL default preprocessing. Is there an option to set one specific preprocessing for all algorithms but then still keep the default chosen algorithms to be done afterwards (just like in a pipeline)?
Thanks for the informative and useful linked issue!

Best,
Vera

eddiebergman · 2023-03-23T14:46:57Z

Unfortunatly that's a limitation of Auto-sklearn. My only suggestion if you want full control is to fork and modify accordingly.

Best,
Eddie

verakye closed this as completed Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Is this the correct way of extending AutoSklearn with a Standard Scaler feature preprocessor? #1651

[Question] Is this the correct way of extending AutoSklearn with a Standard Scaler feature preprocessor? #1651

verakye commented Mar 23, 2023 •

edited

Loading

eddiebergman commented Mar 23, 2023

verakye commented Mar 23, 2023

eddiebergman commented Mar 23, 2023

[Question] Is this the correct way of extending AutoSklearn with a Standard Scaler feature preprocessor? #1651

[Question] Is this the correct way of extending AutoSklearn with a Standard Scaler feature preprocessor? #1651

Comments

verakye commented Mar 23, 2023 • edited Loading

Short Question Description

The code I wrote with the attempt to write a wrapper class and register it to auto-sklearn (runs without errors)

Additional example code to test-run the wrapper

eddiebergman commented Mar 23, 2023

verakye commented Mar 23, 2023

eddiebergman commented Mar 23, 2023

verakye commented Mar 23, 2023 •

edited

Loading