Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Is this the correct way of extending AutoSklearn with a Standard Scaler feature preprocessor? #1651

Closed
verakye opened this issue Mar 23, 2023 · 3 comments

Comments

@verakye
Copy link

verakye commented Mar 23, 2023

Short Question Description

  1. Is my below code the correct approach to extend AutoSklearn with a wrapper class for the scikitlearn Standard Scaler as a feature preprocessor (particularly the settings in get_properties and in get_hyperparameter_search_space)?
  2. Would that approach (in your opinion) solve the problem mentioned in [Question] How can I make sure AutoSklearn is always using StandardScaler for feature preprocessing? #1548?
  3. How can I best double check in the final output that the features were indeed standard-scaled within AutoSklearn, i.e. if the StandardScaler works as it is supposed to work (better than just comparing two runs with or without standard scaling)?

Thank you very much!

The code I wrote with the attempt to write a wrapper class and register it to auto-sklearn (runs without errors)

typing import Optional
from pprint import pprint

from ConfigSpace.configuration_space import ConfigurationSpace
from ConfigSpace.hyperparameters import CategoricalHyperparameter

from autosklearn.askl_typing import FEAT_TYPE_TYPE
import autosklearn.pipeline.components.feature_preprocessing
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import DENSE, SIGNED_DATA, UNSIGNED_DATA, INPUT

class StandardScaler(AutoSklearnPreprocessingAlgorithm):
    def __init__(self, copy=True, with_mean=True, with_std=True, random_state=None):
        self.copy = copy
        self.with_mean = with_mean
        self.with_std = with_std
        self.random_state = random_state
        self.preprocessor_ = None

    def fit(self, X, y=None, sample_weight=None):
        from sklearn.preprocessing import StandardScaler

        self.preprocessor_ = sklearn.preprocessing.StandardScaler(
            copy = self.copy,
            with_mean = self.with_mean,
            with_std = self.with_std,
        )
        self.preprocessor_.fit(X, y, sample_weight)
        return self

    
    def transform(self, X):
        if self.preprocessor_ is None:
            raise NotImplementedError()
        return self.preprocessor_.transform(X)

    # method to query properties of the component
    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            "shortname": "StandardScaler",
            "name": "Standard Scaler Preprocessor",
            "handles_regression": True,
            "handles_classification": False,
            "handles_multiclass": False,
            "handles_multilabel": False,
            "handles_multioutput": False,
            "is_deterministic": True,
            "input": (DENSE, UNSIGNED_DATA, SIGNED_DATA),
            "output": (INPUT, DENSE, UNSIGNED_DATA, SIGNED_DATA),
        }

    # method to return the configuration space
    @staticmethod
    def get_hyperparameter_search_space(
        feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
    ):
        cs = ConfigurationSpace()
        copy = CategoricalHyperparameter(
            name="copy", choices=["True", "False"], default_value="True"
        )
        with_mean = CategoricalHyperparameter(
            name="with_mean", choices=["True", "False"], default_value="True"
        )
        with_std = CategoricalHyperparameter(
            name="with_std", choices=["True", "False"], default_value="True"
        )
        cs.add_hyperparameters([copy, with_mean, with_std])
        return cs

# Add StandardScaler component to auto-sklearn.
autosklearn.pipeline.components.feature_preprocessing.add_preprocessor(StandardScaler)

Additional example code to test-run the wrapper

import autosklearn.regression
import sklearn.metrics
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

# check the configuration space
cs = StandardScaler.get_hyperparameter_search_space()
print(cs)

# run AutoSklearn regression with CV and standard scaler
aml = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=60,
    include={"feature_preprocessor": ["StandardScaler"]},
    memory_limit = 6000,
    resampling_strategy="cv",
    resampling_strategy_arguments={"folds": 3},
    # speed up calculations
    initial_configurations_via_metalearning=0,
    smac_scenario_args={"runcount_limit": 5},
)
aml.fit(X_train, y_train)

# inspect leaderboard -> standard scaler ist listed
aml.leaderboard(detailed=True)

# check scores before re-fit
predictions_before_refit = aml.predict(X_test)
print("Test R2 score:", sklearn.metrics.r2_score(y_test, predictions_before_refit))

# check scores after refit
aml.refit(X_train.copy(), y_train.copy())
predictions_after_refit = aml.predict(X_test)
print("Test R2 score:", sklearn.metrics.r2_score(y_test, predictions_after_refit))
@eddiebergman
Copy link
Contributor

Hi @verakye,

  1. get_properties looks correct at the very least. However the copy hyperparameter looks dubious, I'm not sure why you would want to search over this parameter. It's probably best to just leave it at True, i.e. don't modify any data in place.
  2. The issue mentioned seems to suggest it should always be applied, I can't really verify this properly from here. One possible way to verify this is to use automl.leaderboard(ensemble_only=False, detailed=True). ensemble_only=False indicates that every run should be included while detailed=True indicates to show more columns, one of which is the featre_preprocessor.
  3. I don't really have a good suggestion for you on this one. I imagine you could use the answer here to get the actual model out and pass data through it. However there are limitations to it. I'm sorry it's the best I can do.

Best,
Eddie

@verakye
Copy link
Author

verakye commented Mar 23, 2023

Hi @eddiebergman,

thanks a lot for the quick reply!

  1. That sounds like a very logical suggestions! Will do so.
  2. It actually should always be applied, but in the sense of a pipeline within Autosklearn, meaning, not as a replacement of the default chosen preprocessing for a certain algorithm but as a first step, but then also other (and differing) pre-processing steps could come. I checked the leaderboard as suggested by you and compared it with a vanilla "no specification" of preprocessing. Unfortunately adding the preprocessor, replaces ALL default preprocessing. Is there an option to set one specific preprocessing for all algorithms but then still keep the default chosen algorithms to be done afterwards (just like in a pipeline)?
  3. Thanks for the informative and useful linked issue!

Best,
Vera

@eddiebergman
Copy link
Contributor

  1. Unfortunatly that's a limitation of Auto-sklearn. My only suggestion if you want full control is to fork and modify accordingly.

Best,
Eddie

@verakye verakye closed this as completed Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants