Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run custom pipeline / feature extraction on each cv fold #1182

Open
MaxBenChrist opened this issue Mar 4, 2021 · 2 comments
Open

Run custom pipeline / feature extraction on each cv fold #1182

MaxBenChrist opened this issue Mar 4, 2021 · 2 comments

Comments

@MaxBenChrist
Copy link

Hi tpot team,

Thank you for this great library. I am currently using it on a few datasets and the results are great (especially if you tune the config to your problem).

Now, for a data set, I need to perform a feature extraction that I would consider sensitive to the samples in the train set. I have a custom pipeline to extract those features. In the following example that feature extraction pipeline is pipe. The steps of that pipeline are not important, those are custom transformers.

pipe = Pipeline([
    ('last_foo', AddLastFromGroup(),
    ('last_bar', AddLastFromGroup()),
    ('missing_indicator', AddMissingIndicator()),
    ('imputer_groups', GroupedImputer()),
    ('imputer_median', MeanMedianImputer()),
    ('imputer_categories', CategoricalImputer()),
    ('foo', DropFeatures()),
    ('baz', RareLabelEncoder()),
    ('bu', OneHotEncoder()),
])

X_train = pipe.fit_transform(df_train)
X_test = pipe.transform(df_test)

tpot = TPOTRegressor(cv=10)
tpot.fit(X_train, y_train)

The features calculated by pipe I put into tpot. However, when tpot runs a cross-validation as in tpot.fit(X_train, y_train), it creates a data leakage, because it uses the values of the features calculate on the whole train set df_train, so it uses samples from the cv test fold. This is a data leakage and is a problem for me as it overestimates the importance of certain features.

So, how can I run the pipe to create the features in each of the 10 cross-validation train and test folds inside tpot? Essentially, I want every tpot pipeline to start with pipe and then extend it by estimators, selectors and regressors. In that case, I would call tpot.fit(df_train, y_train) instead of tpot.fit(X_train, y_train).

I was thinking about using the template argument, I looked into the tpot source code but I am a little bit lost. Unfortunately, it is not that greatly documented, I guess you would have to somehow fix my pipe as the root of the tree in https://github.com/EpistasisLab/tpot/blob/master/tpot/base.py#L444:L508?

Finally, where can I find a description of the genetic algorithm that is used in tpot?

@MaxBenChrist
Copy link
Author

Any update on this? Is the description clear?

@ianbenlolo
Copy link

I'd like an update on this as well. Some kind of data preprocessing pipeline on each fold would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants