Run custom pipeline / feature extraction on each cv fold #1182

MaxBenChrist · 2021-03-04T21:19:00Z

Hi tpot team,

Thank you for this great library. I am currently using it on a few datasets and the results are great (especially if you tune the config to your problem).

Now, for a data set, I need to perform a feature extraction that I would consider sensitive to the samples in the train set. I have a custom pipeline to extract those features. In the following example that feature extraction pipeline is pipe. The steps of that pipeline are not important, those are custom transformers.

pipe = Pipeline([
    ('last_foo', AddLastFromGroup(),
    ('last_bar', AddLastFromGroup()),
    ('missing_indicator', AddMissingIndicator()),
    ('imputer_groups', GroupedImputer()),
    ('imputer_median', MeanMedianImputer()),
    ('imputer_categories', CategoricalImputer()),
    ('foo', DropFeatures()),
    ('baz', RareLabelEncoder()),
    ('bu', OneHotEncoder()),
])

X_train = pipe.fit_transform(df_train)
X_test = pipe.transform(df_test)

tpot = TPOTRegressor(cv=10)
tpot.fit(X_train, y_train)

The features calculated by pipe I put into tpot. However, when tpot runs a cross-validation as in tpot.fit(X_train, y_train), it creates a data leakage, because it uses the values of the features calculate on the whole train set df_train, so it uses samples from the cv test fold. This is a data leakage and is a problem for me as it overestimates the importance of certain features.

So, how can I run the pipe to create the features in each of the 10 cross-validation train and test folds inside tpot? Essentially, I want every tpot pipeline to start with pipe and then extend it by estimators, selectors and regressors. In that case, I would call tpot.fit(df_train, y_train) instead of tpot.fit(X_train, y_train).

I was thinking about using the template argument, I looked into the tpot source code but I am a little bit lost. Unfortunately, it is not that greatly documented, I guess you would have to somehow fix my pipe as the root of the tree in https://github.com/EpistasisLab/tpot/blob/master/tpot/base.py#L444:L508?

Finally, where can I find a description of the genetic algorithm that is used in tpot?

The text was updated successfully, but these errors were encountered:

MaxBenChrist · 2021-04-13T23:00:55Z

Any update on this? Is the description clear?

ianbenlolo · 2021-04-28T14:49:46Z

I'd like an update on this as well. Some kind of data preprocessing pipeline on each fold would be great.

perib mentioned this issue Sep 21, 2023

TPOT2 and the future of TPOT development -- From the Devs #1322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run custom pipeline / feature extraction on each cv fold #1182

Run custom pipeline / feature extraction on each cv fold #1182

MaxBenChrist commented Mar 4, 2021

MaxBenChrist commented Apr 13, 2021

ianbenlolo commented Apr 28, 2021

Run custom pipeline / feature extraction on each cv fold #1182

Run custom pipeline / feature extraction on each cv fold #1182

Comments

MaxBenChrist commented Mar 4, 2021

MaxBenChrist commented Apr 13, 2021

ianbenlolo commented Apr 28, 2021