You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for this great library. I am currently using it on a few datasets and the results are great (especially if you tune the config to your problem).
Now, for a data set, I need to perform a feature extraction that I would consider sensitive to the samples in the train set. I have a custom pipeline to extract those features. In the following example that feature extraction pipeline is pipe. The steps of that pipeline are not important, those are custom transformers.
The features calculated by pipe I put into tpot. However, when tpot runs a cross-validation as in tpot.fit(X_train, y_train), it creates a data leakage, because it uses the values of the features calculate on the whole train set df_train, so it uses samples from the cv test fold. This is a data leakage and is a problem for me as it overestimates the importance of certain features.
So, how can I run the pipe to create the features in each of the 10 cross-validation train and test folds inside tpot? Essentially, I want every tpot pipeline to start with pipe and then extend it by estimators, selectors and regressors. In that case, I would call tpot.fit(df_train, y_train) instead of tpot.fit(X_train, y_train).
Hi tpot team,
Thank you for this great library. I am currently using it on a few datasets and the results are great (especially if you tune the config to your problem).
Now, for a data set, I need to perform a feature extraction that I would consider sensitive to the samples in the train set. I have a custom pipeline to extract those features. In the following example that feature extraction pipeline is
pipe
. The steps of that pipeline are not important, those are custom transformers.The features calculated by
pipe
I put into tpot. However, when tpot runs a cross-validation as intpot.fit(X_train, y_train)
, it creates a data leakage, because it uses the values of the features calculate on the whole train setdf_train
, so it uses samples from the cv test fold. This is a data leakage and is a problem for me as it overestimates the importance of certain features.So, how can I run the
pipe
to create the features in each of the 10 cross-validation train and test folds inside tpot? Essentially, I want every tpot pipeline to start withpipe
and then extend it by estimators, selectors and regressors. In that case, I would calltpot.fit(df_train, y_train)
instead oftpot.fit(X_train, y_train)
.I was thinking about using the template argument, I looked into the tpot source code but I am a little bit lost. Unfortunately, it is not that greatly documented, I guess you would have to somehow fix my
pipe
as the root of the tree in https://github.com/EpistasisLab/tpot/blob/master/tpot/base.py#L444:L508?Finally, where can I find a description of the genetic algorithm that is used in tpot?
The text was updated successfully, but these errors were encountered: