Optimize RelevantFeatureAugmenter to avoid re-extraction #669
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I am working on some datasets where the feature extraction step is very time consuming, sometimes totaling multiple days when running a suite of experiments. I'm using
RelevantFeatureAugmenter
in a scikit-learnPipeline
object, and noticed that the training features are extracted twice, becausefit()
extracts them once andtransform()
re-extracts a subset of them (the relevant ones). I see why this is needed, sincefit()
andtransform()
need to be independent for the sake of re-applying the transformer to new data.However, in a scikit-learn pipeline,
fit_transform()
is called during training since the two steps are sequential. For these situations, I added afit_transform()
function which avoids re-extracting the relevant features.I created a small simulated dataset to test the effect:
Before this optimization, the above code took 5m36s to run on my computer. After, it took 4m24s. In detail, the output before this optimization:
It is apparent in the above output that, for each of the 4 cross-validation folds, the initial feature extraction takes longest, followed by a shorter re-extraction of only relevant features in training data, followed by an even shorter extraction of relevant features in testing data. This second step is the one that can be avoided. The new output after this optimization looks like this:
The re-extraction step is now gone. This saves me a ton of time in practice, especially for datasets with many relevant features.