Optimize RelevantFeatureAugmenter to avoid re-extraction #669

pnb · 2020-04-17T20:10:10Z

I am working on some datasets where the feature extraction step is very time consuming, sometimes totaling multiple days when running a suite of experiments. I'm using RelevantFeatureAugmenter in a scikit-learn Pipeline object, and noticed that the training features are extracted twice, because fit() extracts them once and transform() re-extracts a subset of them (the relevant ones). I see why this is needed, since fit() and transform() need to be independent for the sake of re-applying the transformer to new data.

However, in a scikit-learn pipeline, fit_transform() is called during training since the two steps are sequential. For these situations, I added a fit_transform() function which avoids re-extracting the relevant features.

I created a small simulated dataset to test the effect:

# Randomly generate timeseries data for performance testing.
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from tsfresh.transformers import RelevantFeatureAugmenter


def gen_classification(n_instances, n_cols, seq_len, noise_prop, n_classes=2, random_state=1):
    np.random.seed(random_state)
    X = pd.DataFrame({'feature' + str(f): [0] * n_instances * seq_len for f in range(n_cols)})
    y = pd.Series([np.random.choice(n_classes) for _ in range(n_instances)])
    X['instance_id'] = [i for i in range(n_instances) for _ in range(seq_len)]
    X['order'] = range(len(X))
    axis = np.linspace(0, 1, num=seq_len)
    for col_i in range(n_cols):
        form = np.random.choice(['square', 'sin', 'sawtooth', 'uniform', 'random'])
        hz = np.random.random() * seq_len
        for i in range(n_instances):
            if form == 'square':
                ts = (sp.signal.square(axis * 2 * np.pi * hz) + 1) / 2
            elif form == 'sin':
                ts = np.sin(axis * 2 * np.pi * hz)
            elif form == 'sawtooth':
                ts = (sp.signal.sawtooth(axis * 2 * np.pi * hz) + 1) / 2
            elif form == 'uniform':
                ts = axis * 0
            elif form == 'random':
                ts = np.random.random(seq_len)
            ts += y[i] * (1 - noise_prop)  # Distinguish classes
            ts += (np.random.random(seq_len) - .5) * noise_prop  # Add noise
            X.loc[i * seq_len:i * seq_len + seq_len - 1, 'feature' + str(col_i)] = ts
    return X, y


if __name__ == '__main__':
    ts, y = gen_classification(500, 10, 5, .9)
    augmenter = RelevantFeatureAugmenter(column_id='instance_id', column_sort='order',
                                         timeseries_container=ts, hypotheses_independent=True)
    pipeline = Pipeline([('augmenter', augmenter),
                         ('classifier', RandomForestClassifier(random_state=1))])
    X = pd.DataFrame(index=y.index)
    print('Cross-val accuracies:', cross_val_score(pipeline, X, y, cv=4))

Before this optimization, the above code took 5m36s to run on my computer. After, it took 4m24s. In detail, the output before this optimization:

$ time python perftest.py
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.35s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:17<00:00,  1.12it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:06<00:00,  3.30it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.31s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:14<00:00,  1.43it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:04<00:00,  4.30it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.33s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:13<00:00,  1.44it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:04<00:00,  4.31it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:47<00:00,  2.36s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:17<00:00,  1.16it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:05<00:00,  3.53it/s]
Cross-val accuracies: [0.928 0.912 0.944 0.936]

real    5m36.578s
user    19m17.094s
sys     0m17.709s

It is apparent in the above output that, for each of the 4 cross-validation folds, the initial feature extraction takes longest, followed by a shorter re-extraction of only relevant features in training data, followed by an even shorter extraction of relevant features in testing data. This second step is the one that can be avoided. The new output after this optimization looks like this:

$ time python perftest.py
Feature Extraction: 100%|████████████████████████████| 20/20 [00:44<00:00,  2.20s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:05<00:00,  3.43it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.34s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:04<00:00,  4.15it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:51<00:00,  2.59s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:04<00:00,  4.01it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.33s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:05<00:00,  3.91it/s]
Cross-val accuracies: [0.928 0.912 0.944 0.936]

real    4m24.546s
user    15m7.825s
sys     0m15.585s

The re-extraction step is now gone. This saves me a ton of time in practice, especially for datasets with many relevant features.

coveralls · 2020-04-17T20:16:42Z

Coverage decreased (-0.05%) to 97.057% when pulling 8b953ef on pnb:perf_rfa into 892e1ba on blue-yonder:master.

nils-braun · 2020-04-19T19:19:48Z

Hi @pnb! Thank you very much for this well-documented and prepared PR! Nice you solved that problem.
Your code is fine, however I have a general comment: it involves a lot of code-duplication. Therefore my suggestion:

create a function, maybe name it _fit_and_augment which contains all the code of the fit function and returns the X_augmented
call this function in fit (and still return self) and also in fit_transform and add the additional few lines.
How does this sound?

pnb · 2020-04-19T19:57:30Z

@nils-braun Sounds good! I didn't like the amount of code duplication either, to be honest. I will make that change.

Moved duplicate code from fit() and fit_transform() to a helper function

nils-braun · 2020-04-20T19:32:10Z

Nice work!

Optimize fit_transform to avoid re-extraction

49c5ea2

nils-braun self-assigned this Apr 19, 2020

Reduce redundant code

8b953ef

Moved duplicate code from fit() and fit_transform() to a helper function

nils-braun merged commit 88200cc into blue-yonder:master Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize RelevantFeatureAugmenter to avoid re-extraction #669

Optimize RelevantFeatureAugmenter to avoid re-extraction #669

pnb commented Apr 17, 2020

coveralls commented Apr 17, 2020 •

edited

Loading

nils-braun commented Apr 19, 2020

pnb commented Apr 19, 2020

nils-braun commented Apr 20, 2020

Optimize RelevantFeatureAugmenter to avoid re-extraction #669

Optimize RelevantFeatureAugmenter to avoid re-extraction #669

Conversation

pnb commented Apr 17, 2020

coveralls commented Apr 17, 2020 • edited Loading

nils-braun commented Apr 19, 2020

pnb commented Apr 19, 2020

nils-braun commented Apr 20, 2020

coveralls commented Apr 17, 2020 •

edited

Loading