Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize RelevantFeatureAugmenter to avoid re-extraction #669

Merged
merged 2 commits into from
Apr 20, 2020
Merged

Optimize RelevantFeatureAugmenter to avoid re-extraction #669

merged 2 commits into from
Apr 20, 2020

Conversation

pnb
Copy link
Contributor

@pnb pnb commented Apr 17, 2020

I am working on some datasets where the feature extraction step is very time consuming, sometimes totaling multiple days when running a suite of experiments. I'm using RelevantFeatureAugmenter in a scikit-learn Pipeline object, and noticed that the training features are extracted twice, because fit() extracts them once and transform() re-extracts a subset of them (the relevant ones). I see why this is needed, since fit() and transform() need to be independent for the sake of re-applying the transformer to new data.

However, in a scikit-learn pipeline, fit_transform() is called during training since the two steps are sequential. For these situations, I added a fit_transform() function which avoids re-extracting the relevant features.

I created a small simulated dataset to test the effect:

# Randomly generate timeseries data for performance testing.
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from tsfresh.transformers import RelevantFeatureAugmenter


def gen_classification(n_instances, n_cols, seq_len, noise_prop, n_classes=2, random_state=1):
    np.random.seed(random_state)
    X = pd.DataFrame({'feature' + str(f): [0] * n_instances * seq_len for f in range(n_cols)})
    y = pd.Series([np.random.choice(n_classes) for _ in range(n_instances)])
    X['instance_id'] = [i for i in range(n_instances) for _ in range(seq_len)]
    X['order'] = range(len(X))
    axis = np.linspace(0, 1, num=seq_len)
    for col_i in range(n_cols):
        form = np.random.choice(['square', 'sin', 'sawtooth', 'uniform', 'random'])
        hz = np.random.random() * seq_len
        for i in range(n_instances):
            if form == 'square':
                ts = (sp.signal.square(axis * 2 * np.pi * hz) + 1) / 2
            elif form == 'sin':
                ts = np.sin(axis * 2 * np.pi * hz)
            elif form == 'sawtooth':
                ts = (sp.signal.sawtooth(axis * 2 * np.pi * hz) + 1) / 2
            elif form == 'uniform':
                ts = axis * 0
            elif form == 'random':
                ts = np.random.random(seq_len)
            ts += y[i] * (1 - noise_prop)  # Distinguish classes
            ts += (np.random.random(seq_len) - .5) * noise_prop  # Add noise
            X.loc[i * seq_len:i * seq_len + seq_len - 1, 'feature' + str(col_i)] = ts
    return X, y


if __name__ == '__main__':
    ts, y = gen_classification(500, 10, 5, .9)
    augmenter = RelevantFeatureAugmenter(column_id='instance_id', column_sort='order',
                                         timeseries_container=ts, hypotheses_independent=True)
    pipeline = Pipeline([('augmenter', augmenter),
                         ('classifier', RandomForestClassifier(random_state=1))])
    X = pd.DataFrame(index=y.index)
    print('Cross-val accuracies:', cross_val_score(pipeline, X, y, cv=4))

Before this optimization, the above code took 5m36s to run on my computer. After, it took 4m24s. In detail, the output before this optimization:

$ time python perftest.py
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.35s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:17<00:00,  1.12it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:06<00:00,  3.30it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.31s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:14<00:00,  1.43it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:04<00:00,  4.30it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.33s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:13<00:00,  1.44it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:04<00:00,  4.31it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:47<00:00,  2.36s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:17<00:00,  1.16it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:05<00:00,  3.53it/s]
Cross-val accuracies: [0.928 0.912 0.944 0.936]

real    5m36.578s
user    19m17.094s
sys     0m17.709s

It is apparent in the above output that, for each of the 4 cross-validation folds, the initial feature extraction takes longest, followed by a shorter re-extraction of only relevant features in training data, followed by an even shorter extraction of relevant features in testing data. This second step is the one that can be avoided. The new output after this optimization looks like this:

$ time python perftest.py
Feature Extraction: 100%|████████████████████████████| 20/20 [00:44<00:00,  2.20s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:05<00:00,  3.43it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.34s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:04<00:00,  4.15it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:51<00:00,  2.59s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:04<00:00,  4.01it/s]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:46<00:00,  2.33s/it]
Feature Extraction: 100%|████████████████████████████| 20/20 [00:05<00:00,  3.91it/s]
Cross-val accuracies: [0.928 0.912 0.944 0.936]

real    4m24.546s
user    15m7.825s
sys     0m15.585s

The re-extraction step is now gone. This saves me a ton of time in practice, especially for datasets with many relevant features.

@coveralls
Copy link

coveralls commented Apr 17, 2020

Coverage Status

Coverage decreased (-0.05%) to 97.057% when pulling 8b953ef on pnb:perf_rfa into 892e1ba on blue-yonder:master.

@nils-braun
Copy link
Collaborator

Hi @pnb! Thank you very much for this well-documented and prepared PR! Nice you solved that problem.
Your code is fine, however I have a general comment: it involves a lot of code-duplication. Therefore my suggestion:

  • create a function, maybe name it _fit_and_augment which contains all the code of the fit function and returns the X_augmented
  • call this function in fit (and still return self) and also in fit_transform and add the additional few lines.
    How does this sound?

@nils-braun nils-braun self-assigned this Apr 19, 2020
@pnb
Copy link
Contributor Author

pnb commented Apr 19, 2020

@nils-braun Sounds good! I didn't like the amount of code duplication either, to be honest. I will make that change.

Moved duplicate code from fit() and fit_transform() to a helper function
@nils-braun
Copy link
Collaborator

Nice work!

@nils-braun nils-braun merged commit 88200cc into blue-yonder:master Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants