Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved speed #681

Merged
merged 24 commits into from
May 11, 2020
Merged

Improved speed #681

merged 24 commits into from
May 11, 2020

Conversation

nils-braun
Copy link
Collaborator

This PR introduces three things:

  • it uses the numpy.quantile function instead of the one from pandas because it is extremely faster. For this, we need to have numpy >= 1.15.0 which is fine as the current version is 1.18 anyways...
  • it replaces the sample_entropy function. Honestly, I did not really understand the function and if I compare with wikipedia, I think (think!) that the implementation was not 100% correct. Actually, wikipedia has a sample implementation in python, which uses numpy functions instead of for loops and looks much more like the formula. Unfortunately, it gives slightly different numbers (so this is a breaking change), but I trust the new numbers more. Maybe someone can comment on this?
  • it also adds a file on how to do large-scale scaling and timing measurements. I did those measurements, and will publish my results soon.

Using the code of this PR, I achieve a speed-up between 1.17 (for short time series, e.g. length 100) to 2.4 and growing (for long time series, e.g. length 5000), when using all features. I did not test, but I assume it will get even better with even larger time series.
The "sample_entropy" is not part of the efficient features, so people with runtime constraints are not using it anyhow but the "quantile" fix will help especially for short time series.

@MaxBenChrist, feel free to re-assign (e.g. to @kempa-liehr ) if you do not have time!

@nils-braun nils-braun requested a review from MaxBenChrist May 1, 2020 14:15
@github-actions
Copy link

github-actions bot commented May 1, 2020

You have style errors. See them below.

./tsfresh/scripts/measure_execution_time.py:163:34: W292 no newline at end of file
./tsfresh/feature_extraction/feature_calculators.py:1542:12: E262 inline comment should start with '# '

@coveralls
Copy link

coveralls commented May 1, 2020

Coverage Status

Coverage increased (+0.3%) to 96.733% when pulling 888fb98 on feature/improved-speed into 5bdbbcf on master.

@nils-braun
Copy link
Collaborator Author

I posted my findings on the execution time studies here: https://nils-braun.github.io/execution-time/

@nils-braun nils-braun requested review from kempa-liehr and removed request for MaxBenChrist May 8, 2020 19:34
@nils-braun
Copy link
Collaborator Author

@kempa-liehr Do you have some spare time to have a look into this PR?

# This script extracts the execution time for
# various different settings of tsfresh
# using different input data
# Attention: it will run for ~half a day
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on how many cores?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On two. I have described the machine setup on my blog-post: https://nils-braun.github.io/execution-time/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe to be more clear:
it does not matter on how many cores, as the number of cores is a parameter which is set to 0, 1 and 4 for the tests (to see the scaling).
But I did my studies on a google cloud 2-core (4 threads) virtual machine.

tsfresh/scripts/measure_execution_time.py Outdated Show resolved Hide resolved
@@ -794,7 +794,7 @@ def test_sample_entropy(self):
ts = [1, 4, 5, 1, 7, 3, 1, 2, 5, 8, 9, 7, 3, 7, 9, 5, 4, 3, 9, 1, 2, 3, 4, 2, 9, 6, 7, 4, 9, 2, 9, 9, 6, 5, 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some more unit tests for the sample_entroy?
I am missing:

  • short time series
  • negative values in time series
  • only negative values in time series
  • np.NaN in time series
    etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more tests and some documentation on the tests in my last commit.

N = len(x)

# Split time series and save all templates of length m
xmi = np.array([x[i:i + m] for i in range(N - m)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. np.split splits without overlap.
We want to turn the array [1, 2, 3, 4] into [1, 2], [2, 3], [3, 4] but np.split would only give [1, 2], [3, 4].

@nils-braun
Copy link
Collaborator Author

Thanks for the review @MaxBenChrist. Your comments were reasonable and I implemented your feedback.
While doing so, I had a look into the definition of the sample entropy again and to me it occurs that the python implementation given on wikipedia is also wrong. I corrected this in the code and also posted a question on wikipedia.
For long time series, the difference is not large - but for smaller ones it is. As I think this one is more correct, I will merge this tomorrow. Feel free to comment, if you actually know how the sample entropy should be defined :-)

@github-actions
Copy link

You have style errors. See them below.

./tsfresh/examples/driftbif_simulation.py:135:22: E741 ambiguous variable name 'l'

@nils-braun nils-braun merged commit 8106334 into master May 11, 2020
@nils-braun nils-braun deleted the feature/improved-speed branch May 11, 2020 20:59
earthgecko added a commit to earthgecko/tsfresh that referenced this pull request Dec 31, 2020
IssueID #3924: v0.17.9

- Readded baseline unit tests
- Revert to the original sum_of_reoccurring_values v0.4.0 method which was
  changed and the new feature called sum_of_reoccurring_data_points was
  added which results in the same value as the original v0.4.0
  sum_of_reoccurring_values method. The new sum_of_reoccurring_values method
  introduced results in different results as per:
  NOT in baseline   :: [['value__sum_of_reoccurring_values', '49922.0']]
  NOT in calculated :: [['value__sum_of_reoccurring_values', '109822.0']]
- Disable estimate_friedrich_coefficients feature added in v0.6.0
- Disable friedrich_coefficients feature added in v0.6.0
- Disabled max_langevin_fixed_point added in v0.6.0
- Disabled friedrich_coefficients and max_langevin_fixed_point in settings added
  in v0.6.0
- Updated very minor precision changes in the following features which changed
  in v0.6.0
  value__autocorrelation__lag_6 old: 0.5124801685138611, new: 0.5124801685138614, diff: -0.00000000000000022204
  value__autocorrelation__lag_8 old: 0.3600822542968588, new: 0.3600822542968586, diff: 0.00000000000000022204
  value__autocorrelation__lag_5 old: 0.46463952576506423, new: 0.46463952576506445, diff: -0.00000000000000022204
  value__autocorrelation__lag_1 old: 0.5154799442499527, new: 0.5154799442499526, diff: 0.00000000000000011102
  value__autocorrelation__lag_7 old: 0.6538534951469427, new: 0.6538534951469428, diff: -0.00000000000000011102
  value__autocorrelation__lag_2 old: 0.36765813197781533, new: 0.36765813197781516, diff: 0.00000000000000016653
  value__autocorrelation__lag_9 old: 0.21748400096837436, new: 0.21748400096837414, diff: 0.00000000000000022204
  value__augmented_dickey_fuller old: -0.8041220342033505, new: -0.8041220342033477, diff: -0.00000000000000277556
  value__mean_autocorrelation old: 1.1720475293977406, new: 1.1720475293977404, diff: 0.00000000000000022204
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_0__w_2" old: -40.265846960764975, new: -40.26584696076512, diff: 0.00000000000014210855
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_1__w_2" old: 5485.741180131765, new: 5485.741180131762, diff: 0.00000000000272848411
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_2__w_2" old: 7535.022844459651, new: 7535.02284445965, diff: 0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_3__w_2" old: 6017.192007927548, new: 6017.192007927546, diff: 0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_4__w_2" old: 3308.4304014332156, new: 3308.4304014332133, diff: 0.00000000000227373675
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_5__w_2" old: 1295.7433671924819, new: 1295.7433671924832, diff: -0.00000000000136424205
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_7__w_2" old: 39.916767258584514, new: 39.91676725858371, diff: 0.00000000000080291329
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_8__w_2" old: 17.955485691823014, new: 17.95548569182395, diff: -0.00000000000093436370
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_9__w_2" old: 50.259030087877306, new: 50.25903008787768, diff: -0.00000000000037658765
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_10__w_2" old: 35.90470247450105, new: 35.90470247450137, diff: -0.00000000000031974423
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_11__w_2" old: -24.14602386100944, new: -24.14602386100941, diff: -0.00000000000002842171
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_12__w_2" old: -61.88712524130847, new: -61.88712524130824, diff: -0.00000000000022737368
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_13__w_2" old: -33.668504325219715, new: -33.66850432521918, diff: -0.00000000000053290705
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_14__w_2" old: 24.20883821024688, new: 24.2088382102474, diff: -0.00000000000051869620
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_0__w_5" old: -20.257597134272146, new: -20.25759713427192, diff: -0.00000000000022737368
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_1__w_5" old: 3771.325441515319, new: 3771.32544151532, diff: -0.00000000000090949470
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_2__w_5" old: 7120.960920890311, new: 7120.960920890312, diff: -0.00000000000090949470
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_4__w_5" old: 11207.92940647991, new: 11207.929406479912, diff: -0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_5__w_5" old: 11696.157551031656, new: 11696.157551031654, diff: 0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_6__w_5" old: 11253.943680982826, new: 11253.943680982822, diff: 0.00000000000363797881
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_7__w_5" old: 10110.89944351567, new: 10110.899443515671, diff: -0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_8__w_5" old: 8545.47382821769, new: 8545.473828217693, diff: -0.00000000000363797881
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_9__w_5" old: 6826.238621617836, new: 6826.238621617837, diff: -0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_10__w_5" old: 5169.353887616803, new: 5169.353887616802, diff: 0.00000000000090949470
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_11__w_5" old: 3717.969303101324, new: 3717.9693031013257, diff: -0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_12__w_5" old: 2542.0196875693546, new: 2542.019687569354, diff: 0.00000000000045474735
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_13__w_5" old: 1652.101855511854, new: 1652.1018555118546, diff: -0.00000000000068212103
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_14__w_5" old: 1019.5707851504084, new: 1019.5707851504081, diff: 0.00000000000022737368
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_0__w_10" old: 836.6419785398183, new: 836.6419785398173, diff: 0.00000000000102318154
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_1__w_10" old: 3543.0796763032777, new: 3543.079676303278, diff: -0.00000000000045474735
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_3__w_10" old: 8634.724847532967, new: 8634.724847532969, diff: -0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_4__w_10" old: 10876.523736377072, new: 10876.52373637707, diff: 0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_5__w_10" old: 12835.398940237148, new: 12835.39894023715, diff: -0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_6__w_10" old: 14466.10948981898, new: 14466.109489818979, diff: 0.00000000000181898940
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_7__w_10" old: 15737.72244365614, new: 15737.722443656134, diff: 0.00000000000545696821
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_9__w_10" old: 17169.076640994837, new: 17169.07664099483, diff: 0.00000000000727595761
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_11__w_10" old: 17183.302683017104, new: 17183.302683017107, diff: -0.00000000000363797881
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_14__w_10" old: 15154.905872253841, new: 15154.905872253847, diff: -0.00000000000545696821
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_0__w_20" old: 18718.957258866503, new: 18718.957258866507, diff: -0.00000000000363797881
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_1__w_20" old: 20645.63503140842, new: 20645.635031408423, diff: -0.00000000000363797881
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_5__w_20" old: 28065.04062099347, new: 28065.040620993466, diff: 0.00000000000363797881
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_7__w_20" old: 31428.519814904776, new: 31428.519814904783, diff: -0.00000000000727595761
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_8__w_20" old: 32985.81511950059, new: 32985.8151195006, diff: -0.00000000000727595761
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_9__w_20" old: 34437.5408408601, new: 34437.54084086011, diff: -0.00000000000727595761
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_10__w_20" old: 35770.92323199827, new: 35770.923231998284, diff: -0.00000000001455191523
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_11__w_20" old: 36992.814788488264, new: 36992.81478848827, diff: -0.00000000000727595761
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_12__w_20" old: 38098.193912726434, new: 38098.19391272645, diff: -0.00000000001455191523
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_13__w_20" old: 39076.9898057395, new: 39076.98980573952, diff: -0.00000000002182787284
  "value__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_14__w_20" old: 39919.05725014527, new: 39919.05725014526, diff: 0.00000000000727595761
  value__spkt_welch_density__coeff_2 old: 1843.821171807498, new: 1843.8211718074986, diff: -0.00000000000045474735
  value__spkt_welch_density__coeff_8 old: 2536.9954700088933, new: 2536.9954700088906, diff: 0.00000000000272848411
  value__ar_coefficient__k_10__coeff_0 old: 904.439185079118, new: 904.4391850794491, diff: -0.00000000033105607145
  value__ar_coefficient__k_10__coeff_1 old: 0.16357894811580564, new: 0.1635789481157781, diff: 0.00000000000002753353
  value__ar_coefficient__k_10__coeff_2 old: -0.04324700014744565, new: -0.0432470001474492, diff: 0.00000000000000355271
  value__ar_coefficient__k_10__coeff_3 old: -0.06654237068303814, new: -0.06654237068301239, diff: -0.00000000000002575717
  value__ar_coefficient__k_10__coeff_4 old: 0.2836853193919353, new: 0.2836853193919273, diff: 0.00000000000000799361
  value__fft_coefficient__coeff_1 old: -0.8045103874789135, new: -0.8045103874789561, diff: 0.00000000000004263256
  value__fft_coefficient__coeff_2 old: -53.13286168327596, new: -53.13286168327602, diff: 0.00000000000005684342
  value__fft_coefficient__coeff_3 old: -338.00000000000006, new: -338.0, diff: -0.00000000000005684342
  value__fft_coefficient__coeff_4 old: 122.44503935479224, new: 122.44503935479203, diff: 0.00000000000021316282
  value__fft_coefficient__coeff_5 old: -58.930796134231116, new: -58.930796134230846, diff: -0.00000000000027000624
  value__fft_coefficient__coeff_6 old: 13.000000000000057, new: 13.0, diff: 0.00000000000005684342
  value__fft_coefficient__coeff_7 old: 112.23530652170982, new: 112.23530652170984, diff: -0.00000000000002842171
  value__fft_coefficient__coeff_8 old: 118.18782232848393, new: 118.18782232848395, diff: -0.00000000000001421085
- Readded baseline unit tests removed in v0.7.0
- Readded large_number_of_peaks removed in v0.9.0
- Readded mean_autocorrelation removed in v0.9.0
- Reverted to original augmented_dickey_fuller that was changed in v0.9.0
- Reverted to original fft_coefficient that was changed in v0.9.0
- Readded mean_abs_change_quantiles that was removed in v0.9.0
- Readded the original time_reversal_asymmetry_statistic that was in use pre
  v0.9.0 - blue-yonder#198
- Readded original autocorrelation that was removed in v0.9.0
- Disabled partial_autocorrelation added in v0.10.0
- Disabled cid_ce added in v0.11.1
- Disabled fft_aggregated added in v0.11.0
- Disabled Fix agg change made to agg_autocorrelation added in v0.11.1
blue-yonder@a53fb6a
- Changed to new value_count and range_count method added in v0.11.1
- Hardcoded TSFRESH_BASELINE_VERSION = '0.9.1' in tests
- Disabled linear_trend_timewise added in v0.12.0
- Readded tsfresh/examples/test_tsfresh_baseline_dataset.py which was removed
  in v0.12.0
- Use v0.11.01 value_count and range_count method not as per v0.13.0
- Disabled count_above and count_below features that were added in v0.15.0
- Readded the original percentage_of_reoccurring_datapoints_to_all_datapoints
  before the feature name change to percentage_of_reoccurring_values_to_all_values
  implemented in v0.17.0 (feature names should be immutable)
  blue-yonder#725
  blue-yonder@6f9c795
  blue-yonder#724
- Rename the new feature percentage_of_reoccurring_values_to_all_values to
  v0170_percentage_of_reoccurring_values_to_all_values and disabled
- Readded the original percentage_of_reoccurring_values_to_all_values
  before the feature name change to percentage_of_reoccurring_datapoints_to_all_datapoints
  implemented in v0.17.0 (feature names should be immutable)
- Rename the new feature percentage_of_reoccurring_datapoints_to_all_datapoints
  to v0170_percentage_of_reoccurring_datapoints_to_all_datapoints and disabled
- Disabled lempel_ziv_complexity,fourier_entropy and permutation_entropy
  features that were added in v0.17.0
- Revert to the original cwt_coefficients feature names changed in v0.16.0
- Renamed the new sample_entropy introduced in v0.16.0 to v0160_sample_entropy
  and readded sample_entropy from v0.15.1 as this is a breaking change as per:
  blue-yonder#681 and
  blue-yonder@ce493e5
- Configured settings for pre v0.9.0 features
- Hardcoded TSFRESH_BASELINE_VERSION = '0.17.9' in tests

Added:
tests/baseline/tsfresh-0.1.2.py2.data.json.features.transposed.csv
tests/baseline/tsfresh-0.3.0.py2.data.json.features.transposed.csv
tests/baseline/tsfresh-0.3.0.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.3.1.py2.data.json.features.transposed.csv
tests/baseline/tsfresh-0.3.1.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.4.0.py2.data.json.features.transposed.csv
tests/baseline/tsfresh-0.4.0.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.5.0.py2.data.json.features.transposed.csv
tests/baseline/tsfresh-0.5.0.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.5.1.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.6.0.py2.data.json.features.transposed.csv
tests/baseline/tsfresh-0.6.0.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.6.1.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.7.2.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.8.2.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.9.1.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.10.2.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.11.3.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.12.1.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.13.1.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.14.1.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.15.2.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.16.1.py3.data.json.features.transposed.csv
tests/baseline/tsfresh-0.17.9.py3.data.json.features.transposed.csv
tests/baseline/tsfresh_features_test.py
Modified:
CHANGES.rst
README.md
tsfresh/feature_extraction/feature_calculators.py
tsfresh/feature_extraction/settings.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants