Value of `id` column returned by roll_time_series is set to `column_sort` #673

ironerumi · 2020-04-25T16:16:02Z

Hi, probably I do not fully understand the usage of tsfresh.utilities.dataframe_functions.roll_time_series.

Say if I have input data like below and apply roll_time_series to it:

df = pd.DataFrame(
    {
        "id": ["A", "A", "A", "B", "B", "B"],
        "time": [1, 2, 3, 1, 2, 3],
        "x": [11, 12, 13, 14, 15, 16],
        "y": [21, 22, 23, 24, 25, 26],
    }
)
df_roll = roll_time_series(
    df,
    column_id="id",
    column_sort="time",
    column_kind=None,
    rolling_direction=10,
)

The result is like below, I was expecting the id column could remains the same but it was set to the value of column_sort.

idx	time	x	y	id
6	1.0	11.0	21.0	1
7	1.0	14.0	24.0	1
2	1.0	11.0	21.0	2
3	1.0	14.0	24.0	2
8	2.0	12.0	22.0	2
9	2.0	15.0	25.0	2
0	1.0	11.0	21.0	3
1	1.0	14.0	24.0	3
4	2.0	12.0	22.0	3
5	2.0	15.0	25.0	3
10	3.0	13.0	23.0	3
11	3.0	16.0	26.0	3

Do we expect that there is no duplicate time accross id?
Is there a way to distingush data from different id?

Many thanks!

The text was updated successfully, but these errors were encountered:

nils-braun · 2020-04-25T17:34:17Z

Hi @ironerumi!
Actually, you are very right. The id gets lost in this process and gets replaced by the last timestamp in the sub time series.
There are two possibilities:

you duplicate the id column before column and then combine the old id with the new id, which will give you a unique identifier for each sub time series again
or you wait for Better/Correct/Faster rolling #668 - because I noticed the same thing and fixed it there (together with some additional things) :-)

ironerumi · 2020-04-26T10:49:27Z

Hi @nils-braun, thanks for the quick response!

I'm certainly looking forward to seeing how would the function looks like after the fix!
Thanks for the advice that I suppose I would duplicate id to kind and it seems kind would be kept after the transformation.

TSFRESH helped me a lot and certainly hope it would evolve even better!
Many thanks

nils-braun · 2020-04-29T20:26:46Z

The PR is merged, now the old id is kept additional to the time shifts as identifier!

konradsemsch · 2020-05-01T10:46:08Z

Hi @nils-braun! Is that really the case? I installed the package this morning from pip and exactly faced the same confusion as @ironerumi.

nils-braun · 2020-05-01T16:36:46Z

@konradsemsch Thanks for testing it so quick!
The version you install from pip is a stable release. Not every pull request we merge gets automatically into a new stable release (first: because we would produce a lot of releases and a lot of build artifacts then and second because we really only want "stable" things in the release).
Therefore, the pip-installable tsfresh version (which is 0.15.1 in the moment) is not equal to the current head here in git.

You might see this on other git projects, which only use "master" for released code (and have a develop branch - the so called "git flow model"). We have chosen for another branching schema.

Two possibilities:

you wait for the next release of tsfresh. We do not have a fixed release schema, but I plan to do one after Improved speed #681 is merged
you install the latest master version of tsfresh by pip install git+https://github.com/blue-yonder/tsfresh.git

konradsemsch · 2020-05-04T19:52:51Z

Ok, thanks for the answer! :) Could you perhaps make a small example on how extract_features should be applied after we used roll_time_series, in order to make sure that they are available per rolled date/ each time-serie id? My goal would be to have a suitable forecasting structure across many different time-serie ids.

Unfortunately I find the documentation a little bit cryptic:
https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html?highlight=extract_features#tsfresh.feature_extraction.extraction.extract_features

Not 100% what each of those parameters is really responsible for after applying rolling to have a structure suitable for forecasting:

And the tutorial doesn't discuss feature extraction here: https://tsfresh.readthedocs.io/en/latest/text/forecasting.html

Could you shed a bit more light on this?

ironerumi · 2020-05-13T07:59:35Z

@konradsemsch probably you already solved it, just post the way that works for me here.
Continue with the df in the question. extract_features could be used like this:

df_rolling = roll_time_series(df, 'id', column_sort='time',max_timeshift=1,min_timeshift=1)
df_features = extract_features(df_rolling, column_id='id', column_sort='time')

df_rolling

id	time	x	y
id=A,timeshift=2	1	11	21
id=A,timeshift=2	2	12	22
id=A,timeshift=3	2	12	22
id=A,timeshift=3	3	13	23
id=B,timeshift=2	1	14	24
id=B,timeshift=2	2	15	25
id=B,timeshift=3	2	15	25
id=B,timeshift=3	3	16	26

df_features.iloc[:,0:3]

variable	x__abs_energy	x__absolute_sum_of_changes	x__agg_autocorrelation__f_agg_"mean"__maxlag_40
id=A,timeshift=2	265.0	1.0	-1.0
id=A,timeshift=3	313.0	1.0	-1.0
id=B,timeshift=2	421.0	1.0	-1.0
id=B,timeshift=3	481.0	1.0	-1.0

Basically, in my impression, the column_id and column_sort are the same for both rolling and extract function. All values other than id would be kept during the process.
Also, it seems that in 0.16 we could define min_timeshift which I found is quite useful as sometimes we might just want to shift between some values. Many thanks

nils-braun · 2020-05-13T20:26:34Z

Thanks to @ironerumi for your answer and sorry to @konradsemsch for falling silent :-/

Just for us to know (where we need to improve the documentation): did also https://tsfresh.readthedocs.io/en/latest/text/data_formats.html#data-formats-label not help you?

Now to your question: after rolling (in v0.16.0!) your rolled dataframe will contain a new column called "id", which will be different for each package of rolled date + time series id.
As you want to have feature extracted for each of those packages, this column is your column_id :-)
The sort column is just copied, so you would fill in the same name into column_sort here.

If and how you need a column_value and a column_kind depends on your data format. If it is flat (as in the example of @ironerumi), you do not need to supply them at all.
If it is stacked (see https://tsfresh.readthedocs.io/en/latest/text/data_formats.html#data-formats-label), you will need to give it to the function.

I will try to improve the documentation - both for the data formats and an example for feature extraction.

nils-braun · 2020-05-13T20:32:03Z

Ah, I just realized why I am so puzzled. All the new documentation we wrote on the rolling was not visible :-) (because of a mis-configuration with readthedocs...)
Could one of you have a look into https://tsfresh.readthedocs.io/en/v0.16.0/text/forecasting.html? It already includes a code example on how to do feature extraction (and it should be easier than the documentation before). I will nevertheless work on the column definitions...

ironerumi · 2020-05-16T14:35:57Z

@nils-braun thanks for the info! Last time when I checked the module reference, I suppose it was still not 0.16 yet. I think now the forecasting.html explains more clearly on the overall flow.

nils-braun · 2020-05-17T08:29:56Z

Thanks for checking! And sorry all the documentation was outdated - readthedocs was not updated properly.

konradsemsch · 2020-05-18T19:44:37Z

I think it is indeed better right now. Something I would consider though is adding a tangible example in the repo as well. Right now you guys have two examples I believe but they do not concern multiple time series at a time. I think illustrating that would completely clarify things up for a lot of users

nils-braun added the bug label Apr 25, 2020

nils-braun self-assigned this Apr 25, 2020

nils-braun mentioned this issue Apr 25, 2020

Better/Correct/Faster rolling #668

Merged

nils-braun closed this as completed Apr 29, 2020

This was referenced May 18, 2020

Add an example on time series rolling with multiple dimensions #697

Closed

Reworked the notebooks. #701

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Value of `id` column returned by roll_time_series is set to `column_sort` #673

Value of `id` column returned by roll_time_series is set to `column_sort` #673

ironerumi commented Apr 25, 2020

nils-braun commented Apr 25, 2020

ironerumi commented Apr 26, 2020

nils-braun commented Apr 29, 2020

konradsemsch commented May 1, 2020

nils-braun commented May 1, 2020

konradsemsch commented May 4, 2020 •

edited

Loading

ironerumi commented May 13, 2020

nils-braun commented May 13, 2020

nils-braun commented May 13, 2020 •

edited

Loading

ironerumi commented May 16, 2020

nils-braun commented May 17, 2020

konradsemsch commented May 18, 2020

Value of id column returned by roll_time_series is set to column_sort #673

Value of id column returned by roll_time_series is set to column_sort #673

Comments

ironerumi commented Apr 25, 2020

nils-braun commented Apr 25, 2020

ironerumi commented Apr 26, 2020

nils-braun commented Apr 29, 2020

konradsemsch commented May 1, 2020

nils-braun commented May 1, 2020

konradsemsch commented May 4, 2020 • edited Loading

ironerumi commented May 13, 2020

nils-braun commented May 13, 2020

nils-braun commented May 13, 2020 • edited Loading

ironerumi commented May 16, 2020

nils-braun commented May 17, 2020

konradsemsch commented May 18, 2020

Value of `id` column returned by roll_time_series is set to `column_sort` #673

Value of `id` column returned by roll_time_series is set to `column_sort` #673

konradsemsch commented May 4, 2020 •

edited

Loading

nils-braun commented May 13, 2020 •

edited

Loading