Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value of id column returned by roll_time_series is set to column_sort #673

Closed
ironerumi opened this issue Apr 25, 2020 · 12 comments
Closed
Assignees
Labels

Comments

@ironerumi
Copy link

Hi, probably I do not fully understand the usage of tsfresh.utilities.dataframe_functions.roll_time_series.

Say if I have input data like below and apply roll_time_series to it:

df = pd.DataFrame(
    {
        "id": ["A", "A", "A", "B", "B", "B"],
        "time": [1, 2, 3, 1, 2, 3],
        "x": [11, 12, 13, 14, 15, 16],
        "y": [21, 22, 23, 24, 25, 26],
    }
)
df_roll = roll_time_series(
    df,
    column_id="id",
    column_sort="time",
    column_kind=None,
    rolling_direction=10,
)

The result is like below, I was expecting the id column could remains the same but it was set to the value of column_sort.

idx time x y id
6 1.0 11.0 21.0 1
7 1.0 14.0 24.0 1
2 1.0 11.0 21.0 2
3 1.0 14.0 24.0 2
8 2.0 12.0 22.0 2
9 2.0 15.0 25.0 2
0 1.0 11.0 21.0 3
1 1.0 14.0 24.0 3
4 2.0 12.0 22.0 3
5 2.0 15.0 25.0 3
10 3.0 13.0 23.0 3
11 3.0 16.0 26.0 3
  • Do we expect that there is no duplicate time accross id?
  • Is there a way to distingush data from different id?

Many thanks!

@nils-braun
Copy link
Collaborator

Hi @ironerumi!
Actually, you are very right. The id gets lost in this process and gets replaced by the last timestamp in the sub time series.
There are two possibilities:

  • you duplicate the id column before column and then combine the old id with the new id, which will give you a unique identifier for each sub time series again
  • or you wait for Better/Correct/Faster rolling #668 - because I noticed the same thing and fixed it there (together with some additional things) :-)

@ironerumi
Copy link
Author

Hi @nils-braun, thanks for the quick response!

I'm certainly looking forward to seeing how would the function looks like after the fix!
Thanks for the advice that I suppose I would duplicate id to kind and it seems kind would be kept after the transformation.

TSFRESH helped me a lot and certainly hope it would evolve even better!
Many thanks

@nils-braun
Copy link
Collaborator

The PR is merged, now the old id is kept additional to the time shifts as identifier!

@konradsemsch
Copy link

Hi @nils-braun! Is that really the case? I installed the package this morning from pip and exactly faced the same confusion as @ironerumi.

@nils-braun
Copy link
Collaborator

@konradsemsch Thanks for testing it so quick!
The version you install from pip is a stable release. Not every pull request we merge gets automatically into a new stable release (first: because we would produce a lot of releases and a lot of build artifacts then and second because we really only want "stable" things in the release).
Therefore, the pip-installable tsfresh version (which is 0.15.1 in the moment) is not equal to the current head here in git.

You might see this on other git projects, which only use "master" for released code (and have a develop branch - the so called "git flow model"). We have chosen for another branching schema.

Two possibilities:

  • you wait for the next release of tsfresh. We do not have a fixed release schema, but I plan to do one after Improved speed #681 is merged
  • you install the latest master version of tsfresh by pip install git+https://github.com/blue-yonder/tsfresh.git

@konradsemsch
Copy link

konradsemsch commented May 4, 2020

Ok, thanks for the answer! :) Could you perhaps make a small example on how extract_features should be applied after we used roll_time_series, in order to make sure that they are available per rolled date/ each time-serie id? My goal would be to have a suitable forecasting structure across many different time-serie ids.

Unfortunately I find the documentation a little bit cryptic:
https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html?highlight=extract_features#tsfresh.feature_extraction.extraction.extract_features

Not 100% what each of those parameters is really responsible for after applying rolling to have a structure suitable for forecasting:
image

And the tutorial doesn't discuss feature extraction here: https://tsfresh.readthedocs.io/en/latest/text/forecasting.html

Could you shed a bit more light on this?

@ironerumi
Copy link
Author

@konradsemsch probably you already solved it, just post the way that works for me here.
Continue with the df in the question. extract_features could be used like this:

df_rolling = roll_time_series(df, 'id', column_sort='time',max_timeshift=1,min_timeshift=1)
df_features = extract_features(df_rolling, column_id='id', column_sort='time')

df_rolling

id time x y
id=A,timeshift=2 1 11 21
id=A,timeshift=2 2 12 22
id=A,timeshift=3 2 12 22
id=A,timeshift=3 3 13 23
id=B,timeshift=2 1 14 24
id=B,timeshift=2 2 15 25
id=B,timeshift=3 2 15 25
id=B,timeshift=3 3 16 26

df_features.iloc[:,0:3]

variable x__abs_energy x__absolute_sum_of_changes x__agg_autocorrelation__f_agg_"mean"__maxlag_40
id=A,timeshift=2 265.0 1.0 -1.0
id=A,timeshift=3 313.0 1.0 -1.0
id=B,timeshift=2 421.0 1.0 -1.0
id=B,timeshift=3 481.0 1.0 -1.0

Basically, in my impression, the column_id and column_sort are the same for both rolling and extract function. All values other than id would be kept during the process.
Also, it seems that in 0.16 we could define min_timeshift which I found is quite useful as sometimes we might just want to shift between some values. Many thanks

@nils-braun
Copy link
Collaborator

Thanks to @ironerumi for your answer and sorry to @konradsemsch for falling silent :-/

Just for us to know (where we need to improve the documentation): did also https://tsfresh.readthedocs.io/en/latest/text/data_formats.html#data-formats-label not help you?

Now to your question: after rolling (in v0.16.0!) your rolled dataframe will contain a new column called "id", which will be different for each package of rolled date + time series id.
As you want to have feature extracted for each of those packages, this column is your column_id :-)
The sort column is just copied, so you would fill in the same name into column_sort here.

If and how you need a column_value and a column_kind depends on your data format. If it is flat (as in the example of @ironerumi), you do not need to supply them at all.
If it is stacked (see https://tsfresh.readthedocs.io/en/latest/text/data_formats.html#data-formats-label), you will need to give it to the function.

I will try to improve the documentation - both for the data formats and an example for feature extraction.

@nils-braun
Copy link
Collaborator

nils-braun commented May 13, 2020

Ah, I just realized why I am so puzzled. All the new documentation we wrote on the rolling was not visible :-) (because of a mis-configuration with readthedocs...)
Could one of you have a look into https://tsfresh.readthedocs.io/en/v0.16.0/text/forecasting.html? It already includes a code example on how to do feature extraction (and it should be easier than the documentation before). I will nevertheless work on the column definitions...

@ironerumi
Copy link
Author

@nils-braun thanks for the info! Last time when I checked the module reference, I suppose it was still not 0.16 yet. I think now the forecasting.html explains more clearly on the overall flow.

@nils-braun
Copy link
Collaborator

Thanks for checking! And sorry all the documentation was outdated - readthedocs was not updated properly.

@konradsemsch
Copy link

I think it is indeed better right now. Something I would consider though is adding a tangible example in the repo as well. Right now you guys have two examples I believe but they do not concern multiple time series at a time. I think illustrating that would completely clarify things up for a lot of users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants