Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix indexing of melted normalized input #563

Merged
merged 3 commits into from
Aug 19, 2019

Conversation

g12-al
Copy link
Contributor

@g12-al g12-al commented Aug 2, 2019

When using tsfresh with batched df input without column_value or column_kind specified, tsfresh will automatically convert it to the required melted format in _normalize_input_to_internal_representation.

However, the pd.melt() function works differently than the code assumes. If the column_sort contains many various types of indices that do not overlap, this results in wrong input data, resulting in incorrect input data for feature generation!

Reasoning: pd.melt() does not interleave per-ID series, it stacks them. Therefore, when assigning back the column_sort, the column_sort needs to be tiled, not repeated.

When using tsfresh with batched df input without column_value or column_kind specified, tsfresh will automatically convert it to the required melted format in _normalize_input_to_internal_representation.

However, the pd.melt() function works differently than the code assumes. If the column_sort contains many various types of indices that do not overlap, this results in wrong input data, resulting in incorrect input data for feature generation!

Reasoning: pd.melt() does not interleave per-ID series, it stacks them. Therefore, when assigning back the column_sort, the column_sort needs to be tiled, not repeated.
@coveralls
Copy link

Coverage Status

Coverage remained the same at ?% when pulling f6b6587 on g12-al:master into f7b037e on blue-yonder:master.

@g12-al
Copy link
Contributor Author

g12-al commented Aug 8, 2019

Just some words summarizing this in case it's not clear.

If you perform batching with tsfresh on inputs with mulitple columns containing values, the output for individual inputs will be different depending on the inputs used in the batch. This means that if you generated features using batches of 10, each with 2+ columns, the output for feature generation on just an individual input will not match the output when feature generation was applied on the input in the batch.

Features used in production models will likely have different distributions than when the models are being created and validated. I fear that this has far-reaching consequences for model accuracy.

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Aug 16, 2019

I understand the grave implications if this bug holds true. First, we need unit tests that verify that this is an issue, so that the current implementation is faulty. Then we should push the fix on top of the failing unit tests.

I started to work on those tests already but I am quite busy lately. If you could go and add an unit test that demonstrates the error first, and then apply the patch on top of it, that would be great.

@g12-al
Copy link
Contributor Author

g12-al commented Aug 16, 2019

Thank you @MaxBenChrist. I have some time today, so I'll try to write a unit test that demonstrates the issue.

In my own dataset, I found the minimal set of IDs which exhibited reordering with ToT tsfresh. I wasn't able to find a repro by manually constructing inputs. Then, I discovered the repro would still occur even when downsampling extremely. I downsampled my 250k row dataset down to 30 rows, and then manually clipped trimmed it until it was as small as possible.

This test is the minimum base-case. With ToT, using np.repeat, test_df0 will have the order of column "b" flipped after the call to the normalization code.
@g12-al
Copy link
Contributor Author

g12-al commented Aug 16, 2019

I've managed to distill the issue down into a simple unit test and pushed it. Please take a look and verify.

@MaxBenChrist
Copy link
Collaborator

I verified that this bug can result in wrong ordered time series under the following conditions

  • column_kind is None
  • column_value is None
  • the time series have different sizes

This bug will not change the actual values of the time series (so, there can't occur any leaks from train to test set) but change the order of the time series.

Can you please replace your unit test by the following two:

    def test_wide_dataframe_order_preserved_with_sort_column(self):
        """ verifies that the order of the sort column from a wide time series container is preserved
        """

        test_df = pd.DataFrame({'id': ["a", "a", "b"],
                                'v1': [3, 2, 1],
                                'v2': [13, 12, 11],
                                'sort': [103, 102, 101]})

        melt_df, _, _, _ = \
            dataframe_functions._normalize_input_to_internal_representation(
                test_df, column_id="id", column_sort="sort", column_kind=None, column_value=None)

        assert (test_df.sort_values("sort").query("id=='a'")["v1"].values ==
                melt_df.query("id=='a'").query("_variables=='v1'")["_values"].values).all()
        assert (test_df.sort_values("sort").query("id=='a'")["v2"].values ==
                melt_df.query("id=='a'").query("_variables=='v2'")["_values"].values).all()


    def test_wide_dataframe_order_preserved(self):
        """ verifies that the order of the time series inside a wide time series container are preserved
        (columns_sort=None)
        """
        test_df = pd.DataFrame({'id': ["a", "a", "a", "b"],
                                'v1': [4, 3, 2, 1],
                                'v2': [14, 13, 12, 11]})

        melt_df, _, _, _ = \
            dataframe_functions._normalize_input_to_internal_representation(
                test_df, column_id="id", column_sort=None, column_kind=None, column_value=None)

        assert (test_df.query("id=='a'")["v1"].values ==
                melt_df.query("id=='a'").query("_variables=='v1'")["_values"].values).all()
        assert (test_df.query("id=='a'")["v2"].values ==
                melt_df.query("id=='a'").query("_variables=='v2'")["_values"].values).all()

Afterwards we can merge this. Thanks again for reporting this! 👍

@g12-al
Copy link
Contributor Author

g12-al commented Aug 19, 2019

Great, thank you for your help with this Max! I've replaced my unit test with the ones you provided.

Would you still like to do the order which you specified above? I can move the unit tests to a separate pull request.
"First, we need unit tests that verify that this is an issue, so that the current implementation is faulty. Then we should push the fix on top of the failing unit tests."

@MaxBenChrist
Copy link
Collaborator

"First, we need unit tests that verify that this is an issue, so that the current implementation is faulty. Then we should push the fix on top of the failing unit tests."

Yes, I verified the bug and that your fix solves the issue locally, so it's fine to merge this pr. Thanks for reporting the issue again!

@MaxBenChrist MaxBenChrist merged commit ee4d5bb into blue-yonder:master Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants