-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix indexing of melted normalized input #563
Conversation
When using tsfresh with batched df input without column_value or column_kind specified, tsfresh will automatically convert it to the required melted format in _normalize_input_to_internal_representation. However, the pd.melt() function works differently than the code assumes. If the column_sort contains many various types of indices that do not overlap, this results in wrong input data, resulting in incorrect input data for feature generation! Reasoning: pd.melt() does not interleave per-ID series, it stacks them. Therefore, when assigning back the column_sort, the column_sort needs to be tiled, not repeated.
Just some words summarizing this in case it's not clear. If you perform batching with tsfresh on inputs with mulitple columns containing values, the output for individual inputs will be different depending on the inputs used in the batch. This means that if you generated features using batches of 10, each with 2+ columns, the output for feature generation on just an individual input will not match the output when feature generation was applied on the input in the batch. Features used in production models will likely have different distributions than when the models are being created and validated. I fear that this has far-reaching consequences for model accuracy. |
I understand the grave implications if this bug holds true. First, we need unit tests that verify that this is an issue, so that the current implementation is faulty. Then we should push the fix on top of the failing unit tests. I started to work on those tests already but I am quite busy lately. If you could go and add an unit test that demonstrates the error first, and then apply the patch on top of it, that would be great. |
Thank you @MaxBenChrist. I have some time today, so I'll try to write a unit test that demonstrates the issue. |
In my own dataset, I found the minimal set of IDs which exhibited reordering with ToT tsfresh. I wasn't able to find a repro by manually constructing inputs. Then, I discovered the repro would still occur even when downsampling extremely. I downsampled my 250k row dataset down to 30 rows, and then manually clipped trimmed it until it was as small as possible. This test is the minimum base-case. With ToT, using np.repeat, test_df0 will have the order of column "b" flipped after the call to the normalization code.
I've managed to distill the issue down into a simple unit test and pushed it. Please take a look and verify. |
I verified that this bug can result in wrong ordered time series under the following conditions
This bug will not change the actual values of the time series (so, there can't occur any leaks from train to test set) but change the order of the time series. Can you please replace your unit test by the following two: def test_wide_dataframe_order_preserved_with_sort_column(self):
""" verifies that the order of the sort column from a wide time series container is preserved
"""
test_df = pd.DataFrame({'id': ["a", "a", "b"],
'v1': [3, 2, 1],
'v2': [13, 12, 11],
'sort': [103, 102, 101]})
melt_df, _, _, _ = \
dataframe_functions._normalize_input_to_internal_representation(
test_df, column_id="id", column_sort="sort", column_kind=None, column_value=None)
assert (test_df.sort_values("sort").query("id=='a'")["v1"].values ==
melt_df.query("id=='a'").query("_variables=='v1'")["_values"].values).all()
assert (test_df.sort_values("sort").query("id=='a'")["v2"].values ==
melt_df.query("id=='a'").query("_variables=='v2'")["_values"].values).all()
def test_wide_dataframe_order_preserved(self):
""" verifies that the order of the time series inside a wide time series container are preserved
(columns_sort=None)
"""
test_df = pd.DataFrame({'id': ["a", "a", "a", "b"],
'v1': [4, 3, 2, 1],
'v2': [14, 13, 12, 11]})
melt_df, _, _, _ = \
dataframe_functions._normalize_input_to_internal_representation(
test_df, column_id="id", column_sort=None, column_kind=None, column_value=None)
assert (test_df.query("id=='a'")["v1"].values ==
melt_df.query("id=='a'").query("_variables=='v1'")["_values"].values).all()
assert (test_df.query("id=='a'")["v2"].values ==
melt_df.query("id=='a'").query("_variables=='v2'")["_values"].values).all() Afterwards we can merge this. Thanks again for reporting this! 👍 |
This was requested by Max in Pull Request blue-yonder#563.
Great, thank you for your help with this Max! I've replaced my unit test with the ones you provided. Would you still like to do the order which you specified above? I can move the unit tests to a separate pull request. |
Yes, I verified the bug and that your fix solves the issue locally, so it's fine to merge this pr. Thanks for reporting the issue again! |
When using tsfresh with batched df input without column_value or column_kind specified, tsfresh will automatically convert it to the required melted format in _normalize_input_to_internal_representation.
However, the pd.melt() function works differently than the code assumes. If the column_sort contains many various types of indices that do not overlap, this results in wrong input data, resulting in incorrect input data for feature generation!
Reasoning: pd.melt() does not interleave per-ID series, it stacks them. Therefore, when assigning back the column_sort, the column_sort needs to be tiled, not repeated.