df2Xy: Format correctly without the need to specify sort_by #324

pmhinz · 2021-12-17T09:56:46Z

When using df2Xy with multivariate features, the function only works correctly if the passed dataframe adheres to a specific sorting (sample_col, feature_col). If the order is the other way around, the function will still run through but silently mix up data, i.e. instead of having a list
[[feature1_for_sample1, feature2_for_sample1, ...], [feature1_for_sample2, feature2_for_sample2, ...], ...]
one receives
[[feature1_for_sample1, feature1_for_sample2, ...], [feature1_for_sample<n>, feature1_for_sample<n+1>, ...], ...]
with the latter obviously leading to a nonsensical training.

However, when using multivariate data, one needs to pass sample_col and feature_col anyways so the sorting might also be done automatically using this information. I propose to do just that.

This then obviously raises the question on how to handle cases where the sort_by-argument is passed as well. Is there any reason this argument is needed at all other than to prevent the above issue? If not, the obvious solution would be to simply delete the argument.

The text was updated successfully, but these errors were encountered:

oguiza · 2021-12-17T15:34:56Z

Hi Paul (@pmhinz),
There's a reason that sort_by may be required in some cases. Let me try to explain it.
sample_col and feature_col will sort all features for all samples. Those are 2 of the 3 axis of the expected output with shape:
[n_samples x n_vars x steps]
However, you may need to sort the steps as they may be unsorted. There may be a time column you might need to sort. That's why sort_by is required.

pmhinz · 2021-12-17T15:56:00Z

Hi Ignacio,
I see where you are coming from. However, I thought that the steps are expected to be encapsulated in the columns of the incoming dataframe. If my assessment is correct and if the steps truly were unsorted, I would need to sort columns (and not rows), which sort_by does not do at the moment anyways.

oguiza · 2021-12-21T11:50:07Z

Hi @pmhinz,
If by encapsulated you mean that you have an array in each Pandas cell, no that's not the expected input. Please, take a look at the data preparation section in the documentation to see multiple examples of how input data may be prepared.

pmhinz · 2021-12-21T14:58:16Z

Sorry for being ambiguous. By "encapsulated" I mean that the information about which step of the time series is currently considered is contained in the name of the column of the cell.

Having a look at the link you provided, looking at the second dataframe that is being shown, I see e.g. that the value 0.444130 is in the first step of the first time series, -0.593309 is in the second step and so on. This information I gather from the names of their respective columns (1, 2, ...). If these steps were unsorted, that would mean that instead of the columns being in the order (1, 2, 3, 4, 5, 6) as they are there, they would be in any other order. Hence, to order the steps, should they be unordered, I would need to change the order in which the columns appear. Note that this is fundamentally different from ordering the dataframe according to the sample_col (sample_id in the example) and feature_col (feature_id), which would order the rows (and not the columns).

In essence, I see why sort_by could be useful since, theoretically, my steps could be messed up. But since the steps are encoded differently (in the columns) than the feature and sample identifiers (in the rows), this can easily be entangled, i.e. sort_by should only be able to sort the steps (and be named accordingly then) while the feature and sample columns should be ordered automatically.

oguiza · 2021-12-23T17:46:02Z

Ok, here's how I see it (I believe there's a bug in the current implementation).

df2Xy should always sort the dataframe.
sample_col (if not None) and feat_col (if not None) should be the first sorting criteria used.
In addition to those, and only if your steps are in rows, you may also want to add other columns using the sort_by argument.
The order of columns for sorting will then be:
- sample_col
- feat_col
- sort_by
sort_by doesn't need to include the sample_col nor the feat_col. They should be automatically added.

Is this aligned with your proposal @pmhinz?

pmhinz · 2021-12-23T17:57:06Z

Yes, this seems like a good plan of action to me. The only thing I would like to add is that since, according to this proposal, sort_by may now only be used for sorting the steps, its name should be changed to be more descriptive of what it does. I am thinking step_col but maybe that is too reductive of a choice. What do you think?

oguiza · 2021-12-23T20:16:25Z

I understand, but I'd rather keep it the way it is for backward compatibility. I'll update documentation though to explain the use of the sort_by argument (as column/s used to sort the time series steps). I'll try to make it clear that sort_by doesn't need to include the sample_col and feature_col which will be always applied by default.

pmhinz · 2021-12-23T20:20:17Z

Works for me. Thanks for taking care of this.

oguiza · 2022-01-11T15:23:31Z

Paul @pmhinz,
I've already modified df2Xy based on the agreed solution. It'd be good if you can verify it works as expected.

oguiza · 2022-01-19T08:04:16Z

I'll close this issue due to a lack of response. Please, reopen if necessary.

oguiza added the under review Waiting for clarification, confirmation, etc label Dec 23, 2021

oguiza added bug Something isn't working and removed under review Waiting for clarification, confirmation, etc labels Dec 25, 2021

oguiza closed this as completed Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df2Xy: Format correctly without the need to specify sort_by #324

df2Xy: Format correctly without the need to specify sort_by #324

pmhinz commented Dec 17, 2021

oguiza commented Dec 17, 2021 •

edited

Loading

pmhinz commented Dec 17, 2021

oguiza commented Dec 21, 2021

pmhinz commented Dec 21, 2021

oguiza commented Dec 23, 2021

pmhinz commented Dec 23, 2021

oguiza commented Dec 23, 2021

pmhinz commented Dec 23, 2021

oguiza commented Jan 11, 2022

oguiza commented Jan 19, 2022

df2Xy: Format correctly without the need to specify sort_by #324

df2Xy: Format correctly without the need to specify sort_by #324

Comments

pmhinz commented Dec 17, 2021

oguiza commented Dec 17, 2021 • edited Loading

pmhinz commented Dec 17, 2021

oguiza commented Dec 21, 2021

pmhinz commented Dec 21, 2021

oguiza commented Dec 23, 2021

pmhinz commented Dec 23, 2021

oguiza commented Dec 23, 2021

pmhinz commented Dec 23, 2021

oguiza commented Jan 11, 2022

oguiza commented Jan 19, 2022

oguiza commented Dec 17, 2021 •

edited

Loading