Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df2Xy: Format correctly without the need to specify sort_by #324

Closed
pmhinz opened this issue Dec 17, 2021 · 10 comments
Closed

df2Xy: Format correctly without the need to specify sort_by #324

pmhinz opened this issue Dec 17, 2021 · 10 comments
Labels
bug Something isn't working

Comments

@pmhinz
Copy link

pmhinz commented Dec 17, 2021

When using df2Xy with multivariate features, the function only works correctly if the passed dataframe adheres to a specific sorting (sample_col, feature_col). If the order is the other way around, the function will still run through but silently mix up data, i.e. instead of having a list
[[feature1_for_sample1, feature2_for_sample1, ...], [feature1_for_sample2, feature2_for_sample2, ...], ...]
one receives
[[feature1_for_sample1, feature1_for_sample2, ...], [feature1_for_sample<n>, feature1_for_sample<n+1>, ...], ...]
with the latter obviously leading to a nonsensical training.

However, when using multivariate data, one needs to pass sample_col and feature_col anyways so the sorting might also be done automatically using this information. I propose to do just that.

This then obviously raises the question on how to handle cases where the sort_by-argument is passed as well. Is there any reason this argument is needed at all other than to prevent the above issue? If not, the obvious solution would be to simply delete the argument.

@oguiza
Copy link
Contributor

oguiza commented Dec 17, 2021

Hi Paul (@pmhinz),
There's a reason that sort_by may be required in some cases. Let me try to explain it.
sample_col and feature_col will sort all features for all samples. Those are 2 of the 3 axis of the expected output with shape:
[n_samples x n_vars x steps]
However, you may need to sort the steps as they may be unsorted. There may be a time column you might need to sort. That's why sort_by is required.

@pmhinz
Copy link
Author

pmhinz commented Dec 17, 2021

Hi Ignacio,
I see where you are coming from. However, I thought that the steps are expected to be encapsulated in the columns of the incoming dataframe. If my assessment is correct and if the steps truly were unsorted, I would need to sort columns (and not rows), which sort_by does not do at the moment anyways.

@oguiza
Copy link
Contributor

oguiza commented Dec 21, 2021

Hi @pmhinz,
If by encapsulated you mean that you have an array in each Pandas cell, no that's not the expected input. Please, take a look at the data preparation section in the documentation to see multiple examples of how input data may be prepared.

@pmhinz
Copy link
Author

pmhinz commented Dec 21, 2021

Sorry for being ambiguous. By "encapsulated" I mean that the information about which step of the time series is currently considered is contained in the name of the column of the cell.

Having a look at the link you provided, looking at the second dataframe that is being shown, I see e.g. that the value 0.444130 is in the first step of the first time series, -0.593309 is in the second step and so on. This information I gather from the names of their respective columns (1, 2, ...). If these steps were unsorted, that would mean that instead of the columns being in the order (1, 2, 3, 4, 5, 6) as they are there, they would be in any other order. Hence, to order the steps, should they be unordered, I would need to change the order in which the columns appear. Note that this is fundamentally different from ordering the dataframe according to the sample_col (sample_id in the example) and feature_col (feature_id), which would order the rows (and not the columns).

In essence, I see why sort_by could be useful since, theoretically, my steps could be messed up. But since the steps are encoded differently (in the columns) than the feature and sample identifiers (in the rows), this can easily be entangled, i.e. sort_by should only be able to sort the steps (and be named accordingly then) while the feature and sample columns should be ordered automatically.

@oguiza
Copy link
Contributor

oguiza commented Dec 23, 2021

Ok, here's how I see it (I believe there's a bug in the current implementation).

  1. df2Xy should always sort the dataframe.

  2. sample_col (if not None) and feat_col (if not None) should be the first sorting criteria used.

  3. In addition to those, and only if your steps are in rows, you may also want to add other columns using the sort_by argument.

  4. The order of columns for sorting will then be:

    • sample_col
    • feat_col
    • sort_by
  5. sort_by doesn't need to include the sample_col nor the feat_col. They should be automatically added.

Is this aligned with your proposal @pmhinz?

@oguiza oguiza added the under review Waiting for clarification, confirmation, etc label Dec 23, 2021
@pmhinz
Copy link
Author

pmhinz commented Dec 23, 2021

Yes, this seems like a good plan of action to me. The only thing I would like to add is that since, according to this proposal, sort_by may now only be used for sorting the steps, its name should be changed to be more descriptive of what it does. I am thinking step_col but maybe that is too reductive of a choice. What do you think?

@oguiza
Copy link
Contributor

oguiza commented Dec 23, 2021

I understand, but I'd rather keep it the way it is for backward compatibility. I'll update documentation though to explain the use of the sort_by argument (as column/s used to sort the time series steps). I'll try to make it clear that sort_by doesn't need to include the sample_col and feature_col which will be always applied by default.

@pmhinz
Copy link
Author

pmhinz commented Dec 23, 2021

Works for me. Thanks for taking care of this.

@oguiza oguiza added bug Something isn't working and removed under review Waiting for clarification, confirmation, etc labels Dec 25, 2021
@oguiza
Copy link
Contributor

oguiza commented Jan 11, 2022

Paul @pmhinz,
I've already modified df2Xy based on the agreed solution. It'd be good if you can verify it works as expected.

@oguiza
Copy link
Contributor

oguiza commented Jan 19, 2022

I'll close this issue due to a lack of response. Please, reopen if necessary.

@oguiza oguiza closed this as completed Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants