-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
df2Xy: Format correctly without the need to specify sort_by #324
Comments
Hi Paul (@pmhinz), |
Hi Ignacio, |
Hi @pmhinz, |
Sorry for being ambiguous. By "encapsulated" I mean that the information about which step of the time series is currently considered is contained in the name of the column of the cell. Having a look at the link you provided, looking at the second dataframe that is being shown, I see e.g. that the value 0.444130 is in the first step of the first time series, -0.593309 is in the second step and so on. This information I gather from the names of their respective columns (1, 2, ...). If these steps were unsorted, that would mean that instead of the columns being in the order (1, 2, 3, 4, 5, 6) as they are there, they would be in any other order. Hence, to order the steps, should they be unordered, I would need to change the order in which the columns appear. Note that this is fundamentally different from ordering the dataframe according to the sample_col (sample_id in the example) and feature_col (feature_id), which would order the rows (and not the columns). In essence, I see why sort_by could be useful since, theoretically, my steps could be messed up. But since the steps are encoded differently (in the columns) than the feature and sample identifiers (in the rows), this can easily be entangled, i.e. sort_by should only be able to sort the steps (and be named accordingly then) while the feature and sample columns should be ordered automatically. |
Ok, here's how I see it (I believe there's a bug in the current implementation).
Is this aligned with your proposal @pmhinz? |
Yes, this seems like a good plan of action to me. The only thing I would like to add is that since, according to this proposal, |
I understand, but I'd rather keep it the way it is for backward compatibility. I'll update documentation though to explain the use of the sort_by argument (as column/s used to sort the time series steps). I'll try to make it clear that sort_by doesn't need to include the sample_col and feature_col which will be always applied by default. |
Works for me. Thanks for taking care of this. |
Paul @pmhinz, |
I'll close this issue due to a lack of response. Please, reopen if necessary. |
When using
df2Xy
with multivariate features, the function only works correctly if the passed dataframe adheres to a specific sorting (sample_col
,feature_col
). If the order is the other way around, the function will still run through but silently mix up data, i.e. instead of having a list[[feature1_for_sample1, feature2_for_sample1, ...], [feature1_for_sample2, feature2_for_sample2, ...], ...]
one receives
[[feature1_for_sample1, feature1_for_sample2, ...], [feature1_for_sample<n>, feature1_for_sample<n+1>, ...], ...]
with the latter obviously leading to a nonsensical training.
However, when using multivariate data, one needs to pass
sample_col
andfeature_col
anyways so the sorting might also be done automatically using this information. I propose to do just that.This then obviously raises the question on how to handle cases where the
sort_by
-argument is passed as well. Is there any reason this argument is needed at all other than to prevent the above issue? If not, the obvious solution would be to simply delete the argument.The text was updated successfully, but these errors were encountered: