-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cyclical features may not always get the right period #765
Comments
related to #578 |
@glevv any thoughts on this? and would you be interested in trying a fix? |
Hey, We had more or less the same issue. What we did is: you give an optional parameter In general, a periodic function doesn't have always a period. But in the case encountered in real life it has always (example of periodic function without period are non-real world friendly). |
The idea of adding |
We have that option already. All happens with the same parameter Irrespective of that, I'd like to improve the auto detection of the period, if possible at all. Not sure it is. Because one user just used the default parameters, got different results between feature-engine and sklearn's implementation, and got confused. So trying to cater for the less experienced or those who not always read the docs. Plus I am not sure if the docs are clear enough, maybe we should explain better how the auto period is calculated. Thinking out loud. |
I was reading the docstring of the Cyclical features transformer and I agree with you. The docs could be improved for sure :-) |
We can add the following logic: |
That's a good idea. It would also work for features with negative values. Would you like to give it a go? |
Well, I am not sure if it takes care of negative values as well on second thoughts. |
Thinking further out loud here, and based on what @VascoSch92 discussed in #578: if we have time from 0-23, 0 should map to 0, and 23 should map to 2*pi radians, because those are the limits of the sine and cosine function. if the variable has 1-24 instead, then 1 should map to 0 and 24 to 2*radian. So, just dividing by 24 won't do it. We'd need to rescale as well. Now, executing a few examples, there is not a huge difference between the output of the function when period is 23 or 24. But I think this needs to be clarified somewhere, particularly because sklearn docs just divides by 24, even when the variable is 0-23, and people will most likely trust sklearn over feature engine. What's your thoughts on this? maybe we can resolve something before making changes. They might not be necessary. |
Let take in consideration just the If you look you are doing two things here:
About the two examples:
Actually, you are representing hours using different values. In the first example, midnight is 0 but in the second is 24. Why I'm saying that they have the same period? Because if you are at midnight (0 or 24) and you add 24, then you are still at midnight.
The big questions here is: how to find the period of the columns in the best correct way if the user don't give this information? I hope my answer was clear :-) |
I am not sure I agree with that. |
I forgot to apply the sine to my calculation, maybe that's what you meant. |
Yes ;-) |
Actually to fix the transformer i believe you should take the number of unique values in the column (you can check with the above examples). With the supposition that every value in the range is at least one time present in the column. I think this is the best that we can make. |
I don't think that we need to account for negative values in our calculations. The only thing we can do is to raise an error if we encounter a negative value in Cyclic Transformer. Other than that heuristic with |
For the part of negative values, it is not a problem as sin and cos are defined on the real line. Just the period as to be positive. But i could not find periodic function with negative values in a real-life example (perhaps i lack of imaginations ;)) About your solution, i think it doesn't work. Take as an example the hours of the day represented in the two way
For 1 you will have period 23 but for 2 period 24, i.e, two different encoding for the same information. |
Hey :-) what is the status of that? I can open a PR to fix the issue if you are interested ;-) |
CyclicalFeatures obtains the period by looking for the maximum value. So if months are coded from 1-12, the period will be 12 and that'll be fine.
For hours for example, if the hrs are coded from 1-24, it will get the period as 24 and that will also be fine, but if hrs are coded as 0-23, then the period will be set to 23 automatically, but in reality it is still 24.
Maybe we could use n_unique instead of max value?
In both cases there is still an issue to resolve: if the training set does not have the maximum value, or does not contain all elements, then the period will still be wrongly estimated, but I guess we could make that clear in the docs, I am not sure we can sort that out from our end.
The text was updated successfully, but these errors were encountered: