Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cyclical features may not always get the right period #765

Closed
solegalli opened this issue May 18, 2024 · 18 comments · Fixed by #797
Closed

cyclical features may not always get the right period #765

solegalli opened this issue May 18, 2024 · 18 comments · Fixed by #797

Comments

@solegalli
Copy link
Collaborator

CyclicalFeatures obtains the period by looking for the maximum value. So if months are coded from 1-12, the period will be 12 and that'll be fine.

For hours for example, if the hrs are coded from 1-24, it will get the period as 24 and that will also be fine, but if hrs are coded as 0-23, then the period will be set to 23 automatically, but in reality it is still 24.

Maybe we could use n_unique instead of max value?

In both cases there is still an issue to resolve: if the training set does not have the maximum value, or does not contain all elements, then the period will still be wrongly estimated, but I guess we could make that clear in the docs, I am not sure we can sort that out from our end.

@solegalli
Copy link
Collaborator Author

related to #578

@solegalli
Copy link
Collaborator Author

@glevv any thoughts on this? and would you be interested in trying a fix?

@VascoSch92
Copy link
Contributor

Hey,

We had more or less the same issue.

What we did is: you give an optional parameter period where you specify the period you want to use. If the period=None, you use as period the max of the unique values.

In general, a periodic function doesn't have always a period. But in the case encountered in real life it has always (example of periodic function without period are non-real world friendly).

@glevv
Copy link
Contributor

glevv commented May 18, 2024

The idea of adding period parameter seems OK to me. Should give more control to the user. Implementation could be a bit messy, though

@solegalli
Copy link
Collaborator Author

We have that option already. All happens with the same parameter max_values. If None, we pick the period with max values, alternatively, the user can pass a dictionary with the periods per variable. Maybe we should have named it periods instead of max_values thinking retrospectively, but it's probably a bit late for that.

Irrespective of that, I'd like to improve the auto detection of the period, if possible at all. Not sure it is. Because one user just used the default parameters, got different results between feature-engine and sklearn's implementation, and got confused. So trying to cater for the less experienced or those who not always read the docs. Plus I am not sure if the docs are clear enough, maybe we should explain better how the auto period is calculated. Thinking out loud.

@VascoSch92
Copy link
Contributor

Irrespective of that, I'd like to improve the auto detection of the period, if possible at all. Not sure it is. Because one user just used the default parameters, got different results between feature-engine and sklearn's implementation, and got confused. So trying to cater for the less experienced or those who not always read the docs. Plus I am not sure if the docs are clear enough, maybe we should explain better how the auto period is calculated. Thinking out loud.

I was reading the docstring of the Cyclical features transformer and I agree with you. The docs could be improved for sure :-)

@glevv
Copy link
Contributor

glevv commented May 19, 2024

We can add the following logic:
When max_values is None set it to max(n_unique, max_val). If n_unique is lower than max_value - warn user that there is an inconsistency in the data and someone should look into it.

@solegalli
Copy link
Collaborator Author

That's a good idea. It would also work for features with negative values. Would you like to give it a go?

@solegalli
Copy link
Collaborator Author

Well, I am not sure if it takes care of negative values as well on second thoughts.

@solegalli
Copy link
Collaborator Author

Thinking further out loud here, and based on what @VascoSch92 discussed in #578:

if we have time from 0-23, 0 should map to 0, and 23 should map to 2*pi radians, because those are the limits of the sine and cosine function.

if the variable has 1-24 instead, then 1 should map to 0 and 24 to 2*radian. So, just dividing by 24 won't do it. We'd need to rescale as well.

Now, executing a few examples, there is not a huge difference between the output of the function when period is 23 or 24. But I think this needs to be clarified somewhere, particularly because sklearn docs just divides by 24, even when the variable is 0-23, and people will most likely trust sklearn over feature engine.

What's your thoughts on this? maybe we can resolve something before making changes. They might not be necessary.

@VascoSch92
Copy link
Contributor

Let take in consideration just the sin_transformer of sklearn given by the formula

$$\sin( \frac{x}{p}*2*\pi), \quad \text{where p is the period}$$

If you look you are doing two things here:

  • first, you map the columns value to points in the interval $[0, 2*\pi]$ using the expression $\frac{x}{p}*2*\pi$;
  • second, you take the sinus of these points.

About the two examples:

if we have time from 0-23, 0 should map to 0, and 23 should map to 2pi radians, because those are the limits of the sine and cosine function.
if the variable has 1-24 instead, then 1 should map to 0 and 24 to 2
radian. So, just dividing by 24 won't do it. We'd need to rescale as well.

Actually, you are representing hours using different values. In the first example, midnight is 0 but in the second is 24.
Now, they have the same period p=24, and you can compute that the sin_transformation give the same output for both (here sin_transformation(0) = sin_transformation(24)).

Why I'm saying that they have the same period? Because if you are at midnight (0 or 24) and you add 24, then you are still at midnight.

just divides by 24, even when the variable is 0-23, and people will most likely trust sklearn over feature engine.
Yes because the period is 24 for both actually.

The big questions here is: how to find the period of the columns in the best correct way if the user don't give this information?

I hope my answer was clear :-)

@solegalli
Copy link
Collaborator Author

I am not sure I agree with that. sin_transfomer(0) will be 0, because x/24 is 0. And sin_transformer(24) will be something else, because 24/24 is one. In fact, I don't think it is about finding the period any more. I think it is about finding the best way to rescale the variable between 0 and 1, so that when we multiply by 2*pi radian, the variable max and min coincide with the limits of the sine and cosine functions

@solegalli
Copy link
Collaborator Author

I forgot to apply the sine to my calculation, maybe that's what you meant.

@VascoSch92
Copy link
Contributor

I forgot to apply the sine to my calculation, maybe that's what you meant.

Yes ;-) $sin(0)=sin(2*\pi)$

@VascoSch92
Copy link
Contributor

Actually to fix the transformer i believe you should take the number of unique values in the column (you can check with the above examples). With the supposition that every value in the range is at least one time present in the column. I think this is the best that we can make.

@glevv
Copy link
Contributor

glevv commented May 20, 2024

I don't think that we need to account for negative values in our calculations. The only thing we can do is to raise an error if we encounter a negative value in Cyclic Transformer. Other than that heuristic with max(n_unique, max_val) should work.

@VascoSch92
Copy link
Contributor

I don't think that we need to account for negative values in our calculations. The only thing we can do is to raise an error if we encounter a negative value in Cyclic Transformer. Other than that heuristic with max(n_unique, max_val) should work.

For the part of negative values, it is not a problem as sin and cos are defined on the real line. Just the period as to be positive. But i could not find periodic function with negative values in a real-life example (perhaps i lack of imaginations ;))

About your solution, i think it doesn't work. Take as an example the hours of the day represented in the two way

  1. From 0 to 23
  2. From 1 to 24

For 1 you will have period 23 but for 2 period 24, i.e, two different encoding for the same information.

@VascoSch92
Copy link
Contributor

Hey :-)

what is the status of that?

I can open a PR to fix the issue if you are interested ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants