-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OHE: User can select which categories to encoded for selected variables #667
base: main
Are you sure you want to change the base?
Conversation
hi @solegalli, I think we're getting close on this one. Are there any other unit tests that we should implement? According to the coverage report, OHE has 100% coverage. I'll dig into the type error. |
Disregard my comment about code coverage. It seems that the transformer is very poorly "covered". I'll work on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Morgan-Sell
I had a quick look at the logic of the new functionality. There are some things that we need to give more thought to. Would be great it you could have a look and tell me what you think?
Thank you!
The dicitonary values are lists of the categories for each selected variable | ||
that are to be one-hot encoded. | ||
|
||
If `custom_categories` is being used, `top_categories` must equal None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things to think:
- I think custom_categories should override top_categories. So no matter what the user enters, custom_categories takes precedence. It is easier to use like this imo. So here we should say "when custom_categories is entered, top_categories will be ignored.
The other thing to consider is, how do we want custom_categories to interplay with variables. Are we going to allow the user to specify categories for 3 variables for example, while encoding 6? What did you have in mind for this?
For examples vars a,b,c,d,e are categorical, if variables=None, all will be encoded. If user passes a dictionary with keys, b and c, will variables a,d and e also be encoded? or are we making custom_categories take precedence over variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, Dr. Che, you never let me get away easy. No wonder I enjoy working w/ you ;)
I'm going to suggest that variables
is superior. I envision scenarios in which users would like to encode all variables but would like to select which categories are encoded in certain variables.
Consequently, this transformer will require a check when variables
does not equal None
and the user utilizes custom_categories
. The transformer should confirm that the variables included in custom_categories
are a subset of the variables inputted into the variables
param.
What do you think? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
drop_last: bool = False, | ||
drop_last_binary: bool = False, | ||
variables: Union[None, int, str, List[Union[str, int]]] = None, | ||
variables: Union[ | ||
None, int, str, List[Union[str, int]], Iterable[Union[str, int]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you add iterable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before refactoring the code, MyPy raised the following errors:
feature_engine/encoding/one_hot.py:221: error: Argument 1 to "sorted" has incompatible type "Union[int, str, List[Union[str, int]], None]"; expected "Iterable[Union[str, int]]"
feature_engine/encoding/one_hot.py:223: error: Argument 1 to "sorted" has incompatible type "Union[int, str, List[Union[str, int]], None]"; expected "Iterable[Union[str, int]]"
f"its values. Got {custom_categories} instead." | ||
) | ||
|
||
# check that custom_categories variable match variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if custom_categories should exactly match variables, I think we should just make this parameter take precendence over variables, and ignore variables altogether.
The alternative is, if we allow the user to specify None or a group of variables in variables, while custom_categories contains only a subset of these. Could be practical for the user, but a headache for us. Need to think / have a look at how much code we should change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reading this comment after responding to your prior question.
I agree that it will be a headache for us and useful for the user. I'll think about this more too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright then, let me know if you find an elegant solution :)
""" | ||
for var, categories in self.custom_categories.items(): | ||
unique_values = set(X[var].unique()) | ||
if not set(categories).issubset(unique_values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if, the user wants a dummy for a category that is not present in the train set? should we contemplate this or is too unlikely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When someone is OHE-ing a category that is not present in the training set, are they putting it into a "catch-all" bucket, e.g. unlearned_categories
? Is this common?
I think it's unlikely and defeats the purpose of the custom_categories
param. I see custom_categories
being used to protect against unlearned categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example:
I work in a car insurance company. My historical data has 3 types of fuel: gas, petrol, diesel. But, I know that electric vehicles are coming. So I want an additional category: electric.
In the above scenario, it might help if we allow the additional bespoke category.
On the other hand, a model can't be train in a category that does not exist, so the results would be unreliable. Maybe we don't allow them for now?
Hi @solegalli, When you're back from vacay, lmk what you think. Also, I don't know why I said that OHE has a low code coverage score. I just ran the coverage report - the transformer received 100% coverage! |
Hey @Morgan-Sell I disappeared for a while, so I am a bit lost. I guess you were waiting for my input here, no? You let me know when I need to take another look? I am away from next Thursday, for a month, ehem, (red face) |
Closes #303.
Objective:
Create functionality so a user can encode certain categories that may not be the most frequent. The functionality is explained in this thread.
Will create a new init param called
custom_categories
that accepts a dictionary with selected features as keys and lists of the desired categories as values.Both
top_categories
anduser_categories
cannot be used at the same time.