-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OrdinalEncoder component #3736
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3736 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 339 341 +2
Lines 34845 35235 +390
=======================================
+ Hits 34714 35103 +389
- Misses 131 132 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
9b726e5
to
a6fdf4c
Compare
@@ -110,6 +110,7 @@ def fit(self, X, y=None): | |||
top_n = self.parameters["top_n"] | |||
X = infer_feature_types(X) | |||
if self.features_to_encode is None: | |||
# --> should update to not include ordinals once the ord encoder is integragted? Maybe that's configurable based on whether ordinal encoder is used? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note to myself about integrating the new component. Do we need a separate issue for actually using the component? Or do we use the same issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually file another ticket but it's flexible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow-up issue: #3744
"top_n": top_n, | ||
"features_to_encode": features_to_encode, | ||
"categories": categories, | ||
"handle_unknown": handle_unknown, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't include a handle_missing
parameter, since "as_category" isn't an option here, but it means that we don't have the option to error when nans are seen.
Do we want a handle_missing
parameter that is either "use_encoded_value" or "error" and then pairs with the encoded_missing_value parameter?
return self._get_feature_names() | ||
|
||
def _get_feature_provenance(self): | ||
return self._provenance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't yet covered by tests, I assume, because it's not used in the EvalML pipeline. I didn't see tests in the other components for this method, so are we okay leaving this uncovered for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm certainly not fussed about it, but if anyone disagrees speak now
evalml/pipelines/components/transformers/encoders/ordinal_encoder.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome work. There was a lot to think through here and I'm impressed you did it all. I left some cleanup/clarity comments, but the only blocking comment I have is to remove the top_n
parameter. It isn't necessary and (as made clear by all the test cases you so painstakingly wrote!) it's overly complicated without adding any value.
evalml/pipelines/components/transformers/encoders/ordinal_encoder.py
Outdated
Show resolved
Hide resolved
# Put features_to_encode in the same relative order as the columns in the dataframe | ||
self.features_to_encode = [ | ||
col for col in X.columns if col in self.features_to_encode | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? If so, I just learned something new.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in place because we say that
categories[i]
is a list of the categories for the column at indexi
in the dataframes passed in at fit and transform
That kind of breaks down once you're specifying features_to_encode
or if not all of the dataframe's columns are ordinal in nature, so I'm making the assumption that the order of the sublists in categories
is just the relative order from the original dataframe.
I can make note of that assumption in the docstring. Alternatively, we could just make categories
a dictionary mapping column name to the list of categories--was there a specific reason it's a list of lists other than that's what we pass into the SKLearn encoder? The SKLearn encoder requires that categories be the same length as the number of columns in the dataframe, and now that I really think about it, a dictionary probably fits our use case better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to you on this one! If a dictionary makes more sense I support switching over, but this is just fine as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I went through the refactor to use a dict, and I continue to be a fan. It wasn't a small number of changes, though: 54386f0
It lets users avoid having to worry about the relative order of their inputs to the encoder and lets us remove this line.
evalml/pipelines/components/transformers/encoders/ordinal_encoder.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/encoders/ordinal_encoder.py
Outdated
Show resolved
Hide resolved
""" | ||
X = infer_feature_types(X) | ||
|
||
X_copy = X.ww.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick about variable names: I would replace the X with X_t and X_copy with X. We should also avoid the need to copy in that case, since calling X.ww.drop
should return a new DataFrame.
X_t = X.ww.drop(columns=self.features_to_encode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got me thinking: What if we return a copy of X
early in the case where there are no features to encode (having to actually copy in this case, bc I assume we don't want users messing around with the same object they passed in), and otherwise proceed with an X_orig
of non ordinal columns that we concat with X_t
of the transformed ordinal columns? Would that be terribly out of line with how we normally handle transform
?
X = infer_feature_types(X)
if not self.features_to_encode:
return X.ww.copy()
X_orig = X.ww.drop(columns=self.features_to_encode)
# Call sklearn's transform on only the ordinal columns
X_t = pd.DataFrame(
self._encoder.transform(X[self.features_to_encode]),
index=X.index,
)
X_t.columns = self._get_feature_names()
X_t.ww.init(logical_types={c: "Double" for c in X_t.columns})
self._feature_names = X_t.columns
X_t = ww.utils.concat_columns([X_orig, X_t])
return X_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented this in dd5c075 - happy to change if we should be doing this differently!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here by "we don't want users messing around with the same object they passed in", could you explain that a little bit? If we're trying not to modify the user's original data, this wouldn't be the place to enforce it. I'm also worried that copying the data (especially when this component should be a no-op) uses unnecessary time and storage, especially with larger datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was assuming that this was the place to be worrying about modifying the user's original data. My thought was that if you pass a dataframe into transform
, we shouldn't ever give you back the exact same object. But if that's not a contract we need to uphold here, I'm more than happy to avoid the extra computation!
evalml/pipelines/components/transformers/encoders/ordinal_encoder.py
Outdated
Show resolved
Hide resolved
return self._get_feature_names() | ||
|
||
def _get_feature_provenance(self): | ||
return self._provenance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm certainly not fussed about it, but if anyone disagrees speak now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just left some final cleanup comments but overall this is solid
evalml/pipelines/components/transformers/encoders/ordinal_encoder.py
Outdated
Show resolved
Hide resolved
# Put features_to_encode in the same relative order as the columns in the dataframe | ||
self.features_to_encode = [ | ||
col for col in X.columns if col in self.features_to_encode | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to you on this one! If a dictionary makes more sense I support switching over, but this is just fine as is.
col for col in X.columns if col in self.features_to_encode | ||
] | ||
|
||
X_t = X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this here? We shouldn't transform X at all during fit, and it seems like the rest of this just uses one or the other interchangeably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was a holdover from the onehot encoder, but I can definitely remove
categories.append(unique_values) | ||
|
||
# Add any null values into the categories lists so that they aren't treated as unknown values | ||
# This is needed because Ordinal.order won't indicate if nulls are present, and SKOrdinalEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, a documentation update would definitely be helpful!
if encoded_missing_value is None: | ||
encoded_missing_value = np.nan | ||
|
||
self._encoder = SKOrdinalEncoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be saved as the _component_obj
to match the pattern of the rest of our components!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing!
Out of curiosity: Is this out of convention or is it needed to work properly in evalml? I ask, because this was another holdover from the onehot encoder, and I do see _encoder
used elsewhere across the repo (though not nearly as much as _component_obj
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly a convention thing, afaik
54386f0
to
214b910
Compare
evalml/pipelines/components/transformers/encoders/ordinal_encoder.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! well done 😄
7429d2b
to
ee81b32
Compare
Adds the OrdinalEncoder component, implementing #1389.
The implementation is loosely based off of the OneHotEncoder, with a few key differences:
top_n
parameter to limit the number of features creatednp.nan
or convert them to a separate encoded value that users can specify.(Note, this PR does not integrate the new ordinal encoder into the EvalML pipeline)