Add OrdinalEncoder component #3736

tamargrey · 2022-09-28T17:27:03Z

Adds the OrdinalEncoder component, implementing #1389.

The implementation is loosely based off of the OneHotEncoder, with a few key differences:

We only need to make one encoded feature for each ordinal column, so the naming schema is simplified
Because all values in a column must be encoded to something, we do not have the option to ignore unknown values. We can either specify a value to encode them as or raise an error.
There's not the same need to drop categories, since we don't have the problem of duplicate features in binary columns that exists with one-hot encoding
There's not the same need to have a top_n parameter to limit the number of features created
Missing values cannot be treated as a category. This is because Woodwork's Ordinal logical type doesn't include nans, which means that we can't specify a null value's place in the order. Therefore, we either keep nulls as np.nan or convert them to a separate encoded value that users can specify.

(Note, this PR does not integrate the new ordinal encoder into the EvalML pipeline)

codecov · 2022-09-28T17:34:49Z

Codecov Report

Merging #3736 (ee81b32) into main (9331246) will increase coverage by 0.1%.
The diff coverage is 99.8%.

@@           Coverage Diff           @@
##            main   #3736     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        339     341      +2     
  Lines      34845   35235    +390     
=======================================
+ Hits       34714   35103    +389     
- Misses       131     132      +1

Impacted Files	Coverage Δ
evalml/pipelines/__init__.py	`100.0% <ø> (ø)`
evalml/pipelines/components/__init__.py	`100.0% <ø> (ø)`
...alml/pipelines/components/transformers/__init__.py	`100.0% <ø> (ø)`
evalml/tests/component_tests/test_utils.py	`99.1% <ø> (ø)`
...omponents/transformers/encoders/ordinal_encoder.py	`99.0% <99.0%> (ø)`
...lines/components/transformers/encoders/__init__.py	`100.0% <100.0%> (ø)`
...alml/tests/component_tests/test_ordinal_encoder.py	`100.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

CLAassistant · 2022-09-28T18:14:02Z

All committers have signed the CLA.

tamargrey · 2022-09-28T19:03:19Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

@@ -110,6 +110,7 @@ def fit(self, X, y=None):
        top_n = self.parameters["top_n"]
        X = infer_feature_types(X)
        if self.features_to_encode is None:
+            # --> should update to not include ordinals once the ord encoder is integragted? Maybe that's configurable based on whether ordinal encoder is used?


A note to myself about integrating the new component. Do we need a separate issue for actually using the component? Or do we use the same issue?

We usually file another ticket but it's flexible!

Created a follow-up issue: #3744

tamargrey · 2022-09-28T19:05:33Z

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

+            "top_n": top_n,
+            "features_to_encode": features_to_encode,
+            "categories": categories,
+            "handle_unknown": handle_unknown,


I didn't include a handle_missing parameter, since "as_category" isn't an option here, but it means that we don't have the option to error when nans are seen.

Do we want a handle_missing parameter that is either "use_encoded_value" or "error" and then pairs with the encoded_missing_value parameter?

tamargrey · 2022-09-28T19:06:33Z

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

+        return self._get_feature_names()
+
+    def _get_feature_provenance(self):
+        return self._provenance


This isn't yet covered by tests, I assume, because it's not used in the EvalML pipeline. I didn't see tests in the other components for this method, so are we okay leaving this uncovered for now?

I'm certainly not fussed about it, but if anyone disagrees speak now

evalml/tests/component_tests/test_ordinal_encoder.py

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

eccabay

This is awesome work. There was a lot to think through here and I'm impressed you did it all. I left some cleanup/clarity comments, but the only blocking comment I have is to remove the top_n parameter. It isn't necessary and (as made clear by all the test cases you so painstakingly wrote!) it's overly complicated without adding any value.

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

eccabay · 2022-09-30T14:29:11Z

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

+            # Put features_to_encode in the same relative order as the columns in the dataframe
+            self.features_to_encode = [
+                col for col in X.columns if col in self.features_to_encode
+            ]


Is this necessary? If so, I just learned something new.

This is in place because we say that

categories[i] is a list of the categories for the column at index i in the dataframes passed in at fit and transform

That kind of breaks down once you're specifying features_to_encode or if not all of the dataframe's columns are ordinal in nature, so I'm making the assumption that the order of the sublists in categories is just the relative order from the original dataframe.

I can make note of that assumption in the docstring. Alternatively, we could just make categories a dictionary mapping column name to the list of categories--was there a specific reason it's a list of lists other than that's what we pass into the SKLearn encoder? The SKLearn encoder requires that categories be the same length as the number of columns in the dataframe, and now that I really think about it, a dictionary probably fits our use case better.

I'll defer to you on this one! If a dictionary makes more sense I support switching over, but this is just fine as is.

Okay, I went through the refactor to use a dict, and I continue to be a fan. It wasn't a small number of changes, though: 54386f0

It lets users avoid having to worry about the relative order of their inputs to the encoder and lets us remove this line.

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

eccabay · 2022-09-30T15:07:48Z

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

+        """
+        X = infer_feature_types(X)
+
+        X_copy = X.ww.copy()


Nitpick about variable names: I would replace the X with X_t and X_copy with X. We should also avoid the need to copy in that case, since calling X.ww.drop should return a new DataFrame.

X_t = X.ww.drop(columns=self.features_to_encode)

This got me thinking: What if we return a copy of X early in the case where there are no features to encode (having to actually copy in this case, bc I assume we don't want users messing around with the same object they passed in), and otherwise proceed with an X_orig of non ordinal columns that we concat with X_t of the transformed ordinal columns? Would that be terribly out of line with how we normally handle transform?

X = infer_feature_types(X) if not self.features_to_encode: return X.ww.copy() X_orig = X.ww.drop(columns=self.features_to_encode) # Call sklearn's transform on only the ordinal columns X_t = pd.DataFrame( self._encoder.transform(X[self.features_to_encode]), index=X.index, ) X_t.columns = self._get_feature_names() X_t.ww.init(logical_types={c: "Double" for c in X_t.columns}) self._feature_names = X_t.columns X_t = ww.utils.concat_columns([X_orig, X_t]) return X_t

Implemented this in dd5c075 - happy to change if we should be doing this differently!

I'm not sure what you mean here by "we don't want users messing around with the same object they passed in", could you explain that a little bit? If we're trying not to modify the user's original data, this wouldn't be the place to enforce it. I'm also worried that copying the data (especially when this component should be a no-op) uses unnecessary time and storage, especially with larger datasets.

Yeah, I was assuming that this was the place to be worrying about modifying the user's original data. My thought was that if you pass a dataframe into transform, we shouldn't ever give you back the exact same object. But if that's not a contract we need to uphold here, I'm more than happy to avoid the extra computation!

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

eccabay · 2022-09-30T15:10:17Z

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

+        return self._get_feature_names()
+
+    def _get_feature_provenance(self):
+        return self._provenance


I'm certainly not fussed about it, but if anyone disagrees speak now

evalml/tests/component_tests/test_ordinal_encoder.py

eccabay

Looks great! Just left some final cleanup comments but overall this is solid

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

eccabay · 2022-10-03T21:19:48Z

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

+            # Put features_to_encode in the same relative order as the columns in the dataframe
+            self.features_to_encode = [
+                col for col in X.columns if col in self.features_to_encode
+            ]


I'll defer to you on this one! If a dictionary makes more sense I support switching over, but this is just fine as is.

eccabay · 2022-10-03T21:20:54Z

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

+                col for col in X.columns if col in self.features_to_encode
+            ]
+
+        X_t = X


Why do this here? We shouldn't transform X at all during fit, and it seems like the rest of this just uses one or the other interchangeably.

I think that was a holdover from the onehot encoder, but I can definitely remove

eccabay · 2022-10-03T21:21:40Z

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

+                categories.append(unique_values)
+
+        # Add any null values into the categories lists so that they aren't treated as unknown values
+        # This is needed because Ordinal.order won't indicate if nulls are present, and SKOrdinalEncoder


Gotcha, a documentation update would definitely be helpful!

eccabay · 2022-10-03T21:22:37Z

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

+        if encoded_missing_value is None:
+            encoded_missing_value = np.nan
+
+        self._encoder = SKOrdinalEncoder(


This should be saved as the _component_obj to match the pattern of the rest of our components!

Changing!

Out of curiosity: Is this out of convention or is it needed to work properly in evalml? I ask, because this was another holdover from the onehot encoder, and I do see _encoder used elsewhere across the repo (though not nearly as much as _component_obj)

It's mostly a convention thing, afaik

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py

jeremyliweishih

LGTM! well done 😄

auto-assign bot assigned tamargrey Sep 28, 2022

tamargrey force-pushed the add-ordinal-encoder branch from 9b726e5 to a6fdf4c Compare September 28, 2022 18:43

tamargrey commented Sep 28, 2022

View reviewed changes

evalml/tests/component_tests/test_ordinal_encoder.py Outdated Show resolved Hide resolved

tamargrey commented Sep 28, 2022

View reviewed changes

evalml/tests/component_tests/test_ordinal_encoder.py Show resolved Hide resolved

tamargrey commented Sep 28, 2022

View reviewed changes

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py Outdated Show resolved Hide resolved

tamargrey requested review from eccabay and chukarsten September 28, 2022 19:24

eccabay requested changes Sep 30, 2022

View reviewed changes

tamargrey requested a review from eccabay October 3, 2022 14:55

eccabay approved these changes Oct 3, 2022

View reviewed changes

tamargrey force-pushed the add-ordinal-encoder branch 2 times, most recently from 54386f0 to 214b910 Compare October 4, 2022 15:56

eccabay reviewed Oct 4, 2022

View reviewed changes

evalml/pipelines/components/transformers/encoders/ordinal_encoder.py Outdated Show resolved Hide resolved

jeremyliweishih approved these changes Oct 5, 2022

View reviewed changes

Tamar Grey added 11 commits October 6, 2022 09:07

Initial implementation of ordinal encoder

220bdd7

Add more tests and fix encoder behavior

833a08d

More tests and use ltype order as source of truth

fec0cb0

add more tests and make sure categories order matches ordinal ltype's

4d3bd08

Add remaining tests

1aa2e97

record which tests weren't converted

becaf89

Start cleaning up and covering edge cases

8db0905

Finish filling out tests and remove unnecessary arrow comments

c9aa184

clean up

30c2d4c

lint fix and fill out docstrings

036e074

Add release note

12af4d2

Tamar Grey added 9 commits October 6, 2022 09:07

Fix broken component tests

a4d9971

Remove remaining arrow comments

85a5335

Remove top_n parameter

bb60111

Shorter MR comments

38a8e01

split errors in two

4d81962

Refactor transform to avoid unnecessary copying

dd9f05b

MR comments

e82b13a

Make categories arg a dict

94ca85c

remove commented code

ee81b32

tamargrey force-pushed the add-ordinal-encoder branch from 7429d2b to ee81b32 Compare October 6, 2022 14:07

tamargrey merged commit 1e9ce75 into main Oct 6, 2022

tamargrey deleted the add-ordinal-encoder branch October 6, 2022 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OrdinalEncoder component #3736

Add OrdinalEncoder component #3736

tamargrey commented Sep 28, 2022 •

edited

Loading

codecov bot commented Sep 28, 2022 •

edited

Loading

CLAassistant commented Sep 28, 2022 •

edited

Loading

tamargrey Sep 28, 2022

jeremyliweishih Oct 5, 2022

tamargrey Oct 6, 2022

tamargrey Sep 28, 2022

tamargrey Sep 28, 2022

eccabay Sep 30, 2022

eccabay left a comment

eccabay Sep 30, 2022

tamargrey Sep 30, 2022

eccabay Oct 3, 2022

tamargrey Oct 4, 2022

eccabay Sep 30, 2022

tamargrey Oct 3, 2022

tamargrey Oct 3, 2022

eccabay Oct 3, 2022

tamargrey Oct 3, 2022

eccabay Sep 30, 2022

eccabay left a comment

eccabay Oct 3, 2022

eccabay Oct 3, 2022

tamargrey Oct 3, 2022

eccabay Oct 3, 2022

eccabay Oct 3, 2022

tamargrey Oct 3, 2022

eccabay Oct 4, 2022

jeremyliweishih left a comment

Add OrdinalEncoder component #3736

Add OrdinalEncoder component #3736

Conversation

tamargrey commented Sep 28, 2022 • edited Loading

codecov bot commented Sep 28, 2022 • edited Loading

Codecov Report

CLAassistant commented Sep 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

tamargrey commented Sep 28, 2022 •

edited

Loading

codecov bot commented Sep 28, 2022 •

edited

Loading

CLAassistant commented Sep 28, 2022 •

edited

Loading