Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OrdinalEncoder component #3736

Merged
merged 20 commits into from
Oct 6, 2022
Merged

Add OrdinalEncoder component #3736

merged 20 commits into from
Oct 6, 2022

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Sep 28, 2022

Adds the OrdinalEncoder component, implementing #1389.

The implementation is loosely based off of the OneHotEncoder, with a few key differences:

  • We only need to make one encoded feature for each ordinal column, so the naming schema is simplified
  • Because all values in a column must be encoded to something, we do not have the option to ignore unknown values. We can either specify a value to encode them as or raise an error.
  • There's not the same need to drop categories, since we don't have the problem of duplicate features in binary columns that exists with one-hot encoding
  • There's not the same need to have a top_n parameter to limit the number of features created
  • Missing values cannot be treated as a category. This is because Woodwork's Ordinal logical type doesn't include nans, which means that we can't specify a null value's place in the order. Therefore, we either keep nulls as np.nan or convert them to a separate encoded value that users can specify.

(Note, this PR does not integrate the new ordinal encoder into the EvalML pipeline)

@codecov
Copy link

codecov bot commented Sep 28, 2022

Codecov Report

Merging #3736 (ee81b32) into main (9331246) will increase coverage by 0.1%.
The diff coverage is 99.8%.

@@           Coverage Diff           @@
##            main   #3736     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        339     341      +2     
  Lines      34845   35235    +390     
=======================================
+ Hits       34714   35103    +389     
- Misses       131     132      +1     
Impacted Files Coverage Δ
evalml/pipelines/__init__.py 100.0% <ø> (ø)
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 99.1% <ø> (ø)
...omponents/transformers/encoders/ordinal_encoder.py 99.0% <99.0%> (ø)
...lines/components/transformers/encoders/__init__.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_ordinal_encoder.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@CLAassistant
Copy link

CLAassistant commented Sep 28, 2022

CLA assistant check
All committers have signed the CLA.

@@ -110,6 +110,7 @@ def fit(self, X, y=None):
top_n = self.parameters["top_n"]
X = infer_feature_types(X)
if self.features_to_encode is None:
# --> should update to not include ordinals once the ord encoder is integragted? Maybe that's configurable based on whether ordinal encoder is used?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note to myself about integrating the new component. Do we need a separate issue for actually using the component? Or do we use the same issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually file another ticket but it's flexible!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a follow-up issue: #3744

"top_n": top_n,
"features_to_encode": features_to_encode,
"categories": categories,
"handle_unknown": handle_unknown,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't include a handle_missing parameter, since "as_category" isn't an option here, but it means that we don't have the option to error when nans are seen.

Do we want a handle_missing parameter that is either "use_encoded_value" or "error" and then pairs with the encoded_missing_value parameter?

return self._get_feature_names()

def _get_feature_provenance(self):
return self._provenance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't yet covered by tests, I assume, because it's not used in the EvalML pipeline. I didn't see tests in the other components for this method, so are we okay leaving this uncovered for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm certainly not fussed about it, but if anyone disagrees speak now

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work. There was a lot to think through here and I'm impressed you did it all. I left some cleanup/clarity comments, but the only blocking comment I have is to remove the top_n parameter. It isn't necessary and (as made clear by all the test cases you so painstakingly wrote!) it's overly complicated without adding any value.

Comment on lines 151 to 144
# Put features_to_encode in the same relative order as the columns in the dataframe
self.features_to_encode = [
col for col in X.columns if col in self.features_to_encode
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? If so, I just learned something new.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in place because we say that

categories[i] is a list of the categories for the column at index i in the dataframes passed in at fit and transform

That kind of breaks down once you're specifying features_to_encode or if not all of the dataframe's columns are ordinal in nature, so I'm making the assumption that the order of the sublists in categories is just the relative order from the original dataframe.

I can make note of that assumption in the docstring. Alternatively, we could just make categories a dictionary mapping column name to the list of categories--was there a specific reason it's a list of lists other than that's what we pass into the SKLearn encoder? The SKLearn encoder requires that categories be the same length as the number of columns in the dataframe, and now that I really think about it, a dictionary probably fits our use case better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to you on this one! If a dictionary makes more sense I support switching over, but this is just fine as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I went through the refactor to use a dict, and I continue to be a fan. It wasn't a small number of changes, though: 54386f0

It lets users avoid having to worry about the relative order of their inputs to the encoder and lets us remove this line.

"""
X = infer_feature_types(X)

X_copy = X.ww.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick about variable names: I would replace the X with X_t and X_copy with X. We should also avoid the need to copy in that case, since calling X.ww.drop should return a new DataFrame.

X_t = X.ww.drop(columns=self.features_to_encode)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got me thinking: What if we return a copy of X early in the case where there are no features to encode (having to actually copy in this case, bc I assume we don't want users messing around with the same object they passed in), and otherwise proceed with an X_orig of non ordinal columns that we concat with X_t of the transformed ordinal columns? Would that be terribly out of line with how we normally handle transform?

        X = infer_feature_types(X)

        if not self.features_to_encode:
            return X.ww.copy()

        X_orig = X.ww.drop(columns=self.features_to_encode)

        # Call sklearn's transform on only the ordinal columns
        X_t = pd.DataFrame(
            self._encoder.transform(X[self.features_to_encode]),
            index=X.index,
        )
        X_t.columns = self._get_feature_names()
        X_t.ww.init(logical_types={c: "Double" for c in X_t.columns})
        self._feature_names = X_t.columns

        X_t = ww.utils.concat_columns([X_orig, X_t])

        return X_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this in dd5c075 - happy to change if we should be doing this differently!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here by "we don't want users messing around with the same object they passed in", could you explain that a little bit? If we're trying not to modify the user's original data, this wouldn't be the place to enforce it. I'm also worried that copying the data (especially when this component should be a no-op) uses unnecessary time and storage, especially with larger datasets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was assuming that this was the place to be worrying about modifying the user's original data. My thought was that if you pass a dataframe into transform, we shouldn't ever give you back the exact same object. But if that's not a contract we need to uphold here, I'm more than happy to avoid the extra computation!

return self._get_feature_names()

def _get_feature_provenance(self):
return self._provenance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm certainly not fussed about it, but if anyone disagrees speak now

evalml/tests/component_tests/test_ordinal_encoder.py Outdated Show resolved Hide resolved
@tamargrey tamargrey requested a review from eccabay October 3, 2022 14:55
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just left some final cleanup comments but overall this is solid

Comment on lines 151 to 144
# Put features_to_encode in the same relative order as the columns in the dataframe
self.features_to_encode = [
col for col in X.columns if col in self.features_to_encode
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to you on this one! If a dictionary makes more sense I support switching over, but this is just fine as is.

col for col in X.columns if col in self.features_to_encode
]

X_t = X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this here? We shouldn't transform X at all during fit, and it seems like the rest of this just uses one or the other interchangeably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was a holdover from the onehot encoder, but I can definitely remove

categories.append(unique_values)

# Add any null values into the categories lists so that they aren't treated as unknown values
# This is needed because Ordinal.order won't indicate if nulls are present, and SKOrdinalEncoder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, a documentation update would definitely be helpful!

if encoded_missing_value is None:
encoded_missing_value = np.nan

self._encoder = SKOrdinalEncoder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be saved as the _component_obj to match the pattern of the rest of our components!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing!

Out of curiosity: Is this out of convention or is it needed to work properly in evalml? I ask, because this was another holdover from the onehot encoder, and I do see _encoder used elsewhere across the repo (though not nearly as much as _component_obj)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly a convention thing, afaik

@tamargrey tamargrey force-pushed the add-ordinal-encoder branch 2 times, most recently from 54386f0 to 214b910 Compare October 4, 2022 15:56
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! well done 😄

@tamargrey tamargrey merged commit 1e9ce75 into main Oct 6, 2022
@tamargrey tamargrey deleted the add-ordinal-encoder branch October 6, 2022 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants