-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Multi_hot encoder for ambiguous inputs #162
base: master
Are you sure you want to change the base?
WIP: Multi_hot encoder for ambiguous inputs #162
Conversation
Nice tests. Just nitpicking:
self.assertEqual(enc.transform(X_t).shape[1],
enc.transform(X_t[X_t['extra'] != 'A']).shape[1],
'We have to get the same count of columns') become self.assertEqual(enc.transform(X_t).shape[1],
enc.transform(X_t[X_t['extra'] != 'D']).shape[1], # Without the new value. Alternatively we can compare to: enc.transform(X).shape[1]
'We have to get the same count of columns') ? first_extract_column = out.columns.str.extract("(.*)_[1-9]+").dropna()[0].unique()[0] should possibly be: first_extract_column = out.columns.str.extract("(.*)_[0-9]+").dropna()[0].unique()[0]
|
Thank you for your reviews.
|
Good work.
That's actually a mistake of mine in test_one_hot.py. I will fix it.
I see. I was concerned about strings like "extra_10", which would not get captured. But it is merely a hypothetical concern. I am getting a warning in both, Python2 and Python3 travis-ci report:
Maybe it could be silenced by setting the
It looks like there are typos in the documentation of the examples: I would consider moving |
@fullflu Please, check conformance of MultiHotEncoder to the changes in the master. All these changes were about |
LGTM.
I tested it in local environment and open browser ( https://rubular.com/ ). I confirmed the string 'extra_10' is extracted as 'extra'. It would be no problem.
I will check this if necessary. How serious is this warning?
I removed them.
They were my typos. I fixed them.
It would be nice to put them in examples directory.
Oh, I have forked an old code where |
Good.
I just attempt to keep the test results free of errors and warnings - once I allow one warning, additional warnings tend to lure in.
Without looking at the code: isn't it enough to rename |
I got it. I added
In my code, that renaming is not enough. |
Awesome. Move the example. And I will merge it. Note: Just write somewhere that Can you also write somewhere a real-world application of this encoder? When does it happen that we know, e.g.: |
Exactly, that is a very important point.
I also added a default_prior option, which can be used with the prior option at the same time.
The problem of probability distribution would be solved by these options. (Since the name 'prior' may be confusing, it will be renamed if necessary.)
I'm trying to compare the encoder performance using Boston and Titanic dataset, but the comparison is now based on artificial preprocessing by masking several rows. [FYI]: In my experience, the situation when ambiguous features are obtained was caused by the change of data-acquisition process. After a certain day, the granularity of a feature have actually changed. I believe that such a dirty feature is generated in various business field. |
Nice. I like the use of I propose to rename the optional argument names to something like:
But I am leaving the final pick up to you.
That's a nice example. |
Thank you for your nice suggestion. This is the next plan:
Details of future enhancementThe ambiguity problem would be inherently included in missing-value imputation problem. The encoding method that I have implemented is based on the empirical distribution, however, other imputation methods based on machine-learning can be integrated with this delimiter-based multi-hot encoding. |
How is it going to work? Is it similar to
It's up to you. The canonical solution is to propagate NaN to the output. And then use some canned solution for missing value imputation. But I can imagine missing value treatments, that would not work without seeing the raw data.
The canonical solution for change of granularity in the data would be to use a hierarchical model (a.k.a mixed model). But there are many alternatives. |
Although your suggestion is so cool, I found that your prior-related options would discard the flexibility. That is why I want to remain the prior options. The details are described below. Feel free to correct if the description is wrong. The option that I implemented can consider 5 cases:
Your suggestion would not be able to consider the 5th option above. This is a tradeoff between flexibility and simplicity. |
I have imagined simpler than them.
The canonical solution you described will be nice. |
Nice touch with the hyperprior. |
Hi, @fullflu. Is there something I can help you with? |
|
||
""" | ||
|
||
def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', or_delimiter="|"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes you use single quote like 'value' and sometimes double ones like "|"
This PR is here since 2019? We are in 2024. What happened to this PR? |
ping @VascoSch92 @janmotl @fullflu |
Summary
Implement fit and transform function of multi-hot encoding for ambiguous|dirty categorical feature.
#161
I hope you to check the usefulness.