WIP: Multi_hot encoder for ambiguous inputs #162

fullflu · 2019-01-02T14:24:46Z

Summary

Implement fit and transform function of multi-hot encoding for ambiguous|dirty categorical feature.

I hope you to check the usefulness.

janmotl · 2019-01-02T19:16:05Z

Nice tests. Just nitpicking:

In test_multi_hot_fit shouldn't:

self.assertEqual(enc.transform(X_t).shape[1],
                         enc.transform(X_t[X_t['extra'] != 'A']).shape[1],
                         'We have to get the same count of columns')

become

self.assertEqual(enc.transform(X_t).shape[1],
                         enc.transform(X_t[X_t['extra'] != 'D']).shape[1],  # Without the new value. Alternatively we can compare to: enc.transform(X).shape[1]
                         'We have to get the same count of columns')

?

first_extract_column = out.columns.str.extract("(.*)_[1-9]+").dropna()[0].unique()[0]

should possibly be:

first_extract_column = out.columns.str.extract("(.*)_[0-9]+").dropna()[0].unique()[0]

The order of the suffixes in the output seems to be pseudorandom. I am ok with that if the order is guaranteed not to change. But I would prefer to put them in order.
Argument use_cat_names seems to be ignored. I suggest to either remove it or test it that it actually works.
On python 2.7, some of the tests fail.

fullflu · 2019-01-04T08:33:58Z

Thank you for your reviews.

The transformation test was based on test_one_hot.py. I inserted your suggestion into my test code.
Suffixes start with 1, not 0 in the encoder, so it would be no problem to write out.columns.str.extract("(.*)_[1-9]+") . Missing values are encoded as other values.
Oh, that was my mistake. I fixed the order of suffixes.
I removed use_cat_names for simplicity.
I fixed several bugs. All tests have passed in my latest commit.

janmotl · 2019-01-04T10:02:07Z

Good work.

The transformation test was based on test_one_hot.py.

That's actually a mistake of mine in test_one_hot.py. I will fix it.

Suffixes start with 1, not 0 in the encoder, so it would be no problem to write out.columns.str.extract("(.*)_[1-9]+") . Missing values are encoded as other values.

I see. I was concerned about strings like "extra_10", which would not get captured. But it is merely a hypothetical concern.

I am getting a warning in both, Python2 and Python3 travis-ci report:

/home/travis/build/scikit-learn-contrib/categorical-encoding/category_encoders/tests/test_multi_hot.py:78: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  extra_mask_columns = out.columns.str.extract("(extra_.*)_[1-9]+").dropna()

Maybe it could be silenced by setting the expand argument.

multiple_split_string and inv_map in get_dummies() look to be unused.

It looks like there are typos in the documentation of the examples:
normilize -> normalize
numetic_dataset -> numeric_dataset
numetic_normalized_dataset - > numeric_normalized_dataset

I would consider moving create_boston_RAD and run_example into test_multi_hot.py. Or into examples directory and writing a note in the documentation where to find the example. But I am leaving it up to you - if you want it in the encoder file, it will be in the encoder.

janmotl · 2019-01-04T10:41:15Z

@fullflu Please, check conformance of MultiHotEncoder to the changes in the master. All these changes were about handle_missing and handle_unknown arguments, which should be supported by all encoders. Note that not all the options have to be implemented. But the arguments should be there and the documentation should describe the default behaviour.

… documentations

fullflu · 2019-01-04T15:01:47Z

That's actually a mistake of mine in test_one_hot.py. I will fix it.

LGTM.

I see. I was concerned about strings like "extra_10", which would not get captured. But it is merely a hypothetical concern.

I tested it in local environment and open browser ( https://rubular.com/ ). I confirmed the string 'extra_10' is extracted as 'extra'. It would be no problem.

Maybe it could be silenced by setting the expand argument.

I will check this if necessary. How serious is this warning?

multiple_split_string and inv_map in get_dummies() look to be unused.

I removed them.

It looks like there are typos in the documentation of the examples:
normilize -> normalize
numetic_dataset -> numeric_dataset
numetic_normalized_dataset - > numeric_normalized_dataset

They were my typos. I fixed them.

I would consider moving create_boston_RAD and run_example into test_multi_hot.py. Or into examples directory and writing a note in the documentation where to find the example. But I am leaving it up to you - if you want it in the encoder file, it will be in the encoder

It would be nice to put them in examples directory.
I will fix it until writing a blog post.

Please, check conformance of MultiHotEncoder to the changes in the master. All these changes were about handle_missing and handle_unknown arguments, which should be supported by all encoders. Note that not all the options have to be implemented. But the arguments should be there and the documentation should describe the default behaviour.

Oh, I have forked an old code where missing_impute was used instead of handle_missing.
I added handle_missing argument and wrote documentation about handle_missing and handle_unknown arguments, but 3 tests (test_handle_missing_return_nan_test, test_handle_missing_return_nan_train and test_handle_unknown_return_nan) failed.
Should I consider return_nan option of these arguments?

janmotl · 2019-01-04T15:27:13Z

Good.

I will check this if necessary. How serious is this warning?

I just attempt to keep the test results free of errors and warnings - once I allow one warning, additional warnings tend to lure in.

Should I consider return_nan option of these arguments?

Without looking at the code: isn't it enough to rename ignore option to return_nan?

fullflu · 2019-01-04T16:11:24Z

I just attempt to keep the test results free of errors and warnings - once I allow one warning, additional warnings tend to lure in.

I got it. I added expand argument and warning was silenced.

Without looking at the code: isn't it enough to rename ignore option to return_nan?

In my code, that renaming is not enough.
I renamed ignore to value and reproduced return_nan result of OneHotEncoder.
All tests have passed!

janmotl · 2019-01-05T08:46:41Z

Awesome. Move the example. And I will merge it.

Note: Just write somewhere that | assumes uniform distribution of the feature values. For example, when the data contain 1|2, the encoder assumes that there is 0.5 probability that the real value is 1 and 0.5 probability that the real value is 2, even when the distribution of 1 and 2 in the training data is distinctly non-uniform.

Can you also write somewhere a real-world application of this encoder? When does it happen that we know, e.g.: 1|2? And can you illustrate on the blog that multi-hot-encoder can beat one-hot-encoder by a lot? I just cannot wait to see the post :)

fullflu · 2019-01-06T11:05:03Z

Just write somewhere that | assumes uniform distribution of the feature values. For example, when the data contain 1|2, the encoder assumes that there is 0.5 probability that the real value is 1 and 0.5 probability that the real value is 2, even when the distribution of 1 and 2 in the training data is distinctly non-uniform

Exactly, that is a very important point.
I added a prior option to solve the problem:

train: each probability is trained using input data when fitting
uniform: each probability is assumed to be uniform as I have ever implemented

I also added a default_prior option, which can be used with the prior option at the same time.

If a column is included in default_prior dictionary, prior is fixed as default_prior.
Else, prior is calculated using the prior option.

The problem of probability distribution would be solved by these options.

(Since the name 'prior' may be confusing, it will be renamed if necessary.)

Can you also write somewhere a real-world application of this encoder? When does it happen that we know, e.g.: 1|2? And can you illustrate on the blog that multi-hot-encoder can beat one-hot-encoder by a lot? I just cannot wait to see the post :)

I'm trying to compare the encoder performance using Boston and Titanic dataset, but the comparison is now based on artificial preprocessing by masking several rows.
I'm sorry that I have not found any good real-world dataset which is available for free and naturally contains ambiguous categorical features... (I will search such a dataset)

[FYI]: In my experience, the situation when ambiguous features are obtained was caused by the change of data-acquisition process. After a certain day, the granularity of a feature have actually changed. I believe that such a dirty feature is generated in various business field.

janmotl · 2019-01-06T14:34:48Z

Nice. I like the use of assert_frame_equal() (I didn't know that it existed). And that you wrote the default settings in the documentation.

I propose to rename the optional argument names to something like:

distribution, which accepts {'uniform', 'prior'}
prior (if not provided, it is estimated from the training data as arithmetic mean of the target for each attribute value)

But I am leaving the final pick up to you.

In my experience, the situation when ambiguous features are obtained was caused by the change of data-acquisition process. After a certain day, the granularity of a feature have actually changed. I believe that such a dirty feature is generated in various business field.

That's a nice example.

fullflu · 2019-01-07T07:48:55Z

Thank you for your nice suggestion.

This is the next plan:

Coming soon (within a few weeks at the latest)
- rename multiple_split_string to or_delimiter
- rename and fix prior-related options as your suggestion
- create examples and blog-like posts
Future enhancement
- implement and_delimiter
- integrate missing-value imputation methods (if possible)
- integrate information theoretic methods (if possible)

Details of future enhancement

The ambiguity problem would be inherently included in missing-value imputation problem. The encoding method that I have implemented is based on the empirical distribution, however, other imputation methods based on machine-learning can be integrated with this delimiter-based multi-hot encoding.
I expect someone who is interested in this topic could contribute such new encoding methods in the future!

janmotl · 2019-01-07T11:50:29Z

implement and_delimiter

How is it going to work? Is it similar to TfidfVectorizer or CountVectorizer? A potentially useful dataset for the functionality illustration: data, description. In my opinion, it is a pretty dirty dataset. But if the encoder is going to work well on this, it's likely going to work well on many other datasets.

integrate missing-value imputation methods

It's up to you. The canonical solution is to propagate NaN to the output. And then use some canned solution for missing value imputation. But I can imagine missing value treatments, that would not work without seeing the raw data.

integrate information theoretic methods

The canonical solution for change of granularity in the data would be to use a hierarchical model (a.k.a mixed model). But there are many alternatives.

fullflu · 2019-01-07T14:07:52Z

rename and fix prior-related options as your suggestion

Although your suggestion is so cool, I found that your prior-related options would discard the flexibility. That is why I want to remain the prior options. The details are described below. Feel free to correct if the description is wrong.

The option that I implemented can consider 5 cases:

all cols are transformed by uniform distribution (prior is 'uniform' and default_prior is None)
all cols are transformed by empirical distribution (prior is 'train' and default_prior is None)
all cols are transformed by default_prior (all cols are included in default_prior)
several cols are transformed by default_prior and others are transformed by empirical distribution (prior is 'train' and default_prior is not None)
several cols are transformed by default_prior and others are transformed by uniform prior (prior is 'uniform' and default_prior is not None)

Your suggestion would not be able to consider the 5th option above.

This is a tradeoff between flexibility and simplicity.
If there is another way in which both the flexibility and simplicity are achieved, I would adopt the options.

fullflu · 2019-01-07T14:18:22Z

How is it going to work? Is it similar to TfidfVectorizer or CountVectorizer?

I have imagined simpler than them.
For each column, all rows where and_delimiter is included are transformed by multi-hot encoder without normalization. This process will reflect the meaning A and B, and it would be easily implemented.

integrate missing-value imputation methods
integrate information theoretic methods

The canonical solution you described will be nice.
Since I do not have enough solution to solve them, I will survey related work.
These two methods can be out of scope unless any good paper is found.

janmotl · 2019-01-07T18:22:10Z

dictionary used as prior (hyperprior is [1,1,1,...],...

Nice touch with the hyperprior.

janmotl · 2019-02-25T13:13:43Z

Hi, @fullflu. Is there something I can help you with?

VascoSch92 · 2024-02-05T21:39:36Z

category_encoders/multi_hot.py

+
+    """
+
+    def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', or_delimiter="|"):


sometimes you use single quote like 'value' and sometimes double ones like "|"

celestinoxp · 2024-07-31T12:12:55Z

This PR is here since 2019? We are in 2024. What happened to this PR?

celestinoxp · 2024-08-16T09:39:48Z

ping @VascoSch92 @janmotl @fullflu

fullflu added 6 commits January 2, 2019 00:48

add multi_hot.py

3f5c060

add multi-hot encoder in __init__.py

82bc7ed

update fit and transform functions

c9d1ceb

create test of multi hot encoder

487ce39

add example

568e6dc

add normalization in transform function

521e61e

fullflu added 6 commits January 4, 2019 04:01

fix AttributeError in pandas 0.20.1

b6c4447

add get_feature_names function

8063c76

fix extraction for pandas 0.20.1; add assertEqual test

4f28809

add unknown error handling

6a75ca4

fix suffixes order

a0c839f

remove use_cat_names from multi_hot.py

4b7ac77

fix documentation typo; remove unused codes

9cf81a3

fullflu added 3 commits January 4, 2019 23:17

fix default options (handle_unknown and impute_missing) and add their…

6cbe38a

… documentations

update handle_* options; raise error when handle_missing is error

b505b49

merge latest master branch

6373d58

fullflu added 2 commits January 5, 2019 00:39

add expand=True in str.extract argument

7ebedc9

consider handle_* = return_nan; change ->

7cbd4d7

fullflu added 2 commits January 6, 2019 18:32

add prior option for ambiguous feature normalization

8e4083a

add prior test for multi-hot test

0dea6b4

rename multiple_split_string to or_delimiter

964cba9

rename prior to prior_setting

4915304

VascoSch92 reviewed Feb 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Multi_hot encoder for ambiguous inputs #162

WIP: Multi_hot encoder for ambiguous inputs #162

fullflu commented Jan 2, 2019

janmotl commented Jan 2, 2019 •

edited

Loading

fullflu commented Jan 4, 2019 •

edited

Loading

janmotl commented Jan 4, 2019

janmotl commented Jan 4, 2019

fullflu commented Jan 4, 2019

janmotl commented Jan 4, 2019

fullflu commented Jan 4, 2019 •

edited

Loading

janmotl commented Jan 5, 2019

fullflu commented Jan 6, 2019

janmotl commented Jan 6, 2019 •

edited

Loading

fullflu commented Jan 7, 2019

janmotl commented Jan 7, 2019

fullflu commented Jan 7, 2019

fullflu commented Jan 7, 2019

janmotl commented Jan 7, 2019

janmotl commented Feb 25, 2019

VascoSch92 Feb 5, 2024

celestinoxp commented Jul 31, 2024

celestinoxp commented Aug 16, 2024


		"""

		def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', or_delimiter="\|"):

WIP: Multi_hot encoder for ambiguous inputs #162

Are you sure you want to change the base?

WIP: Multi_hot encoder for ambiguous inputs #162

Conversation

fullflu commented Jan 2, 2019

Summary

janmotl commented Jan 2, 2019 • edited Loading

fullflu commented Jan 4, 2019 • edited Loading

janmotl commented Jan 4, 2019

janmotl commented Jan 4, 2019

fullflu commented Jan 4, 2019

janmotl commented Jan 4, 2019

fullflu commented Jan 4, 2019 • edited Loading

janmotl commented Jan 5, 2019

fullflu commented Jan 6, 2019

janmotl commented Jan 6, 2019 • edited Loading

fullflu commented Jan 7, 2019

This is the next plan:

Details of future enhancement

janmotl commented Jan 7, 2019

fullflu commented Jan 7, 2019

fullflu commented Jan 7, 2019

janmotl commented Jan 7, 2019

janmotl commented Feb 25, 2019

VascoSch92 Feb 5, 2024

Choose a reason for hiding this comment

celestinoxp commented Jul 31, 2024

celestinoxp commented Aug 16, 2024

janmotl commented Jan 2, 2019 •

edited

Loading

fullflu commented Jan 4, 2019 •

edited

Loading

fullflu commented Jan 4, 2019 •

edited

Loading

janmotl commented Jan 6, 2019 •

edited

Loading