Adding OneHotEncoder HypergridAdapter #130

edcthayer · 2020-10-07T20:39:45Z

OneHotEncoder Hypergrid Adapter allows categorical dimensions to be transformed to dummy variables using the sklearn OneHotEncoder. By default, each categorical dimension will produce its own collection of OneHotEncoded dummy variables, but one may also generate a OneHotEncoding of the cross product of all categorical dimension categories using the merge_all_categorical_dimensions argument (as used in the RERF surrogate model).

Once this adapter is approved, it will be integrated into the regression-enhanced random forest regression (RERF) model.

…; corrected some defects in Point, HypergridAdapter; Added more unit tests

source/Mlos.Python/mlos/Spaces/Point.py

source/Mlos.Python/mlos/Spaces/HypergridAdapters/HypergridAdapter.py

byte-sculptor · 2020-10-20T20:33:35Z

...n/mlos/Spaces/HypergridAdapters/unit_tests/TestCategoricalToOneHotEncodedHypergridAdapter.py

+ projected_df = adapter.project_dataframe(df=original_df, in_place=False)
+ self.assertTrue(id(original_df) != id(projected_df))
+ for column in adapter.get_one_hot_encoded_column_names():
+ self.assertTrue(projected_df[column].between(0, 1).all())


Would it be good to check if projected_df[column].isin([0,1]).all() instead of between(0, 1)?

...n/mlos/Spaces/HypergridAdapters/unit_tests/TestCategoricalToOneHotEncodedHypergridAdapter.py

source/Mlos.Python/mlos/Spaces/HypergridAdapters/CategoricalToOneHotEncodedHypergridAdapter.py

…adapter

source/.pylintrc

bpkroth · 2020-10-22T15:13:08Z

source/Mlos.Python/mlos/Spaces/HypergridAdapters/CategoricalToOneHotEncodedHypergridAdapter.py

+
+
+class CategoricalToOneHotEncodingAdapteeTargetMapping:
+ """ Retains the list of target Hypergrid's (one hot encoded) dimensions


@amueller to comment more, but I think these are supposed to not have the space separation in order to be processed by sphinx for API publication.

bpkroth · 2020-10-22T15:13:18Z

source/Mlos.Python/mlos/Spaces/HypergridAdapters/CategoricalToOneHotEncodedHypergridAdapter.py

+ return self._target
+
+ def get_original_categorical_column_names(self):
+ return self._adaptee_to_target_data_dict.keys()


Could see also providing access to the the map for lookup instead of just the array of names, but that could always be added later when the need arises.

bpkroth · 2020-10-22T15:14:26Z

source/Mlos.Python/mlos/Spaces/HypergridAdapters/CategoricalToOneHotEncodedHypergridAdapter.py

+ 2) Since sklearn's OHE will handle both project and unproject dataframe transforms, prepare the OHE class.
+ This requires constructing the 'categories' argument for OHE (all categorical dims or 1 cross product dim).
+ The dimension's .linspace() method provides the order list of values but doesn't include possible np.NaN values,
+ hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value.


Suggested change

hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value.

hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value.

?

Similar below.

bpkroth

I don't have any substantial feedback.
I think there are some comment styling things that will show up when we try to publish the API docs here: https://microsoft.github.io/MLOS/python_api/
I don't think we have instructions written up on testing that yourself yet.
The rough outline goes something like this:

cd website
# generate the Python APIs to be served at http://localhost:8080/MLOS/python_api/
make sphinx-site # repeat this one as necessary while the docker container is running
# generate the rest of the http://localhost:8080/MLOS site content
make hugo-site # optional
# start a local webserver
# borrowed from https://github.com/microsoft/MLOS/blob/main/website/test_site_links.sh#L48
docker run -d --name mlos-website-link-checker -v $PWD:/src/MLOS/website -v $PWD/nginx.conf:/etc/nginx/conf.d/mlos.conf -p 8080:8080 nginx:latest
# now point your browser at http://localhost:8080/MLOS/python_api/ to double check things

Note: the instructions assume a Linux environment atm (e.g. via WSL2)

bpkroth · 2020-10-22T16:46:24Z

I don't have any substantial feedback.
I think there are some comment styling things that will show up when we try to publish the API docs here: https://microsoft.github.io/MLOS/python_api/
I don't think we have instructions written up on testing that yourself yet.
The rough outline goes something like this:
cd website
# generate the Python APIs to be served at http://localhost:8080/MLOS/python_api/
make sphinx-site # repeat this one as necessary while the docker container is running
# generate the rest of the http://localhost:8080/MLOS site content
make hugo-site # optional
# start a local webserver
# borrowed from https://github.com/microsoft/MLOS/blob/main/website/test_site_links.sh#L48
docker run -d --name mlos-website-link-checker -v $PWD:/src/MLOS/website -v $PWD/nginx.conf:/etc/nginx/conf.d/mlos.conf -p 8080:8080 nginx:latest
# now point your browser at http://localhost:8080/MLOS/python_api/ to double check things
Note: the instructions assume a Linux environment atm (e.g. via WSL2)

Actually, now that I look at it, we seem to have neglected to transfer some API doc comments to some of the public calls after we added the _ underscore prefix.

For example:
https://microsoft.github.io/MLOS/python_api/api/mlos.Spaces.HypergridAdapters.HypergridAdapter.html
https://github.com/Microsoft/MLOS/blob/732b9d7/source/Mlos.Python/mlos/Spaces/HypergridAdapters/HypergridAdapter.py#L98

We should fix that, though I don't think it needs to be done in this PR.

…ttps://github.com/microsoft/MLOS into personal/edthaye/2020/sept/one_hot_encoding_adapter

initial commit for OneHotEncoder HypergridAdapter

ca446e1

edcthayer requested review from amueller, bpkroth and byte-sculptor October 7, 2020 20:39

Cleaned up project/unproject; dropped point project/unproject methods…

692b74e

…; corrected some defects in Point, HypergridAdapter; Added more unit tests

edcthayer commented Oct 9, 2020

View reviewed changes

source/Mlos.Python/mlos/Spaces/Point.py Show resolved Hide resolved

byte-sculptor reviewed Oct 20, 2020

View reviewed changes

source/Mlos.Python/mlos/Spaces/HypergridAdapters/HypergridAdapter.py Outdated Show resolved Hide resolved

byte-sculptor reviewed Oct 20, 2020

View reviewed changes

...n/mlos/Spaces/HypergridAdapters/unit_tests/TestCategoricalToOneHotEncodedHypergridAdapter.py Show resolved Hide resolved

byte-sculptor reviewed Oct 20, 2020

View reviewed changes

...n/mlos/Spaces/HypergridAdapters/unit_tests/TestCategoricalToOneHotEncodedHypergridAdapter.py Show resolved Hide resolved

byte-sculptor reviewed Oct 20, 2020

View reviewed changes

source/Mlos.Python/mlos/Spaces/HypergridAdapters/CategoricalToOneHotEncodedHypergridAdapter.py Outdated Show resolved Hide resolved

Addressing PR feedback

c2901da

byte-sculptor approved these changes Oct 21, 2020

View reviewed changes

Ed Thayer and others added 2 commits October 21, 2020 21:00

Addressing pylint

4493746

Merge branch 'main' into personal/edthaye/2020/sept/one_hot_encoding_…

6a8fd1b

…adapter

bpkroth reviewed Oct 22, 2020

View reviewed changes

bpkroth approved these changes Oct 22, 2020

View reviewed changes

Ed Thayer added 2 commits October 22, 2020 11:07

Addressing PR feedback

1ad784c

Merge branch 'personal/edthaye/2020/sept/one_hot_encoding_adapter' of h…

6fe4ccb

…ttps://github.com/microsoft/MLOS into personal/edthaye/2020/sept/one_hot_encoding_adapter

edcthayer merged commit 7c0e1fc into main Oct 22, 2020

edcthayer deleted the personal/edthaye/2020/sept/one_hot_encoding_adapter branch October 22, 2020 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding OneHotEncoder HypergridAdapter #130

Adding OneHotEncoder HypergridAdapter #130

edcthayer commented Oct 7, 2020 •

edited

Loading

byte-sculptor Oct 20, 2020

bpkroth Oct 22, 2020

bpkroth Oct 22, 2020

bpkroth Oct 22, 2020

bpkroth Oct 22, 2020

bpkroth left a comment •

edited

Loading

bpkroth commented Oct 22, 2020



		class CategoricalToOneHotEncodingAdapteeTargetMapping:
		""" Retains the list of target Hypergrid's (one hot encoded) dimensions

	hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value.
	hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value.

Adding OneHotEncoder HypergridAdapter #130

Adding OneHotEncoder HypergridAdapter #130

Conversation

edcthayer commented Oct 7, 2020 • edited Loading

byte-sculptor Oct 20, 2020

Choose a reason for hiding this comment

bpkroth Oct 22, 2020

Choose a reason for hiding this comment

bpkroth Oct 22, 2020

Choose a reason for hiding this comment

bpkroth Oct 22, 2020

Choose a reason for hiding this comment

bpkroth Oct 22, 2020

Choose a reason for hiding this comment

bpkroth left a comment • edited Loading

Choose a reason for hiding this comment

bpkroth commented Oct 22, 2020

edcthayer commented Oct 7, 2020 •

edited

Loading

bpkroth left a comment •

edited

Loading