Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding OneHotEncoder HypergridAdapter #130

Merged
merged 7 commits into from
Oct 22, 2020

Conversation

edcthayer
Copy link
Contributor

@edcthayer edcthayer commented Oct 7, 2020

OneHotEncoder Hypergrid Adapter allows categorical dimensions to be transformed to dummy variables using the sklearn OneHotEncoder. By default, each categorical dimension will produce its own collection of OneHotEncoded dummy variables, but one may also generate a OneHotEncoding of the cross product of all categorical dimension categories using the merge_all_categorical_dimensions argument (as used in the RERF surrogate model).

Once this adapter is approved, it will be integrated into the regression-enhanced random forest regression (RERF) model.

…; corrected some defects in Point, HypergridAdapter; Added more unit tests
projected_df = adapter.project_dataframe(df=original_df, in_place=False)
self.assertTrue(id(original_df) != id(projected_df))
for column in adapter.get_one_hot_encoded_column_names():
self.assertTrue(projected_df[column].between(0, 1).all())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good to check if projected_df[column].isin([0,1]).all() instead of between(0, 1)?

source/.pylintrc Show resolved Hide resolved


class CategoricalToOneHotEncodingAdapteeTargetMapping:
""" Retains the list of target Hypergrid's (one hot encoded) dimensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller to comment more, but I think these are supposed to not have the space separation in order to be processed by sphinx for API publication.

return self._target

def get_original_categorical_column_names(self):
return self._adaptee_to_target_data_dict.keys()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could see also providing access to the the map for lookup instead of just the array of names, but that could always be added later when the need arises.

2) Since sklearn's OHE will handle both project and unproject dataframe transforms, prepare the OHE class.
This requires constructing the 'categories' argument for OHE (all categorical dims or 1 cross product dim).
The dimension's .linspace() method provides the order list of values but doesn't include possible np.NaN values,
hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value.
hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar below.

Copy link
Contributor

@bpkroth bpkroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any substantial feedback.
I think there are some comment styling things that will show up when we try to publish the API docs here: https://microsoft.github.io/MLOS/python_api/
I don't think we have instructions written up on testing that yourself yet.
The rough outline goes something like this:

cd website
# generate the Python APIs to be served at http://localhost:8080/MLOS/python_api/
make sphinx-site # repeat this one as necessary while the docker container is running
# generate the rest of the http://localhost:8080/MLOS site content
make hugo-site # optional
# start a local webserver
# borrowed from https://github.com/microsoft/MLOS/blob/main/website/test_site_links.sh#L48
docker run -d --name mlos-website-link-checker -v $PWD:/src/MLOS/website -v $PWD/nginx.conf:/etc/nginx/conf.d/mlos.conf -p 8080:8080 nginx:latest
# now point your browser at http://localhost:8080/MLOS/python_api/ to double check things

Note: the instructions assume a Linux environment atm (e.g. via WSL2)

@bpkroth
Copy link
Contributor

bpkroth commented Oct 22, 2020

I don't have any substantial feedback.
I think there are some comment styling things that will show up when we try to publish the API docs here: https://microsoft.github.io/MLOS/python_api/
I don't think we have instructions written up on testing that yourself yet.
The rough outline goes something like this:

cd website
# generate the Python APIs to be served at http://localhost:8080/MLOS/python_api/
make sphinx-site # repeat this one as necessary while the docker container is running
# generate the rest of the http://localhost:8080/MLOS site content
make hugo-site # optional
# start a local webserver
# borrowed from https://github.com/microsoft/MLOS/blob/main/website/test_site_links.sh#L48
docker run -d --name mlos-website-link-checker -v $PWD:/src/MLOS/website -v $PWD/nginx.conf:/etc/nginx/conf.d/mlos.conf -p 8080:8080 nginx:latest
# now point your browser at http://localhost:8080/MLOS/python_api/ to double check things

Note: the instructions assume a Linux environment atm (e.g. via WSL2)

Actually, now that I look at it, we seem to have neglected to transfer some API doc comments to some of the public calls after we added the _ underscore prefix.

For example:
https://microsoft.github.io/MLOS/python_api/api/mlos.Spaces.HypergridAdapters.HypergridAdapter.html
https://github.com/Microsoft/MLOS/blob/732b9d7/source/Mlos.Python/mlos/Spaces/HypergridAdapters/HypergridAdapter.py#L98

We should fix that, though I don't think it needs to be done in this PR.

@edcthayer edcthayer merged commit 7c0e1fc into main Oct 22, 2020
@edcthayer edcthayer deleted the personal/edthaye/2020/sept/one_hot_encoding_adapter branch October 22, 2020 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants