-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding OneHotEncoder HypergridAdapter #130
Adding OneHotEncoder HypergridAdapter #130
Conversation
…; corrected some defects in Point, HypergridAdapter; Added more unit tests
source/Mlos.Python/mlos/Spaces/HypergridAdapters/HypergridAdapter.py
Outdated
Show resolved
Hide resolved
projected_df = adapter.project_dataframe(df=original_df, in_place=False) | ||
self.assertTrue(id(original_df) != id(projected_df)) | ||
for column in adapter.get_one_hot_encoded_column_names(): | ||
self.assertTrue(projected_df[column].between(0, 1).all()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be good to check if projected_df[column].isin([0,1]).all()
instead of between(0, 1)
?
...n/mlos/Spaces/HypergridAdapters/unit_tests/TestCategoricalToOneHotEncodedHypergridAdapter.py
Show resolved
Hide resolved
...n/mlos/Spaces/HypergridAdapters/unit_tests/TestCategoricalToOneHotEncodedHypergridAdapter.py
Show resolved
Hide resolved
source/Mlos.Python/mlos/Spaces/HypergridAdapters/CategoricalToOneHotEncodedHypergridAdapter.py
Outdated
Show resolved
Hide resolved
|
||
|
||
class CategoricalToOneHotEncodingAdapteeTargetMapping: | ||
""" Retains the list of target Hypergrid's (one hot encoded) dimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller to comment more, but I think these are supposed to not have the space separation in order to be processed by sphinx
for API publication.
return self._target | ||
|
||
def get_original_categorical_column_names(self): | ||
return self._adaptee_to_target_data_dict.keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could see also providing access to the the map for lookup instead of just the array of names, but that could always be added later when the need arises.
2) Since sklearn's OHE will handle both project and unproject dataframe transforms, prepare the OHE class. | ||
This requires constructing the 'categories' argument for OHE (all categorical dims or 1 cross product dim). | ||
The dimension's .linspace() method provides the order list of values but doesn't include possible np.NaN values, | ||
hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value. | |
hence that list is augmented to include the string 'nan' which pandas.DataFrame.apply(map(str)) will produce from a np.NaN value. | |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any substantial feedback.
I think there are some comment styling things that will show up when we try to publish the API docs here: https://microsoft.github.io/MLOS/python_api/
I don't think we have instructions written up on testing that yourself yet.
The rough outline goes something like this:
cd website
# generate the Python APIs to be served at http://localhost:8080/MLOS/python_api/
make sphinx-site # repeat this one as necessary while the docker container is running
# generate the rest of the http://localhost:8080/MLOS site content
make hugo-site # optional
# start a local webserver
# borrowed from https://github.com/microsoft/MLOS/blob/main/website/test_site_links.sh#L48
docker run -d --name mlos-website-link-checker -v $PWD:/src/MLOS/website -v $PWD/nginx.conf:/etc/nginx/conf.d/mlos.conf -p 8080:8080 nginx:latest
# now point your browser at http://localhost:8080/MLOS/python_api/ to double check things
Note: the instructions assume a Linux environment atm (e.g. via WSL2)
Actually, now that I look at it, we seem to have neglected to transfer some API doc comments to some of the public calls after we added the For example: We should fix that, though I don't think it needs to be done in this PR. |
…ttps://github.com/microsoft/MLOS into personal/edthaye/2020/sept/one_hot_encoding_adapter
OneHotEncoder Hypergrid Adapter allows categorical dimensions to be transformed to dummy variables using the sklearn OneHotEncoder. By default, each categorical dimension will produce its own collection of OneHotEncoded dummy variables, but one may also generate a OneHotEncoding of the cross product of all categorical dimension categories using the
merge_all_categorical_dimensions
argument (as used in the RERF surrogate model).Once this adapter is approved, it will be integrated into the regression-enhanced random forest regression (RERF) model.