Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH duck-typing scikit-learn estimator instead of inheritance #858

Merged
merged 53 commits into from
Jan 16, 2022

Conversation

NV-jpt
Copy link
Contributor

@NV-jpt NV-jpt commented Sep 2, 2021

Reference Issue

Fixes #856

What does this implement/fix? Explain your changes.

This PR has two parts:

First, it switches the logical check in check_neighbors_object() from an inheritance based check to be duck typing based (verifying based on the interface). More specifically, a logical check for whether a neighbors object is appropriate is done by checking if it inherits from the scikit-learn Mixin class KNeighborsMixin. This PR switches this check to evaluate whether the interface of the operator is consistent with KNeighborsMixin, rather than an explicit subclassing.

Second, it modifies several resamplers' _validate_estimator() methods to model the strategy adopted in KMeansSMOTE; that of verifying estimators by way of sklearn.base.clone(), instead of an isinstance check.

These changes allow for greater flexibility, as users will be able to cleanly integrate estimators from libraries that do not directly depend on scikit-learn, but enforce the same API contract.

Any other comments?

@NV-jpt NV-jpt changed the title Switch check_neighbors_object()'s KNeighborsMixin type checks to be interface based rather than inheritance based [ENH] Switch estimator operator logical check in check_neighbors_object() to be interface based rather than inheritance based Sep 2, 2021
@pep8speaks
Copy link

pep8speaks commented Sep 9, 2021

Hello @NV-jpt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-24 19:07:23 UTC

@codecov
Copy link

codecov bot commented Sep 9, 2021

Codecov Report

Merging #858 (b75b77d) into master (b3a3dce) will increase coverage by 0.06%.
The diff coverage is 97.95%.

❗ Current head b75b77d differs from pull request most recent head b627cf1. Consider uploading reports for the commit b627cf1 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #858      +/-   ##
==========================================
+ Coverage   97.20%   97.26%   +0.06%     
==========================================
  Files          96       97       +1     
  Lines        6253     6373     +120     
  Branches      712      721       +9     
==========================================
+ Hits         6078     6199     +121     
- Misses        104      105       +1     
+ Partials       71       69       -2     
Impacted Files Coverage Δ
imblearn/over_sampling/_adasyn.py 87.50% <ø> (-0.20%) ⬇️
imblearn/over_sampling/_smote/base.py 98.97% <ø> (ø)
imblearn/over_sampling/_smote/cluster.py 100.00% <ø> (ø)
imblearn/over_sampling/_smote/tests/test_smote.py 100.00% <ø> (ø)
imblearn/over_sampling/tests/test_adasyn.py 100.00% <ø> (ø)
...mpling/_prototype_selection/tests/test_nearmiss.py 100.00% <ø> (ø)
...election/tests/test_neighbourhood_cleaning_rule.py 100.00% <ø> (ø)
imblearn/utils/tests/test_validation.py 98.91% <81.81%> (-1.09%) ⬇️
imblearn/utils/testing.py 98.82% <96.15%> (-1.18%) ⬇️
imblearn/over_sampling/_smote/filter.py 87.82% <100.00%> (+1.38%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3a3dce...b627cf1. Read the comment docs.

@NV-jpt
Copy link
Contributor Author

NV-jpt commented Sep 9, 2021

@glemaitre , please let me know what you think about these proposed changes when you have a moment - If these changes are agreeable, I would be happy to update the docstring and related documentation to align them with these logical changes.

Additionally, I would also be happy to craft up a jupyter notebook demo tutorial to show how the integration of cuML into imbalanced-learn can provide significant speed-ups when re-sampling large datasets.

…or - similar to _validate_estimator() in KMeansSMOTE
@glemaitre
Copy link
Member

I will have a look soon. We are in the release process of scikit-learn. So I will probably make a small sprint right afterwards to make a sync release and thus look at all the issues/prs that are available.

@lgtm-com
Copy link

lgtm-com bot commented Sep 13, 2021

This pull request introduces 1 alert when merging 94b0725 into edf6eae - view on LGTM.com

new alerts:

  • 1 for Unused import

@NV-jpt NV-jpt changed the title [ENH] Switch estimator operator logical check in check_neighbors_object() to be interface based rather than inheritance based [ENH] Switch estimator operator logical checks to be interface based rather than inheritance based Sep 13, 2021
else:
raise_isinstance_error("svm_estimator", [SVC], self.svm_estimator)
self.svm_estimator_ = clone(self.svm_estimator)
Copy link
Contributor Author

@NV-jpt NV-jpt Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes the explicit isinstance check for validating the SVC estimator in SVMSMOTE's _validate_estimator method; the estimator is instead validated by way of sklearn.base.clone(), similar to that of KMeansSMOTE.

This will enable the integration of SVM estimators that enforce the same API contract as sklearn instead of requiring the explicit class check (isinstance(svm_estimator, sklearn.svm.SVC))

As a motivating example, the integration of a GPU-accelerated SVC from cuML can offer significant performance gains when working with large datasets.

image

Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU

Benchmarking gist:
https://gist.github.com/NV-jpt/039a8d9c7d37365379faa1d7c7aafc5e

f"`estimator` has to be a KMeans clustering."
f" Got {type(self.estimator)} instead."
)
self.estimator_ = clone(self.estimator)
Copy link
Contributor Author

@NV-jpt NV-jpt Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes the explicit isinstance check for validating the KMeans estimator in ClusterCentroid's _validate_estimator method; the estimator is instead validated by way of sklearn.base.clone(), similar to that of KMeansSMOTE.

This will enable KMeans estimators that enforce the same API contract as sklearn to be integrated instead of requiring the explicit class check (isinstance(estimator, sklearn.cluster.KMeans))

As a motivating example, the integration of a GPU-accelerated KMeans estimator from cuML can offer significant performance gains when working with large datasets.

image

Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU

Benchmarking gist:
https://gist.github.com/NV-jpt/d176d4c1aa3b184103efe605ee96b0e6

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we will need tests to ensure that the functionality would work.
We would need as well some documentation.

@@ -93,7 +101,7 @@ def check_neighbors_object(nn_name, nn_object, additional_neighbor=0):
"""
if isinstance(nn_object, Integral):
return NearestNeighbors(n_neighbors=nn_object + additional_neighbor)
elif isinstance(nn_object, KNeighborsMixin):
elif _is_neighbors_object(nn_object):
return clone(nn_object)
else:
raise_isinstance_error(nn_name, [int, KNeighborsMixin], nn_object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should as well change the error message since we don't strictly require to be a KNeighborsMixin but instead to expose both kneighbors and kneighbors_graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been debating between two implementations here (my two latest commits [1 2]).

[1] uses sklearn.base.clone to verify that the nn_object is an sklearn-like estimator that can be cloned. This implementation is more consistent with how the library checks the integrity of other estimators - such as the KMeans Estimator check in KMeansSmote, but it does not protect users from the mistake of inputting other types of estimators.

[2] raises a TypeError if the nn_object is neither an integer, nor exposes both kneighbors and kneighbors_graph; thus, it protects users from this potential mistake.

Do you prefer one over the other?

@NV-jpt
Copy link
Contributor Author

NV-jpt commented Sep 28, 2021

Great! Thank you; I will go ahead and update the documentation to accommodate these changes. As for the testing, @glemaitre, do you have any specific tests in mind?

@glemaitre
Copy link
Member

@glemaitre, do you have any specific tests in mind?

I think that the idea is to make a smoke test plugging the other NeareastNeighbors. We could add a CI build with some additional dependency just for testing such behaviour (like what we do with keras or tensorflow).

@@ -18,6 +18,9 @@
JOBLIB_MIN_VERSION = "0.11"
THREADPOOLCTL_MIN_VERSION = "2.0.0"
PYTEST_MIN_VERSION = "5.0.1"
CUML_MIN_VERSION = "21.10"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to highlight this comment #858 (comment)

I believe using cuML here requires 21.12. 21.12 is the current nightly, which is on track for release this week https://docs.rapids.ai/maintainers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but I am kind of struggling to install cudatoolkit with conda. Locally, the package is available but it seems that this is not the case on Azure Pipeline. I am currently making sure that I have the latest version of conda.

This said, since there is no GPU on Azure, is cuML falling back on some sort of CPU computing or it will just fail?
In which case, I should investigate to find a free CI service for open source where I can get a GPU.

@glemaitre
Copy link
Member

On another note, I think that I will make some subsequent PR, to have either cuNumeric/Dask/Numpy as different backends. I already looked a Dask and it seems feasible. It would require to bypass the input validation thought.

@glemaitre
Copy link
Member

Thank you for the response, @glemaitre !! Does this test I have just now committed look appropriately placed? I can add additional tests for the other resamplers too.

@NV-jpt regarding the test, I would like to make one CI build work. Then, I would rewrite a common test for all samplers that takes a NN, KMeans and check that I can pass a cuML estimator. In a next PR, I will make that any estimator would take cuML estimator (SVM, etc.)

@glemaitre
Copy link
Member

@NV-jpt @beckernick Do you know if it would be possible to add imbalanced-learn in the gpuCI (https://gpuci.gpuopenanalytics.com/) to check the potential regression?

@glemaitre
Copy link
Member

So I will take an alternative here by using our own KNN since we only want to check that ducktyping is working properly. It will be enough for the purpose of the PR.

@glemaitre glemaitre merged commit afcc8e6 into scikit-learn-contrib:master Jan 16, 2022
@glemaitre
Copy link
Member

Thanks @NV-jpt I will make a subsequent PR for the other estimator, namely KNearestNeighbors that could benefit from the same duck-typing check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based
4 participants