-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH duck-typing scikit-learn estimator instead of inheritance #858
ENH duck-typing scikit-learn estimator instead of inheritance #858
Conversation
Codecov Report
@@ Coverage Diff @@
## master #858 +/- ##
==========================================
+ Coverage 97.20% 97.26% +0.06%
==========================================
Files 96 97 +1
Lines 6253 6373 +120
Branches 712 721 +9
==========================================
+ Hits 6078 6199 +121
- Misses 104 105 +1
+ Partials 71 69 -2
Continue to review full report at Codecov.
|
@glemaitre , please let me know what you think about these proposed changes when you have a moment - If these changes are agreeable, I would be happy to update the docstring and related documentation to align them with these logical changes. Additionally, I would also be happy to craft up a jupyter notebook demo tutorial to show how the integration of cuML into imbalanced-learn can provide significant speed-ups when re-sampling large datasets. |
…or - similar to _validate_estimator() in KMeansSMOTE
I will have a look soon. We are in the release process of scikit-learn. So I will probably make a small sprint right afterwards to make a sync release and thus look at all the issues/prs that are available. |
This pull request introduces 1 alert when merging 94b0725 into edf6eae - view on LGTM.com new alerts:
|
else: | ||
raise_isinstance_error("svm_estimator", [SVC], self.svm_estimator) | ||
self.svm_estimator_ = clone(self.svm_estimator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change removes the explicit isinstance
check for validating the SVC estimator in SVMSMOTE's _validate_estimator
method; the estimator is instead validated by way of sklearn.base.clone()
, similar to that of KMeansSMOTE
.
This will enable the integration of SVM estimators that enforce the same API contract as sklearn instead of requiring the explicit class check (isinstance(svm_estimator, sklearn.svm.SVC)
)
As a motivating example, the integration of a GPU-accelerated SVC from cuML can offer significant performance gains when working with large datasets.
Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU
Benchmarking gist:
https://gist.github.com/NV-jpt/039a8d9c7d37365379faa1d7c7aafc5e
f"`estimator` has to be a KMeans clustering." | ||
f" Got {type(self.estimator)} instead." | ||
) | ||
self.estimator_ = clone(self.estimator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change removes the explicit isinstance
check for validating the KMeans estimator in ClusterCentroid
's _validate_estimator
method; the estimator is instead validated by way of sklearn.base.clone()
, similar to that of KMeansSMOTE
.
This will enable KMeans estimators that enforce the same API contract as sklearn to be integrated instead of requiring the explicit class check (isinstance(estimator, sklearn.cluster.KMeans)
)
As a motivating example, the integration of a GPU-accelerated KMeans estimator from cuML can offer significant performance gains when working with large datasets.
Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU
Benchmarking gist:
https://gist.github.com/NV-jpt/d176d4c1aa3b184103efe605ee96b0e6
imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py
Outdated
Show resolved
Hide resolved
imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we will need tests to ensure that the functionality would work.
We would need as well some documentation.
imblearn/utils/_validation.py
Outdated
@@ -93,7 +101,7 @@ def check_neighbors_object(nn_name, nn_object, additional_neighbor=0): | |||
""" | |||
if isinstance(nn_object, Integral): | |||
return NearestNeighbors(n_neighbors=nn_object + additional_neighbor) | |||
elif isinstance(nn_object, KNeighborsMixin): | |||
elif _is_neighbors_object(nn_object): | |||
return clone(nn_object) | |||
else: | |||
raise_isinstance_error(nn_name, [int, KNeighborsMixin], nn_object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should as well change the error message since we don't strictly require to be a KNeighborsMixin
but instead to expose both kneighbors
and kneighbors_graph
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been debating between two implementations here (my two latest commits [1 2]).
[1] uses sklearn.base.clone to verify that the nn_object is an sklearn-like estimator that can be cloned. This implementation is more consistent with how the library checks the integrity of other estimators - such as the KMeans Estimator check in KMeansSmote
, but it does not protect users from the mistake of inputting other types of estimators.
[2] raises a TypeError if the nn_object is neither an integer, nor exposes both kneighbors
and kneighbors_graph
; thus, it protects users from this potential mistake.
Do you prefer one over the other?
Great! Thank you; I will go ahead and update the documentation to accommodate these changes. As for the testing, @glemaitre, do you have any specific tests in mind? |
@glemaitre, do you have any specific tests in mind? I think that the idea is to make a smoke test plugging the other NeareastNeighbors. We could add a CI build with some additional dependency just for testing such behaviour (like what we do with keras or tensorflow). |
imblearn/_min_dependencies.py
Outdated
@@ -18,6 +18,9 @@ | |||
JOBLIB_MIN_VERSION = "0.11" | |||
THREADPOOLCTL_MIN_VERSION = "2.0.0" | |||
PYTEST_MIN_VERSION = "5.0.1" | |||
CUML_MIN_VERSION = "21.10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to highlight this comment #858 (comment)
I believe using cuML here requires 21.12. 21.12 is the current nightly, which is on track for release this week https://docs.rapids.ai/maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but I am kind of struggling to install cudatoolkit
with conda
. Locally, the package is available but it seems that this is not the case on Azure Pipeline. I am currently making sure that I have the latest version of conda
.
This said, since there is no GPU on Azure, is cuML falling back on some sort of CPU computing or it will just fail?
In which case, I should investigate to find a free CI service for open source where I can get a GPU.
On another note, I think that I will make some subsequent PR, to have either cuNumeric/Dask/Numpy as different backends. I already looked a Dask and it seems feasible. It would require to bypass the input validation thought. |
@NV-jpt regarding the test, I would like to make one CI build work. Then, I would rewrite a common test for all samplers that takes a NN, KMeans and check that I can pass a cuML estimator. In a next PR, I will make that any estimator would take cuML estimator (SVM, etc.) |
@NV-jpt @beckernick Do you know if it would be possible to add |
So I will take an alternative here by using our own KNN since we only want to check that ducktyping is working properly. It will be enough for the purpose of the PR. |
Thanks @NV-jpt I will make a subsequent PR for the other estimator, namely |
Reference Issue
Fixes #856
What does this implement/fix? Explain your changes.
This PR has two parts:
First, it switches the logical check in check_neighbors_object() from an inheritance based check to be duck typing based (verifying based on the interface). More specifically, a logical check for whether a neighbors object is appropriate is done by checking if it inherits from the scikit-learn Mixin class KNeighborsMixin. This PR switches this check to evaluate whether the interface of the operator is consistent with KNeighborsMixin, rather than an explicit subclassing.
Second, it modifies several resamplers'
_validate_estimator()
methods to model the strategy adopted inKMeansSMOTE
; that of verifying estimators by way ofsklearn.base.clone()
, instead of anisinstance
check.These changes allow for greater flexibility, as users will be able to cleanly integrate estimators from libraries that do not directly depend on scikit-learn, but enforce the same API contract.
Any other comments?