You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The estimator_ object fit by CondensedNearestNeighbour() (and probably other sampling strategies) is incorrect when y has multiple classes (and possibly also for binary classes). In particular, the estimator is only fit to a subset of 2 of the classes.
Steps/Code to Reproduce
fromsklearn.datasetsimportmake_blobsfromsklearn.neighborsimportKNeighborsClassifierfromimblearn.under_samplingimportCondensedNearestNeighbourn_clusters=10X, y=make_blobs(n_samples=2000, centers=n_clusters, n_features=2, cluster_std=.5, random_state=0)
n_neighbors=1condenser=CondensedNearestNeighbour(sampling_strategy='all', n_neighbors=n_neighbors)
X_cond, y_cond=condenser.fit_resample(X, y)
print('condenser.estimator_.classes_', condenser.estimator_.classes_) # this should have 10 classes, which it does!print("condenser.estomator_ accuracy", condenser.estimator_.score(X, y))
# I think the estimator we want should look like thisknn_cond_manual=KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_cond, y_cond)
print('knn_cond_manual.classes_', knn_cond_manual.classes_) # yes 10 classes!print("Manual KNN on condensted data accuracy", knn_cond_manual.score(X, y)) # good accuracy!
knn_cond_manual.classes_ [0 1 2 3 4 5 6 7 8 9]
Manual KNN on condensted data accuracy 0.996
The issue
The issue that we set estimator_ in each run of the loop in _fit_resample e.g. this line. We should really set estimator_ after the loop ends on the condensed datasets.
This looks like it's also an issue with OneSidedSelection and possibly other samplers.
Fix
I think we should just add the following to directly before the return statement in fit_resample
The strategy used here is minority-vs-each class (indeed the documentation is wrong) so it makes sense that self.estimator_ is only containing 2 classes. However, we should store all combination and not the last combination of 2 classes.
What we expect is that a 1-NN train on the full set should have a similar performance to a 1-NN train on the condensed set:
Describe the bug
The estimator_ object fit by
CondensedNearestNeighbour()
(and probably other sampling strategies) is incorrect when y has multiple classes (and possibly also for binary classes). In particular, the estimator is only fit to a subset of 2 of the classes.Steps/Code to Reproduce
The issue
The issue that we set
estimator_
in each run of the loop in_fit_resample
e.g. this line. We should really setestimator_
after the loop ends on the condensed datasets.This looks like it's also an issue with OneSidedSelection and possibly other samplers.
Fix
I think we should just add the following to directly before the return statement in
fit_resample
Versions
The text was updated successfully, but these errors were encountered: