Skip to content

[BUG] The estimator_ in CondensedNearestNeighbour() is incorrect for multiple classes #908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
idc9 opened this issue Jun 9, 2022 · 1 comment · Fixed by #1011
Closed
Labels
Package: under_sampling Type: Bug Indicates an unexpected problem or unintended behavior

Comments

@idc9
Copy link

idc9 commented Jun 9, 2022

Describe the bug

The estimator_ object fit by CondensedNearestNeighbour() (and probably other sampling strategies) is incorrect when y has multiple classes (and possibly also for binary classes). In particular, the estimator is only fit to a subset of 2 of the classes.

Steps/Code to Reproduce

from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import CondensedNearestNeighbour

n_clusters = 10
X, y = make_blobs(n_samples=2000, centers=n_clusters, n_features=2, cluster_std=.5, random_state=0)

n_neighbors = 1
condenser = CondensedNearestNeighbour(sampling_strategy='all', n_neighbors=n_neighbors)
X_cond, y_cond = condenser.fit_resample(X, y)
print('condenser.estimator_.classes_', condenser.estimator_.classes_) # this should have 10 classes, which it does!
print("condenser.estomator_ accuracy", condenser.estimator_.score(X, y))
condenser.estimator_.classes_ [5 9]
condenser.estomator_ accuracy 0.2
# I think the estimator we want should look like this
knn_cond_manual = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_cond, y_cond)
print('knn_cond_manual.classes_', knn_cond_manual.classes_)  # yes 10 classes!
print("Manual KNN on condensted data accuracy", knn_cond_manual.score(X, y)) # good accuracy!
knn_cond_manual.classes_ [0 1 2 3 4 5 6 7 8 9]
Manual KNN on condensted data accuracy 0.996

The issue

The issue that we set estimator_ in each run of the loop in _fit_resample e.g. this line. We should really set estimator_ after the loop ends on the condensed datasets.

This looks like it's also an issue with OneSidedSelection and possibly other samplers.

Fix

I think we should just add the following to directly before the return statement in fit_resample

X_condensed, y_condensed = _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)
self.estimator_.fit(X_condensed, y_condensed)
return X_condensed, y_condensed

Versions


System:
    python: 3.8.12 (default, Oct 12 2021, 06:23:56)  [Clang 10.0.0 ]
executable: /Users/iaincarmichael/anaconda3/envs/comp_onc/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.1
          pip: 21.2.4
   setuptools: 58.0.4
        numpy: 1.21.4
        scipy: 1.7.3
       Cython: 0.29.25
       pandas: 1.3.5
   matplotlib: 3.5.0
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/sklearn/.dylibs/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/numpy/.dylibs/libopenblas.0.dylib
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.17
    num_threads: 4
threading_layer: pthreads
   architecture: Haswell

       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libmkl_rt.1.dylib
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 4
threading_layer: intel

       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8
@hayesall hayesall added Package: under_sampling Type: Bug Indicates an unexpected problem or unintended behavior labels Jul 17, 2022
@glemaitre
Copy link
Member

glemaitre commented Jul 9, 2023

The strategy used here is minority-vs-each class (indeed the documentation is wrong) so it makes sense that self.estimator_ is only containing 2 classes. However, we should store all combination and not the last combination of 2 classes.

What we expect is that a 1-NN train on the full set should have a similar performance to a 1-NN train on the condensed set:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=10_000, weights=[0.9, 0.1])
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0
)

from sklearn.neighbors import KNeighborsClassifier
from imblearn.metrics import classification_report_imbalanced

model = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report_imbalanced(y_test, y_pred))

from imblearn.under_sampling import CondensedNearestNeighbour
from imblearn.pipeline import make_pipeline

model = make_pipeline(
    CondensedNearestNeighbour(),
    KNeighborsClassifier(n_neighbors=1),
).fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report_imbalanced(y_test, y_pred))
                   pre       rec       spe        f1       geo       iba       sup

          0       0.93      0.94      0.41      0.94      0.62      0.41      2236
          1       0.46      0.41      0.94      0.43      0.62      0.36       264

avg / total       0.88      0.89      0.47      0.88      0.62      0.40      2500

                   pre       rec       spe        f1       geo       iba       sup

          0       0.94      0.84      0.58      0.89      0.70      0.50      2236
          1       0.30      0.58      0.84      0.40      0.70      0.47       264

avg / total       0.88      0.81      0.61      0.84      0.70      0.50      2500

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Package: under_sampling Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants