Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ValueError: Found array with 0 sample(s) #742

Closed
allenyllee opened this issue Aug 11, 2020 · 8 comments · Fixed by #1016
Closed

[BUG] ValueError: Found array with 0 sample(s) #742

allenyllee opened this issue Aug 11, 2020 · 8 comments · Fixed by #1016

Comments

@allenyllee
Copy link

Describe the bug

When using SVMSMOTE on dataset which contains a minority class which has very few samples (may be < 10), it'll raise error ValueError: Found array with 0 sample(s) (shape=(0, 600)) while a minimum of 1 is required.

Steps/Code to Reproduce

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SVMSMOTE # doctest: +NORMALIZE_WHITESPACE

X, y = make_classification(n_classes=3, class_sep=0,
            weights=[0.004, 0.451, 0.545], n_informative=3, n_redundant=0, flip_y=0,
            n_features=3, n_clusters_per_class=2, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))


sm = SVMSMOTE(random_state=42, k_neighbors=4)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

Expected Results

Running without error

Actual Results

Original dataset shape Counter({2: 544, 1: 451, 0: 5})

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-78-8f5d2308c2bd> in <module>()
     10 
     11 sm = SVMSMOTE(random_state=42, k_neighbors=4)
---> 12 X_res, y_res = sm.fit_resample(X, y)
     13 print('Resampled dataset shape %s' % Counter(y_res))

~/anaconda3/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y)
     82             self.sampling_strategy, y, self._sampling_type)
     83 
---> 84         output = self._fit_resample(X, y)
     85 
     86         if binarize_y:

~/anaconda3/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
    530     def _fit_resample(self, X, y):
    531         # print("_fit_resample X shape", X.shape)
--> 532         return self._sample(X, y)
    533 
    534     def _sample(self, X, y):

~/anaconda3/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _sample(self, X, y)
    569 
    570             danger_bool = self._in_danger_noise(
--> 571                 self.nn_m_, support_vector, class_sample, y, kind='danger')
    572             safety_bool = np.logical_not(danger_bool)
    573 

~/anaconda3/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _in_danger_noise(self, nn_estimator, samples, target_class, y, kind)
    213         # print("kind", kind)
    214         # print("_in_danger_noise samples shape", samples.shape)
--> 215         x = nn_estimator.kneighbors(samples, return_distance=False)[:, 1:]
    216         # print("x", x)
    217         nn_label = (y[x] != target_class).astype(int)

~/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
    400         if X is not None:
    401             query_is_train = False
--> 402             X = check_array(X, accept_sparse='csr')
    403         else:
    404             query_is_train = True

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    548                              " minimum of %d is required%s."
    549                              % (n_samples, array.shape, ensure_min_samples,
--> 550                                 context))
    551 
    552     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.

Versions

System:
python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0]
executable: /home/allenyl/anaconda3/bin/python
machine: Linux-4.15.0-112-generic-x86_64-with-debian-buster-sid

Python deps:
pip: 19.2.2
setuptools: 41.0.1
sklearn: 0.21.3
numpy: 1.15.1
scipy: 1.4.1
Cython: 0.28.2
pandas: 0.24.1

@hiyamgh
Copy link

hiyamgh commented Jan 31, 2021

Did you find a fix for this ? Having the same issue here

@allenyllee
Copy link
Author

@hiyamgh I've pushed a fix, but as @glemaitre's commented on #743, I need to add something before it can be merged. But currently I have no time to do it....

@hiyamgh
Copy link

hiyamgh commented Feb 6, 2021

Thank you @allenyllee for notifying me, from my side the error turned out to be that I was using SMOTENC, and in there, I was passing an empty list for the categorical_features parameter (did not know that the dataset must have a mix of numerical and categorical).

Here is the documentation

@MontaseerAlam
Copy link

Thank you @allenyllee for notifying me, from my side the error turned out to be that I was using SMOTENC, and in there, I was passing an empty list for the categorical_features parameter (did not know that the dataset must have a mix of numerical and categorical).

Here is the documentation

Hi @hiyamgh, I am having the same issue. Did you fix the problem? I am very new to the field. I can hardy follow #743

@szperajacyzolw
Copy link

Hi All!
I have found this thread searching for a solution for identical problem.
I have found that generally SMOTE-based algos might have a problem with oversampling extremely scarce class.
ADASYN solved my problem.

@nmshafie1993
Copy link

Is this fixed? I am having the same issue

@4d30
Copy link

4d30 commented Feb 18, 2022

This is present in:
Python3.9.9
imbalanced-learn 0.9.0

@glemaitre
Copy link
Member

Regarding the original use example, class_sep is really meaning that all data points are mixed. Therefore, the support vectors are categorized as noise. In this case, there is another solution than using another variant. In real-life, there actually no point to do machine learning in this case because the underlying classification predictor will be useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants