diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 13798ad78..6f341f712 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -237,14 +237,18 @@ figure illustrates this behaviour. .. _edited_nearest_neighbors: -Edited data set using nearest neighbours -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and -"edit" the dataset by removing samples which do not agree "enough" with their -neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be -under-sampled, the nearest-neighbours are computed and if the selection -criterion is not fulfilled, the sample is removed:: +Edited data set using nearest neighbors +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:class:`EditedNearestNeighbours` trains a nearest neighbors algorithm and +then looks at the closest neighbors of each data point of the class to be +under-sampled, and "edits" the dataset by removing samples which do not agree +"enough" with their neighborhood :cite:`wilson1972asymptotic`. In short, +a nearest neighbors algorithm algorithm is trained on the data. Then, for each +sample in the class to be under-sampled, the nearest neighbors are identified. +Once the neighbors are identified, if all the neighbors or most of the neighbors +agree with the class of the sample being inspected, the sample is kept, otherwise +removed:: >>> sorted(Counter(y).items()) [(0, 64), (1, 262), (2, 4674)] @@ -255,11 +259,10 @@ criterion is not fulfilled, the sample is removed:: [(0, 64), (1, 213), (2, 4568)] Two selection criteria are currently available: (i) the majority (i.e., -``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the -nearest-neighbors have to belong to the same class than the sample inspected to -keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less -conservative than `kind_sel='mode'`, and more samples will be excluded in -the former strategy than the latest:: +``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) of the +nearest neighbors must belong to the same class than the sample inspected to +keep it in the dataset. This means that `kind_sel='all'` will be less +conservative than `kind_sel='mode'`, and more samples will be excluded:: >>> enn = EditedNearestNeighbours(kind_sel="all") >>> X_resampled, y_resampled = enn.fit_resample(X, y) @@ -270,14 +273,20 @@ the former strategy than the latest:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 234), (2, 4666)] -The parameter ``n_neighbors`` allows to give a classifier subclassed from -``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make -the decision to keep a given sample or not. +The parameter ``n_neighbors`` can take a classifier subclassed from +``KNeighborsMixin`` from scikit-learn to find the nearest neighbors. +Note that if a 4-KNN classifier is passed, 3 neighbors will be +examined for the selection criteria, because the sample being inspected +is the fourth neighbor returned by the algorithm. Alternatively, an integer +can be passed to ``n_neighbors`` to indicate the size of the neighborhood +to examine to make a decision. Thus, if ``n_neighbors=3`` the edited nearest +neighbors will look at the 3 closest neighbors of each sample. :class:`RepeatedEditedNearestNeighbours` extends :class:`EditedNearestNeighbours` by repeating the algorithm multiple times :cite:`tomek1976experiment`. Generally, repeating the algorithm will delete -more data:: +more data. The user indicates how many times to repeat the algorithm +through the parameter ``max_iter``:: >>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours >>> renn = RepeatedEditedNearestNeighbours() @@ -285,10 +294,21 @@ more data:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 208), (2, 4551)] -:class:`AllKNN` differs from the previous -:class:`RepeatedEditedNearestNeighbours` since the number of neighbors of the -internal nearest neighbors algorithm is increased at each iteration -:cite:`tomek1976experiment`:: +Note that :class:`RepeatedEditedNearestNeighbours` will end before reaching +``max_iter`` if no more samples are removed from the data, or one of the +majority classes ends up disappearing or with less samples than the minority +after being "edited". + +:class:`AllKNN` extends :class:`EditedNearestNeighbours` by repeating +the algorithm multiple times, each time with an additional neighbor +:cite:`tomek1976experiment`. In other words, :class:`AllKNN` differs +from :class:`RepeatedEditedNearestNeighbours` in that the number of +neighbors of the internal nearest neighbors algorithm increases at +each iteration. In short, in the first iteration, a 2-KNN algorithm +is trained on the data to examine the 1 closest neighbor of each +sample from the class to be under-sampled. In each subsequent +iteration, the neighborhood examined is increased by 1, until the +number of neighbors indicated in the parameter ``n_neighbors``:: >>> from imblearn.under_sampling import AllKNN >>> allknn = AllKNN() @@ -296,8 +316,18 @@ internal nearest neighbors algorithm is increased at each iteration >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 220), (2, 4601)] -In the example below, it can be seen that the three algorithms have similar -impact by cleaning noisy samples next to the boundaries of the classes. + +The parameter ``n_neighbors`` can take an integer to indicate the size +of the neighborhood to examine in the last iteration. Thus, if +``n_neighbors=3``, AlKNN will examine the 1 closest neighbor in the +first iteration, the 2 closest neighbors in the second iteration +and the 3 closest neighbors in the third iteration. The parameter +``n_neighbors`` can also take a classifier subclassed from +``KNeighborsMixin`` from scikit-learn to find the nearest neighbors. +Again, this will be the KNN used in the last iteration. + +In the example below, we can see that the three algorithms have a similar +impact on cleaning noisy samples at the boundaries of the classes. .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html diff --git a/imblearn/under_sampling/_prototype_selection/_edited_nearest_neighbours.py b/imblearn/under_sampling/_prototype_selection/_edited_nearest_neighbours.py index e0eb866a7..0fdfdc925 100644 --- a/imblearn/under_sampling/_prototype_selection/_edited_nearest_neighbours.py +++ b/imblearn/under_sampling/_prototype_selection/_edited_nearest_neighbours.py @@ -1,4 +1,4 @@ -"""Class to perform under-sampling based on the edited nearest neighbour +"""Classes to perform under-sampling based on the edited nearest neighbor method.""" # Authors: Guillaume Lemaitre @@ -27,9 +27,9 @@ n_jobs=_n_jobs_docstring, ) class EditedNearestNeighbours(BaseCleaningSampler): - """Undersample based on the edited nearest neighbour method. + """Undersample based on the edited nearest neighbor method. - This method will clean the database by removing samples close to the + This method will clean the data set by removing samples close to the decision boundary. Read more in the :ref:`User Guide `. @@ -39,7 +39,7 @@ class EditedNearestNeighbours(BaseCleaningSampler): {sampling_strategy} n_neighbors : int or object, default=3 - If ``int``, size of the neighbourhood to consider to compute the + If ``int``, size of the neighborhood to consider to compute the nearest neighbors. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to find the nearest-neighbors. @@ -47,13 +47,13 @@ class EditedNearestNeighbours(BaseCleaningSampler): kind_sel : {{'all', 'mode'}}, default='all' Strategy to use in order to exclude samples. - - If ``'all'``, all neighbours will have to agree with the samples of - interest to not be excluded. - - If ``'mode'``, the majority vote of the neighbours will be used in - order to exclude a sample. + - If ``'all'``, all neighbors will have to agree with a sample in order + not to be excluded. + - If ``'mode'``, the majority of the neighbors will have to agree with + a sample in order not to be excluded. The strategy `"all"` will be less conservative than `'mode'`. Thus, - more samples will be removed when `kind_sel="all"` generally. + more samples will be removed when `kind_sel="all"`, generally. {n_jobs} @@ -70,7 +70,7 @@ class EditedNearestNeighbours(BaseCleaningSampler): RepeatedEditedNearestNeighbours : Undersample by repeating ENN algorithm. - AllKNN : Undersample using ENN and various number of neighbours. + AllKNN : Undersample using ENN and various number of neighbors. Notes ----- @@ -81,8 +81,8 @@ class EditedNearestNeighbours(BaseCleaningSampler): References ---------- - .. [1] D. Wilson, Asymptotic" Properties of Nearest Neighbor Rules Using - Edited Data," In IEEE Transactions on Systems, Man, and Cybernetrics, + .. [1] D. Wilson, "Asymptotic Properties of Nearest Neighbor Rules Using + Edited Data", in IEEE Transactions on Systems, Man, and Cybernetics, vol. 2 (3), pp. 408-421, 1972. Examples @@ -172,9 +172,13 @@ def _more_tags(self): n_jobs=_n_jobs_docstring, ) class RepeatedEditedNearestNeighbours(BaseCleaningSampler): - """Undersample based on the repeated edited nearest neighbour method. + """Undersample based on the repeated edited nearest neighbor method. - This method will repeat several time the ENN algorithm. + This method will repeat the ENN algorithm several times. The repetitions + will stop when i) the maximum number of iterations is reached, or ii) no + more observations are being removed, or iii) one of the majority classes + becomes a minority class or iv) one of the majority classes disappears + from the target after undersampling. Read more in the :ref:`User Guide `. @@ -183,25 +187,24 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler): {sampling_strategy} n_neighbors : int or object, default=3 - If ``int``, size of the neighbourhood to consider to compute the + If ``int``, size of the neighborhood to consider to compute the nearest neighbors. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to find the nearest-neighbors. max_iter : int, default=100 - Maximum number of iterations of the edited nearest neighbours - algorithm for a single run. + Maximum number of repetitions of the edited nearest neighbors algorithm. kind_sel : {{'all', 'mode'}}, default='all' Strategy to use in order to exclude samples. - - If ``'all'``, all neighbours will have to agree with the samples of - interest to not be excluded. - - If ``'mode'``, the majority vote of the neighbours will be used in - order to exclude a sample. + - If ``'all'``, all neighbors will have to agree with a sample in order + not to be excluded. + - If ``'mode'``, the majority of the neighbors will have to agree with + a sample in order not to be excluded. The strategy `"all"` will be less conservative than `'mode'`. Thus, - more samples will be removed when `kind_sel="all"` generally. + more samples will be removed when `kind_sel="all"`, generally. {n_jobs} @@ -213,7 +216,7 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler): .. versionadded:: 0.4 n_iter_ : int - Number of iterations run. + Number of iterations that were actually run. .. versionadded:: 0.6 @@ -223,14 +226,14 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler): EditedNearestNeighbours : Undersample by editing samples. - AllKNN : Undersample using ENN and various number of neighbours. + AllKNN : Undersample using ENN and various number of neighbors. Notes ----- - The method is based on [1]_. A one-vs.-rest scheme is used when - sampling a class as proposed in [1]_. + The method is based on [1]_. - Supports multi-class resampling. + Supports multi-class resampling. A one-vs.-rest scheme is used when + sampling a class as proposed in [1]_. References ---------- @@ -303,11 +306,12 @@ def _fit_resample(self, X, y): prev_len = y_.shape[0] X_enn, y_enn = self.enn_.fit_resample(X_, y_) - # Check the stopping criterion - # 1. If there is no changes for the vector y - # 2. If the number of samples in the other class become inferior to - # the number of samples in the majority class - # 3. If one of the class is disappearing + # Check the stopping criterion: + # 1. If there are no changes in the vector y + # (that is, if no further observations are removed) + # 2. If the number of samples in any of the other (majority) classes becomes + # smaller than the number of samples in the minority class + # 3. If one of the classes disappears # Case 1 b_conv = prev_len == y_enn.shape[0] @@ -359,8 +363,14 @@ def _more_tags(self): class AllKNN(BaseCleaningSampler): """Undersample based on the AllKNN method. - This method will apply ENN several time and will vary the number of nearest - neighbours. + This method will apply ENN several times, starting by looking at the + 1 closest neighbor, and increasing the number of nearest neighbors + by 1 at each round, up to the number of neighbors specified in + `n_neighbors`. + + The repetitions will stop when i) one of the majority classes + becomes a minority class or ii) one of the majority classes + disappears from the target after undersampling. Read more in the :ref:`User Guide `. @@ -369,21 +379,28 @@ class AllKNN(BaseCleaningSampler): {sampling_strategy} n_neighbors : int or estimator object, default=3 - If ``int``, size of the neighbourhood to consider to compute the - nearest neighbors. If object, an estimator that inherits from + If ``int``, the maximum size of the the neighborhood to evaluate. + The method will start by looking at the 1 closest neighbor, and + then repeat the edited nearest neighbors increasing + the neighborhood by 1, until examining a neighborhood of + `n_neighbors` in the final iteration. + + If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to - find the nearest-neighbors. By default, it will be a 3-NN. + find the nearest-neighbors in the final round. In this case, + AllKNN will repeat edited nearest neighbors starting from a 2-KNN + up to the specified KNN in the object. kind_sel : {{'all', 'mode'}}, default='all' Strategy to use in order to exclude samples. - - If ``'all'``, all neighbours will have to agree with the samples of - interest to not be excluded. - - If ``'mode'``, the majority vote of the neighbours will be used in - order to exclude a sample. + - If ``'all'``, all neighbors will have to agree with a sample in order + not to be excluded. + - If ``'mode'``, the majority of the neighbors will have to agree with + a sample in order not to be excluded. The strategy `"all"` will be less conservative than `'mode'`. Thus, - more samples will be removed when `kind_sel="all"` generally. + more samples will be removed when `kind_sel="all"`, generally. allow_minority : bool, default=False If ``True``, it allows the majority classes to become the minority @@ -418,7 +435,7 @@ class without early stopping. References ---------- .. [1] I. Tomek, "An Experiment with the Edited Nearest-Neighbor - Rule," IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), + Rule", IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), pp. 448-452, June 1976. Examples @@ -484,10 +501,10 @@ def _fit_resample(self, X, y): X_enn, y_enn = self.enn_.fit_resample(X_, y_) - # Check the stopping criterion - # 1. If the number of samples in the other class become inferior to - # the number of samples in the majority class - # 2. If one of the class is disappearing + # Stopping criterion: + # 1. If the number of samples in any of the majority classes ends up + # smaller than the number of samples in the minority class + # 2. If one of the classes disappears # Case 1else: stats_enn = Counter(y_enn) diff --git a/imblearn/utils/_validation.py b/imblearn/utils/_validation.py index 7eb4099ea..5d2d475d9 100644 --- a/imblearn/utils/_validation.py +++ b/imblearn/utils/_validation.py @@ -68,12 +68,13 @@ def _transfrom_one(self, array, props): def check_neighbors_object(nn_name, nn_object, additional_neighbor=0): - """Check the objects is consistent to be a NN. + """Check that the object is consistent with a NN. - Several methods in imblearn relies on NN. Until version 0.4, these + Several methods in imblearn rely on NN. Until version 0.4, these objects can be passed at initialisation as an integer or a - KNeighborsMixin. After only KNeighborsMixin will be accepted. This - utility allows for type checking and raise if the type is wrong. + KNeighborsMixin. In later versions, only KNeighborsMixin will be + accepted. This utility allows for type checking and raises an error + if the type is wrong. Parameters ---------- @@ -84,7 +85,9 @@ def check_neighbors_object(nn_name, nn_object, additional_neighbor=0): The object to be checked. additional_neighbor : int, default=0 - Sometimes, some algorithm need an additional neighbors. + Some algorithms need an additional neighbour. This is because to + explore a neighbourhood of 3, we need to train a 4-KNN algorithm + as the sample to examine is a neighbour itself. Returns ------- @@ -105,7 +108,7 @@ def _count_class_sample(y): def check_target_type(y, indicate_one_vs_all=False): - """Check the target types to be conform to the current samplers. + """Check the target types conform to the current samplers. The current samplers should be compatible with ``'binary'``, ``'multilabel-indicator'`` and ``'multiclass'`` targets only. @@ -116,7 +119,7 @@ def check_target_type(y, indicate_one_vs_all=False): The array containing the target. indicate_one_vs_all : bool, default=False - Either to indicate if the targets are encoded in a one-vs-all fashion. + Indicate if the targets are encoded in a one-vs-all fashion. Returns ------- @@ -407,7 +410,7 @@ def check_sampling_strategy(sampling_strategy, y, sampling_type, **kwargs): Checks that ``sampling_strategy`` is of consistent type and return a dictionary containing each targeted class with its corresponding - number of sample. It is used in :class:`~imblearn.base.BaseSampler`. + number of samples. It is used in :class:`~imblearn.base.BaseSampler`. Parameters ---------- @@ -435,7 +438,7 @@ def check_sampling_strategy(sampling_strategy, y, sampling_type, **kwargs): - When ``str``, specify the class targeted by the resampling. For **under- and over-sampling methods**, the number of samples in the - different classes will be equalized. For **cleaning methods**, the + different classes will be equal. For **cleaning methods**, the number of samples will not be equal. Possible choices are: ``'minority'``: resample only the minority class; @@ -461,8 +464,8 @@ def check_sampling_strategy(sampling_strategy, y, sampling_type, **kwargs): methods**. An error is raised with **cleaning methods**. Use a ``list`` instead. - - When ``list``, the list contains the targeted classes. It used only - for **cleaning methods**. + - When ``list``, the list contains the targeted classes. It is used + only in **cleaning methods**. .. warning:: ``list`` is available for **cleaning methods**. An error is raised