scikit-learn-contrib · solegalli · Aug 4, 2021 · Aug 4, 2021 · Aug 4, 2021 · Aug 4, 2021
diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst
@@ -237,14 +237,18 @@ figure illustrates this behaviour.
 
 .. _edited_nearest_neighbors:
 
-Edited data set using nearest neighbours
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and
-"edit" the dataset by removing samples which do not agree "enough" with their
-neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be
-under-sampled, the nearest-neighbours are computed and if the selection
-criterion is not fulfilled, the sample is removed::
+Edited data set using nearest neighbors
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:class:`EditedNearestNeighbours` trains a nearest neighbors algorithm and
+then looks at the closest neighbors of each data point of the class to be
+under-sampled, and "edits" the dataset by removing samples which do not agree
+"enough" with their neighborhood :cite:`wilson1972asymptotic`. In short,
+a nearest neighbors algorithm algorithm is trained on the data. Then, for each
+sample in the class to be under-sampled, the nearest neighbors are identified.
+Once the neighbors are identified, if all the neighbors or most of the neighbors
+agree with the class of the sample being inspected, the sample is kept, otherwise
+removed::
 
   >>> sorted(Counter(y).items())
   [(0, 64), (1, 262), (2, 4674)]
@@ -255,11 +259,10 @@ criterion is not fulfilled, the sample is removed::
   [(0, 64), (1, 213), (2, 4568)]
 
 Two selection criteria are currently available: (i) the majority (i.e.,
-``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
-nearest-neighbors have to belong to the same class than the sample inspected to
-keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less
-conservative than `kind_sel='mode'`, and more samples will be excluded in
-the former strategy than the latest::
+``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) of the
+nearest neighbors must belong to the same class than the sample inspected to
+keep it in the dataset. This means that `kind_sel='all'` will be less
+conservative than `kind_sel='mode'`, and more samples will be excluded::
 
   >>> enn = EditedNearestNeighbours(kind_sel="all")
   >>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -270,34 +273,61 @@ the former strategy than the latest::
   >>> print(sorted(Counter(y_resampled).items()))
   [(0, 64), (1, 234), (2, 4666)]
 
-The parameter ``n_neighbors`` allows to give a classifier subclassed from
-``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make
-the decision to keep a given sample or not.
+The parameter ``n_neighbors`` can take a classifier subclassed from
+``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
+Note that if a 4-KNN classifier is passed, 3 neighbors will be
+examined for the selection criteria, because the sample being inspected
+is the fourth neighbor returned by the algorithm. Alternatively, an integer
+can be passed to ``n_neighbors`` to indicate the size of the neighborhood
+to examine to make a decision. Thus, if ``n_neighbors=3`` the edited nearest
+neighbors will look at the 3 closest neighbors of each sample.
 
 :class:`RepeatedEditedNearestNeighbours` extends
 :class:`EditedNearestNeighbours` by repeating the algorithm multiple times
 :cite:`tomek1976experiment`. Generally, repeating the algorithm will delete
-more data::
+more data. The user indicates how many times to repeat the algorithm
+through the parameter ``max_iter``::
 
    >>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours
    >>> renn = RepeatedEditedNearestNeighbours()
    >>> X_resampled, y_resampled = renn.fit_resample(X, y)
    >>> print(sorted(Counter(y_resampled).items()))
    [(0, 64), (1, 208), (2, 4551)]
 
-:class:`AllKNN` differs from the previous
-:class:`RepeatedEditedNearestNeighbours` since the number of neighbors of the
-internal nearest neighbors algorithm is increased at each iteration
-:cite:`tomek1976experiment`::
+Note that :class:`RepeatedEditedNearestNeighbours` will end before reaching
+``max_iter`` if no more samples are removed from the data, or one of the
+majority classes ends up disappearing or with less samples than the minority
+after being "edited".
+
+:class:`AllKNN` extends :class:`EditedNearestNeighbours` by repeating
+the algorithm multiple times, each time with an additional neighbor
+:cite:`tomek1976experiment`. In other words, :class:`AllKNN` differs
+from :class:`RepeatedEditedNearestNeighbours` in that the number of
+neighbors of the internal nearest neighbors algorithm increases at
+each iteration. In short, in the first iteration, a 2-KNN algorithm
+is trained on the data to examine the 1 closest neighbor of each
+sample from the class to be under-sampled. In each subsequent
+iteration, the neighborhood examined is increased by 1, until the
+number of neighbors indicated in the parameter ``n_neighbors``::
 
   >>> from imblearn.under_sampling import AllKNN
   >>> allknn = AllKNN()
   >>> X_resampled, y_resampled = allknn.fit_resample(X, y)
   >>> print(sorted(Counter(y_resampled).items()))
   [(0, 64), (1, 220), (2, 4601)]
 
-In the example below, it can be seen that the three algorithms have similar
-impact by cleaning noisy samples next to the boundaries of the classes.
+
+The parameter ``n_neighbors`` can take an integer to indicate the size
+of the neighborhood to examine in the last iteration. Thus, if
+``n_neighbors=3``, AlKNN will examine the 1 closest neighbor in the
+first iteration, the 2 closest neighbors in the second iteration
+and the 3 closest neighbors in the third iteration. The parameter
+``n_neighbors`` can also take a classifier subclassed from
+``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
+Again, this will be the KNN used in the last iteration.
+
+In the example below, we can see that the three algorithms have a similar
+impact on cleaning noisy samples at the boundaries of the classes.
 
 .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png
    :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html

diff --git a/imblearn/under_sampling/_prototype_selection/_edited_nearest_neighbours.py b/imblearn/under_sampling/_prototype_selection/_edited_nearest_neighbours.py
@@ -1,4 +1,4 @@
-"""Class to perform under-sampling based on the edited nearest neighbour
+"""Classes to perform under-sampling based on the edited nearest neighbor
 method."""
 
 # Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@@ -27,9 +27,9 @@
     n_jobs=_n_jobs_docstring,
 )
 class EditedNearestNeighbours(BaseCleaningSampler):
-    """Undersample based on the edited nearest neighbour method.
+    """Undersample based on the edited nearest neighbor method.
 
-    This method will clean the database by removing samples close to the
+    This method will clean the data set by removing samples close to the
     decision boundary.
 
     Read more in the :ref:`User Guide <edited_nearest_neighbors>`.
@@ -39,21 +39,21 @@ class EditedNearestNeighbours(BaseCleaningSampler):
     {sampling_strategy}
 
     n_neighbors : int or object, default=3
-        If ``int``, size of the neighbourhood to consider to compute the
+        If ``int``, size of the neighborhood to consider to compute the
         nearest neighbors. If object, an estimator that inherits from
         :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
         find the nearest-neighbors.
 
     kind_sel : {{'all', 'mode'}}, default='all'
         Strategy to use in order to exclude samples.
 
-        - If ``'all'``, all neighbours will have to agree with the samples of
-          interest to not be excluded.
-        - If ``'mode'``, the majority vote of the neighbours will be used in
-          order to exclude a sample.
+        - If ``'all'``, all neighbors will have to agree with a sample in order
+          not to be excluded.
+        - If ``'mode'``, the majority of the neighbors will have to agree with
+         a sample in order not to be excluded.
 
         The strategy `"all"` will be less conservative than `'mode'`. Thus,
-        more samples will be removed when `kind_sel="all"` generally.
+        more samples will be removed when `kind_sel="all"`, generally.
 
     {n_jobs}
 
@@ -70,7 +70,7 @@ class EditedNearestNeighbours(BaseCleaningSampler):
 
     RepeatedEditedNearestNeighbours : Undersample by repeating ENN algorithm.
 
-    AllKNN : Undersample using ENN and various number of neighbours.
+    AllKNN : Undersample using ENN and various number of neighbors.
 
     Notes
     -----
@@ -81,8 +81,8 @@ class EditedNearestNeighbours(BaseCleaningSampler):
 
     References
     ----------
-    .. [1] D. Wilson, Asymptotic" Properties of Nearest Neighbor Rules Using
-       Edited Data," In IEEE Transactions on Systems, Man, and Cybernetrics,
+    .. [1] D. Wilson, "Asymptotic Properties of Nearest Neighbor Rules Using
+       Edited Data", in IEEE Transactions on Systems, Man, and Cybernetics,
        vol. 2 (3), pp. 408-421, 1972.
 
     Examples
@@ -172,9 +172,13 @@ def _more_tags(self):
     n_jobs=_n_jobs_docstring,
 )
 class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
-    """Undersample based on the repeated edited nearest neighbour method.
+    """Undersample based on the repeated edited nearest neighbor method.
 
-    This method will repeat several time the ENN algorithm.
+    This method will repeat the ENN algorithm several times. The repetitions
+    will stop when i) the maximum number of iterations is reached, or ii) no
+    more observations are being removed, or iii) one of the majority classes
+    becomes a minority class or iv) one of the majority classes disappears
+    from the target after undersampling.
 
     Read more in the :ref:`User Guide <edited_nearest_neighbors>`.
 
@@ -183,25 +187,24 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
     {sampling_strategy}
 
     n_neighbors : int or object, default=3
-        If ``int``, size of the neighbourhood to consider to compute the
+        If ``int``, size of the neighborhood to consider to compute the
         nearest neighbors. If object, an estimator that inherits from
         :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
         find the nearest-neighbors.
 
     max_iter : int, default=100
-        Maximum number of iterations of the edited nearest neighbours
-        algorithm for a single run.
+        Maximum number of repetitions of the edited nearest neighbors algorithm.
 
     kind_sel : {{'all', 'mode'}}, default='all'
         Strategy to use in order to exclude samples.
 
-        - If ``'all'``, all neighbours will have to agree with the samples of
-          interest to not be excluded.
-        - If ``'mode'``, the majority vote of the neighbours will be used in
-          order to exclude a sample.
+        - If ``'all'``, all neighbors will have to agree with a sample in order
+          not to be excluded.
+        - If ``'mode'``, the majority of the neighbors will have to agree with
+         a sample in order not to be excluded.
 
         The strategy `"all"` will be less conservative than `'mode'`. Thus,
-        more samples will be removed when `kind_sel="all"` generally.
+        more samples will be removed when `kind_sel="all"`, generally.
 
     {n_jobs}
 
@@ -213,7 +216,7 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
         .. versionadded:: 0.4
 
     n_iter_ : int
-        Number of iterations run.
+        Number of iterations that were actually run.
 
         .. versionadded:: 0.6
 
@@ -223,14 +226,14 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
 
     EditedNearestNeighbours : Undersample by editing samples.
 
-    AllKNN : Undersample using ENN and various number of neighbours.
+    AllKNN : Undersample using ENN and various number of neighbors.
 
     Notes
     -----
-    The method is based on [1]_. A one-vs.-rest scheme is used when
-    sampling a class as proposed in [1]_.
+    The method is based on [1]_.
 
-    Supports multi-class resampling.
+    Supports multi-class resampling. A one-vs.-rest scheme is used when
+    sampling a class as proposed in [1]_.
 
     References
     ----------
@@ -303,11 +306,12 @@ def _fit_resample(self, X, y):
             prev_len = y_.shape[0]
             X_enn, y_enn = self.enn_.fit_resample(X_, y_)
 
-            # Check the stopping criterion
-            # 1. If there is no changes for the vector y
-            # 2. If the number of samples in the other class become inferior to
-            # the number of samples in the majority class
-            # 3. If one of the class is disappearing
+            # Check the stopping criterion:
+            # 1. If there are no changes in the vector y
+            # (that is, if no further observations are removed)
+            # 2. If the number of samples in any of the other (majority) classes becomes
+            # smaller than the number of samples in the minority class
+            # 3. If one of the classes disappears
 
             # Case 1
             b_conv = prev_len == y_enn.shape[0]
@@ -359,8 +363,14 @@ def _more_tags(self):
 class AllKNN(BaseCleaningSampler):
     """Undersample based on the AllKNN method.
 
-    This method will apply ENN several time and will vary the number of nearest
-    neighbours.
+    This method will apply ENN several times, starting by looking at the
+    1 closest neighbor, and increasing the number of nearest neighbors
+    by 1 at each round, up to the number of neighbors specified in
+    `n_neighbors`.
+
+    The repetitions will stop when i) one of the majority classes
+    becomes a minority class or ii) one of the majority classes
+    disappears from the target after undersampling.
 
     Read more in the :ref:`User Guide <edited_nearest_neighbors>`.
 
@@ -369,21 +379,28 @@ class AllKNN(BaseCleaningSampler):
     {sampling_strategy}
 
     n_neighbors : int or estimator object, default=3
-        If ``int``, size of the neighbourhood to consider to compute the
-        nearest neighbors. If object, an estimator that inherits from
+        If ``int``, the maximum size of the the neighborhood to evaluate.
+        The method will start by looking at the 1 closest neighbor, and
+        then repeat the edited nearest neighbors increasing
+        the neighborhood by 1, until examining a neighborhood of
+        `n_neighbors` in the final iteration.
+
+        If object, an estimator that inherits from
         :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
-        find the nearest-neighbors. By default, it will be a 3-NN.
+        find the nearest-neighbors in the final round. In this case,
+        AllKNN will repeat edited nearest neighbors starting from a 2-KNN
+        up to the specified KNN in the object.
 
     kind_sel : {{'all', 'mode'}}, default='all'
         Strategy to use in order to exclude samples.
 
-        - If ``'all'``, all neighbours will have to agree with the samples of
-          interest to not be excluded.
-        - If ``'mode'``, the majority vote of the neighbours will be used in
-          order to exclude a sample.
+        - If ``'all'``, all neighbors will have to agree with a sample in order
+          not to be excluded.
+        - If ``'mode'``, the majority of the neighbors will have to agree with
+         a sample in order not to be excluded.
 
         The strategy `"all"` will be less conservative than `'mode'`. Thus,
-        more samples will be removed when `kind_sel="all"` generally.
+        more samples will be removed when `kind_sel="all"`, generally.
 
     allow_minority : bool, default=False
         If ``True``, it allows the majority classes to become the minority
@@ -418,7 +435,7 @@ class without early stopping.
     References
     ----------
     .. [1] I. Tomek, "An Experiment with the Edited Nearest-Neighbor
-       Rule," IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6),
+       Rule", IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6),
        pp. 448-452, June 1976.
 
     Examples
@@ -484,10 +501,10 @@ def _fit_resample(self, X, y):
 
             X_enn, y_enn = self.enn_.fit_resample(X_, y_)
 
-            # Check the stopping criterion
-            # 1. If the number of samples in the other class become inferior to
-            # the number of samples in the majority class
-            # 2. If one of the class is disappearing
+            # Stopping criterion:
+            # 1. If the number of samples in any of the majority classes ends up
+            # smaller than the number of samples in the minority class
+            # 2. If one of the classes disappears
             # Case 1else:
 
             stats_enn = Counter(y_enn)