Skip to content

[MRG] updates docstrings and user guide for ENN, RENN and AllKNN #850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 53 additions & 23 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -237,14 +237,18 @@ figure illustrates this behaviour.

.. _edited_nearest_neighbors:

Edited data set using nearest neighbours
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and
"edit" the dataset by removing samples which do not agree "enough" with their
neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be
under-sampled, the nearest-neighbours are computed and if the selection
criterion is not fulfilled, the sample is removed::
Edited data set using nearest neighbors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`EditedNearestNeighbours` trains a nearest neighbors algorithm and
then looks at the closest neighbors of each data point of the class to be
under-sampled, and "edits" the dataset by removing samples which do not agree
"enough" with their neighborhood :cite:`wilson1972asymptotic`. In short,
a nearest neighbors algorithm algorithm is trained on the data. Then, for each
sample in the class to be under-sampled, the nearest neighbors are identified.
Once the neighbors are identified, if all the neighbors or most of the neighbors
agree with the class of the sample being inspected, the sample is kept, otherwise
removed::

>>> sorted(Counter(y).items())
[(0, 64), (1, 262), (2, 4674)]
Expand All @@ -255,11 +259,10 @@ criterion is not fulfilled, the sample is removed::
[(0, 64), (1, 213), (2, 4568)]

Two selection criteria are currently available: (i) the majority (i.e.,
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
nearest-neighbors have to belong to the same class than the sample inspected to
keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less
conservative than `kind_sel='mode'`, and more samples will be excluded in
the former strategy than the latest::
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) of the
nearest neighbors must belong to the same class than the sample inspected to
keep it in the dataset. This means that `kind_sel='all'` will be less
conservative than `kind_sel='mode'`, and more samples will be excluded::

>>> enn = EditedNearestNeighbours(kind_sel="all")
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
Expand All @@ -270,34 +273,61 @@ the former strategy than the latest::
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 234), (2, 4666)]

The parameter ``n_neighbors`` allows to give a classifier subclassed from
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make
the decision to keep a given sample or not.
The parameter ``n_neighbors`` can take a classifier subclassed from
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
Note that if a 4-KNN classifier is passed, 3 neighbors will be
examined for the selection criteria, because the sample being inspected
is the fourth neighbor returned by the algorithm. Alternatively, an integer
can be passed to ``n_neighbors`` to indicate the size of the neighborhood
to examine to make a decision. Thus, if ``n_neighbors=3`` the edited nearest
neighbors will look at the 3 closest neighbors of each sample.

:class:`RepeatedEditedNearestNeighbours` extends
:class:`EditedNearestNeighbours` by repeating the algorithm multiple times
:cite:`tomek1976experiment`. Generally, repeating the algorithm will delete
more data::
more data. The user indicates how many times to repeat the algorithm
through the parameter ``max_iter``::

>>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours
>>> renn = RepeatedEditedNearestNeighbours()
>>> X_resampled, y_resampled = renn.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 208), (2, 4551)]

:class:`AllKNN` differs from the previous
:class:`RepeatedEditedNearestNeighbours` since the number of neighbors of the
internal nearest neighbors algorithm is increased at each iteration
:cite:`tomek1976experiment`::
Note that :class:`RepeatedEditedNearestNeighbours` will end before reaching
``max_iter`` if no more samples are removed from the data, or one of the
majority classes ends up disappearing or with less samples than the minority
after being "edited".

:class:`AllKNN` extends :class:`EditedNearestNeighbours` by repeating
the algorithm multiple times, each time with an additional neighbor
:cite:`tomek1976experiment`. In other words, :class:`AllKNN` differs
from :class:`RepeatedEditedNearestNeighbours` in that the number of
neighbors of the internal nearest neighbors algorithm increases at
each iteration. In short, in the first iteration, a 2-KNN algorithm
is trained on the data to examine the 1 closest neighbor of each
sample from the class to be under-sampled. In each subsequent
iteration, the neighborhood examined is increased by 1, until the
number of neighbors indicated in the parameter ``n_neighbors``::

>>> from imblearn.under_sampling import AllKNN
>>> allknn = AllKNN()
>>> X_resampled, y_resampled = allknn.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 220), (2, 4601)]

In the example below, it can be seen that the three algorithms have similar
impact by cleaning noisy samples next to the boundaries of the classes.

The parameter ``n_neighbors`` can take an integer to indicate the size
of the neighborhood to examine in the last iteration. Thus, if
``n_neighbors=3``, AlKNN will examine the 1 closest neighbor in the
first iteration, the 2 closest neighbors in the second iteration
and the 3 closest neighbors in the third iteration. The parameter
``n_neighbors`` can also take a classifier subclassed from
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors.
Again, this will be the KNN used in the last iteration.

In the example below, we can see that the three algorithms have a similar
impact on cleaning noisy samples at the boundaries of the classes.

.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png
:target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Class to perform under-sampling based on the edited nearest neighbour
"""Classes to perform under-sampling based on the edited nearest neighbor
method."""

# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Expand Down Expand Up @@ -27,9 +27,9 @@
n_jobs=_n_jobs_docstring,
)
class EditedNearestNeighbours(BaseCleaningSampler):
"""Undersample based on the edited nearest neighbour method.
"""Undersample based on the edited nearest neighbor method.

This method will clean the database by removing samples close to the
This method will clean the data set by removing samples close to the
decision boundary.

Read more in the :ref:`User Guide <edited_nearest_neighbors>`.
Expand All @@ -39,21 +39,21 @@ class EditedNearestNeighbours(BaseCleaningSampler):
{sampling_strategy}

n_neighbors : int or object, default=3
If ``int``, size of the neighbourhood to consider to compute the
If ``int``, size of the neighborhood to consider to compute the
nearest neighbors. If object, an estimator that inherits from
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
find the nearest-neighbors.

kind_sel : {{'all', 'mode'}}, default='all'
Strategy to use in order to exclude samples.

- If ``'all'``, all neighbours will have to agree with the samples of
interest to not be excluded.
- If ``'mode'``, the majority vote of the neighbours will be used in
order to exclude a sample.
- If ``'all'``, all neighbors will have to agree with a sample in order
not to be excluded.
- If ``'mode'``, the majority of the neighbors will have to agree with
a sample in order not to be excluded.

The strategy `"all"` will be less conservative than `'mode'`. Thus,
more samples will be removed when `kind_sel="all"` generally.
more samples will be removed when `kind_sel="all"`, generally.

{n_jobs}

Expand All @@ -70,7 +70,7 @@ class EditedNearestNeighbours(BaseCleaningSampler):

RepeatedEditedNearestNeighbours : Undersample by repeating ENN algorithm.

AllKNN : Undersample using ENN and various number of neighbours.
AllKNN : Undersample using ENN and various number of neighbors.

Notes
-----
Expand All @@ -81,8 +81,8 @@ class EditedNearestNeighbours(BaseCleaningSampler):

References
----------
.. [1] D. Wilson, Asymptotic" Properties of Nearest Neighbor Rules Using
Edited Data," In IEEE Transactions on Systems, Man, and Cybernetrics,
.. [1] D. Wilson, "Asymptotic Properties of Nearest Neighbor Rules Using
Edited Data", in IEEE Transactions on Systems, Man, and Cybernetics,
vol. 2 (3), pp. 408-421, 1972.

Examples
Expand Down Expand Up @@ -172,9 +172,13 @@ def _more_tags(self):
n_jobs=_n_jobs_docstring,
)
class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
"""Undersample based on the repeated edited nearest neighbour method.
"""Undersample based on the repeated edited nearest neighbor method.

This method will repeat several time the ENN algorithm.
This method will repeat the ENN algorithm several times. The repetitions
will stop when i) the maximum number of iterations is reached, or ii) no
more observations are being removed, or iii) one of the majority classes
becomes a minority class or iv) one of the majority classes disappears
from the target after undersampling.

Copy link
Contributor Author

@solegalli solegalli Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a way to understanding how the algo stops, unless we read the source code. So I added this bit.

Read more in the :ref:`User Guide <edited_nearest_neighbors>`.

Expand All @@ -183,25 +187,24 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
{sampling_strategy}

n_neighbors : int or object, default=3
If ``int``, size of the neighbourhood to consider to compute the
If ``int``, size of the neighborhood to consider to compute the
nearest neighbors. If object, an estimator that inherits from
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
find the nearest-neighbors.

max_iter : int, default=100
Maximum number of iterations of the edited nearest neighbours
algorithm for a single run.
Maximum number of repetitions of the edited nearest neighbors algorithm.

kind_sel : {{'all', 'mode'}}, default='all'
Strategy to use in order to exclude samples.

- If ``'all'``, all neighbours will have to agree with the samples of
interest to not be excluded.
- If ``'mode'``, the majority vote of the neighbours will be used in
order to exclude a sample.
- If ``'all'``, all neighbors will have to agree with a sample in order
not to be excluded.
- If ``'mode'``, the majority of the neighbors will have to agree with
a sample in order not to be excluded.

The strategy `"all"` will be less conservative than `'mode'`. Thus,
more samples will be removed when `kind_sel="all"` generally.
more samples will be removed when `kind_sel="all"`, generally.

{n_jobs}

Expand All @@ -213,7 +216,7 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):
.. versionadded:: 0.4

n_iter_ : int
Number of iterations run.
Number of iterations that were actually run.

.. versionadded:: 0.6

Expand All @@ -223,14 +226,14 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler):

EditedNearestNeighbours : Undersample by editing samples.

AllKNN : Undersample using ENN and various number of neighbours.
AllKNN : Undersample using ENN and various number of neighbors.

Notes
-----
The method is based on [1]_. A one-vs.-rest scheme is used when
sampling a class as proposed in [1]_.
The method is based on [1]_.

Supports multi-class resampling.
Supports multi-class resampling. A one-vs.-rest scheme is used when
sampling a class as proposed in [1]_.

References
----------
Expand Down Expand Up @@ -303,11 +306,12 @@ def _fit_resample(self, X, y):
prev_len = y_.shape[0]
X_enn, y_enn = self.enn_.fit_resample(X_, y_)

# Check the stopping criterion
# 1. If there is no changes for the vector y
# 2. If the number of samples in the other class become inferior to
# the number of samples in the majority class
# 3. If one of the class is disappearing
# Check the stopping criterion:
# 1. If there are no changes in the vector y
# (that is, if no further observations are removed)
# 2. If the number of samples in any of the other (majority) classes becomes
# smaller than the number of samples in the minority class
# 3. If one of the classes disappears
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had trouble understanding the comments and the logic so I rephrased a bit


# Case 1
b_conv = prev_len == y_enn.shape[0]
Expand Down Expand Up @@ -359,8 +363,14 @@ def _more_tags(self):
class AllKNN(BaseCleaningSampler):
"""Undersample based on the AllKNN method.

This method will apply ENN several time and will vary the number of nearest
neighbours.
This method will apply ENN several times, starting by looking at the
1 closest neighbor, and increasing the number of nearest neighbors
by 1 at each round, up to the number of neighbors specified in
`n_neighbors`.

The repetitions will stop when i) one of the majority classes
becomes a minority class or ii) one of the majority classes
disappears from the target after undersampling.

Read more in the :ref:`User Guide <edited_nearest_neighbors>`.

Expand All @@ -369,21 +379,28 @@ class AllKNN(BaseCleaningSampler):
{sampling_strategy}

n_neighbors : int or estimator object, default=3
If ``int``, size of the neighbourhood to consider to compute the
nearest neighbors. If object, an estimator that inherits from
If ``int``, the maximum size of the the neighborhood to evaluate.
The method will start by looking at the 1 closest neighbor, and
then repeat the edited nearest neighbors increasing
the neighborhood by 1, until examining a neighborhood of
`n_neighbors` in the final iteration.

If object, an estimator that inherits from
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
find the nearest-neighbors. By default, it will be a 3-NN.
find the nearest-neighbors in the final round. In this case,
AllKNN will repeat edited nearest neighbors starting from a 2-KNN
up to the specified KNN in the object.

kind_sel : {{'all', 'mode'}}, default='all'
Strategy to use in order to exclude samples.

- If ``'all'``, all neighbours will have to agree with the samples of
interest to not be excluded.
- If ``'mode'``, the majority vote of the neighbours will be used in
order to exclude a sample.
- If ``'all'``, all neighbors will have to agree with a sample in order
not to be excluded.
- If ``'mode'``, the majority of the neighbors will have to agree with
a sample in order not to be excluded.

The strategy `"all"` will be less conservative than `'mode'`. Thus,
more samples will be removed when `kind_sel="all"` generally.
more samples will be removed when `kind_sel="all"`, generally.

allow_minority : bool, default=False
If ``True``, it allows the majority classes to become the minority
Expand Down Expand Up @@ -418,7 +435,7 @@ class without early stopping.
References
----------
.. [1] I. Tomek, "An Experiment with the Edited Nearest-Neighbor
Rule," IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6),
Rule", IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6),
pp. 448-452, June 1976.

Examples
Expand Down Expand Up @@ -484,10 +501,10 @@ def _fit_resample(self, X, y):

X_enn, y_enn = self.enn_.fit_resample(X_, y_)

# Check the stopping criterion
# 1. If the number of samples in the other class become inferior to
# the number of samples in the majority class
# 2. If one of the class is disappearing
# Stopping criterion:
# 1. If the number of samples in any of the majority classes ends up
# smaller than the number of samples in the minority class
# 2. If one of the classes disappears
# Case 1else:

stats_enn = Counter(y_enn)
Expand Down
Loading