-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[MRG] updates docstrings and user guide for ENN, RENN and AllKNN #850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
aa2bf56
fixes typos and wording in ENN script
solegalli 651a360
fixes tests to match bug fix
solegalli 248f11f
fixes test RENN
solegalli 5c0cb4a
[wip] updating allknn and tests
solegalli fc73cb3
final edits to old version of docstrings
solegalli b8d2292
adds stopping criteria of RENN and AllKNN to docstrings
solegalli cc24e41
tidies tests ENN
solegalli 749566f
tidies up tests RENN
solegalli 4d441f9
add max_iter to test_init_params
solegalli 7cc977d
final update allknn sampler and its tests
solegalli 7eb85e8
fixes test smote_enn
solegalli a631055
reverts tests to original format
solegalli 7bde866
reverts back to original form of enn and renn
solegalli cea8d7b
intermediate changes
solegalli 4fb3339
revert back to original allknn
solegalli 05a253e
removes max iter from allknn
solegalli 4e30c9b
modifies docstring for n_neighbours in allknn
solegalli 7507331
add more detail in param n_neighbor from AllKNN
solegalli 3f0e265
update dosctrings in validation
solegalli dae3e2e
updates user guide for enn, renn and allknn
solegalli 1be039e
final edits
solegalli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
"""Class to perform under-sampling based on the edited nearest neighbour | ||
"""Classes to perform under-sampling based on the edited nearest neighbor | ||
method.""" | ||
|
||
# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com> | ||
|
@@ -27,9 +27,9 @@ | |
n_jobs=_n_jobs_docstring, | ||
) | ||
class EditedNearestNeighbours(BaseCleaningSampler): | ||
"""Undersample based on the edited nearest neighbour method. | ||
"""Undersample based on the edited nearest neighbor method. | ||
|
||
This method will clean the database by removing samples close to the | ||
This method will clean the data set by removing samples close to the | ||
decision boundary. | ||
|
||
Read more in the :ref:`User Guide <edited_nearest_neighbors>`. | ||
|
@@ -39,21 +39,21 @@ class EditedNearestNeighbours(BaseCleaningSampler): | |
{sampling_strategy} | ||
|
||
n_neighbors : int or object, default=3 | ||
If ``int``, size of the neighbourhood to consider to compute the | ||
If ``int``, size of the neighborhood to consider to compute the | ||
nearest neighbors. If object, an estimator that inherits from | ||
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to | ||
find the nearest-neighbors. | ||
|
||
kind_sel : {{'all', 'mode'}}, default='all' | ||
Strategy to use in order to exclude samples. | ||
|
||
- If ``'all'``, all neighbours will have to agree with the samples of | ||
interest to not be excluded. | ||
- If ``'mode'``, the majority vote of the neighbours will be used in | ||
order to exclude a sample. | ||
- If ``'all'``, all neighbors will have to agree with a sample in order | ||
not to be excluded. | ||
- If ``'mode'``, the majority of the neighbors will have to agree with | ||
a sample in order not to be excluded. | ||
|
||
The strategy `"all"` will be less conservative than `'mode'`. Thus, | ||
more samples will be removed when `kind_sel="all"` generally. | ||
more samples will be removed when `kind_sel="all"`, generally. | ||
|
||
{n_jobs} | ||
|
||
|
@@ -70,7 +70,7 @@ class EditedNearestNeighbours(BaseCleaningSampler): | |
|
||
RepeatedEditedNearestNeighbours : Undersample by repeating ENN algorithm. | ||
|
||
AllKNN : Undersample using ENN and various number of neighbours. | ||
AllKNN : Undersample using ENN and various number of neighbors. | ||
|
||
Notes | ||
----- | ||
|
@@ -81,8 +81,8 @@ class EditedNearestNeighbours(BaseCleaningSampler): | |
|
||
References | ||
---------- | ||
.. [1] D. Wilson, Asymptotic" Properties of Nearest Neighbor Rules Using | ||
Edited Data," In IEEE Transactions on Systems, Man, and Cybernetrics, | ||
.. [1] D. Wilson, "Asymptotic Properties of Nearest Neighbor Rules Using | ||
Edited Data", in IEEE Transactions on Systems, Man, and Cybernetics, | ||
vol. 2 (3), pp. 408-421, 1972. | ||
|
||
Examples | ||
|
@@ -172,9 +172,13 @@ def _more_tags(self): | |
n_jobs=_n_jobs_docstring, | ||
) | ||
class RepeatedEditedNearestNeighbours(BaseCleaningSampler): | ||
"""Undersample based on the repeated edited nearest neighbour method. | ||
"""Undersample based on the repeated edited nearest neighbor method. | ||
|
||
This method will repeat several time the ENN algorithm. | ||
This method will repeat the ENN algorithm several times. The repetitions | ||
will stop when i) the maximum number of iterations is reached, or ii) no | ||
more observations are being removed, or iii) one of the majority classes | ||
becomes a minority class or iv) one of the majority classes disappears | ||
from the target after undersampling. | ||
|
||
Read more in the :ref:`User Guide <edited_nearest_neighbors>`. | ||
|
||
|
@@ -183,25 +187,24 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler): | |
{sampling_strategy} | ||
|
||
n_neighbors : int or object, default=3 | ||
If ``int``, size of the neighbourhood to consider to compute the | ||
If ``int``, size of the neighborhood to consider to compute the | ||
nearest neighbors. If object, an estimator that inherits from | ||
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to | ||
find the nearest-neighbors. | ||
|
||
max_iter : int, default=100 | ||
Maximum number of iterations of the edited nearest neighbours | ||
algorithm for a single run. | ||
Maximum number of repetitions of the edited nearest neighbors algorithm. | ||
|
||
kind_sel : {{'all', 'mode'}}, default='all' | ||
Strategy to use in order to exclude samples. | ||
|
||
- If ``'all'``, all neighbours will have to agree with the samples of | ||
interest to not be excluded. | ||
- If ``'mode'``, the majority vote of the neighbours will be used in | ||
order to exclude a sample. | ||
- If ``'all'``, all neighbors will have to agree with a sample in order | ||
not to be excluded. | ||
- If ``'mode'``, the majority of the neighbors will have to agree with | ||
a sample in order not to be excluded. | ||
|
||
The strategy `"all"` will be less conservative than `'mode'`. Thus, | ||
more samples will be removed when `kind_sel="all"` generally. | ||
more samples will be removed when `kind_sel="all"`, generally. | ||
|
||
{n_jobs} | ||
|
||
|
@@ -213,7 +216,7 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler): | |
.. versionadded:: 0.4 | ||
|
||
n_iter_ : int | ||
Number of iterations run. | ||
Number of iterations that were actually run. | ||
|
||
.. versionadded:: 0.6 | ||
|
||
|
@@ -223,14 +226,14 @@ class RepeatedEditedNearestNeighbours(BaseCleaningSampler): | |
|
||
EditedNearestNeighbours : Undersample by editing samples. | ||
|
||
AllKNN : Undersample using ENN and various number of neighbours. | ||
AllKNN : Undersample using ENN and various number of neighbors. | ||
|
||
Notes | ||
----- | ||
The method is based on [1]_. A one-vs.-rest scheme is used when | ||
sampling a class as proposed in [1]_. | ||
The method is based on [1]_. | ||
|
||
Supports multi-class resampling. | ||
Supports multi-class resampling. A one-vs.-rest scheme is used when | ||
sampling a class as proposed in [1]_. | ||
|
||
References | ||
---------- | ||
|
@@ -303,11 +306,12 @@ def _fit_resample(self, X, y): | |
prev_len = y_.shape[0] | ||
X_enn, y_enn = self.enn_.fit_resample(X_, y_) | ||
|
||
# Check the stopping criterion | ||
# 1. If there is no changes for the vector y | ||
# 2. If the number of samples in the other class become inferior to | ||
# the number of samples in the majority class | ||
# 3. If one of the class is disappearing | ||
# Check the stopping criterion: | ||
# 1. If there are no changes in the vector y | ||
# (that is, if no further observations are removed) | ||
# 2. If the number of samples in any of the other (majority) classes becomes | ||
# smaller than the number of samples in the minority class | ||
# 3. If one of the classes disappears | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had trouble understanding the comments and the logic so I rephrased a bit |
||
|
||
# Case 1 | ||
b_conv = prev_len == y_enn.shape[0] | ||
|
@@ -359,8 +363,14 @@ def _more_tags(self): | |
class AllKNN(BaseCleaningSampler): | ||
"""Undersample based on the AllKNN method. | ||
|
||
This method will apply ENN several time and will vary the number of nearest | ||
neighbours. | ||
This method will apply ENN several times, starting by looking at the | ||
1 closest neighbor, and increasing the number of nearest neighbors | ||
by 1 at each round, up to the number of neighbors specified in | ||
`n_neighbors`. | ||
|
||
The repetitions will stop when i) one of the majority classes | ||
becomes a minority class or ii) one of the majority classes | ||
disappears from the target after undersampling. | ||
|
||
Read more in the :ref:`User Guide <edited_nearest_neighbors>`. | ||
|
||
|
@@ -369,21 +379,28 @@ class AllKNN(BaseCleaningSampler): | |
{sampling_strategy} | ||
|
||
n_neighbors : int or estimator object, default=3 | ||
If ``int``, size of the neighbourhood to consider to compute the | ||
nearest neighbors. If object, an estimator that inherits from | ||
If ``int``, the maximum size of the the neighborhood to evaluate. | ||
The method will start by looking at the 1 closest neighbor, and | ||
then repeat the edited nearest neighbors increasing | ||
the neighborhood by 1, until examining a neighborhood of | ||
`n_neighbors` in the final iteration. | ||
|
||
If object, an estimator that inherits from | ||
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to | ||
find the nearest-neighbors. By default, it will be a 3-NN. | ||
find the nearest-neighbors in the final round. In this case, | ||
AllKNN will repeat edited nearest neighbors starting from a 2-KNN | ||
up to the specified KNN in the object. | ||
|
||
kind_sel : {{'all', 'mode'}}, default='all' | ||
Strategy to use in order to exclude samples. | ||
|
||
- If ``'all'``, all neighbours will have to agree with the samples of | ||
interest to not be excluded. | ||
- If ``'mode'``, the majority vote of the neighbours will be used in | ||
order to exclude a sample. | ||
- If ``'all'``, all neighbors will have to agree with a sample in order | ||
not to be excluded. | ||
- If ``'mode'``, the majority of the neighbors will have to agree with | ||
a sample in order not to be excluded. | ||
|
||
The strategy `"all"` will be less conservative than `'mode'`. Thus, | ||
more samples will be removed when `kind_sel="all"` generally. | ||
more samples will be removed when `kind_sel="all"`, generally. | ||
|
||
allow_minority : bool, default=False | ||
If ``True``, it allows the majority classes to become the minority | ||
|
@@ -418,7 +435,7 @@ class without early stopping. | |
References | ||
---------- | ||
.. [1] I. Tomek, "An Experiment with the Edited Nearest-Neighbor | ||
Rule," IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), | ||
Rule", IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), | ||
pp. 448-452, June 1976. | ||
|
||
Examples | ||
|
@@ -484,10 +501,10 @@ def _fit_resample(self, X, y): | |
|
||
X_enn, y_enn = self.enn_.fit_resample(X_, y_) | ||
|
||
# Check the stopping criterion | ||
# 1. If the number of samples in the other class become inferior to | ||
# the number of samples in the majority class | ||
# 2. If one of the class is disappearing | ||
# Stopping criterion: | ||
# 1. If the number of samples in any of the majority classes ends up | ||
# smaller than the number of samples in the minority class | ||
# 2. If one of the classes disappears | ||
# Case 1else: | ||
|
||
stats_enn = Counter(y_enn) | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a way to understanding how the algo stops, unless we read the source code. So I added this bit.