Skip to content

DOC improve introduction to undersampling methods #1018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 35 additions & 11 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,25 @@ Under-sampling

.. currentmodule:: imblearn.under_sampling

You can refer to
One way of handling imbalanced datasets is to reduce the number of observations from
all classes but the minority class. The minority class is that with the least number
of observations. The most well known algorithm in this group is random
undersampling, where samples from the targeted classes are removed at random.

But there are many other algorithms to help us reduce the number of observations in the
dataset. These algorithms can be grouped based on their undersampling strategy into:

- Prototype generation methods.
- Prototype selection methods.

And within the latter, we find:

- Controlled undersampling
- Cleaning methods

We will discuss the different algorithms throughout this document.

Check also
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`.

.. _cluster_centroids:
Expand All @@ -16,7 +34,7 @@ Prototype generation

Given an original data set :math:`S`, prototype generation algorithms will
generate a new set :math:`S'` where :math:`|S'| < |S|` and :math:`S' \not\subset
S`. In other words, prototype generation technique will reduce the number of
S`. In other words, prototype generation techniques will reduce the number of
samples in the targeted classes but the remaining samples are generated --- and
not selected --- from the original set.

Expand Down Expand Up @@ -61,16 +79,22 @@ original one.
Prototype selection
===================

On the contrary to prototype generation algorithms, prototype selection
algorithms will select samples from the original set :math:`S`. Therefore,
:math:`S'` is defined such as :math:`|S'| < |S|` and :math:`S' \subset S`.
Prototype selection algorithms will select samples from the original set :math:`S`,
generating a dataset :math:`S'`, where :math:`|S'| < |S|` and :math:`S' \subset S`. In
other words, :math:`S'` is a subset of :math:`S`.

Prototype selection algorithms can be divided into two groups: (i) controlled
under-sampling techniques and (ii) cleaning under-sampling techniques.

Controlled under-sampling methods reduce the number of observations in the majority
class or classes to an arbitrary number of samples specified by the user. Typically,
they reduce the number of observations to the number of samples observed in the
minority class.

In addition, these algorithms can be divided into two groups: (i) the
controlled under-sampling techniques and (ii) the cleaning under-sampling
techniques. The first group of methods allows for an under-sampling strategy in
which the number of samples in :math:`S'` is specified by the user. By
contrast, cleaning under-sampling techniques do not allow this specification
and are meant for cleaning the feature space.
In contrast, cleaning under-sampling techniques "clean" the feature space by removing
either "noisy" or "too easy to classify" observations, depending on the method. The
final number of observations in each class varies with the cleaning method and can't be
specified by the user.

.. _controlled_under_sampling:

Expand Down