Skip to content

DOC improve TomekLinks documentation #1020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 11, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 26 additions & 15 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -197,38 +197,49 @@ affected by noise due to the first step sample selection.
Cleaning under-sampling techniques
----------------------------------

Cleaning under-sampling techniques do not allow to specify the number of
samples to have in each class. In fact, each algorithm implement an heuristic
which will clean the dataset.
Cleaning under-sampling methods "clean" the feature space by removing
either "noisy" observations or observations that are "too easy to classify", depending
on the method. The final number of observations in each targeted class varies with the
cleaning method and cannot be specified by the user.

.. _tomek_links:

Tomek's links
^^^^^^^^^^^^^

:class:`TomekLinks` detects the so-called Tomek's links :cite:`tomek1976two`. A
Tomek's link between two samples of different class :math:`x` and :math:`y` is
defined such that for any sample :math:`z`:
A Tomek's link exists when two samples from different classes are closest neighbors to
each other.

Mathematically, a Tomek's link between two samples from different classes :math:`x`
and :math:`y` is defined such that for any sample :math:`z`:

.. math::

d(x, y) < d(x, z) \text{ and } d(x, y) < d(y, z)

where :math:`d(.)` is the distance between the two samples. In some other
words, a Tomek's link exist if the two samples are the nearest neighbors of
each other. In the figure below, a Tomek's link is illustrated by highlighting
the samples of interest in green.
where :math:`d(.)` is the distance between the two samples.

:class:`TomekLinks` detects and removes Tomek's links :cite:`tomek1976two`. The
underlying idea is that Tomek's links are noisy or hard to classify observations and
would not help the algorithm find a suitable discrimination boundary.

In the following figure, a Tomek's link between an observation of class :math:`+` and
class :math:`-` is highlighted in green:

.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
:scale: 60
:align: center

The parameter ``sampling_strategy`` control which sample of the link will be
removed. For instance, the default (i.e., ``sampling_strategy='auto'``) will
remove the sample from the majority class. Both samples from the majority and
minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The
figure illustrates this behaviour.
When :class:`TomekLinks` finds a Tomek's link, it can either remove the sample of the
majority class, or both. The parameter ``sampling_strategy`` controls which samples
from the link will be removed. By default (i.e., ``sampling_strategy='auto'``), it will
remove the sample from the majority class. Both samples, that is that from the majority
and the one from the minority class, can be removed by setting ``sampling_strategy`` to
``'all'``.

The following figure illustrates this behaviour: on the left, only the sample from the
majority class is removed, whereas on the right, the entire Tomek's link is removed.

.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
Expand Down