From 8fb6c03c436210dde28c935bc0a8828be7837d0a Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 19:53:35 +0200 Subject: [PATCH 1/3] reword introduction to undersampling methods --- doc/under_sampling.rst | 47 +++++++++++++++++++++++++++++++----------- 1 file changed, 35 insertions(+), 12 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9f2795430..ab496d959 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -6,8 +6,25 @@ Under-sampling .. currentmodule:: imblearn.under_sampling -You can refer to -:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`. +One way of handling imbalanced datasets is to reduce the number of observations from +the majority class or classes. The most well known algorithm in this group is random +undersampling, where samples from the majority classes are removed at random. + +But there are many other algorithms to help us reduce the number of observations in the +dataset. These algorithms can be grouped based on their undersampling strategy into: + +- Prototype generation methods +- Prototype selection methods. + +And within the latter, we find: + +- Controlled undersampling +- Cleaning methods + +We will discuss the different algorithms throughout this document. + +Refer to :ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py` +for a comparison of the different methods. .. _cluster_centroids: @@ -16,7 +33,7 @@ Prototype generation Given an original data set :math:`S`, prototype generation algorithms will generate a new set :math:`S'` where :math:`|S'| < |S|` and :math:`S' \not\subset -S`. In other words, prototype generation technique will reduce the number of +S`. In other words, prototype generation techniques will reduce the number of samples in the targeted classes but the remaining samples are generated --- and not selected --- from the original set. @@ -61,16 +78,22 @@ original one. Prototype selection =================== -On the contrary to prototype generation algorithms, prototype selection -algorithms will select samples from the original set :math:`S`. Therefore, -:math:`S'` is defined such as :math:`|S'| < |S|` and :math:`S' \subset S`. +Prototype selection algorithms will select samples from the original set :math:`S`, +generating a dataset :math:`S'`, where :math:`|S'| < |S|` and :math:`S' \subset S`. In +other words, :math:`S'` is a subset of :math:`S`. + +Prototype selection algorithms can be divided into two groups: (i) controlled +under-sampling techniques and (ii) cleaning under-sampling techniques. + +Controlled under-sampling methods reduce the number of observations in the majority +class or classes to an arbitrary number of samples specified by the user. Typically, +they reduce the number of observations to the number of samples observed in the +minority class. -In addition, these algorithms can be divided into two groups: (i) the -controlled under-sampling techniques and (ii) the cleaning under-sampling -techniques. The first group of methods allows for an under-sampling strategy in -which the number of samples in :math:`S'` is specified by the user. By -contrast, cleaning under-sampling techniques do not allow this specification -and are meant for cleaning the feature space. +In contrast, cleaning under-sampling techniques "clean" the feature space by removing +either "noisy" or "too easy to classify" observations, depending on the method. The +final number of observations in each class varies with the cleaning method and can't be +specified by the user. .. _controlled_under_sampling: From f1901db884c7086d44b42fe57dfb798fba7c5fba Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 19:58:06 +0200 Subject: [PATCH 2/3] final touches --- doc/under_sampling.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index ab496d959..d886649b1 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -13,7 +13,7 @@ undersampling, where samples from the majority classes are removed at random. But there are many other algorithms to help us reduce the number of observations in the dataset. These algorithms can be grouped based on their undersampling strategy into: -- Prototype generation methods +- Prototype generation methods. - Prototype selection methods. And within the latter, we find: @@ -24,7 +24,7 @@ And within the latter, we find: We will discuss the different algorithms throughout this document. Refer to :ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py` -for a comparison of the different methods. +for a comparison of the different undersampling methodologies. .. _cluster_centroids: From 0e607c79d7499ca7e385cd13efac376c7c6d562a Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 12:19:15 +0200 Subject: [PATCH 3/3] reword --- doc/under_sampling.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index d886649b1..a7c195133 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -7,8 +7,9 @@ Under-sampling .. currentmodule:: imblearn.under_sampling One way of handling imbalanced datasets is to reduce the number of observations from -the majority class or classes. The most well known algorithm in this group is random -undersampling, where samples from the majority classes are removed at random. +all classes but the minority class. The minority class is that with the least number +of observations. The most well known algorithm in this group is random +undersampling, where samples from the targeted classes are removed at random. But there are many other algorithms to help us reduce the number of observations in the dataset. These algorithms can be grouped based on their undersampling strategy into: @@ -23,8 +24,8 @@ And within the latter, we find: We will discuss the different algorithms throughout this document. -Refer to :ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py` -for a comparison of the different undersampling methodologies. +Check also +:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`. .. _cluster_centroids: