Skip to content

A curated list of awesome imbalanced learning papers, codes, frameworks, and libraries. | 类别不平衡学习:论文、代码、框架与库

License

Notifications You must be signed in to change notification settings

cvlzw/awesome-imbalanced-learning

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation

A curated list of awesome imbalanced learning papers, codes, frameworks and libraries.

Class-imbalance (also known as the long-tail problem) is the fact that the classes are not represented equally in a classification problem, which is quite common in practice. For instance, fraud detection, prediction of rare adverse drug reactions and prediction gene families. Failure to account for the class imbalance often causes inaccurate and decreased predictive performance of many classification algorithms. Imbalanced learning aims to tackle the class imbalance problem to learn an unbiased model from imbalanced data.

Inspired by awesome-machine-learning. Contributions are welcomed!

  • Frameworks and libraries are grouped by programming language.
  • Research papers are grouped by research field.
  • There are tons of papers in this research area, we only keep those "awesome" ones that either have a good influence or published in reputed top conferences/journals.

Table of Contents

Frameworks and Libraries

Python

  • imbalanced-ensemble [Github][Documentation] - imbalanced-ensemble (imported as imbalanced_ensemble) is a Python toolbox for quick implementing and deploying ensemble imbalanced learning algorithms. This package aims to provide users with easy-to-use ensemble imbalanced learning (EIL) methods and related utilities, so that everyone can quickly deploy EIL algorithms to their tasks.

    NOTE: written in python, easy to use.

  • imbalanced-learn [Github][Documentation][Paper] - imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

    NOTE: written in python, easy to use.

  • smote_variants [Documentation][Github] - A collection of 85 minority over-sampling techniques for imbalanced learning with multi-class oversampling and model selection features (All writen in Python, also support R and Julia).

R

  • smote_variants [Documentation][Github] - A collection of 85 minority over-sampling techniques for imbalanced learning with multi-class oversampling and model selection features (All writen in Python, also support R and Julia).
  • caret [Documentation][Github] - Contains the implementation of Random under/over-sampling.
  • ROSE [Documentation] - Contains the implementation of ROSE (Random Over-Sampling Examples).
  • DMwR [Documentation] - Contains the implementation of SMOTE (Synthetic Minority Over-sampling TEchnique).

Java

  • KEEL [Github][Paper] - KEEL provides a simple GUI based on data flow to design experiments with different datasets and computational intelligence algorithms (paying special attention to evolutionary algorithms) in order to assess the behavior of the algorithms. This tool includes many widely used imbalanced learning techniques such as (evolutionary) over/under-resampling, cost-sensitive learning, algorithm modification, and ensemble learning methods.

    NOTE: wide variety of classical classification, regression, preprocessing algorithms included.

Scalar

Julia

  • smote_variants [Documentation][Github] - A collection of 85 minority over-sampling techniques for imbalanced learning with multi-class oversampling and model selection features (All writen in Python, also support R and Julia).

Research Papers

Surveys

  • Learning from imbalanced data (2009, 4700+ citations) - Highly cited, classic survey paper. It systematically reviewed the popular solutions, evaluation metrics, and challenging problems in future research in this area (as of 2009).

    NOTE: classic work.

  • Learning from imbalanced data: open challenges and future directions (2016, 400+ citations) - This paper concentrates on discussing the open issues and challenges in imbalanced learning, such as extreme class imbalance, dealing imbalance in online/stream learning, multi-class imbalanced learning, and semi/un-supervised imbalanced learning.

  • Learning from class-imbalanced data: Review of methods and applications (2017, 400+ citations) - A recent exhaustive survey of imbalanced learning methods and applications, a total of 527 papers were included in this study. It provides several detailed taxonomies of existing methods and also the recent trend of this research area.

    NOTE: a systematic survey with detailed taxonomies of existing methods.

Ensemble Learning

Data resampling

  • Over-sampling

    • ROS [Code] - Random Over-sampling

    • SMOTE [Code] (2002, 9800+ citations) - Synthetic Minority Over-sampling TEchnique

      NOTE: classic work.

    • Borderline-SMOTE [Code] (2005, 1400+ citations) - Borderline-Synthetic Minority Over-sampling TEchnique

    • ADASYN [Code] (2008, 1100+ citations) - ADAptive SYNthetic Sampling

    • SPIDER [Code (Java)] (2008, 150+ citations) - Selective Preprocessing of Imbalanced Data

    • Safe-Level-SMOTE [Code (Java)] (2009, 370+ citations) - Safe Level Synthetic Minority Over-sampling TEchnique

    • SVM-SMOTE [Code] (2009, 120+ citations) - SMOTE based on Support Vectors of SVM

    • MDO (2015, 150+ citations) - Mahalanobis Distance-based Over-sampling for Multi-Class imbalanced problems.

    • 85 variants of SMOTE [Code]

  • Under-sampling

    • RUS [Code] - Random Under-sampling
    • CNN [Code] (1968, 2100+ citations) - Condensed Nearest Neighbor
    • ENN [Code] (1972, 1500+ citations) - Edited Condensed Nearest Neighbor
    • TomekLink [Code] (1976, 870+ citations) - Tomek's modification of Condensed Nearest Neighbor
    • NCR [Code] (2001, 500+ citations) - Neighborhood Cleaning Rule
    • NearMiss-1 & 2 & 3 [Code] (2003, 420+ citations) - Several kNN approaches to unbalanced data distributions.
    • CNN with TomekLink [Code (Java)] (2004, 2000+ citations) - Condensed Nearest Neighbor + TomekLink
    • OSS [Code] (2007, 2100+ citations) - One Side Selection
    • EUS (2009, 290+ citations) - Evolutionary Under-sampling
    • IHT [Code] (2014, 130+ citations) - Instance Hardness Threshold
  • Hybrid-sampling

    • SMOTE-Tomek & SMOTE-ENN (2004, 2000+ citations) [Code (SMOTE-Tomek)] [Code (SMOTE-ENN)] - Synthetic Minority Over-sampling TEchnique + Tomek's modification of Condensed Nearest Neighbor/Edited Nearest Neighbor

      NOTE: extensive experimental evaluation involving 10 different over/under-sampling methods.

    • SMOTE-RSB (2012, 210+ citations) - Hybrid Preprocessing using SMOTE and Rough Sets Theory

    • SMOTE-IPF (2015, 180+ citations) - SMOTE with Iterative-Partitioning Filter

Cost-sensitive Learning

  • CSC4.5 [Code (Java)] (2002, 420+ citations) - An instance-weighting method to induce cost-sensitive trees

  • CSSVM [Code (Java)] (2008, 710+ citations) - Cost-sensitive SVMs for highly imbalanced classification

  • CSNN [Code (Java)] (2005, 950+ citations) - Training cost-sensitive neural networks with methods addressing the class imbalance problem.

Deep Learning

Anomaly Detection

Others

1. Imbalanced Datasets

ID Name Repository & Target Ratio #S #F
1 ecoli UCI, target: imU 8.6:1 336 7
2 optical_digits UCI, target: 8 9.1:1 5,620 64
3 satimage UCI, target: 4 9.3:1 6,435 36
4 pen_digits UCI, target: 5 9.4:1 10,992 16
5 abalone UCI, target: 7 9.7:1 4,177 10
6 sick_euthyroid UCI, target: sick euthyroid 9.8:1 3,163 42
7 spectrometer UCI, target: > =44 11:1 531 93
8 car_eval_34 UCI, target: good, v good 12:1 1,728 21
9 isolet UCI, target: A, B 12:1 7,797 617
10 us_crime UCI, target: >0.65 12:1 1,994 100
11 yeast_ml8 LIBSVM, target: 8 13:1 2,417 103
12 scene LIBSVM, target: >one label 13:1 2,407 294
13 libras_move UCI, target: 1 14:1 360 90
14 thyroid_sick UCI, target: sick 15:1 3,772 52
15 coil_2000 KDD, CoIL, target: minority 16:1 9,822 85
16 arrhythmia UCI, target: 06 17:1 452 278
17 solar_flare_m0 UCI, target: M->0 19:1 1,389 32
18 oil UCI, target: minority 22:1 937 49
19 car_eval_4 UCI, target: vgood 26:1 1,728 21
20 wine_quality UCI, wine, target: <=4 26:1 4,898 11
21 letter_img UCI, target: Z 26:1 20,000 16
22 yeast_me2 UCI, target: ME2 28:1 1,484 8
23 webpage LIBSVM, w7a, target: minority 33:1 34,780 300
24 ozone_level UCI, ozone, data 34:1 2,536 72
25 mammography UCI, target: minority 42:1 11,183 6
26 protein_homo KDD CUP 2004, minority 111:1 145,751 74
27 abalone_19 UCI, target: 19 130:1 4,177 10

Note: This collection of datasets is from imblearn.datasets.fetch_datasets.

2. Imbalanced Databases

https://github.com/gykovacs/mldb

In this repo, there are 140+ KEEL data:

https://github.com/gykovacs/mldb/tree/master/mldb/data/classification

Other Resources

About

A curated list of awesome imbalanced learning papers, codes, frameworks, and libraries. | 类别不平衡学习:论文、代码、框架与库

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published