← ML Research Wiki / 1106.1813

SMOTE: Synthetic Minority Over-sampling Technique

Nitesh V Chawla [email protected], Kevin W Bowyer, Lawrence O Hall [email protected], W Philip Kegelmeyer, Department of Computer Science and Engineering Department of Computer Science and Engineering 384 ENB 118 University of South Florida 4202 E. Fowler Ave. Tampa33620-5399FLUSA, Dame Notre Dame Department of Computer Science and Engineering, ENB 118 Fitzpatrick Hall University of Notre 46556INUSA, Sandia National Laboratories Biosystems Research Department University of South Florida 4202 E. Fowler Ave. TampaP.O. Box 96933620-5399, 9951, 94551-0969LivermoreFL, MS, CAUSA, USA (2002)

Paper Information
arXiv ID
Venue
Journal of Artificial Intelligence Research
Domain
Machine Learning
SOTA Claim
Yes

Abstract

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy. Predicted Negative Predicted Positive TN FP FN TP Actual Negative Actual Positive 1. The data is available from the USF Intelligent Systems Lab,

Summary

This paper presents a novel approach called SMOTE (Synthetic Minority Over-sampling Technique) aimed at addressing the issue of imbalanced datasets in machine learning. Imbalanced datasets occur when the instances of one class are significantly underrepresented compared to others, leading to classifiers that perform poorly on the minority class. The authors propose a method that combines over-sampling the minority class by generating synthetic examples and under-sampling the majority class. They conduct experiments using several classifiers including C4.5, Ripper, and Naive Bayes, and evaluate their method using metrics like Area Under the Receiver Operating Characteristic curve (AUC) and ROC convex hulls. Results demonstrate that this hybrid approach of SMOTE combined with under-sampling outperforms traditional under-sampling alone, as well as adjusting loss ratios in Ripper and class priors in Naive Bayes. The paper also discusses the limitations of existing methods and suggests that SMOTE can lead to better generalization performance for classifiers, particularly in diverse application domains such as fraud detection, medical diagnosis, and text classification. Recommendations for future work include refining the selection of nearest neighbors and exploring adaptive methods for creating synthetic samples.

Methods

This paper employs the following methods:

  • SMOTE
  • Under-sampling

Models Used

  • C4.5
  • Ripper
  • Naive Bayes

Datasets

The following datasets were used in this research:

  • Pima Indian Diabetes
  • Phoneme
  • Adult
  • Satimage
  • Forest Cover
  • Oil
  • Mammography
  • Can

Evaluation Metrics

  • AUC
  • ROC convex hull

Results

  • SMOTE improves classifier performance over traditional under-sampling methods
  • Combining SMOTE with under-sampling provides better results than varying loss ratios or class priors
  • ROC curves indicate dominance of SMOTE in classifier performance

Limitations

The authors identified the following limitations:

  • Need for varied methods to select nearest neighbors
  • Synthetic sample generation might lead to overlap in majority class space

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

SMOTE Class imbalance Over-sampling Under-sampling ROC curve Imbalanced datasets

Papers Using Similar Methods

External Resources