Nitesh V Chawla [email protected], Kevin W Bowyer, Lawrence O Hall [email protected], W Philip Kegelmeyer, Department of Computer Science and Engineering Department of Computer Science and Engineering 384 ENB 118 University of South Florida 4202 E. Fowler Ave. Tampa33620-5399FLUSA, Dame Notre Dame Department of Computer Science and Engineering, ENB 118 Fitzpatrick Hall University of Notre 46556INUSA, Sandia National Laboratories Biosystems Research Department University of South Florida 4202 E. Fowler Ave. TampaP.O. Box 96933620-5399, 9951, 94551-0969LivermoreFL, MS, CAUSA, USA (2002)
This paper presents a novel approach called SMOTE (Synthetic Minority Over-sampling Technique) aimed at addressing the issue of imbalanced datasets in machine learning. Imbalanced datasets occur when the instances of one class are significantly underrepresented compared to others, leading to classifiers that perform poorly on the minority class. The authors propose a method that combines over-sampling the minority class by generating synthetic examples and under-sampling the majority class. They conduct experiments using several classifiers including C4.5, Ripper, and Naive Bayes, and evaluate their method using metrics like Area Under the Receiver Operating Characteristic curve (AUC) and ROC convex hulls. Results demonstrate that this hybrid approach of SMOTE combined with under-sampling outperforms traditional under-sampling alone, as well as adjusting loss ratios in Ripper and class priors in Naive Bayes. The paper also discusses the limitations of existing methods and suggests that SMOTE can lead to better generalization performance for classifiers, particularly in diverse application domains such as fraud detection, medical diagnosis, and text classification. Recommendations for future work include refining the selection of nearest neighbors and exploring adaptive methods for creating synthetic samples.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: