TY - JOUR

T1 - Determining the optimal re-sampling strategy for a classification model with imbalanced data using design of experiments and response surface methodologies

AU - Tong, Lee-Ing

AU - Chang, Yung-Chia

AU - Lin, Shan Hui

PY - 2011/4

Y1 - 2011/4

N2 - Imbalanced data are common in many machine learning applications. In an imbalanced data set, the number of instances in at least one class is significantly higher or lower than that in other classes. Consequently, when classification models with imbalanced data are developed, most classifiers are subjected to an unequal number of instances in each class, thus failing to construct an effective model. Balancing sample sizes for various classes using a re-sampling strategy is a conventional means of enhancing the effectiveness of a classification model for imbalanced data. Despite numerous attempts to determine the appropriate re-sampling proportion in each class by using a trial-and-error method in order to construct a classification model with imbalanced data (Barandela, Vadovinos, Sánchez, & Ferri, 2004; He, Han, & Wang, 2005; Japkowicz, 2000; McCarthy, Zabar, & Weiss, 2005), the optimal strategy for each class may be infeasible when using such a method. Therefore, this work proposes a novel analytical procedure to determine the optimal re-sampling strategy based on design of experiments (DOE) and response surface methodologies (RSM). The proposed procedure, S-RSM, can be utilized by any classifier. Also, C4.5 algorithm is adopted for illustration. The classification results are evaluated by using the area under the receiver operating characteristic curve (AUC) as a performance measure. Among the several desirable features of the AUC index include independence of the decision threshold and invariance to a priori class probabilities. Furthermore, five real world data sets demonstrate that the higher AUC score of the classification model based on the training data obtained from the S-RSM is than that obtained using oversampling approach or undersampling approach.

AB - Imbalanced data are common in many machine learning applications. In an imbalanced data set, the number of instances in at least one class is significantly higher or lower than that in other classes. Consequently, when classification models with imbalanced data are developed, most classifiers are subjected to an unequal number of instances in each class, thus failing to construct an effective model. Balancing sample sizes for various classes using a re-sampling strategy is a conventional means of enhancing the effectiveness of a classification model for imbalanced data. Despite numerous attempts to determine the appropriate re-sampling proportion in each class by using a trial-and-error method in order to construct a classification model with imbalanced data (Barandela, Vadovinos, Sánchez, & Ferri, 2004; He, Han, & Wang, 2005; Japkowicz, 2000; McCarthy, Zabar, & Weiss, 2005), the optimal strategy for each class may be infeasible when using such a method. Therefore, this work proposes a novel analytical procedure to determine the optimal re-sampling strategy based on design of experiments (DOE) and response surface methodologies (RSM). The proposed procedure, S-RSM, can be utilized by any classifier. Also, C4.5 algorithm is adopted for illustration. The classification results are evaluated by using the area under the receiver operating characteristic curve (AUC) as a performance measure. Among the several desirable features of the AUC index include independence of the decision threshold and invariance to a priori class probabilities. Furthermore, five real world data sets demonstrate that the higher AUC score of the classification model based on the training data obtained from the S-RSM is than that obtained using oversampling approach or undersampling approach.

KW - Classifier

KW - Design of experiments

KW - Imbalanced data

KW - Machine learning

KW - Re-sampling strategy

KW - Response surface methodologies

KW - The area under ROC curve

UR - http://www.scopus.com/inward/record.url?scp=78650689032&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2010.09.087

DO - 10.1016/j.eswa.2010.09.087

M3 - Article

AN - SCOPUS:78650689032

SN - 0957-4174

VL - 38

SP - 4222

EP - 4227

JO - Expert Systems with Applications

JF - Expert Systems with Applications

IS - 4

ER -