Objective: Atrial Fibrillation (AF) is one of the important public health problems with elevated comorbidity, advanced mortality risk, and increasing healthcare costs. In this study, the objective is to explore and resolve the imbalanced class problem for the prediction of AF in obese individuals and to compare the predictive results of balanced and imbalanced datasets by several data mining approaches.
Materials and Methods: The retrospective study contained 362 successive obese individuals undergoing Coronary Artery Bypass Grafting (CABG) operation at the cardiovascular surgery clinic. AF developed postoperatively (AF Group) in 42 of the patients, whereas AF did not develop (non-AF Group) in 320 individuals. The Synthetic Minority Over-sampling Technique (SMOTE) was performed to balance the distribution of the target variable (AF/non-AF groups). The LogitBoost and GLMBoost ensemble approaches were constructed with 10-fold cross validation.
Results: After applying SMOTE algorithm, the number of subjects in AF and non-AF was almost balanced (336 in AF and 320 in non-AF groups). The values of accuracy were 0.8812 (0.8433-0.9127) for GLMBoost and 0.9144 (0.8806-0.9411) for LogitBoost on the imbalanced dataset, and 0.8247 (0.7934-0.853) for GLMBoost and 0.9695 (0.9533-0.9813) for LogitBoost on the balanced dataset by SMOTE. The values of the area under the receiver operating curve for GLMBoost and LogitBoost were 0.5088 (0.485-0.5325) and 0.6827 (0.608-0.7573) on imbalanced dataset, and were 0.8259 (0.7971-0.8546) and 0.9696 (0.9564-0.9827) on balanced dataset, respectively.
Conclusions: The predicted results indicated that LogitBoost on the balanced dataset by SMOTE had the highest and most accurate values of performance metrics. Hence, SMOTE and other oversampling approaches may be beneficial to overcome class imbalance issues emerging in biomedical studies.