Biomedical Research

Journal Banner

Handling imbalanced class problem for the prediction of atrial fibrillation in obese patient

Cengiz Colak M1, Erol Karaaslan2, Cemil Colak3, Ahmet Kadir Arslan3* and Nevzat Erdil1

1Department of Cardiovascular Surgery, Faculty of Medicine, Inonu University, Malatya, Turkey

2Department of Anaesthesiology and Reanimation, Malatya State Hospital, Malatya, Turkey

3Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, Malatya, Turkey

*Corresponding Author:
Ahmet Kadir Arslan
Faculty of Medicine
Department of Biostatistics and Medical Informatics, Malatya, Turkey

Accepted on January 5, 2017

Visit for more related articles at Biomedical Research

Abstract

Objective: Atrial Fibrillation (AF) is one of the important public health problems with elevated comorbidity, advanced mortality risk, and increasing healthcare costs. In this study, the objective is to explore and resolve the imbalanced class problem for the prediction of AF in obese individuals and to compare the predictive results of balanced and imbalanced datasets by several data mining approaches.

Materials and Methods: The retrospective study contained 362 successive obese individuals undergoing Coronary Artery Bypass Grafting (CABG) operation at the cardiovascular surgery clinic. AF developed postoperatively (AF Group) in 42 of the patients, whereas AF did not develop (non-AF Group) in 320 individuals. The Synthetic Minority Over-sampling Technique (SMOTE) was performed to balance the distribution of the target variable (AF/non-AF groups). The LogitBoost and GLMBoost ensemble approaches were constructed with 10-fold cross validation.

Results: After applying SMOTE algorithm, the number of subjects in AF and non-AF was almost balanced (336 in AF and 320 in non-AF groups). The values of accuracy were 0.8812 (0.8433-0.9127) for GLMBoost and 0.9144 (0.8806-0.9411) for LogitBoost on the imbalanced dataset, and 0.8247 (0.7934-0.853) for GLMBoost and 0.9695 (0.9533-0.9813) for LogitBoost on the balanced dataset by SMOTE. The values of the area under the receiver operating curve for GLMBoost and LogitBoost were 0.5088 (0.485-0.5325) and 0.6827 (0.608-0.7573) on imbalanced dataset, and were 0.8259 (0.7971-0.8546) and 0.9696 (0.9564-0.9827) on balanced dataset, respectively.

Conclusions: The predicted results indicated that LogitBoost on the balanced dataset by SMOTE had the highest and most accurate values of performance metrics. Hence, SMOTE and other oversampling approaches may be beneficial to overcome class imbalance issues emerging in biomedical studies.

Keywords

Imbalanced dataset classification, Atrial fibrillation GLMBoost, LogitBoost, Synthetic minority over-sampling technique.

Introduction

Atrial Fibrillation (AF) is one of the important public health problems with elevated comorbidity, advanced mortality risk, and increasing healthcare costs. AF is one of the most commonly experienced and important cardiac arrhythmias. The explanations behind the increasing prevalence of AF have not been efficiently determined; however, may be associated with improved detection, increasing incidence, and enhanced survival in cardiovascular patients. AF is related significantly with advancing age, with about 1 in 25 people above 60 years and 1 in 10 above 80 years influenced by AF. AF causes increasing morbidity/mortality, advancing risks for death, Congestive Heart Failure (CHF), stroke, and other related diseases. Significant risk factors related to the development of AF contain advanced age, hypertension, smoking, alcohol consumption, obesity, prevalent myocardial infarction, CHF, diabetes mellitus and so forth, described in detail by the studies [1-3].

Obesity is one of the important public health problems influencing both children and adults in the world, and also causes important health risks. Obesity is associated with the development and progression of AF. Hence, obesity can be reduced and prevented by conducting effective public health programs [4].

Knowledge Discovery Process (KDP) is related to reproducing and extracting higher-level comprehensions and knowledge from the database. The methods implemented in KDP are based on knowledge-intense stages and can often take advantage of utilizing supplementary knowledge from different data sources [5]. In biomedical research, the KDP has been applied in the different areas of medicine and has attracted interest [6,7].

Imbalanced classification is one of the important subjects in the knowledge discovery process and data mining. On the imbalanced issues of two classes, a minority class and a majority class were present in the data of interest, which is called imbalanced classification. Class imbalance leads to some troubles for data mining algorithms assuming an almost equal class distribution, and consequently, minority class instances are largely misclassified by the data mining algorithms [8,9].

The primary aim of this study is to explore and resolve the imbalanced class problem for the prediction of AF in obese individuals. The secondary goal of the study is to compare the predictive results of balanced and imbalanced datasets by implementing several data mining approaches.

Materials and Methods

Study design and data

The current retrospective study contained 362 successive obese individuals who underwent Coronary Artery Bypass Grafting (CABG) operation at the cardiovascular surgery clinic of Turgut Ozal Medical Center, Inonu University, Malatya, Turkey between January 2012 and December 2015. AF developed postoperatively (AF Group) in 42 of the patients, whereas AF did not develop (non-AF Group) in 320 individuals. The exclusion criteria for this study were past atrial arrhythmia, requirement for extra procedures, left ventricle and renal dysfunctions, and chronic obstructive pulmonary disease. Table 1 tabulates the details of the target and predictor variables.

Attribute name Abbreviation Attribute type Explanations Role
Atrial Fibrillation AF Categorical Present/absent Target
Age - Numerical Natural number Input
Gender - Categorical Female/male Input
Smoking - Categorical Present/absent Input
Alcohol Consumption AC Categorical Present/absent Input
Diabetes Mellitus DM Categorical Present/absent Input
Hypertension HT Categorical Present/absent Input
Chronic Obstructive Pulmonary Disease COPD Categorical Present/absent Input
History of Myocardial Infarction HMI Categorical Present/absent Input
Rhythm - Categorical Present/absent Input
Emergency Operation EO Categorical Present/absent Input
Heart Palpitations HP Categorical Present/absent Input
Early Mortality EM Categorical Present/absent Input
Blood Surface Area BSA Numerical Positive integer Input
Blood Urea Nitrogen BUN Numerical Positive integer Input
Creatinine CR Numerical Positive real number Input
Haemoglobin HB Numerical Positive real number Input
Haematocrit HCT Numerical Positive real number Input
Platelets PLT Numerical Positive integer Input
Blood Sugar Concentration BSC Numerical Positive integer Input
Cholesterol CHL Numerical Positive integer Input
Low Density Lipoprotein LDL Numerical Positive integer Input
High Density Lipoprotein HDL Numerical Positive integer Input
Triglyceride TG Numerical Positive integer Input
Ventilation Time VT Numerical Positive integer Input
Intensive Care Unit Hospitalization Time ICUHT Numerical Positive integer Input

Table 1. The definition of the variables employed in the current study.

Outlier detection

Local Density Cluster-Based Outlier Factor (LDCOF) was used to explore outlier observations. This approach utilizes X-means clustering algorithm which determines heuristically the number of clusters [10,11]. In this study, no outlier observations were detected according to the results of LDCOF analysis.

Synthetic minority over-sampling technique (SMOTE)

The Synthetic Minority Over-sampling Technique (SMOTE) [12] is one of the oversampling methods and constitutes minority class instances. For this reason, it is commonly employed for the class imbalance problems and produces better results than simple oversampling techniques. The SMOTE is a useful and powerful technique used successively in many medical applications. In relation to implementation of this algorithm, artificial data were created according to the attribute space [13,14].

The SMOTE is employed to achieve an artificial class-balanced or almost class-balanced dataset. The percentage of over and under-sampling (%) and the number of nearest neighbours were adjusted respectively to 700:109 and k=5 to constitute the novel synthetic samples. The DMwR package [15] of R was employed to implement SMOTE.

LogitBoost

Boosting technique was essentially suggested to combine many weak learners to increase the prediction performance. LogitBoost, presented initially by [16], is a boosting algorithm employed commonly in many areas [17]. LogitBoost is characterized by presenting a loss function of the log-likelihood to diminish the sensitivity to extreme and outlier observations. Also, it can implement classification and regression through joining a number of weak learners to achieve very powerful and robust learners [18]. Details of this technique can be found [19].

GLMBoost

One of the boosting approaches is GLMBoost based on penalized log-likelihood. This approach can be employed widely for the problems of regression/classification since it is one of the important ensemble learning algorithms. GLMBoost has many implementation advantages. In addition to the ease of calculation, GLMBoost indicates different advantages. It has high calculation capacity, and complex tuning procedures are not necessary [20]. More detailed information on this ensemble learning algorithm can be obtained from [21].

Modelling and performance metrics

In this study, 10-fold cross validation technique was employed for evaluating the models and obtaining unbiased results from the models. LogitBoost and GLMBoost were utilized to compare classification performances of the balanced and unbalanced datasets. All modelling and validation processes were carried out by the caret package of R [22].

In this study, the assessment of the models was performed using accuracy, kappa (κ), Area Under Curve (AUC) of Receiver Operating Characteristic (ROC), recall and precision. These metrics are explained below [23]:

Accuracy=(TP+TN)/(TP+TN+FP+FN)

κ=Kappa=(p0-pe)/(1-pe)

Recall=TP/(FN+TP)

Precision=TP/(FP+TP)

where, TP describes the number of true positives, TN describes the number of true negatives, FN describes the number of false negatives, FP describes the number of false positives, p0 describes the relative observed agreement and pe describes the hypothetical possibility of chance agreement.

Results

In the current study, there were 320 (88%) subjects in the non- AF group and 42 (12%) subjects in the AF group. When the gender distribution of the study was considered, 170 (47%) subjects were females, and 192 (53%) subjects were males. The mean ages with standard deviations were 59.7 ± 9.4 years and 64 ± 6.4 years for non-AF and AF groups, respectively. Table 2 summarizes the distribution of the target variable before and after SMOTE.

The target variable distribution Before performing SMOTE After performing SMOTE
AF Non - AF Total AF Non - AF Total
42 320 362 336 320 656

Table 2. Summary of the target distribution before and after SMOTE.

Table 3 indicates the performance metric results of the imbalanced and balanced datasets according to the classification models of LogitBoost and GLMBoost. The performance metrics with 95% Confidence Interval (CI) values are also calculated for accuracy, kappa, AUC of ROC, recall and precision (Table 3). According to Table 3, all of the performance metrics for LogitBoost model were higher than those for GLMBoost model on the imbalanced dataset. Similarly, for the balanced dataset by SMOTE, all of the performance metrics results of LogitBoost model outperformed the results of GLMBoost model. These findings clearly demonstrate that over-sampling with SMOTE ameliorates the deficits raised by the imbalanced structure of the dataset.

Models Performance metric (95% CI) for imbalanced dataset
Accuracy AUC Kappa Recall Precision
GLMBoost 0.8812 (0.8433-0.9127) 0.5088 (0.485-0.5325) 0.0294 (-0.048-0.107 ) 0.3333 (0.0084-0.9057) 0.0238 (0.0006-0.1257)
LogitBoost 0.9144 (0.8806-0.9411) 0.6827 (0.608-0.7573) 0.4667 (0.311-0.623) 0.7619 (0.5283-0.9178) 0.3810 (0.2357-0.5436)
Models Performance metric (95% CI) for balanced dataset by SMOTE        
GLMBoost 0.8247 (0.7934-0.853) 0.8259 (0.7971-0.8546) 0.6501 (0.592-0.708) 0.8671 (0.8235-0.9033) 0.7768 (0.7284-0.8202)
LogitBoost 0.9695 (0.9533-0.9813) 0.9696 (0.9564-0.9827) 0.939 (0.913-0.965) 0.9731 (0.9495-0.9876) 0.9673 (0.9422-0.9835)

Table 3. The performance metric results of the imbalanced and balanced datasets according to the classification models.

Figure 1 demonstrates the performance metric results of the imbalanced and balanced datasets according to the classification models of LogitBoost and GLMBoost. In a similar way, Figure 2 displays the comparative results of performance metrics on the balanced datasets according to LogitBoost and GLMBoost. When Figure 1 is examined, almost all performance metrics for LogitBoost and GLMBoost models are higher in the balanced dataset than in the unbalanced dataset. When the Figure 2 is inspected, LogitBoost model outperforms GLMBoost model in all the performance metrics.

biomedres-classification-models

Figure 1. The performance metric results of the imbalanced and balanced datasets according to the classification models of LogitBoost and GLMBoost.

biomedres-performance-metrics

Figure 2. The comparative results of performance metrics on the balanced datasets according to LogitBoost and GLMBoost.

ROC curves of LogitBoost and GLMBoost models and imbalanced/balanced datasets are depicted in the Figure 3. According to this figure, each model has more classification power on the balanced dataset as compared to that on the unbalanced dataset.

biomedres-LogitBoost-models

Figure 3. ROC curves of GLMBoost and LogitBoost models for the imbalanced/balanced datasets.

Table 4 represents the LogitBoost model based-predictor importance on the balanced dataset by SMOTE given in descending order. Based on the information of Table 4, the highest predictor importance belongs to hospitalization time in the intensive care unit.

Predictor Predictor importance
ICUHT 0.9011
VT 0.6455
Age 0.6448
PLT 0.6337
TG 0.6279
CHL 0.6076
BSA 0.5659
LDL 0.5539
Gender 0.5443
BUN 0.5302
HCT 0.5251
Rhythm 0.5222
EO 0.5196
COPD 0.5181
BSC 0.511
EM 0.5097
HB 0.5052
AC 0.502
HMI 0.5012
HDL 0.4998
HT 0.4851
Smoking 0.4835
DM 0.4767
CR 0.4596

Table 4. Predictor importance of the most accurate model.

Discussion

This research investigates and resolves the imbalanced class problem by a well-known approach of SMOTE for the prediction of AF in obese individuals. Since postoperative AF after cardiac surgery is rarely observed, the imbalanced class problem can arise in biomedical researches. As expected, the dataset used in the analysis was highly imbalanced and the minority class was the AF group (11.6% for AF group versus 88.4% for non-AF group). To cope with this problem, we performed an oversampling method of SMOTE, and afterwards, GLMBoost and LogitBoost models were built to compare the predictive results before and after SMOTE. In the first stage, when we applied the classification models to the imbalanced dataset, the accuracies of each model were observed so high. However, the values of the other performance metrics were considerably lower than expected. For instance, when the AUC value (0.5088) of GLMBoost model was considered, GLMBoost model was not capable of classifying between the AF and non-AF groups due to the imbalanced class problem encountered in this study. Similarly, although the AUC value (0.6827) of LogitBoost was little more than the value of GLMBoost, the AUC was quite low when considering the accuracy of LogitBoost. When only the accuracies of the models are evaluated on the unbalanced data, the obtained results can be very misleading. For this reason, performance evaluation of the constructed models should be performed based on the results of different performance metrics (i.e. AUC, recall, precision and so on.) as suggested and implemented in the present study [24].

Postoperative AF is one of the prevailing complications following cardiac surgery. Postoperative AF leads to higher rates for morbidity and mortality in the patients undergoing cardiac surgery [25]. Important risk factors for postoperative AF can be estimated by predictive models to prevent the development/progression of this complication. In the present research, the most important predictor for postoperative AF was hospitalization time in the intensive care unit (ICUHT), which is associated with this complication [26]. According to the reported works [27], VT is another factor for postoperative AF. Accordingly, we determined Ventilation Time (VT) as the second most important factor related to postoperative AF. In this research, the third most important factor was age, which is an independent factor of the arrhythmia [28]. The other important predictors which are associated with postoperative AF and are determined by the selected model are lengthily given in Table 4. The predictor variables determined in this work for postoperative AF were largely analogous to the risk factors notified by other researches examining the prediction of AF after coronary artery bypass surgery [29-33].

Conclusion

Eventually, the predicted results from this research indicated that LogitBoost on the balanced dataset by SMOTE had the highest and most accurate values of performance metrics. Our results suggest that SMOTE and other oversampling approaches would be so beneficial to overcome class imbalance issues emerging in biomedical studies. As future researches, other sampling techniques incorporating with different ensemble and meta-learning algorithms are planned to handle imbalanced class problems in multi-category classification.

Acknowledgement

We would like to thank to Inonu University Scientific Research Coordination Unit to support by a grant this study. (Project number: 2016/61).

References