Date Published: July 28, 2017
Publisher: Public Library of Science
Author(s): Jinyan Li, Lian-sheng Liu, Simon Fong, Raymond K. Wong, Sabah Mohammed, Jinan Fiaidhi, Yunsick Sung, Kelvin K. L. Wong, Quan Zou.
Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.
Big Data in medical fields, such as hospital informatization construction, the progress of treatments, and the extensive use of high-throughput equipment, have caused a geometric growth of attentions. It has been desirable to improve the efficiency, accuracy and quality of medical data processing . The sources of health data include clinical medical treatments, pharmaceutical companies, medical research, medical assistance application, and more. Existing datasets bring in important medical and health information for research topics, such as understanding of the human genetic and disease systems  , medical and biological imaging ; and classification and prediction in medical engineering .
In recent years, more and more researchers from different fields have begun to focus on imbalanced dataset research. This research can be considered as having two different levels, the first concerns methods of data modification and optimization, and the second relates to improvement of the algorithms.
Differences in the sources and formats of datasets cause complexity. In this paper, the health and medical datasets are divided into two kinds according to the size of the datasets, which are processed by the two methods, Swarm Balancing Algorithms and Adaptive Swarm Balancing Algorithms. Therefore, two experiments are performed as follows. The following experimental results responded that the first method is more suitable for the relatively small dataset. However it would be invalid when the processed dataset is relatively big. As above mentioned that big data is common to seen in health care filed and imbalanced classification problem . Therefore, the latter method was proposed to overcome the big and highly imbalanced dataset.
Our methods clearly show their effectiveness in the processing of the imbalanced dataset classification problem with different dataset sizes. Meta-heuristic algorithms can blindly select the parameters of SMOTE to obtain a relatively high accuracy with a Kappa value that falls within the credible range. With changes in the sizes of the datasets, we used two methods to respectively improve processing of the normal-size imbalanced dataset and the large-size imbalanced dataset. The experiments indicate that the Swarm Balancing Algorithms are more suitable for a small dataset, and if we consider the big dataset as a data feed, the Adaptive Swarm Balancing Algorithms will more quickly and better solve the imbalance problem of the dataset. In the small- and normal-size datasets, no matter from which aspect is assessed, when compared with the neural network classification algorithm, PSO was better than BA. With large datasets however, except for search time, for which the PSO is still faster than BA, the other important performance parameters are better with BA rather than PSO. The Adaptive Swarm Balancing Algorithms operate more like a process of constant iteration and learning, which is more suitable to the actual problem in health and medical datasets. Because the number of diagnosed cases is constantly increasing daily, along with the gradual accumulation of cases, the dataset will grow into a large dataset that needs to be processed as a data feed. Therefore, the Adaptive Swarm Balancing Algorithms can effectively solve the imbalanced data classification problem in the large datasets typically found in the health and medical field. These methods will help the classifier to accurately classify and identify patient data.