Date Published: December 11, 2009
Publisher: Public Library of Science
Author(s): Qingzhong Liu, Andrew H. Sung, Zhongxue Chen, Jianzhong Liu, Xudong Huang, Youping Deng, Raya Khanin. http://doi.org/10.1371/journal.pone.0008250
Abstract: Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods:Support Vector Machine Recursive Feature Elimination (SVMRFE)Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS)Gradient based Leave-one-out Gene Selection (GLGS)To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.
Partial Text: Using microarray techniques, researchers can measure the expression levels for tens of thousands of genes in a single experiment. This ability allows scientists to investigate the functional relationship between the cellular and physiological processes of biological organisms and genes at a genome-wide level. The preprocessing procedure for the raw microarray data consists of background correction, normalization, and summarization. After preprocessing, a high level analysis, such as gene selection, classification, or clustering, is applied to profile the gene expression patterns . In the high-level analysis, partitioning genes into closely related groups across time and classifying patients into different health statuses based on selected gene signatures have become two main tracks of microarray data analysis in the past decade –. Various standards related to systems biology are discussed by Brazma et al.. When sample sizes are substantially smaller than the number of features or genes, statistical modeling and inference issues become challenging as the familiar “large p small n problem” arises. Designing feature selection methods that lead to reliable and accurate predictions by learning classifiers, therefore, is an issue of great theoretical as well as practical importance in high dimensional data analysis.
Due to a huge number of variables and small sample size, there are complicated interactions and relations among genes as well as high redundancy information with microarray data. The selection of predictive models that depend on selected features and employed classifiers is extremely important for the classification of microarray data and for the further biological function analysis/validation. Machine learning and data mining techniques provide us with a powerful approach to the study of the relationship among genes. Based on supervised learning and similarity measurements, we propose a Recursive Feature Addition (RFA), recursively employ supervised learning to obtain the highest training accuracy and add a subsequent gene based on the similarity between the chosen features and the candidates to minimize the redundancy within the feature set. We believe this RFA method captures more informative and differently expressed genes than other methods. Experimental comparisons are performed by using two MAQC-II microarray datasets, breast cancer and multiple myeloma. Our studies show that the method of gene selection is strictly paired with learning classifier, which determines the final predictive model by using training data. In other words, the best classification models under different learning classifiers are associated with different methods of gene selection. Using several popular learning classifiers including NMSC, NBC, SVM, and UDC, on average, the best method of gene selection is RFA, followed by GLGS, LOOCSFS, and SVMRFE. Regarding compared learning classifiers, NMSC outperforms the others with respect to testing performance, stabilization, and consistency.