Date Published: May 31, 2019
Publisher: Public Library of Science
Author(s): Qingyang Zhang, Yuchun Du, Fabio Rapallo.
Feature screening has become a real prerequisite for the analysis of high-dimensional genomic data, as it is effective in reducing dimensionality and removing redundant features. However, existing methods for feature screening have been mostly relying on the assumptions of linear effects and independence (or weak dependence) between features, which might be inappropriate in real practice. In this paper, we consider the problem of selecting continuous features for a categorical outcome from high-dimensional data. We propose a powerful statistical procedure that consists of two steps, a nonparametric significance test based on edge count and a multiple testing procedure with dependence adjustment for false discovery rate control. The new method presents two novelties. First, the edge-count test directly targets distributional difference between groups, therefore it is sensitive to nonlinear effects. Second, we relax the independence assumption and adapt Efron’s procedure to adjust for the dependence between features. The performance of the proposed procedure, in terms of statistical power and false discovery rate, is illustrated by simulated data. We apply the new method to three genomic datasets to identify genes associated with colon, cervical and prostate cancers.
Feature screening, as a key and inevitable step in many bioinformatics applications, is effective in reducing dimensionality and removing redundant features. Because the quality of selected features may greatly affect the subsequent analysis and conclusions, a reliable screening procedure is essential in practice. In general, the ideal feature screening should have high sensitivity and specificity simultaneously, as too many false positives could result in poor model interpretability while too many false negatives may cause lack of fit and inaccurate prediction. In statistics and bioinformatics literature, there has been a wealth of feature screening techniques that can be roughly classified into two categories, namely model-based screening and model-free screening. The model-based methods often rely on a class of specific models such as generalized linear model and nonparametric regression model [1–4]. However with a large number of predictors, it can be very challenging to specify the model structure without prior information. The model-free methods do not require any parametric assumption or model structure, therefore they are more flexible and more efficient than model-based methods for high-dimensional data [5–7].
Genomic studies with high-dimensional data often rely on feature screening. In this work, we developed and validated a model-free feature screening method which reliably selects continuous features associated with a categorical outcome under high dimension. The new method tackles two major challenges in feature screening and feature selection, namely nonlinear effect detection and false discovery rate control under feature dependencies. The edge-count test is based on some simple calculations such as MST construction and Chi-square test, therefore it is easy-to-implement and feasible for large-scale data sets such as cancer genomic data and brain mapping data. For instance, in the colon cancer example with 2,000 genes, the computation took less than 10 seconds by R implementation on single CPU (2.5 GHz Intel Core i7).
Identification of disease-related biomarkers from large-scale data is essential in many genomic studies. However, existence of nonlinear effects and strong feature dependencies make existing methods inappropriate and unreliable. In this work, we presented a model-free feature screening method which is sensitive to both linear and nonlinear effects. In addition, the dependence-adjusted multiple testing procedure can well control the false discovery rate under feature dependencies. On a whole, we put forward a simple yet effective testing procedure that reliably captures different types of effects. Although we used gene expression data for illustration in the paper, the proposed test can be readily applied to other data types and problems, such as DNA methylation data and protein expression data and pathway selection.