Research Article: Nearest shrunken centroids via alternative genewise shrinkages

Date Published: February 15, 2017

Publisher: Public Library of Science

Author(s): Byeong Yeob Choi, Eric Bair, Jae Won Lee, Lars Kaderali.


Nearest shrunken centroids (NSC) is a popular classification method for microarray data. NSC calculates centroids for each class and “shrinks” the centroids toward 0 using soft thresholding. Future observations are then assigned to the class with the minimum distance between the observation and the (shrunken) centroid. Under certain conditions the soft shrinkage used by NSC is equivalent to a LASSO penalty. However, this penalty can produce biased estimates when the true coefficients are large. In addition, NSC ignores the fact that multiple measures of the same gene are likely to be related to one another. We consider several alternative genewise shrinkage methods to address the aforementioned shortcomings of NSC. Three alternative penalties were considered: the smoothly clipped absolute deviation (SCAD), the adaptive LASSO (ADA), and the minimax concave penalty (MCP). We also showed that NSC can be performed in a genewise manner. Classification methods were derived for each alternative shrinkage method or alternative genewise penalty, and the performance of each new classification method was compared with that of conventional NSC on several simulated and real microarray data sets. Moreover, we applied the geometric mean approach for the alternative penalty functions. In general the alternative (genewise) penalties required fewer genes than NSC. The geometric mean of the class-specific prediction accuracies was improved, as well as the overall predictive accuracy in some cases. These results indicate that these alternative penalties should be considered when using NSC.

Partial Text

Nearest shrunken centroids (NSC) is one of the most frequently used classification methods for high-dimensional data such as microarray data [1, 2]. NSC shrinks the average expression (i.e., centroid) of each gene within each class toward the overall centroid via soft thresholding. Genes whose expression levels do not significantly differ between the classes will have their centroids reduced to the overall centroids, effectively removing them from the classification procedure. The amount of shrinkage is determined by cross validation. Then class prediction is performed using the shrunken centroids, which allows one to identify important genes and predict the class of unlabeled observations.

In this section, we conducted simulation studies to compare ALT-NSC, GEN-NSC, and the GM versions of ALT-NSC and GEN-NSC to conventional NSC. We examined the overall prediction accuracy (PA), geometric mean (g-mean), area under the curve (AUC, only for a two-class classification scenario), sensitivity (SEN) and positive predictive value (PPV). SEN is the number of detected important genes divided by total number of important genes. PPV is the number of detected important genes divided by total number of genes the method selects. As in Dudoit et al. [19], we presented the median and upper quartiles of the evaluation measures.

In this section, we applied conventional NSC and the proposed methods (ALT-NSC and GEN-NSC) to four real microarray data sets. The main characteristics of the four microarray data sets are presented in Table 9.

In this article, we proposed several variations of NSC that use alternative genewise shrinkages. We derived these methods using three penalized regression models that enjoy oracle properties and have closed-form solutions under an orthonormal design. We also further modified these variants of NSC by adapting genewise penalty functions that use the correlations between the parameters belonging to the same gene, and the geometric mean approach for class-imbalanced data. We showed that these methods have better performance than conventional NSC in terms of prediction accuracy, g-mean and gene selection through simuation and real data studies.