Date Published: October 1, 2018
Publisher: Public Library of Science
Author(s): Olivier Collignon, Jeongseop Han, Hyungmi An, Seungyoung Oh, Youngjo Lee, Giandomenico Roviello.
Covariate selection is a fundamental step when building sparse prediction models in order to avoid overfitting and to gain a better interpretation of the classifier without losing its predictive accuracy. In practice the LASSO regression of Tibshirani, which penalizes the likelihood of the model by the L1 norm of the regression coefficients, has become the gold-standard to reach these objectives. Recently Lee and Oh developed a novel random-effect covariate selection method called the modified unbounded penalty (MUB) regression, whose penalization function can equal minus infinity at 0 in order to produce very sparse models. We sought to compare the predictive accuracy and the number of covariates selected by these two methods in several high-dimensional datasets, consisting in genes expressions measured to predict response to chemotherapy in breast cancer patients. These comparisons were performed by building the Receiver Operating Characteristics (ROC) curves of the classifiers obtained with the selected genes and by comparing their area under the ROC curve (AUC) corrected for optimism using several variants of bootstrap internal validation and cross-validation. We found consistently in all datasets that the MUB penalization selected a remarkably smaller number of covariates than the LASSO while offering a similar—and encouraging—predictive accuracy. The models selected by the MUB were actually nested in the ones obtained with the LASSO. Similar findings were observed when comparing these results to those obtained in their first publication by other authors or when using the area under the Precision-Recall curve (AUCPR) as another measure of predictive performance. In conclusion, the MUB penalization seems therefore to be one of the best options when sparsity is required in high-dimension. Further investigation in other datasets is however required to validate these findings.
When building prediction models, covariate selection is a fundamental step in order to maximize the interpretability of the classifier and to avoid overfitting [1–5] while maintaining the predictive accuracy. In particular, sparse models, i.e. including a very limited number of covariates, are very attractive because their fitting depends only on the estimation of a few parameters, offering an easier interpretation of the model. In practice, the financial cost of these models is also potentially lower since only a few covariates are to be measured to accurately classify a new individual. Indeed, in medicine and biology for example, predictive biomarkers can be very costly to measure and therefore the larger the number of covariates needed in a predictive model the higher its effective cost. In this respect, the LASSO regression has become the gold standard for covariate selection [5, 6]. This method is based on a penalization of the likelihood of the model by the L1 norm of the vector of the regression coefficients of the covariates. Indeed, non-differentiability of the penalization function at 0 enables to produce sparse selection. Variable selection methods based on likelihood penalization also encompass the Elastic Net penalty  and the Smoothly Clipped Absolute Deviation (SCAD) penalty . Bayesian alternatives such as spike and slab, and the Bayesian LASSO are also available [9–11]. More recently Lee and Oh  developed a novel random-effect covariate selection method called the MUB regression, whose penalization function can equal minus infinity at 0. This method offered promising results in simulations and in toy datasets [13, 14] and therefore deserves further practical investigation.
In order to illustrate our analysis with the largest sample size possible, the results obtained with the pooled database are described in details before summarizing those achieved in each of the four clinical subtypes.
In this study, we found consistently in several different datasets that the MUB penalization tended to select a remarkably smaller number of covariates than the LASSO’s while offering similar predictive accuracy (the models obtained were actually nested). Indeed the difference between the performances of the classifiers built with the covariates selected with each covariate selection method was relatively slight and varied moderately by datasets and validation technique. When comparing the results obtained to those published in the study where the data were originally described and analysed, the predictive accuracy obtained with the LASSO and the MUB varied only moderately from the performance of de Ronde et al . However, we found again that the number of covariates selected appeared to be much smaller. In these high-dimensional datasets, the MUB appeared therefore to be an efficient method to select a small number of important predictive genes of resistance to chemotherapy while offering encouraging predictive performances. Although Bayesian alternatives could be considered, we think that in a high-dimensional setting, the h-likelihood based method is easier to estimate. Thus, when sparsity is required, MUB seemed therefore to be the best option in high-dimension. Investigation in other datasets is to be planned in order to confirm these findings in other high-dimensional datasets.