Date Published: January 18, 2010
Publisher: Public Library of Science
Author(s): Nicolas Guex, Eugenia Migliavacca, Ioannis Xenarios, Mark Isalan. http://doi.org/10.1371/journal.pone.0008012
Abstract: DREAM is an initiative that allows researchers to assess how well their methods or approaches can describe and predict networks of interacting molecules . Each year, recently acquired datasets are released to predictors ahead of publication. Researchers typically have about three months to predict the masked data or network of interactions, using any predictive method. Predictions are assessed prior to an annual conference where the best predictions are unveiled and discussed. Here we present the strategy we used to make a winning prediction for the DREAM3 phosphoproteomics challenge. We used Amelia II, a multiple imputation software method developed by Gary King, James Honaker and Matthew Blackwell in the context of social sciences to predict the 476 out of 4624 measurements that had been masked for the challenge. To chose the best possible multiple imputation parameters to apply for the challenge, we evaluated how transforming the data and varying the imputation parameters affected the ability to predict additionally masked data. We discuss the accuracy of our findings and show that multiple imputations applied to this dataset is a powerful method to accurately estimate the missing data. We postulate that multiple imputations methods might become an integral part of experimental design as a mean to achieve cost savings in experimental design or to increase the quantity of samples that could be handled for a given cost.
Partial Text: DREAM is an initiative that is quite essential in the field of methods development to critically evaluate current computational methodologies (http://wiki.c2b2.columbia.edu/dream/index.php/The_DREAM_Project). In this respect, it follows the well-established Critical Assessment of methods of protein Structure Prediction (CASP) , , , , , , which has spurred innovation in this field. DREAM is now at its 4th instance, and there is no doubt that it will become as beneficial for the Systems Biology world as CASP already is for the structural biology domain. We participated in the 3rd instance of the DREAM challenge, in the phosphoproteomics section. Briefly, this challenge is based on a data set provided by Peter Sorger et al, where the authors measured the difference in signaling between normal and cancerous cells using phosphoproteomics assays. Predictors were given only 90% of the data and had to predict the value of the remaining measurements, which had been masked by the authors. This consisted in predicting the concentration of 17 phosphoproteins at two time points for 7 combinations of stimuli and inhibitors applied to normal and cancer hepatocytes (Figure 1). For each of the 17 phosphoproteins, 42 distinct combinations of stimuli and inhibitors measurements were given, in addition to un-stimulated and un-inhibited controls.
One interesting aspect of the DREAM challenge is that there is only about three months between the time the data are released and the due date for the analysis. This does not leave much time to develop and validate novel methods, and predictors typically apply methods they have been developing in their laboratory over time. We took a slightly different approach, which consisted in analyzing the problem, identifying a suitable tool to perform the analysis, tuning the parameters during the time allowed and performing our final prediction. The summary of the analysis workflow is described in Figure 2. Each step is described in more depth in the following sections.
When the number of imputations was large, we did not observe a statistical difference between imputing the missing data using untransformed or squared root transformed measurements, although we noticed a slightly tighter variance when untransformed data was used. Log transforming the data consistently gave inferior results (data not shown). However, we anticipated a beneficial effect of transforming the data, because during our initial data exploration phase, we observed that the measurements acquired for several of the 17 phosphoproteins were not normally distributed (data not shown). This violated the assumption made by the imputation model implemented in Amelia II, which optimally requires multivariate normally distributed data. During our search for optimal parameters, we either used the data as-is, or applied a squared root transformation on all measurements. As the various phosphoprotein measurements follow distinct distributions, we reasoned that the putative improvement obtained by transforming some measurements was compensated by the detrimental effect of transforming measurements that should have been left untransformed. Thus, we kept the multiple imputation parameters that gave us the best correlation with our own masked data and further evaluated the effect of transforming measurements for just some of the 17 phosphoproteins. We identified that a squared root transformation of Akt, IkBα, p38, p70S6 and HSP27 measurements modestly but significantly improved the overall correlation from 0.94 to 0.95 (unpaired t-test P-value 0.02). This is what we used for our final prediction.