Research Article: Training set optimization of genomic prediction by means of EthAcc

Date Published: February 19, 2019

Publisher: Public Library of Science

Author(s): Brigitte Mangin, Renaud Rincent, Charles-Elie Rabier, Laurence Moreau, Ellen Goudemand-Dugue, Momiao Xiong.


Genomic prediction is a useful tool for plant and animal breeding programs and is starting to be used to predict human diseases as well. A shortcoming that slows down the genomic selection deployment is that the accuracy of the prediction is not known a priori. We propose EthAcc (Estimated THeoretical ACCuracy) as a method for estimating the accuracy given a training set that is genotyped and phenotyped. EthAcc is based on a causal quantitative trait loci model estimated by a genome-wide association study. This estimated causal model is crucial; therefore, we compared different methods to find the one yielding the best EthAcc. The multilocus mixed model was found to perform the best. We compared EthAcc to accuracy estimators that can be derived via a mixed marker model. We showed that EthAcc is the only approach to correctly estimate the accuracy. Moreover, in case of a structured population, in accordance with the achieved accuracy, EthAcc showed that the biggest training set is not always better than a smaller and closer training set. We then performed training set optimization with EthAcc and compared it to CDmean. EthAcc outperformed CDmean on real datasets from sugar beet, maize, and wheat. Nonetheless, its performance was mainly due to the use of an optimal but inaccessible set as a start of the optimization algorithm. EthAcc’s precision and algorithm issues prevent it from reaching a good training set with a random start. Despite this drawback, we demonstrated that a substantial gain in accuracy can be obtained by performing training set optimization.

Partial Text

Prediction of unobserved individuals using genomic information has gained increasing importance in plant and animal breeding [1, 2]. Moreover, it is an accurate tool for prediction of complex diseases in humans [3, 4] and is included in the precision medicine initiative [5].

The genomic prediction of the genetic value of test individuals was based on GBLUP [7], and we define the accuracy of genomic prediction as Pearson’s correlation between its phenotype and its BLUP-value for a random test individual and a given training set. We refer to this correlation as the accuracy in the text below.

We have compared several estimates to infer the accuracy of GBLUP for a given training set (i.e., Pearson’s correlation between the observed phenotype and the predicted genetic value of tests given a training set genotyping). We demonstrated that neither CD- nor PEV-based accuracy estimators are accurate. The reason is that both implied that the causal-QTL model, which emulates the genetic value, is identical to the linear mixed marker model that enables making the prediction. This model equality implies that each marker is a QTL and that the QTL effects are independent and identically distributed according to a Gaussian distribution. These assumptions on QTL effects (and thus on the genetic value) are asymptotically correct in the pedigree mixed model because it is proved to be the consequence of random draws of individuals in a lineage and an infinite number of equal and additive loci [43]. On the other hand, to estimate the accuracy precisely, the essential missing information is the identical-by-descent status of alleles at the causal loci between the test and the training individuals. This missing information has to be reflected by the marker-based kinship matrix, and a lot of research has been published regarding improvement of this kinship estimation [14, 44, 45]. In contrast to this way of thinking stuck in the mixed model framework, Rabier et al. [20] proposed an estimate of the accuracy by working in an instrumental mixed marker model to predict the genetic value and a causal fixed linear model to emulate the genetic values. Into their theoretical accuracy formula, we plugged the location and the SNP effect estimated by the forward MLMM approach [28]. Thus, we showed on real data that we obtained an accurate estimate of the accuracy (MSE of 10−3 on average among sugar beet traits).




0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments