Date Published: June 27, 2019
Publisher: Public Library of Science
Author(s): Stijn Vanderzande, Nicholas P. Howard, Lichun Cai, Cassia Da Silva Linge, Laima Antanaviciute, Marco C. A. M. Bink, Johannes W. Kruisselbrink, Nahla Bassil, Ksenija Gasic, Amy Iezzoni, Eric Van de Weg, Cameron Peace, David A. Lightfoot.
High-quality genotypic data is a requirement for many genetic analyses. For any crop, errors in genotype calls, phasing of markers, linkage maps, pedigree records, and unnoticed variation in ploidy levels can lead to spurious marker-locus-trait associations and incorrect origin assignment of alleles to individuals. High-throughput genotyping requires automated scoring, as manual inspection of thousands of scored loci is too time-consuming. However, automated SNP scoring can result in errors that should be corrected to ensure recorded genotypic data are accurate and thereby ensure confidence in downstream genetic analyses. To enable quick identification of errors in a large genotypic data set, we have developed a comprehensive workflow. This multiple-step workflow is based on inheritance principles and on removal of markers and individuals that do not follow these principles, as demonstrated here for apple, peach, and sweet cherry. Genotypic data was obtained on pedigreed germplasm using 6-9K SNP arrays for each crop and a subset of well-performing SNPs was created using ASSIsT. Use of correct (and corrected) pedigree records readily identified violations of simple inheritance principles in the genotypic data, streamlined with FlexQTL software. Retained SNPs were grouped into haploblocks to increase the information content of single alleles and reduce computational power needed in downstream genetic analyses. Haploblock borders were defined by recombination locations detected in ancestral generations of cultivars and selections. Another round of inheritance-checking was conducted, for haploblock alleles (i.e., haplotypes). High-quality genotypic data sets were created using this workflow for pedigreed collections representing the U.S. breeding germplasm of apple, peach, and sweet cherry evaluated within the RosBREED project. These data sets contain 3855, 4005, and 1617 SNPs spread over 932, 103, and 196 haploblocks in apple, peach, and sweet cherry, respectively. The highly curated phased SNP and haplotype data sets, as well as the raw iScan data, of germplasm in the apple, peach, and sweet cherry Crop Reference Sets is available through the Genome Database for Rosaceae.
A high-quality, mostly error-free genotypic data set is imperative to obtain reliable results in many downstream genetic analyses. The results of genetic analyses can be influenced by even low rates of genotyping errors . For example, the size of genetic maps and order of markers therein are affected by errors in genotypic data [2–4]. Inaccurate genotypic data will also lower the power, accuracy, and resolution of linkage studies and increase the number of false marker-locus-trait associations [5–7]. The number of observed (double) recombinants is inflated by errors in genotypic data . Incorrect calling of recombinations in turn leads to incorrect determination of haploblock limits and assignment of haplotypes . Finally, incorrect genotype calls can lead to incorrect imputations of missing data or even the improper adjustment of correct data to ensure the data is consistent with Mendelian inheritance .
We established a workflow to efficiently and confidently identify and remove genotyping errors from genotyped and pedigreed germplasm sets for apple, peach, and sweet cherry. The proposed workflow (Fig 1, S4 File) enables directed identification of markers and individuals with genotyping errors. It uses simple genetic principles such as inheritance of parental alleles, the co-segregation of linked markers, and the likelihood of double recombinations to find these errors. The order of steps was determined to efficiently minimize errors found in later steps and thereby minimize overall time needed to find errors in the data set. For example, in apple, any incorrect PC relationship would lead to an average of 196 reported Mendelian-inconsistent errors, and any unresolved Mendelian-inconsistent errors led to an average of 30 more reported Mendelian-consistent errors. The developed workflow was demonstrated on Illumina SNP array data and some software is specific to this platform (Table 4), but the same workflow order and genetic principles are appropriate for other marker types and genotyping platforms. The workflow is especially useful when medium- and high-throughput genotyping tools are used for which checking each individual marker would be too time-consuming.
A curation workflow for genotypic data of pedigreed germplasm was generated by determining the optimal order of resolving issues and by providing a step-by-step guideline. Using simple genetic principles, errors can be found and curated in a directed and efficient way, reducing the time needed to obtain a high-quality genotypic data set. The workflow was used to obtain a SNP data set for large germplasm sets for each of apple, peach, and sweet cherry representing U.S. breeding programs based on the apple 8K SNP array, peach 9K SNP array, and cherry 6K SNP array, respectively, whose SNP data is available through this paper (www.rosaceae.org), as well as used on apple and peach germplasm sets representing European breeding programs based on the apple 20K and peach 9K arrays, whose SNP data are still private. These high-quality data sets contain the largest sets of SNPs obtained through their respective SNP arrays and will provide the foundation for confident subsequent analyses in genetic research.