Research Article: Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure

Date Published: June 18, 2019

Publisher: Public Library of Science

Author(s): Hugh G. Gauch, Sheng Qian, Hans-Peter Piepho, Linda Zhou, Rui Chen, Francesc Calafell.


SNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants, and microbes—especially to elucidate population structure. PCA is not a single method that is always done the same way, but rather requires three choices which we explore as a three-way factorial: two kinds of PCA graphs by three SNP codings by six PCA variants. Our main three recommendations are simple and easily implemented: Use PCA biplots, SNP coding 1 for the rare allele and 0 for the common allele, and double-centered PCA (or AMMI1 if main effects are also of interest). We also document contemporary practices by a literature survey of 125 representative articles that apply PCA to SNP data, find that virtually none implement our recommendations. The ultimate benefit from informed and optimal choices of PCA graph, SNP coding, and PCA variant, is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications.

Partial Text

Single nucleotide polymorphism (SNP) data is common in the genetics and genomics literature, and principal components analysis (PCA) is one of the statistical analyses applied most frequently to SNP data. These PCA analyses serve a multitude of research purposes, including increasing biological understanding, accelerating crop breeding, and improving human medicine. This article focuses on the one research purpose identified in its title, elucidating population structure—although its discussion and citations make evident the broader relevance of the results and principles presented here.

Because PCA monoplots of only Individuals provide some insight into population structure, they are deservedly popular in the literature. However, a monoplot cannot show interaction structure, which is often the dominant source of variation in a dataset and is usually the variation of principal interest. Production of a useful biplot is an unlikely prospect apart from understanding the consequences of SNP codings and PCA variants.

This appendix concerns which variants of PCA are, or else are not, immune to changes in SNP coding as regards PCA monoplots of Individuals, where “Individuals” is a generic term for samples such as persons or cultivars. The main text already showed in Table 1 that SNP coding affects the sums of squares (SS) for SNP main effects and S×I interaction effects. Therefore, Individual-Centered PCA is not immune because different proportions of main and interaction effects can change which PC is dominated by the SNP main effects, thereby dramatically altering a PCA monoplot of Individuals. This same verdict of not being immune also applies to Individual-Standardized PCA for the same sort of reason. Likewise, Grand-Mean-Centered PCA is not immune because it also retains SNP main effects (and Individual main effects), and again SNP coding affects the SSs for main and interaction effects. The remainder of this appendix addresses the remaining three variants in the order SNP-Centered, SNP-Standardized, and Double-Centered PCA.