Research Article: Population Structure and Eigenanalysis

Date Published: December 22, 2006

Publisher: Public Library of Science

Author(s): Nick Patterson, Alkes L Price, David Reich, David B Allison

Abstract: Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.

Partial Text: A central challenge in analyzing any genetic dataset is to explore whether there is any evidence that the samples in the data are from a population that is structured. Are the individuals from a homogeneous population or from a population containing subgroups that are genetically distinct? Can we find evidence for substructure in the data, and quantify it?

The basic technique is simple. We assume our markers are biallelic, for example, biallelic single nucleotide polymorphisms (SNPs). Regard the data as a large rectangular matrix C, with rows indexed by individuals, and columns indexed by polymorphic markers. For each marker choose a reference and variant allele. We suppose we have n such markers and m individuals. Let C(i,j) be the number of variant alleles for marker j, individual i. (Thus for autosomal data we have C(i,j) is 0,1 or 2.) For now suppose that there is no missing data. From each column we subtract the column means. So set for column j:
and then the corrected entries are:

For many genetic datasets, it is important to try to understand the population structure implied by the data. STRUCTURE [9], since its introduction, has been the tool of choice, especially for small datasets. We think we have provided some evidence that PCA has advantages also, as it is fast, easily implemented, and allows accurate testing of significance of a natural null model.

SMARTPCA, a software package for running eigenanalysis in a LINUX environment, is available at our laboratory: