Date Published: August 9, 2016
Publisher: Public Library of Science
Author(s): Yaron Granot, Omri Tal, Saharon Rosset, Karl Skorecki, Francesc Calafell.
Measures of population differentiation, such as FST, are traditionally derived from the partition of diversity within and between populations. However, the emergence of population clusters from multilocus analysis is a function of genetic structure (departures from panmixia) rather than of diversity. If the populations are close to panmixia, slight differences between the mean pairwise distance within and between populations (low FST) can manifest as strong separation between the populations, thus population clusters are often evident even when the vast majority of diversity is partitioned within populations rather than between them. For any given FST value, clusters can be tighter (more panmictic) or looser (more stratified), and in this respect higher FST does not always imply stronger differentiation. In this study we propose a measure for the partition of structure, denoted EST, which is more consistent with results from clustering schemes. Crucially, our measure is based on a statistic of the data that is a good measure of internal structure, mimicking the information extracted by unsupervised clustering or dimensionality reduction schemes. To assess the utility of our metric, we ranked various human (HGDP) population pairs based on FST and EST and found substantial differences in ranking order. EST ranking seems more consistent with population clustering and classification and possibly with geographic distance between populations. Thus, EST may at times outperform FST in identifying evolutionary significant differentiation.
Genetic differentiation among populations is typically derived from the ratio of within- to between-population diversity. The most commonly used metric, FST, was originally introduced as a fixation index at a single biallelic locus , and subsequently adapted as a measure of population subdivision by averaging over multiple loci [2–3]. FST can be expressed mathematically in terms of population diversities as FST = 1−S/T, where S and T represent the heterozygosity in subpopulations and in the total population, respectively [4–5]. The validity of FST as a measure of differentiation has been brought into question, especially when gene diversity is high (e.g., in microsatellites), and various metrics, including G’ST  and Jost’s D , have been proposed to address this inadequacy (though see  for a counter-perspective).
The core distinction between FST and EST is that FST partitions genetic diversity, whereas EST partitions genetic structure within and between populations. While FST is more sensitive to differences in within-population diversity, EST is more sensitive to outliers (though this is largely mitigated by using ESTmedian rather than ESTmean; see Materials and Methods). Since FST is weighed down by high levels of intrapopulation diversity, it can be close to zero even when population clusters are highly separated, however because it does not account for intrapopulation structure, high FST does not necessarily reflect highly separated population clusters. This is not necessarily a flaw in FST, but it does demonstrate a conceptual discrepancy between FST and strength of clustering.
The HGDP data used in our analysis were accessed at: http://www.hagsc.org/hgdp/files.html. After removing the 163 mitochondrial SNPs and 105 samples previously inferred to be close relatives , the final file included 660,755 SNPs from 938 samples in 53 populations. Strings of SNPs were treated as sequences, with mismatches summed and divided by the sequence length. Pairwise distances, based on Allele Sharing Distance (ASD) , were calculated as one minus half the average number of shared alleles per locus. The theoretical model, mathematical proofs and numerical simulations (using Mathematica v.8.0) of SDT and SDS appear in Appendix A.
The standard deviation of pairwise distances as a measure of population structure