Date Published: February 7, 2017
Publisher: Public Library of Science
Author(s): Andrzej Polanski, Agnieszka Szczesna, Mateusz Garbulowski, Marek Kimmel, Zaid Abdo.
We present new results concerning probability distributions of times in the coalescence tree and expected allele frequencies for coalescent with large sample size. The obtained results are based on computational methodologies, which involve combining coalescence time scale changes with techniques of integral transformations and using analytical formulae for infinite products. We show applications of the proposed methodologies for computing probability distributions of times in the coalescence tree and their limits, for evaluation of accuracy of approximate expressions for times in the coalescence tree and expected allele frequencies, and for analysis of large human mitochondrial DNA dataset.
Coalescent theory [1, 2], widely used for statistical inference on genetic parameters and structures of evolving populations is a thoroughly studied area with many results published over decades. The classical coalescent model concerns a sample drawn from a population which has evolved with constant size over many generations in the past. For such a model many results concerning e.g., probability distributions of times in the coalescence tree [3, 4], expected ages [5, 6] and frequencies of mutations and recombinations [3, 4] were developed. Since majority of populations undergo changes in their size in the course of their evolution several authors developed coalescence computations for the case of time dependent population sizes, either by deriving analytical approaches [5, 7–9] or by using stochastic coalescence simulations [5, 10]. Other directions of developing coalescent modeling involve different scenarios of populations evolution, constant or undergoing expansions or bottlenecks, combined with possible inhomogeneity of their structures [11, 12], as well as different models of mutation, infinite size, infinite alleles, recurrent, stepwise. There are also several studies concerning coalescence modeling for populations under selection [13–15].
Results, which we show in this paper concern the past history of an n-sample (of DNA sequences) taken at present, as illustrated in Fig 1 where samples are numbered from 1 to n = 5. Time t is measured from the present to the past with the units defined by number of generations. We assume validity of the diffusion approximation , so t is a continuous variable. Coalescences are events of merging (joining) branches of the phylogenic tree of samples. Random coalescence times from sample of size n to sample of size k − 1 are denoted by Tk, k = 2, 3…n, and their realizations by corresponding lower case letters tn, tn−1, …, t2, 0 < tn < tn−1… < t2. Times between coalescence events are denoted by the capital and lower case letters S, s; in Fig 1 these times are denoted by S5, S4, …, S2. Apart from coalescnce times T2, …, Tn−1, Tn and times between coalescence events S2, …, Sn−1, Sn of special interest (e.g., [5, 7, 8, 16]) are also the time to the most recent common ancestor (TMRCA) and total length of branches in the coalescence tree (TLBT), defined as follows TMRCA=T2,(1) and TLBT=∑k=2nkSk=T2+∑k=2nTk.(2) In this paper we evaluate the accuracy of the approximations for times in the coalescence tree and expected allele frequencies as proposed by  and we compute the probability distributions of times in the coalescence tree and their limits. We also use Human Mitochondrial Genome mtDB database to present a comparison of exact versus approximate log likelihood function for solving the inverse problem of estimating population size history from observed allele frequencies [18, 20, 24]. In the case of the evolutionary scenario with the constant population size times between coalescence events, SnC,Sn-1C,…,S2C, are mutually independent random variables, each distributed exponentially, with expectations (e.g., ) E(SkC)=N0(k2),k=2,3,...,n.(9) For the case of constant population size one can obtain analytical expressions for expected allele frequencies fnbC and probabilities pnbC [5, 34]. Source: http://doi.org/10.1371/journal.pone.0170701