**Date Published:** August 31, 2012

**Publisher:** BioMed Central

**Author(s):** Daniel Doerr, Ilan Gronau, Shlomo Moran, Irad Yavneh.

http://doi.org/10.1186/1748-7188-7-22

**Abstract**

**Distance-based phylogenetic reconstruction methods use evolutionary distances between species in order to reconstruct the phylogenetic tree spanning them. There are many different methods for estimating distances from sequence data. These methods assume different substitution models and have different statistical properties. Since the true substitution model is typically unknown, it is important to consider the effect of model misspecification on the performance of a distance estimation method.**

This paper continues the line of research which attempts to adjust to each given set of input sequences a distance function which maximizes the expected topological accuracy of the reconstructed tree. We focus here on the effect of systematic error caused by assuming an inadequate model, but consider also the stochastic error caused by using short sequences. We introduce a theoretical framework for analyzing both sources of error based on the notion of deviation from additivity, which quantifies the contribution of model misspecification to the estimation error. We demonstrate this framework by studying the behavior of the Jukes-Cantor distance function when applied to data generated according to Kimura’s two-parameter model with a transition-transversion bias. We provide both a theoretical derivation for this case, and a detailed simulation study on quartet trees.

**We demonstrate both analytically and experimentally that by deliberately assuming an oversimplified evolutionary model, it is possible to increase the topological accuracy of reconstruction. Our theoretical framework provides new insights into the mechanisms that enables statistically inconsistent reconstruction methods to outperform consistent methods.**

**Partial Text**

Phylogenetic reconstruction is the task of determining the topology of an evolutionary tree underlying a given set of samples (species) using sequence data extracted from them. This is typically done by assuming some simplified model for DNA sequence evolution, in most cases modeling it as a homogeneous continuous-time Markov process

[1-3]. Distance-based reconstruction algorithms tackle this task by first computing a set of

n2 pairwise distances between the n input samples and then finding a tree which fits these distances. The distance measures used for this purpose typically reflect the rates of certain substitution events along the evolutionary paths in question. We thus refer to these distance measures as substitution rate (SR) functions. The distance-based approach is based on the fact that if the SR function used is additive for the underlying substitution model, and the input sequences are sufficiently long, then the topology of the true tree can be efficiently recovered with high probability. However, since the underlying evolutionary model is usually unknown, this assumption is rarely satisfied in practice.

In this section we provide a brief exposition of DNA substitution models and substitution rate functions used for distance estimation. We concentrate on details essential to this study and refer the reader to a previous paper

[9] and standard textbooks

[1,2] for a more complete survey.

In order to assess whether a given SR function Δ is consistent w.r.t. a given model tree t, one has to find an affine-additive mapping d⋆ which minimizes the ratio

||DΔT,D⋆||∞wmin(D⋆) (see Definition 2.3). This task seems hard in a general setting, but in the special case of homogeneous substitution models it is tractable. Consider a homogeneous substitution model

ℳR. The unit rate matrix Rimplies a 1-1 mapping between evolutionary time t and rate matrices in

ℳR. It is thus useful to view an SR function for

ℳR as a function

Δ:R+→R+ which maps the evolutionary timet to a dissimilarity measure Δ (t ).

The quartet tree is the smallest phylogenetic tree with non-trivial topology. Focusing on quartets enables a close study of the effects of deviation from additivity and stochastic noise on reconstruction accuracy. The topology of a quartet spanning four taxa {1,2,3,4} can be represented by the split notation (ij |kl ) (where {ijkl }={1,2,3,4}), indicating that the internal edge of the quartet separates ij from kl . All distance based quartet resolution algorithms essentially reduce to the four-point method (FPM)

[26,30], which resolves this split using the six observed pairwise distances

{d^ij:{i,j}⊂{1,2,3,4}}: it first partitions the six observed distances into three sums

d^12+d^34,

d^13+d^24, and

d^14+d^23, and then determines the quartet split according to the minimal sum (the sum

d^ij+d^kl corresponds to the split (ij |kl )). We will focus on the task of reconstructing homogeneous K2P quartets using FPM with distances

{d^ij} estimated using either ΔJCor ΔK2P. We note that most of our findings easily generalize to more sophisticated homogeneous substitution models, replacing ΔJC by any concave distance function and ΔK2P by some SR function corresponding to the evolutionary time t .

In this section we describe experiments done on simulated data sets generated along the seven-taxon tree assembled by Hasegawa, Kishino, and Yano in 1985

[1,6]. This tree, spanning seven eutherian mammals (Figure

5a), was reconstructed originally using mitochondrial DNA sequences. It has a caterpillar topology (meaning that every internal node is incident to an external edge), and it has long external edges and short internal edges, making it a suitable representative of small phylogenetic trees spanning moderately distant species. These features also make it particularly challenging for distance-based reconstruction.

In this section we describe our study comparing various SR functions on genomic DNA sequences. Next to ΔJC and ΔK2P we also considered the well known LogDet SR function

[36,37], denoted here as ΔLogDet. Extending our study to this setting is challenging in two respects. First of all, unlike the simulated case, the true tree is not known with complete confidence, and accuracy of reconstruction can only be determined by using a well-accepted reference tree that may contain some errors. Secondly, the true substitution model is also unknown and is likely to violate the assumptions of both JC and K2P models and even the relaxed assumptions of the general time-reversible model (in which ΔLogDet is additive). Hence, we have to assume in this case that ΔJC, ΔK2P, and ΔLogDet are all non affine-additive, where ΔJC and ΔK2P are still likely to exhibit higher deviation from additivity than ΔLogDet, since they make stronger assumptions on the substitution model.

In this paper we explored the basic properties of methods for estimating evolutionary distances, and studied how these properties affect the accuracy of distance-based phylogenetic reconstruction. We considered both the systematic bias and the stochastic noise (variance) of the distance estimators, and examined the tradeoff between these two factors. We focused on the common task of phylogenetic reconstruction under homogeneous substitution models. Assuming homogeneous models simplifies the analytical framework, since in such models each SR function is reduced to a univariate function of the evolutionary time t . However, obtaining accurate estimates of t is still a hard task in this setting, since the unit rate matrix is unknown. An SR function Δ is guaranteed to yield consistent reconstruction across all trees in a homogeneous model only if it is additive, meaning that it is a linear function of t . When Δ is not additive, it introduces a systematic bias in distance estimates, which we denoted here as deviation from additivity . Some SR functions are only additive in one homogeneous model, whereas others are additive across a wider collection of homogeneous models. This less constrained additivity is typically achieved at a price of increased estimation noise. We studied the tradeoff between “deviation from additivity” and “estimation noise” via a case study where the model tree is a homogeneous K2P-tree with an unknown ti-tv ratio R . In this case, Kimura’s distance formula ΔK2P is always additive, while the less noisy Jukes Cantor’s formula, ΔJC, is additive only when

R=12.

aThis is a WABI 2011 special issue invited paper. Extended abstract of this paper appeared in

[47]. bTypically, the unit rate matrix is assumed to be the one corresponding to one substitution per site. cMany common distance-based algorithms, such as the Neighbor Joining (NJ) algorithm

[31,32], are known to be robust in this sense. dIn a tree, edges which touch leaves are external, and all other edges are internal. eTypes A and B quartets represent the Farris zone and Felsenstein zone, resp. (see, e.g.,

[1], Chapter 9). fWe use here the square root of the criterion commonly used in the literature, because we prefer to think in terms of distances rather than squares of distances. This has no practical influence, since we use FC only for comparing between different choices, not for assessing the quality of a give choice.g This ML estimate is obtained by a simple numerical method for maximizing the likelihood function (see, e.g.,

[1]).

Let f (t ) be a (continuous) function on some interval [t0,t1]. We prove below that if f does not intersect its linear interpolation At + B in that interval, then

dev(f,[t0,t1])=1Amaxt∈[t0,t1]|f(t)−At−b∗|. We use the following notations, conforming to the notations in the proof of Lemma 2.8:

The authors declare that they have no competing interests.

All authors participated in discussing, formulating, and modulating the research. DD performed the simulations and experiments of Sections Simulations on Hasegawa’s Tree and Section Inferring trees from genomic sequences. IG and SM initiated and directed the research and drafted the manuscript. IY performed the analysis in Sections Deviation from additivity in homogeneous substitution models and Section Performance of Non affine-additive SR functions in quartet resolution and contributed to the ideas of the project. All authors contributed to the writing and editing of the manuscript, and all authors read and approved the final manuscript.

Source:

http://doi.org/10.1186/1748-7188-7-22