Research Article: Polynomial Supertree Methods Revisited

Date Published: December 21, 2011

Publisher: Hindawi Publishing Corporation

Author(s): Malte Brinkmeyer, Thasso Griebel, Sebastian Böcker.


Supertree methods allow to reconstruct large phylogenetic trees by combining smaller trees with overlapping leaf sets into one, more comprehensive supertree. The most commonly used supertree method, matrix representation with parsimony (MRP), produces accurate supertrees but is rather slow due to the underlying hard optimization problem. In this paper, we present an extensive simulation study comparing the performance of MRP and the polynomial supertree methods MinCut Supertree, Modified MinCut Supertree, Build-with-distances, PhySIC, PhySIC_IST, and super distance matrix. We consider both quality and resolution of the reconstructed supertrees. Our findings illustrate the tradeoff between accuracy and running time in supertree construction, as well as the pros and cons of voting- and veto-based supertree approaches. Based on our results, we make some general suggestions for supertree methods yet to come.

Partial Text

In recent years, supertree methods have become a familiar tool for building large phylogenetic trees. Supertree approaches combine input trees with overlapping taxa sets into one large and more comprehensive tree [1]. Systematists have probably used informal supertree methods since the beginning of systematics itself, pasting together hierarchically nested taxa. Since the introduction of the term supertree and the first formal supertree method [2], there has been a continuous development of such methods, see for example Bininda-Emonds [3]. In contrast to the combination of input trees, the combination of input datasets into a matrix of characters, and subsequent analysis of this matrix under standard reconstruction criteria is called the supermatrix approach, see for example de Queiroz and Gatesy [4]. Both approaches hold the promise to reconstruct phylogenies for large clades of the tree of life, for example, Bininda-Emonds et al. [5] reconstructed a supertree on 4510 taxa for the entire group of mammals.

Our notations and definitions mainly follow Semple and Steel [34]. A graph, denoted G = (V, E), is a structure consisting of a set V of vertices, and a set E⊆{{x, y} : x, y ∈ V} of connections called edges. A graph is simple if it has no loops or parallel edges, and it is called weighted if each edge e ∈ E has a weight w(e) assigned to it. Given E′⊆E, we let G∖E′ denote the graph obtained from G by deleting all edges in E′. If G∖E′ is disconnected, E′ is called a cut set of G. In a weighted graph, w(E′) : = ∑e∈E′w(e) is the cut weight of E′. A cut set of minimum weight is called a minimum cut. A connected component Ci of a graph is a maximal set of connected vertices, that is, for any pair of vertices x and y there is a path from x to y.

Branch length from the input trees in a simulation study is arguably a best-case scenario for the BWD method, as the trees are simulated using the same parameters (see Section 5 for details). Nevertheless, if two taxa x and y cooccur in two input tree Ti and Tj, typically it is not true that λi(x, y) = λj(x, y). Willson’s as well as our implementation of the BWD algorithm merely averages the different values λ(x, y) obtained from different input trees, which works well in our simulation. If the BWD method is going to be used in a real-world study, branch length from different input trees have to be comparable. Clearly, careful studies on how to reconcile different distances have to be performed before applying BWD. In the following we outline ideas about how to reconcile branch-length from real-world data. First, we can compute a pairwise distance matrix for all taxa in the supertree, using either multiple alignments or pairwise alignments. We have different length estimates from different multiple datasets that we can normalize by finding a multiplicative constant for each dataset such that the sum of squared differences is minimized. Alternatively, we can use a linear program for this purpose, minimizing the maximum absolute difference. Finally, we can find multiplicative constants by a robust estimator. This results in a pairwise distance matrix D(t, t′) for all taxa t, t′.

We now present the layout of our large-scale simulation study, conducted to evaluate the accuracy and resolution of the methods MRP, MC, MMC, PhySIC, PhySIC_IST, BWD (including our new support function), and SDM. An overview of the simulation layout can be found in Figure 1. Each step is described in detail below.

All computations were performed on a Linux cluster of AMD Opteron-2378, 2.4 GHz CPUs, with 16 GB of memory.

We have presented a large-scale simulation study to assess and compare the accuracy and the resolution of supertrees constructed by polynomial supertree methods and the de facto standard supertree method MRP. Our results show that recent polynomial supertree methods can sometimes compete with the classical MRP approach while guaranteeing a significantly better running time, which did not exceed a few seconds for all polynomial methods. As mentioned in the introduction, one future approach to build larger clades of the Tree of Life might be a divide-and-conquer framework coupled with supertree methods, and “particular focus needs to be placed on developing fast supertree methods that yield accurate and well resolved solutions” [53].