Date Published: May 2, 2012
Publisher: BioMed Central
Author(s): Frederick A Matsen, Aaron Gallagher.
Although taxonomy is often used informally to evaluate the results of phylogenetic inference and the root of phylogenetic trees, algorithmic methods to do so are lacking.
In this paper we formalize these procedures and develop algorithms to solve the relevant problems. In particular, we introduce a new algorithm that solves a “subcoloring” problem to express the difference between a taxonomy and a phylogeny at a given rank. This algorithm improves upon the current best algorithm in terms of asymptotic complexity for the parameter regime of interest; we also describe a branch-and-bound algorithm that saves orders of magnitude in computation on real data sets. We also develop a formalism and an algorithm for rooting phylogenetic trees according to a taxonomy.
The algorithms in this paper, and the associated freely-available software, will help biologists better use and understand taxonomically labeled phylogenetic trees.
Since the beginnings of phylogenetics, researchers have used a combination of phylogenetic inference and taxonomic knowledge to understand evolutionary relationships. Taxonomic classifications are often used to diagnose problems with phylogenetic inferences, and conversely, phylogeny is used to bring taxonomies up to date with recent inferences and to find misclassified sequences. Similarly, biologists often evaluate a putative “root” of a phylogenetic tree by looking at the taxonomic classifications of the subtrees branching off that node.
Researchers generally like to root phylogenetic trees in a way that the progression along edges from the root to the leaves is one of evolutionary descent. There are a number of ways of achieving this, from using outgroups to using non-reversible models of mutation . However, there has been surprisingly little work on one of the most commonly used informal means of rerooting, which is by using taxonomic classifications. By that we mean looking for a rooting such that the leaf sets of the descendant subtrees each have a single taxonomic classification at the highest taxonomic rank that contains multiple taxonomic identifiers. Here we formalize this process and describe algorithms for finding the taxonomic root or roots.
We have formalized the question of describing the discordance of a phylogenetic tree with its taxonomic classifications in terms of a convex subcoloring problem previously described in the literature. This coloring problem has some elegant solutions for the general case, but the parameter regime of interest here consists of trees of small degree and local nonconvexity. These considerations motivate a solution that solves a given recursion for as few “questions” as possible. The first component of this is to restrict attention to cut colors, resulting in a smaller base for the exponential complexity (Figure 4). The second is a branch and bound algorithm that gives a significant improvement in runtime compared to the algorithm in Theorem 1 (Figure 6). To enable this the φi are only built up “upon demand,” that is, when a given question is asked. The implementation described here is the first of which we are aware, and certainly the first that conveniently integrates with taxonomic annotation.
The authors declare that they have no competing interests.
FAM conceived of the project, designed the algorithms, and wrote the paper. AG designed the algorithms and implemented the algorithms and validations. Both authors read and approved the final manuscript.