Date Published: July 9, 2013
Publisher: BioMed Central
Author(s): Akshay Deepak, David Fernández-Baca, Michelle M McMahon.
A multi-labeled tree, or MUL-tree, is a phylogenetic tree where two or more leaves share a label, e.g., a species name. A MUL-tree can imply multiple conflicting phylogenetic relationships for the same set of taxa, but can also contain conflict-free information that is of interest and yet is not obvious.
We define the information content of a MUL-tree T as the set of all conflict-free quartet topologies implied by T, and define the maximal reduced form of T as the smallest tree that can be obtained from T by pruning leaves and contracting edges while retaining the same information content. We show that any two MUL-trees with the same information content exhibit the same reduced form. This introduces an equivalence relation among MUL-trees with potential applications to comparing MUL-trees. We present an efficient algorithm to reduce a MUL-tree to its maximally reduced form and evaluate its performance on empirical datasets in terms of both quality of the reduced tree and the degree of data reduction achieved.
Our measure of conflict-free information content based on quartets is simple and topologically appealing. In the experiments, the maximally reduced form is often much smaller than the original tree, yet retains most of the taxa. The reduction algorithm is quadratic in the number of leaves and its complexity is unaffected by the multiplicity of leaf labels or the degree of the nodes.
Multi-labeled trees, also known as MUL-trees, are phylogenetic trees that can have more than one leaf with the same label
1). MUL-trees arise naturally and frequently in data sets containing multiple gene sequences for the same species
, but they can also arise in biogeographical studies or co-speciation studies where leaves represent individual taxa yet are labeled with their areas
 or hosts
A MUL-tree is a triple (T,M,ψ), where (i) T is an unrooted treea with leaf set
L(T) all of whose internal nodes have degree at least three, (ii) M is a set of labels, and (iii)
ψT:L(T)→M is a surjective map that assigns each leaf of T a label from M. (Note that if ψ is a bijection, T is singly labeled; that is, singly-labeled trees are a special case of MUL-trees.) For brevity we often refer to a MUL-tree by its underlying tree T. In what follows, unless stated otherwise, by a tree we mean a MUL-tree.
Our goal is to provide a way to reduce a MUL-tree T as much as possible, while preserving its information content. Our reduction algorithm uses the following operations. Prune (v): Delete leaf v from T. If, as a result, v’s neighbor u becomes a degree-two node, connect the former two neighbors of u by an edge and delete u. Contract (e): Delete an internal edge e and identify its endpoints.
In preparation for the MUL-tree reduction algorithm of the next section, we give some results that help to identify contractible edges and prunable leaves.
We now describe a O(n2) algorithm to compute the MRF of an n-leaf MUL-tree T. In the previous section, the MRF was defined as the tree obtained by applying information-preserving pruning and contraction operations to T, in any order, until it is no longer possible. For efficiency, however, the sequence in which these steps are performed is important. Our algorithm has three distinct phases: a preprocessing step, redundant edge contraction, and pruning of redundant leaves. We describe these next and then give an example.
We implemented our MUL-tree reduction algorithm, as well as a second step that restricts the MRF to the set of labels that appear only once, which yields a singlylabeled tree. We tested our two-step program on a set of 110,842 MUL-trees obtained from the PhyLoTA database
http://phylota.net/; GenBank eukaryotic nucleotide sequences, release 184, June 2011), which included a broad range of label-set sizes, from 4 to 1500 taxa.
We introduced an efficient algorithm to reduce a multi-labeled MUL-tree to a maximally reduced form with the same information content, defined as the set of non-conflicting quartets it resolves. We showed that the information content of a MUL-tree uniquely identifies the MUL-tree’s maximally reduced form. This has potential application in comparing MUL-trees by significantly reducing the number of comparisons as well as in extracting species-level information efficiently and conservatively from large sets of trees, irrespective of the underlying cause of multiple labels. Our algorithm can easily be adapted to work for rooted trees.
aThe results presented here can be extended to rooted trees, using triplets instead of quartets, exploiting the well-known bijection between rooted and unrooted trees (
, p. 20).
The authors declare that they have no competing interests.
MMM and DFB conceived the problem. AD, DFB and MMM designed the experiments and drafted the manuscript. AD designed and implemented the algorithms, and implemented the experiments. DFB coordinated the project. All authors read and approved the final manuscript.