Research Article: Computing evolutionary distinctiveness indices in large scale analysis

Date Published: April 13, 2012

Publisher: BioMed Central

Author(s): Iain Martyn, Tyler S Kuhn, Arne O Mooers, Vincent Moulton, Andreas Spillner.


We present optimal linear time algorithms for computing the Shapley values and ‘heightened evolutionary distinctiveness’ (HED) scores for the set of taxa in a phylogenetic tree. We demonstrate the efficiency of these new algorithms by applying them to a set of 10,000 reasonable 5139-species mammal trees. This is the first time these indices have been computed on such a large taxon and we contrast our finding with an ad-hoc index for mammals, fair proportion (FP), used by the Zoological Society of London’s EDGE programme. Our empirical results follow expectations. In particular, the Shapley values are very strongly correlated with the FP scores, but provide a higher weight to the few monotremes that comprise the sister to all other mammals. We also find that the HED score, which measures a species’ unique contribution to future subsets as function of the probability that close relatives will go extinct, is very sensitive to the estimated probabilities. When they are low, HED scores are less than FP scores, and approach the simple measure of a species’ age. Deviations (like the Solendon genus of the West Indies) occur when sister species are both at high risk of extinction and their clade roots deep in the tree. Conversely, when endangered species have higher probabilities of being lost, HED scores can be greater than FP scores and species like the African elephant Loxondonta africana, the two solendons and the thumbless bat Furipterus horrens can move up the rankings. We suggest that conservation attention be applied to such species that carry genetic responsibility for imperiled close relatives. We also briefly discuss extensions of Shapley values and HED scores that are possible with the algorithms presented here.

Partial Text

A phylogenetic tree is a directed graph that portrays the evolutionary relationships among its leaves. The shape of a phylogenetic tree of species can also be viewed as a measure of the redundant and unique evolutionary information embodied in the species: a species in a large and recently-diverged genus like Mus shares much of its evolutionary history with many other species, while the monotypic platypus (Ornithorhynchus anatinus) embodies a large amount of mammalian evolutionary information not found elsewhere (as expressed in its peculiar genome [1] and phenotype [2]).

Let T=(V,E,λ) be an unrooted, edge-weighted phylogenetic tree on a set X with n taxa. Here V and E denote the set of vertices and edges of the tree and λ is a map that assigns to every edge e ∈ E a non-negative real number, the length λ(e) of this edge. With every edge e of  T  is associated a split Se of X. For any x ∈ X, we denote by Se(x) that set in Se that contains x and by S¯e(x) the other set. In addition, for any subset Y⊆X,PDT(Y) denotes the total length of the smallest subtree of  T  containing the taxa in Y, also known as the phylogenetic diversity of Y with respect to  T  (see e.g. [6]). In the following we first define the two indices we will focus on in this paper and then present optimal linear time algorithms for computing them.

We tested the utility of the new linear time algorithms for the Shapley values and HED scores by applying them to an updated version of the complete mammal tree the ZSL used to generate EDGE scores [11]. We outline the dataset and implementation below.

The fact that the Shapley and HED values are measures of evolutionary distinctiveness on unrooted trees suggests that the above approach to highlighting imperiled and evolutionarily isolated bits of biodiversity could be extended from species on a tree to populations connected via a network on the landscape. Importantly, the algorithms presented here for computing Shapley values and HED scores lend themselves naturally to split networks [23]. The motivation for such an extension comes from the observation that prioritizing populations within species may present policymakers with a useful tool after a species has been legally listed (e.g. through an Endangered Species Act) for conservation management. Once a species has been awarded protection, and funds are allocated for survival and recovery, an early step in any management plan is to assess how many populations there are, what state each is in, how they are demographically and genetically connected on the landscape, and where genetic diversity lies. As when arguing for a triage approach to species conservation, it may be useful and efficient to highlight those populations of an endangered species that are at once distinctive and that carry genetic responsibility for other populations. Costs and benefits may be easier to compare within than between species, such that objective decisions as to where to invest scarce conservation resources may be more palatable.

The authors declare that they have no competing interests

AS, VM conceived of the algorithm, AS produced the equations, TK created the trees, and IM and AM implemeted the algorithm, perfomed the study, and wrote the first draft of the paper. All authors read and approved the final manuscript.