Date Published: August 10, 2017
Publisher: Public Library of Science
Author(s): Luke Maurits, Robert Forkel, Gereon A. Kaiping, Quentin D. Atkinson, Gyaneshwer Chaubey.
We present a new open source software tool called BEASTling, designed to simplify the preparation of Bayesian phylogenetic analyses of linguistic data using the BEAST 2 platform. BEASTling transforms comparatively short and human-readable configuration files into the XML files used by BEAST to specify analyses. By taking advantage of Creative Commons-licensed data from the Glottolog language catalog, BEASTling allows the user to conveniently filter datasets using names for recognised language families, to impose monophyly constraints so that inferred language trees are backward compatible with Glottolog classifications, or to assign geographic location data to languages for phylogeographic analyses. Support for the emerging cross-linguistic linked data format (CLDF) permits easy incorporation of data published in cross-linguistic linked databases into analyses. BEASTling is intended to make the power of Bayesian analysis more accessible to historical linguists without strong programming backgrounds, in the hopes of encouraging communication and collaboration between those developing computational models of language evolution (who are typically not linguists) and relevant domain experts.
Recent years have seen an increased interest in the use of computational and especially Bayesian methods for inferring phylogenetic trees of languages within an explicit, model-based framework [1–13]. The recency of this trend means there is currently a lack of software tailored to the needs of this sort of analysis of linguistic data. Thus, published analyses to date have all relied on software developed for biological phylogenetics, such as BayesPhylogenies , BEAST [15, 16] or MrBayes [17, 18].
The recent explosion of interest in linguistic phylogenetics has quickly settled on a standard framework of Bayesian inference using probabilistic models of language evolution. In this framework, languages are placed into a binary phylogenetic tree and linguistic data, such as cognacy judgements or structural/typological observations, are associated with the leaf nodes of the tree, representing extant or recently extinct languages. The data is assumed to have been generated via a probabilistic process defined on the tree, with features taking a particular value at the root and then potentially changing along each branch. The probability of a feature changing value depends upon the length of the branch. The model for calculating probabilities can be simple or quite complicated, with multiple parameters controlling the behaviour.
BEASTling analyses are focussed on the inference of phylogenetic trees from linguistic data. It is also possible for users to provide a known and trusted phylogenetic tree which is held fixed during the analysis, so that model parameters may be estimated conditional on that tree.
BEASTling analyses use the Yule pure birth process  to define a prior distribution over phylogenetic trees. The birthrate parameter is constant over all locations on the tree, but the particular constant value is inferred during the MCMC procedure. The Yule prior is one of two tree prior families supported by BEAST, and in biological applications is typically used to constrain trees over multiple species, i.e. the branching events are interpreted as speciation. The other supported family is the coalescent process , which is typically used for trees over populations of a single species, i.e. the branching events are interpreted as reproduction. Coalescent trees have a characteristic shape in which the oldest branching events are very much older than the most recent. There is no theoretical basis for expecting the language diversification process, which is more often analogised to speciation than within-population variation, to yield trees with this shape, nor is there empirical evidence in any established reconstructions. BEASTling therefore prefers the Yule prior. One shortcoming of this approach is that the Yule model assumes that languages never go extinct, when in fact language extinction is believed to be a frequent occurrence. The development of new tree priors specifically designed for linguistic phylogenetics is a continuing area of research, and future releases of BEASTling will include support for any suitable new tree priors implemented for BEAST.
To illustrate the sorts of analyses BEASTling is designed to facilitate, we present the results of two example analyses. Our intent is to concisely demonstrate the various abilities of the software, and these analyses should not be construed as serious attempts at historical linguistic scholarship. The BEASTling configuration files for both analyses are available as S1 and S2 Files in the Supporting Material. Further, the configuration files, data files and processing scripts required to replicate both of these example analyses are available in a GitHub repository at https://github.com/glottobank/BEASTling_paper/.
BEASTling is an open source project and full source code is available in a version control repository hosted by GitHub at https://github.com/lmaurits/BEASTling, under the terms of a 2-clause BSD license. BEASTling is also hosted at the Python Package Index (PyPI) and thus may be easily installed using standard Python packaging tools such as easy_install or pip. Searchable documentation, including a tutorial, is hosted by Read The Docs at https://beastling.readthedocs.org.