Research Article: Research applications of primary biodiversity databases in the digital age

Date Published: September 11, 2019

Publisher: Public Library of Science

Author(s): Joan E. Ball-Damerow, Laura Brenskelle, Narayani Barve, Pamela S. Soltis, Petra Sierwald, Rüdiger Bieler, Raphael LaFrance, Arturo H. Ariño, Robert P. Guralnick, Daniel de Paiva Silva.


Our world is in the midst of unprecedented change—climate shifts and sustained, widespread habitat degradation have led to dramatic declines in biodiversity rivaling historical extinction events. At the same time, new approaches to publishing and integrating previously disconnected data resources promise to help provide the evidence needed for more efficient and effective conservation and management. Stakeholders have invested considerable resources to contribute to online databases of species occurrences. However, estimates suggest that only 10% of biocollections are available in digital form. The biocollections community must therefore continue to promote digitization efforts, which in part requires demonstrating compelling applications of the data. Our overarching goal is therefore to determine trends in use of mobilized species occurrence data since 2010, as online systems have grown and now provide over one billion records. To do this, we characterized 501 papers that use openly accessible biodiversity databases. Our standardized tagging protocol was based on key topics of interest, including: database(s) used, taxa addressed, general uses of data, other data types linked to species occurrence data, and data quality issues addressed. We found that the most common uses of online biodiversity databases have been to estimate species distribution and richness, to outline data compilation and publication, and to assist in developing species checklists or describing new species. Only 69% of papers in our dataset addressed one or more aspects of data quality, which is low considering common errors and biases known to exist in opportunistic datasets. Globally, we find that biodiversity databases are still in the initial stages of data compilation. Novel and integrative applications are restricted to certain taxonomic groups and regions with higher numbers of quality records. Continued data digitization, publication, enhancement, and quality control efforts are necessary to make biodiversity science more efficient and relevant in our fast-changing environment.

Partial Text

Online databases with detailed information on organism occurrences collectively contain well over one billion records, and the numbers continue to grow. The digitization of natural history specimens [1,2] and development of online platforms for citizen science [3] have driven a steady accumulation of species occurrence records over the past decade. Each data point provides details on the taxonomic identification, date collected or observed, location, and name of the collector or observer for an organism. Applications of these primary biodiversity data are varied—such data have historically helped determine harmful effects of pesticides, document spread of infectious disease and invasive species, monitor environmental change, and much more [4–9]. The overall goal of this paper is to determine how researchers use open-access data in published work, focusing on the past decade, when growth of online biodiversity databases has been most rapid. As one illustration of that growth, the Global Biodiversity Information Facility (GBIF) has grown from provisioning just over 200 million records in 2010 to over 1.08 billion records today, a greater than fivefold increase [10].

We searched for papers that use online and openly accessible primary occurrence records or add data to an online database. Google Scholar (GS) provides full-text indexing, which was important for identifying data sources that often appear buried in the methods section of a paper. Our search was therefore restricted to GS and to the time period of 2010 through the date of the search (April 2017; note when looking at trends over time we remove 2017, as the year was not complete in our dataset). All authors discussed and agreed upon representative search terms, which were relatively broad to capture a variety of databases hosting primary occurrence records. The terms included: “species occurrence” database (8,800 results), “natural history collection” database (634 results), herbarium database (16,500 results), “biodiversity database” (3,350 results), “primary biodiversity data” database (483 results), “museum collection” database (4,480 results), “digital accessible information” database (10 results), and “digital accessible knowledge” database (52 results)–note that quotations are used as part of the search terms where specific phrases are needed in whole. We downloaded the first 500 records (or all if there were fewer than 500 results), which are presumably the most relevant search returns, for each search term into a Zotero reference management database [57]. We obtained citation numbers for each paper from the GS search results at the time of downloading records (April 2017) [58]. After removing duplicates across search terms, the final database included 2,460 papers. We then randomly sorted papers into four separate sets of around 500 to allow subsampling of the dataset.

We characterize a variety of ways in which researchers are using species occurrence records by assessing the prevalence of individual tags corresponding to topics of interest. We identify the most commonly cited databases and most-studied taxa, number of taxa addressed, most common research uses, the types of data most often linked to species occurrence records, and aspects of data quality addressed in these papers. In addition, we determine prevalence of these tags over time to assess positive or negative trends. Some expected trends include the following:




Leave a Reply

Your email address will not be published.