Research Article: Reproducible big data science: A case study in continuous FAIRness

Date Published: April 11, 2019

Publisher: Public Library of Science

Author(s): Ravi Madduri, Kyle Chard, Mike D’Arcy, Segun C. Jung, Alexis Rodriguez, Dinanath Sulakhe, Eric Deutsch, Cory Funk, Ben Heavner, Matthew Richards, Paul Shannon, Gustavo Glusman, Nathan Price, Carl Kesselman, Ian Foster, Rashid Mehmood.


Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility—thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.

Partial Text

Rapidly growing data collections create exciting opportunities for a new mode of scientific discovery in which alternative hypotheses are developed and tested against existing data, rather than by generating new data to validate a predetermined hypothesis [1, 2]. A key enabler of these data-driven discovery methods is the ability to easily access and analyze data of unprecedented size, complexity, and generation rate (i.e., volume, variety, and velocity)—so called big data. Equally important to the scientific method is that results be easily consumed by other scientists [3, 4]: that is that results be findable, accessible, interoperable, and re-usable (FAIR) [5].

Large quantities of DNase I hypersensitive sites sequencing (DNase-seq) data are now available, for example from the Encyclopedia of DNA Elements (ENCODE) [12]. Funk et al. [13] show how such data can be used to construct genome-wide maps of candidate transcription factor binding sites (TFBSs) via the large-scale application of footprinting methods. As outlined in Fig 1, their method comprises five main steps, which are labeled in the figure and referenced throughout this paper as ‥:
Retrieve tissue-specific DNase-seq data from ENCODE, for hundreds of biosample replicates and 27 tissue types.Combine the DNase-seq replicates data for each aligned replicate in each tissue and merge the results. Alignments are computed for two seed sizes, yielding double the number of output files.Apply two footprinting methods—Wellington [14] and HMM-based identification of TF footprints (HINT) [15], each of which has distinct strengths and limitations [16]—to each DNase-seq from to infer footprints. (On average, this process identifies a few million footprints for each tissue type, of which many but certainly not all are found by both approaches.)Starting with a supplied set of non-redundant position weight matrices (PWMs) representing transcription-factor-DNA interactions, create a catalog of “hits” within the human genome, i.e., the genomic coordinates of occurrences of the supplied PWMs.Intersect the footprints from and the hits from to identify candidate TFBSs in the DNase-seq data.

Before describing our implementation of the TFBS workflow, we introduce tools that we leverage in its development. These tools, developed or enhanced within the NIH-funded Big Data for Discovery Science center (BDDS) [17], simplify the development of scalable and reusable software by providing robust solutions to a range of big data problems, from data exchange to scalable analysis.

Having described the major technologies on which we build, we now describe the end-to-end workflow of Fig 1. We cover each of – in turn. Table 1 summarizes the biosamples, data, and computations involved in the workflow.

We review here the complete TFBS workflow, for which we specify the input datasets consumed by the workflow, the output datasets produced by the workflow, and the programs used to transform the inputs into the outputs. The inputs and programs are provided to enable readers to reproduce the results of the workflow; the outputs are provided for readers who want to use those results.

We take two approaches to evaluate the FAIRness and reproducibility of our approach. First, we conducted a user study asking participants to reproduce the analysis presented in this paper using the tools described above. Second, we evaluated FAIRness by determining whether or not each dataset and tool met a set of criteria specifically developed for this purpose [48].

The TFBS inference workflow implementation presented in Section 4 is structured in a way that it can be easily re-run by others. It is, furthermore, organized in a way that allows it to make easy use of parallel cloud computing. These desirable properties are the result of a disciplined approach to application development that aims for compliance with the ten simple rules for reproducible computational research defined by Sandve et al. [51]:
For every result, keep track of how it was produced. We preserve workflows and assign Minids to workflow results.Avoid manual data manipulation steps. We encode all data manipulation steps in either Galaxy workflows or R scripts.Archive the exact versions of all external programs used. We create a Docker container with versions of the tools used in the analysis, and generate Minids for the Docker file and Docker image of the container.Version control all custom scripts. We maintain our programs in GitHub, which supports versioning, and provide Minids for the versions used.Record all intermediate results, when possible in standardized formats. We record the major intermediate results, in the same manner as inputs and output, using FASTQ, BAM, and BED formats. In the case of database files, we dump tables to a text file via SQL commands.For analyses that include randomness, note underlying random seeds. F-Seq uses the Java random number generator, but does not set or record a seed. We would need to modify F-Seq to record that information.Always store raw data behind plots. Minids provide concise references to the raw data used to create the plots in the paper, which are bundled in BDBags.Generate hierarchical analysis output, allowing layers of increasing detail to be inspected. Because we record the complete provenance of each result, a reader can easily trace lineage from a fact, plot, or summarized result back through the processing steps and intermediate and raw data used to derive that result.Connect textual statements to underlying results. Our use of Minids would make it easy for Funk et al. [13] to reference specific data in their text. They do not do this at present, but may in a future version of their paper.Provide public access to scripts, runs, and results. Each is publicly available at a location accessible via a persistent identifier, as detailed in Tables 2 and 3.

We have presented tools designed to facilitate the implementation of complex, “big data” computations in ways that make the associated data and code findable, accessible, interoperable, and reusable (FAIR). To illustrate the use of these tools, we have described the implementation of a multi-stage DNase I hypersensitive sites sequencing data analysis that retrieves large datasets from a public repository and uses a mix of parallel cloud and workstation computation to identify candidate transcription factor binding sites. This pipeline can be rerun in its current form, for example as new DNase I hypersensitive sites sequencing data become available; extended with additional footprinting methods (for example, protein interaction quantification [66]) as new techniques become available; or modified to apply different integration and analysis methods. The case study thus demonstrates solutions to problems of scale and reproducibility in the heterogeneous, distributed world that characterizes much of modern biomedicine. We hope to see others experiment with these tools in other contexts and report their experiences.




Leave a Reply

Your email address will not be published.