Date Published: July 9, 2019
Publisher: Public Library of Science
Author(s): Azza E. Ahmed, Jacob Heldenbrand, Yan Asmann, Faisal M. Fadlelmola, Daniel S. Katz, Katherine Kendig, Matthew C. Kendzior, Tiffany Li, Yingxue Ren, Elliott Rodriguez, Matthew R. Weber, Justin M. Wozniak, Jennie Zermeno, Liudmila S. Mainzer, Li Chen.
Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the “best” workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.
Advancements in sequencing technology [1, 2] have paved the way for many applications of Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES) in genomic research and the clinic [3, 4]. Be it primary variant calling, RNASeq, genome assembly or annotation, a genomics analysis invariably involves constructing a complex workflow that could be hard to manage for large sample sizes (hundreds and beyond, [5–7]) that necessitate the use of large computer clusters. In such cases, features like resiliency and auto-restart in case of node failures, tracking of individual samples, efficient node utilization, and easy debugging of errors and failures are very important. Without a high-quality workflow manager, these requirements can be difficult to satisfy, resulting in error-prone workflow development, maintenance and execution. An additional challenge is porting the workflow among different computing environments, a common need in collaborative and consortium projects.
Our chosen use case is genomic variant calling, commonly performed in accordance with the Best Practices established by the GATK team (Genome Analysis Toolkit) [21–23]. It is likely that the GATK will continue to be the standard in research and medicine for those reasons, and also due to the need for HIPAA /CLIA  approval and compliance. The GATK is well trusted, validated by the community, and grandfathered in. Thus, the need for a generic, modular and flexible workflow built around the toolkit will persist for some time. We focused only on the primary analysis: the steps from aligning raw reads through variant calling, excluding any downstream steps, such as phasing and annotation. Additionally, we focused on small variant discovery, i.e. the detection of SNPs and InDels, not including structural variant calling. The implementation focused on WGS and WES data. The included functionality was sufficient to test the power and ability of Swift/T and evaluate its usefulness in creating extensible workflows that could be augmented with additional steps.
Complexity of problems in biology means that nearly every kind of analytics is a multi-step process, a pipeline of individual analyses that feed their outputs to each other (e.g. [59–61]). The algorithms and methods used for those processing steps are in continual development by scientists, as computational biology and specifically bioinformatics are still rapidly developing. Few studies can be accomplished via a single integrated executable. Instead, we deal with a heterogeneous medley of software of varied robustness and accuracy, frequently with multiple packages available to perform seemingly the same kind of analysis—yet subtly differing in applicability depending on the species or input data type. Thus bioinformatics today requires advanced, flexible automation via modular data-driven workflows. This is a tall order, considering the added requirements of scalability, portability and robustness. Genomics is a big data field: we no longer talk about sequencing individual organisms, but every baby being born (∼500 per day per state in the US) and every patient who comes in for a checkup (a million per year in a major hospital), not to mention the massive contemporary crop and livestock genotyping efforts. The workflows managing data analysis at that scale must take full advantage of parallelism on modern hardware, be portable among multiple HPC systems and the cloud, be robust against data corruption and hardware failure, and provide full logging and reporting to the analyst for monitoring and reproducibility.
Our experience implementing a genomic variant calling workflow in Swift/T suggests that it is a very powerful system for workflow management in supercomputing environments. The language is rich with features that give developers control over their workflow structure and execution while providing familiar syntax. The execution engine also has intelligent mechanisms for task placement and regulation, permitting efficient use of compute resources. This unfortunately comes at the cost of a relatively steep learning curve—a common trade-off for programming languages in general. Thus Swift/T can be an extremely useful—and possibly the best—tool for certain genomics analyses, though its complexity may pose an adoption barrier for biologists.