Date Published: September 29, 2017
Publisher: BioMed Central
Author(s): Shixiang Wan, Quan Zou.
Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types.
Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction.
The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource.
THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
Multiple sequence alignment (MSA) is a necessary step for analyzing biological sequence structures and functions, phylogenetic inferences, and other basic fields in bioinformatics . Given the rapid increment of biological sequences in next-generation sequencing , difficulties arise from insufficiency of available state-of-the-art methods for addressing ultra-large sources.
Multiple biological sequence alignment and phylogenetic tree construction present complicated inter-relationships, and both are necessary for sequence analysis. In the last several decades, many state-of-the-art methods and algorithms were created for more time- and space-efficient MSA and phylogenetic trees construction issues. With increasing next-generation sequence database, addressing ultra-large datasets became an unprecedented challenge. Other outstanding methods were developed to improve time efficiency even with precision loss; such methods include ClustalW-MPI, Hadoop-BAM, HAlign, and HPTree. Thus, with the urgent need for additional time-efficient and computing power for ultra-large datasets, we conduct a series of experiments to assess the performance of our HAlign-II method.
This paper presents a distributed and parallel computing tool named HAlign-II to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. After comparing this tool with a series of state-of-the-art methods with ultra-large data, we conclude that HAlign-II features three advantages: (1) extremely high memory efficiency and good scaling with increases in computing resource; (2) efficient construction of phylogenetic trees with ultra-large biological sequences; (3) provision of user-friendly web server based on high performance and distributed computing infrastructure; the server is established at http://lab.malab.cn/soft/halign. These improvements will be significant in coping with extreme increases in next-generation sequencing.