Date Published: December 14, 2009
Publisher: Public Library of Science
Author(s): Patrick D. Schloss, John Quackenbush. http://doi.org/10.1371/journal.pone.0008230
Abstract: As the scope of microbial surveys expands with the parallel growth in sequencing capacity, a significant bottleneck in data analysis is the ability to generate a biologically meaningful multiple sequence alignment. The most commonly used aligners have varying alignment quality and speed, tend to depend on a specific reference alignment, or lack a complete description of the underlying algorithm. The purpose of this study was to create and validate an aligner with the goal of quickly generating a high quality alignment and having the flexibility to use any reference alignment. Using the simple nearest alignment space termination algorithm, the resulting aligner operates in linear time, requires a small memory footprint, and generates a high quality alignment. In addition, the alignments generated for variable regions were of as high a quality as the alignment of full-length sequences. As implemented, the method was able to align 18 full-length 16S rRNA gene sequences and 58 V2 region sequences per second to the 50,000-column SILVA reference alignment. Most importantly, the resulting alignments were of a quality equal to SILVA-generated alignments. The aligner described in this study will enable scientists to rapidly generate robust multiple sequences alignments that are implicitly based upon the predicted secondary structure of the 16S rRNA molecule. Furthermore, because the implementation is not connected to a specific database it is easy to generalize the method to reference alignments for any DNA sequence.
Partial Text: Recent advances in traditional Sanger sequencing and pyrosequencing technologies have facilitated the ability to design studies where 102−107 16S rRNA gene sequences ranging in length between 60 and 1500 bp are generated to address interesting ecological questions –. This data gush has forced computational microbial ecologists to re-factor software tools to make the analysis of these datasets feasible. A significant bottleneck in the analysis of these sequences is the generation of a robust multiple sequence alignment (MSA). An MSA is critical to generating phylogenies and calculating meaningful pairwise genetic distances that can be used to assign sequences to operationally-defined taxonomic units [OTUs, 5]. Because of the difficulty inherent in MSA calculations, investigators have bypassed OTU-based approaches in preference for phylotype-based approaches , . In such approaches, sequences are assigned to bins based on similarity to a curated database. This has the limitation that sequences in the same phylotype may be only marginally similar to each other or unknown sequences may not affiliate to a pre-existing taxonomy. Therefore, there is a significant need to reassess alignment techniques with regard to their speed, memory requirements, and accuracy.
A critical step in analyzing DNA sequences generated from community surveys is generating a MSA. Here I described and validated a variation of the greengenes and SILVA aligners and showed that this aligner quickly generates a high-quality alignment. Also, although investigators are encouraged to perform similar types of experiments to optimize the alignment conditions for their region of interest, the kmer search and Needlema-Wunsch alignment approach was robust to perturbations in their settings. Interestingly, whereas other aligners appear to use multiple template sequences to align one candidate sequence, this alignment algorithm only requires one reference sequence per candidate sequence and does require explicit knowledge of the 16S rRNA secondary structure.