Date Published: October 5, 2016
Publisher: Public Library of Science
Author(s): Qi Zheng, Elizabeth A. Grice, Timothée Poisot
Abstract: Accurate mapping of next-generation sequencing (NGS) reads to reference genomes is crucial for almost all NGS applications and downstream analyses. Various repetitive elements in human and other higher eukaryotic genomes contribute in large part to ambiguously (non-uniquely) mapped reads. Most available NGS aligners attempt to address this by either removing all non-uniquely mapping reads, or reporting one random or “best” hit based on simple heuristics. Accurate estimation of the mapping quality of NGS reads is therefore critical albeit completely lacking at present. Here we developed a generalized software toolkit “AlignerBoost”, which utilizes a Bayesian-based framework to accurately estimate mapping quality of ambiguously mapped NGS reads. We tested AlignerBoost with both simulated and real DNA-seq and RNA-seq datasets at various thresholds. In most cases, but especially for reads falling within repetitive regions, AlignerBoost dramatically increases the mapping precision of modern NGS aligners without significantly compromising the sensitivity even without mapping quality filters. When using higher mapping quality cutoffs, AlignerBoost achieves a much lower false mapping rate while exhibiting comparable or higher sensitivity compared to the aligner default modes, therefore significantly boosting the detection power of NGS aligners even using extreme thresholds. AlignerBoost is also SNP-aware, and higher quality alignments can be achieved if provided with known SNPs. AlignerBoost’s algorithm is computationally efficient, and can process one million alignments within 30 seconds on a typical desktop computer. AlignerBoost is implemented as a uniform Java application and is freely available at https://github.com/Grice-Lab/AlignerBoost.
Partial Text: Numerous genome-scale experimental applications are now possible due to the advent of high throughput, low cost next-generation sequencing (NGS) platforms, including genome sequencing/re-sequencing, gene expression profiling, mRNA splicing prediction/characterization, SNP identification and genotyping, and disease-associated variant identification. Accurate mapping of NGS reads to reference genomes is critical to all of these applications. Many public or commercial NGS read mapping programs (“aligners”) are available, most of which utilize a “seed-search” first strategy to allow ultra-fast processing. The most commonly used algorithms for seed-search are Hash-index (e.g. MAQ, GSNAP, SRMapper, mrsFAST-Ultra, SeqAlto [1–5]), “Burrows-Wheeler Transform” (e.g. Bowtie/Bowtie2, BWA, SOAP2 [6–10]), un-compressed tries (e.g. STAR ), or a mixture of the above (e.g. YOABS ). These seed-search algorithms usually use relatively small segments of the reads (“seeds”) to initiate mapping, due to large RAM requirements to build the index. They then attempt to extend the mapping either by naive comparison or local Smith-Waterman alignment algorithm [1–4, 6–12]. Theoretically, the use of relatively small “seeds” should provide enough uniqueness (or mapping precision) even for very large genomes. In reality however, most genomes, especially those of higher eukaryotes, are enriched with various large and highly similar repetitive elements, such as pseudogenes, paralog gene families, transposable elements, tandem repeats and sequences encoding repeat RNAs. This often leads to multiple hits in the “seed-search” stage and subsequent ambiguously mapped reads for most real NGS datasets, even those designed to target exome regions. In fact, certain NGS aligners treat seeds in highly repetitive regions as “high complexity” and ignore them by default. However, some repetitive elements in the human genome can have an overwhelming high copy number. For example, some human pseudogene classes may have more than 500 copies of over 3,000 bp; a few human SINE retrotransposon families may have over 100,000 copies of about 300 bp. In these repetitive regions, a “low complexity” seed might not even exist, leading to biased mapping in favor of these regions and subsequent false mapping.
We tested AlignerBoost with both simulated and real datasets under various combinations of experimental strategies. Since it is very difficult to determine whether a read from a real dataset is mapped correctly, we first generated synthetic NGS datasets under complex sequencing error models (see below). It is noteworthy that we didn’t choose published software for this purpose, such as SlnC , XS , GemSIM , or ART , because to our knowledge, they do not support generating simulated reads in designated regions of the genome as our procedures do. For all datasets, we tested the overall mapping precision (positive prediction value, or PPV = TP/(TP + FP) and sensitivity (true positive rate, or TPR = TP/(TP + FN)) for AlignerBoost with various NGS aligners, and compared the AlignerBoost filtered results to the default “best” outputs (S1 Table). To calculate the mapping precision and sensitivity, a “correct-mapping” is defined as aligned boundaries that are within +/- 20% of the true locus relative to the alignment length. The exact definition of a true locus is explained for simulated and real datasets separately below.
Increasing throughput and decreasing costs of employing NGS platforms for various genome-wide experimental applications have made fast and accurate mapping of NGS reads to reference genomes an imperative need. Though ultra-fast speed has been achieved in many state-of-art NGS aligners, rarely have there been attempts to improve the mapping quality in terms of precision and sensitivity. Here, we developed a generalized software toolkit, AlignerBoost, which we show dramatically boosts the mapping precision for most modern NGS aligners while maintaining a similar level of sensitivity. AlignerBoost works for almost any experimental design requiring alignment to reference genomes, but has the greatest advantage for NGS libraries with a considerable proportion of repetitive reads, such as pseudogenes, transposons and paralog gene families that are usually contributing more than half of higher eukaryotic genomes. AlignerBoost supports numerous customizable mapping parameters and users can expect up to 100% mapping precision in most cases if parameters are correctly chosen. The fact that pure pseudogene or RMSK datasets can achieve up to 98% mapping precision makes it practical to interrogate these “dark matter” genomic regions with good confidence when using AlignerBoost with relatively short NGS reads. AlignerBoost is also able to greatly increase the mapping precision and sensitivity simultaneously for RNA-seq datasets regardless of whether a dedicated RNA-seq aligner is used or not, making it especially useful when mapping RNA reads to a poorly annotated genome. Furthermore, the ability to estimate the true “inserts” by “1DP” function of AlignerBoost makes it particularly promising for mapping NGS reads with non-genome fragments, which could result from untrimmed adapters/barcodes, RNA modification, exon/intron boundaries or chimeric reads. At last, we speculate that AlignerBoost will become a powerful tool for identification of disease-associated mutations and variations in the near future, when personalized SNP and variation data will be available that can be utilized by AlignerBoost to generate extremely high quality alignments.
AlignerBoost first generates executable scripts that call external NGS aligners in multiple-mapping enabled mode. To achieve optimal sensitivity, AlignerBoost also performs optional pre-processing procedures such as quality-control (QC), adapter trimming, non-redundant tag reduction and provides sequence statistic summaries. All of these pre-processing and mapping steps are governed by tunable options, which are specified by a single user-provided configuration file, and support many major NGS aligners (S8 Table). The executable scripts generate standard SAM/BAM alignment files that contain all potential alignments (multiple-mapping enabled) for every read.