Date Published: July 11, 2017
Publisher: Public Library of Science
Author(s): Adam Price, Cynthia Gibas, Peng Xu.
Sequence read alignment to a reference genome is a fundamental step in many genomics studies. Accuracy in this fundamental step is crucial for correct interpretation of biological data. In cases where two or more closely related bacterial strains are being studied, a common approach is to simply map reads from all strains to a common reference genome, whether because there is no closed reference for some strains or for ease of comparison. The assumption is that the differences between bacterial strains are insignificant enough that the results of differential expression analysis will not be influenced by choice of reference. Genes that are common among the strains under study are used for differential expression analysis, while the remaining genes, which may fail to express in one sample or the other because they are simply absent, are analyzed separately. In this study, we investigate the practice of using a common reference in transcriptomic analysis. We analyze two multi-strain transcriptomic data sets that were initially presented in the literature as comparisons based on a common reference, but which have available closed genomic sequence for all strains, allowing a detailed examination of the impact of reference choice. We provide a method for identifying regions that are most affected by non-native alignments, leading to false positives in differential expression analysis, and perform an in depth analysis identifying the extent of expression loss. We also simulate several data sets to identify best practices for non-native reference use.
Sequence read alignment to a reference genome is currently a key step in many common bioinformatics workflows. Researchers frequently encounter situations in which the most appropriate reference genome for a reference-based analysis is not available, and a homologous alternative must be used. This can lead to inaccuracies in mapping and subsequently in quantitation and interpretation. These inaccuracies skew the results of otherwise sound analysis workflows. This study approaches the problem of non-native reference alignment by comparing the effects of read alignment to native and heterologous reference genomes. We describe a method to identify false positives caused by improper alignments to the heterologous reference, and examine the underlying causes, to provide a set of best practices for research that makes use of non-native reference genomes.
The use of non-native references genomes relies on the distance between the native and heterologous genomes and the development of high-integrity data that can overcome the naturally occurring differences between those genomes. Several factors should be taken into account by researchers intending to use non-native references for alignment of read sets. The first step that must always be taken is to identify correspondence between orthologous regions. For example, a researcher with a read set with no complete native reference genome available that has a potential heterologous genome available for read alignment should first investigate if the heterologous genome is viable for alignment. To do this, de novo assembly of reads into contigs, followed by ortholog identification using BLAST should be performed. Then, by extracting read counts for orthologous regions, correspondence can be examined as shown in Fig 3. It is important to mention that the parameters of ortholog identification here are highly relevant. In this study, we performed ortholog identification using very strict parameters (95% identity, length > 200bp, max mismatch = 5) and were able to identify a large subset of orthologous regions with high confidence. By relaxing these ortholog identification criteria, undoubtedly a larger subset of orthologous genes could be identified, however the false positive rate would also correspondingly increase with increased numbers of mismatching regions existing. A researcher intent on using a non-native reference genome for alignment should then properly tune their BLAST ortholog identification parameters to maximize the number of orthologs they can identify between their read set and non-native reference genome, while confirming viability by monitoring the correspondence of a single read set as aligned to both the non-native reference and their de novo assembled contig sets. That is, as long as an identical read set produces strong correspondence when aligned to the native and non-native alignment target, such as that seen in Fig 3 of this study, comparison between those orthologous regions can be considered viable. If that alignment instead becomes increasingly dispersed, the strictness of BLAST parameters for ortholog mapping should be increased. Once the researcher has determined an appropriate level of ortholog identification, additional investigation can be performed, if desired, to further eliminate false positive outliers.