Research Article: When old metagenomic data meet newly sequenced genomes, a case study

Date Published: June 14, 2018

Publisher: Public Library of Science

Author(s): Xin Li, Saleh A. Naser, Annette Khaled, Haiyan Hu, Xiaoman Li, Ulrich Melcher.


Dozens of computational methods are developed to identify species present in a metagenomic dataset. Many of these computational methods depend on available sequenced microbial species, which are still far from being representative. To see how newly sequenced genomes affect the analysis results, we re-analyzed a shotgun metagenomic dataset composed of twelve colitis free metagenomic samples and ten colitis-related metagenomic samples. Unexpectedly, we identified at least two new phyla that may relate to colitis development in patients, together with the phylum identified previously. Compared with the previously identified phylum that differed between the two types of samples, the differences associated with the two new phyla are statistically more significant. Moreover, the abundance of the two new phyla correlates more with the severity of colitis. Surprisingly, even by repeating the analyses implemented in the previous study, we found that at least one main conclusion in the previous study is not supported. Our study indicates the importance of re-analysis of the generated metagenomic datasets and the necessity of considering multiple updated tools in metagenomic studies. It also sheds light on the limitations of the popular tools used currently and the importance to infer the presence of taxa without relying upon available sequenced genomes.

Partial Text

A plethora of metagenomic datasets have been generated in the past fifteen years [1–4]. Early datasets are often based on 16S rRNA profiling and Sanger sequencing [5–7]. Later datasets are usually sequenced by next generation sequencing technologies [8, 9]. The generated datasets vary from the early ones such as those in seawater [2], acid mine drainage [10], and deep sea [11, 12] to current ones such as those in gut [8, 13], skin [14], soil [15], etc. These metagenomic datasets have enabled an unprecedented exploration of microbes, which has significantly advanced our understanding of microbes in the living world [3, 4, 8].

By mapping metagenomic reads to all available microbial genomes, we identified at least 3 phyla, 2 classes, 9 orders, 22 families, 70 genera and 162 species that are potentially colitis-related (last column of Tables 1 and S1 and S2). This is because the abundance of each of these identified taxa is significantly different between CF and PtC samples, and correlates with the colitis severity in patients better than the abundance of Bacteroidetes. Moreover, these taxa are identified by both unique reads and all mapped reads. In addition, 2 phyla, 1 order, 4 families, 18 genera and 71 species are colitis-related based on literature search (S1 and S2 Tables). Compared with the previously identified colitis-related taxa from the same data, we identified much more taxa supported by literature.

Our study shed new light on metagenomic studies. It shows the necessity to consider every region in sequenced genomes instead of considering marker genes only. It also suggests caution when working with duplicated reads and multi-reads during the analyses. Moreover, it is mandatory to take into account how newly sequenced genomes affect the results if methods based on sequenced genomes are used. We hope that in the near future, new and better tools to consider multi-read mapping and novel methods without relying on sequenced genomes can be developed so that the issues here can all be addressed or at least minimized.




0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments