Date Published: November 24, 2009
Publisher: Public Library of Science
Author(s): Konstantinos Mavromatis, Ken Chu, Natalia Ivanova, Sean D. Hooper, Victor M. Markowitz, Nikos C. Kyrpides, Mikael Rørdam Andersen. http://doi.org/10.1371/journal.pone.0007979
Abstract: Computational methods for determining the function of genes in newly sequenced genomes have been traditionally based on sequence similarity to genes whose function has been identified experimentally. Function prediction methods can be extended using gene context analysis approaches such as examining the conservation of chromosomal gene clusters, gene fusion events and co-occurrence profiles across genomes. Context analysis is based on the observation that functionally related genes are often having similar gene context and relies on the identification of such events across phylogenetically diverse collection of genomes. We have used the data management system of the Integrated Microbial Genomes (IMG) as the framework to implement and explore the power of gene context analysis methods because it provides one of the largest available genome integrations. Visualization and search tools to facilitate gene context analysis have been developed and applied across all publicly available archaeal and bacterial genomes in IMG. These computations are now maintained as part of IMG’s regular genome content update cycle. IMG is available at: http://img.jgi.doe.gov.
Partial Text: Gene context analysis methods have proved to be valuable for genome structure and evolution studies as well as for protein function prediction , .
We have extended the Integrated Microbial Genomes (IMG) system with gene context analysis, visualization and search tools.
We have developed computational methods together with visualization and search tools that explore the power of gene context analysis within the comparative analysis framework of Integrated Microbial Genomes (IMG) data management system. Although similar methods and approaches have been reported by other groups in the past, this is the first time that gene context analysis is based on multiple protein clusters and applied to such a large number of genomes. Each of the three clustering methods has a different scope and allows different applications. For instance, Pfam is a clustering method based on local similarity, and can be used for the exploration of domain order conservation and shuffling across the phylogenetic space. On the other hand COGs, which group proteins with sequence similarity over the entire length, are more sensitive in detection of the overall protein relationships. IMG orthologs on the other hand are focused on computationally determined orthologs (BBH) and are limited to closely related organisms excluding paralogs from the same clusters. These differences are reflected in Table 1, where the number of Pfam based conserved cassettes is significantly higher than the counts produced by other two clustering methods due to the highly combinatorial nature of protein domains.