Research Article: Using Protein Clusters from Whole Proteomes to Construct and Augment a Dendrogram

Date Published: February 20, 2013

Publisher: Hindawi Publishing Corporation

Author(s): Yunyun Zhou, Douglas R. Call, Shira L. Broschat.


In this paper we present a new ab initio approach for constructing an unrooted dendrogram using protein clusters, an approach that has the potential for estimating relationships among several thousands of species based on their putative proteomes. We employ an open-source software program called pClust that was developed for use in metagenomic studies. Sequence alignment is performed by pClust using the Smith-Waterman algorithm, which is known to give optimal alignment and, hence, greater accuracy than BLAST-based methods. Protein clusters generated by pClust are used to create protein profiles for each species in the dendrogram, these profiles forming a correlation filter library for use with a new taxon. To augment the dendrogram with a new taxon, a protein profile for the taxon is created using BLASTp, and this new taxon is placed into a position within the dendrogram corresponding to the highest correlation with profiles in the correlation filter library. This work was initiated because of our interest in plasmids, and each step is illustrated using proteomes from Gram-negative bacterial plasmids. Proteomes for 527 plasmids were used to generate the dendrogram, and to demonstrate the utility of the insertion algorithm twelve recently sequenced pAKD plasmids were used to augment the dendrogram.

Partial Text

The availability of complete proteomes for hundreds of thousands of species provides an unprecedented opportunity to study genetic relationships among a large number of species. However, the necessary software tools for handling massive amounts of data must first be developed before we can exploit the availability of these proteomes. Currently the tools used for clustering either are restricted in terms of the number of proteomes that can be examined because of the time required to obtain results or else are restricted in terms of their sensitivity. For example, clustering by means of hidden markov models (HMM), multiple sequence alignment, and pairwise sequence alignment by means of the Smith-Waterman alignment algorithm are limited by their time complexity. The Smith-Waterman algorithm, a dynamic programming algorithm, is known to give optimal alignment between two protein sequences for a given similarity matrix [1], but alignment of two sequences of lengths m and n requires O(mn) time. On the other hand, heuristic approximate alignment methods, frequently based on BLAST and its variants [2], reduce the computational time required; for example, in practice BLAST effectively reduces the time to O(n), but this comes at the risk of losing sensitivity to homology detection. In fact, numerous articles—for example, see [3, 4]—have discussed this loss of sensitivity in BLAST-based results compared to those of the Smith-Waterman algorithm. To ensure that a maximum number of homologous sequences are identified, highly sensitive pairwise homology detection is required. Otherwise, the clusters of homologous sequences obtained by means of a given clustering method will not include all possible members and, ultimately, the final results will be less accurate.

In this work we present a new ab initio method for constructing a dendrogram from whole proteomes that begins with output from pClust, a software program developed for homology detection for large-scale protein sequence analyses. We develop an efficient approach for insertion of a new species into the dendrogram based on the use of a correlation filter library. This is much more efficient than constructing an entirely new tree which is computationally costly. We illustrate our method by creating a dendrogram for 527 Gram-negative bacterial plasmids and augmenting this dendrogram with twelve pAKD plasmids isolated from Norwegian soil. For purposes of comparison, we also construct a smaller dendrogram consisting of 50 species and use two different distance metrics. The two resulting trees agree well with results shown in [21]. The classification results for the twelve plasmids agree with a phylogenetic tree constructed using multiple sequence alignment of the relaxase gene traI presented in [20].




Leave a Reply

Your email address will not be published.