Date Published: November 20, 2009
Publisher: Public Library of Science
Author(s): Sushmita Roy, Diego Martinez, Harriett Platero, Terran Lane, Margaret Werner-Washburne, Dafydd Jones. http://doi.org/10.1371/journal.pone.0007813
Abstract: Computational prediction of protein interactions typically use protein domains as classifier features because they capture conserved information of interaction surfaces. However, approaches relying on domains as features cannot be applied to proteins without any domain information. In this paper, we explore the contribution of pure amino acid composition (AAC) for protein interaction prediction. This simple feature, which is based on normalized counts of single or pairs of amino acids, is applicable to proteins from any sequenced organism and can be used to compensate for the lack of domain information.
Partial Text: Protein interaction networks are networks of physical interactions among proteins and constitute an important component of the bio-molecular network in cells. Capturing the complete set of protein interactions is crucial for understanding the programs for cellular response to different environmental stresses. Although high-throughput technology has advanced our knowledge of proteomes of many organisms –, the estimated false negative rates of these datasets suggests a non-trivial fraction of interactions remains undetected .
We first compared AAC against the evolutionarily-rich protein domain features for predicting interactions in the three yeast interaction datasets. We then compared AAC against the tuples and signature product features, which like AAC do not require protein domain information on yeast, worm and fly datasets. We then performed a post-hoc feature analysis to identify the AAC features that were most beneficial for predicting interactions. Finally, we used classifiers combining AAC and domains to predict the complete yeast interactome and validated novel interactions using Gene ontology.
We have described a novel sequence-based feature, amino acid composition (AAC), that can be used to predict protein interactions in different organisms. Compared to other sequence-based features, AAC is much simpler because it models very little sequential dependencies (domains and tuples) and no explicit pairwise information (Sigprod). Surprisingly, despite its simplicity, AAC performs at par with domains on protein pairs for which domain information is available. The good performance of AAC, in spite of its strong independence assumptions, maybe due to its similarity to the bag of words model, which often performs at least as well as models that do not make independence assumptions .