Date Published: September 25, 2017
Publisher: BioMed Central
Author(s): William R. Taylor.
In order to find correlated pairs of positions between proteins, which are useful in predicting interactions, it is necessary to concatenate two large multiple sequence alignments such that the sequences that are joined together belong to those that interact in their species of origin. When each protein is unique then the species name is sufficient to guide this match, however, when there are multiple related sequences (paralogs) in each species then the pairing is more difficult. In bacteria a good guide can be gained from genome co-location as interacting proteins tend to be in a common operon but in eukaryotes this simple principle is not sufficient.
The methods developed in this paper take sets of paralogs for different proteins found in the same species and make a pairing based on their evolutionary distance relative to a set of other proteins that are unique and so have a known relationship (singletons). The former constitute a set of unlabelled nodes in a graph while the latter are labelled. Two variants were tested, one based on a phylogenetic tree of the sequences (the topology-based method) and a simpler, faster variant based only on the inter-sequence distances (the distance-based method). Over a set of test proteins, both gave good results, with the topology method performing slightly better.
The methods develop here still need refinement and augmentation from constraints other than the sequence data alone, such as known interactions from annotation and databases, or non-trivial relationships in genome location. With the ever growing numbers of eukaryotic genomes, it is hoped that the methods described here will open a route to the use of these data equal to the current success attained with bacterial sequences.
The online version of this article (doi:10.1186/s13015-017-0115-y) contains supplementary material, which is available to authorized users.
Test data requires protein complexes of known structure, each component of which has sufficient sequences that can be aligned and equated with their partners across the protein families to generate a concatenated alignment. In bacterial systems, many such examples can be found and the alignments automatically generated and paired, for example by the GREMLIN method . However, the automatic pairing across families retains an element of error which can be simply avoided by using domains in large proteins instead of subunits in a complex, which because of their covalent link through the protein chain, are known without doubt to coexist and interact in the same organism.