Date Published: June 7, 2019
Publisher: Public Library of Science
Author(s): Xue Wang, Yuejin Wu, Rujing Wang, Yuanyuan Wei, Yuanmiao Gui, Jie Zhang.
Protein-protein interactions (PPIs) play an important role in the life activities of organisms. With the availability of large amounts of protein sequence data, PPIs prediction methods have attracted increasing attention. A variety of protein sequence coding methods have emerged, but the training of these methods is particularly time consuming. To solve this issue, we have proposed a novel matrix sequence coding method. Based on deep neural network (DNN) and a novel matrix protein sequence descriptor, we constructed a protein interaction prediction model for predicting PPIs. When performed on human PPIs data, the method achieved an accuracy of 94.34%, a recall of 98.28%, an area under the curve (AUC) of 97.79% and a loss of 23.25%. A non-redundant dataset was used to evaluate this prediction model, and the prediction accuracy is 88.29%. These results indicate that the matrix of sequence (MOS) descriptor can enhance the predictive power of PPIs and reduce training time, which can be a useful complement for future proteomics research. The experimental code and experimental results can be found at https://github.com/smalltalkman/hppi-tensorflow.
Protein-protein interactions (PPIs) are useful for elucidating the changing mechanisms of organisms in physiological or pathological conditions and are important for disease prevention and drug development. In the last decade, numerous methods for studying protein-protein interactions, such as yeast two-hybrid screens , hybrid approaches  and protein chips , have emerged. However, all of these experimental methods have the disadvantage of being time-consuming and costly. Therefore, using computational approaches to predict unknown PPIs has become an important research topic in bioinformatics. In recent years, many computer prediction methods have been proposed to predict PPIs based on a phylogenetic profile method , amino acid index distribution  and gene fusion events [6, 7]. However, these methods are not universal because the reliability of these methods depends on a priori information about the protein pairs.
We have presented a novel protein sequence coding approach for PPIs prediction. Of note, we propose a strategy for projecting protein sequences into a vector space, which is used to represent the matrix space of PPI information. Specifically, we first classify 20 amino acids into 7 amino acids according to their physicochemical properties (Table 1). The dimensions of the matrix space can be significantly reduced, from 20×20 to 7×7. Next, we combine the elements on the 7×7 matrix diagonal and the elements above the diagonal into a 28-dimensional vector. To distinguish the length of a sequence, a sequence label is added. Finally, a 29-dimensional vector can represent a protein sequence. We combined MOS with DT, KN and RF and achieved good results. The experimental results show that the proposed MOS feature extraction method is effective. However, the disadvantage of the novel matrix sequence descriptor is that the sequence matrix cannot be in one-to-one correspondence with the protein sequence. For any given two sequences, the corresponding sequence matrices are different when the sequence lengths are different, or the sequence lengths are the same but at least one element contains different numbers of elements. Therefore, pre-processing data is required to remove protein pairs with the same protein sequence length and the same number of elements.
With the increasing number of PPI calculation methods, the coding methods of various amino acid feature vectors are also emerging. Although the various protein encoding methods such as AC, CT, and LD are useful, one of the disadvantages is that the order relationship of the entire amino acid sequence is not considered. The CT  considers considered the order relationship of three amino acids. AC  considers the order relationship of 30 amino acids. LD only considers the neighbouring effect of two adjacent types of amino acids . To overcome this problem, we propose an efficient method for predicting PPIs from amino acid sequences by a novel matrix sequence descriptor feature representation with deep neural network. The novel protein feature extraction method we have proposed considers the order relationship of the entire amino acid sequence. When performed on human PPIs data, DNN-MOS, DT-MOS, KN-MOS and RF-MOS have achieved good results. Additionally, the model was used to evaluate this prediction model on a non-redundant dataset and the prediction accuracy is 88.29%. The experimental results show that the matrix sequence descriptor is promising for predicting PPIs and can be used as a complementary supplement to other methods.