Date Published: April 10, 2015
Publisher: Public Library of Science
Author(s): Khaled Daqrouq, Rami Alhmouz, Ahmed Balamesh, Adnan Memic, Manuela Helmer-Citterich.
PDZ domains have been identified as part of an array of signaling proteins that are often unrelated, except for the well-conserved structural PDZ domain they contain. These domains have been linked to many disease processes including common Avian influenza, as well as very rare conditions such as Fraser and Usher syndromes. Historically, based on the interactions and the nature of bonds they form, PDZ domains have most often been classified into one of three classes (class I, class II and others – class III), that is directly dependent on their binding partner. In this study, we report on three unique feature extraction approaches based on the bigram and trigram occurrence and existence rearrangements within the domain’s primary amino acid sequences in assisting PDZ domain classification. Wavelet packet transform (WPT) and Shannon entropy denoted by wavelet entropy (WE) feature extraction methods were proposed. Using 115 unique human and mouse PDZ domains, the existence rearrangement approach yielded a high recognition rate (78.34%), which outperformed our occurrence rearrangements based method. The recognition rate was (81.41%) with validation technique. The method reported for PDZ domain classification from primary sequences proved to be an encouraging approach for obtaining consistent classification results. We anticipate that by increasing the database size, we can further improve feature extraction and correct classification.
One of the most common and important protein domains that play an essential role in underlying cell signaling and organizing the post synaptic density region is represented by PDZ domain containing proteins [1–3]. Specifically, PDZ domain proteins have been implicated in functions such as maintainers of cell polarity, regulating the post-synaptic density by mediating protein-protein interactions, and in directing protein trafficking amongst other functions [4–6]. Furthermore, their function, or better yet malfunction, has been characterized in several disease states ranging from cystic fibrosis to cancer [7–9]. PDZ domains primary sequence is usually composed from 80 to 90 amino acids. In addition, most PDZ domains have a conserved 3D fold made of six β strands and two α helices. Almost exclusively, PDZ domains bind the C-terminal motifs of their ligand and target proteins, causing them to cluster . However, occasional internal motifs binding PDZ domains have been observed [11, 12]. The binding pocket of PDZ domains is formed by the conserved GLGF motif, present in most PDZ domains, which usually uses the last four C-terminal amino acids for target recognition. Historically, based on these C-terminal motifs binding the PDZ domain, classification of PDZ domains was proposed. The two most prominent PDZ domain classes are Class I and II. For Class I PDZ domains, a typical motif is S/T-X-Φ, where a hydrophobic amino acid (Φ) is at the C-terminus or position P0, followed by any amino acid (X) at P-1, and then Serine or Threonine at P-2. Alternatively, Class II PDZs would recognize ligand sequence with the Φ-X-Φ- motif at the C-terminus.
In order to find out the optimal PDZ domains classification approach, different objective methods are performed. The three suggested feature extraction methods ORM, ERM1, and ERM2 in conjunction with WE (ERWE2) are examined. For PDZ domain classification, support vector machine (SVM), probabilistic neural network (PNN), and K-nearest neighbors (KNN) are utilized. Bigram, trigram and fusion (both of bigram and trigram in the same feature vector) of each PDZ domain primary sequence are investigated. Table 2 shows the results of PDZ domains classification in terms of recognition rate (RR), which is the number of properly classified test samples over the total number of testing samples. The whole database comprising of 115 unique PDZ domains is utilized. We select 57% of PDZ domains from each class for training. Our RR results are calculated as an average of 1000 different combination training and testing sets. In addition, we perform confidence interval measurements representing recognition rates at each step and iteration as depicted in Fig 4. The role of confidence interval is to give an estimated range of possible values including an unknown population parameter, a range estimate for a given set from the sample data. A confidence interval for the recognition rates of the sets combination mean value μ and Standard deviation σ are based on samples size n, therefore,
where C is the critical value for a 95% confidence interval (1.96) [39, 40]. Confidence interval results were calculated and reported; specifically, the confidence interval states that 95% of the calculated recognition rate for each combination should be contained in this interval. A wider confidence interval would represent an improper dataset or a database that is unsuitable for performing feature extraction. In our study, all intervals calculated for each method are within a reasonable range.
This work studied classification methods of PDZ domain by examining the primary sequences. Three feature extraction methods were proposed. For classification, three known methods were utilized. For better representation of the extracted feature, we used WPT to decompose the signal into different sub signals of different bands of frequency. Shannon entropy was calculated for each WPT sub signal. Thus, we could decrease the number of features obtained BERM1 down to 128. Three methods for feature extraction were tested, one based on the occurrence rearrangement method and the other methods based on the existence rearrangement method. Existence rearrangement was better than occurrence rearrangement in terms of recognition rate. We found that the classification performance of the 115 PDZ domains by ERM2 is better, though not significantly, than those of the two other feature extraction methods. The method ERWE2 based on KNN classifier successfully achieved the recognition rate of 78.34% and the ACC value of 81.92%. For comparison purposes, our proposed method outperformed the three other published methods under similar test parameters. Overall, these types of analyses of sequence space can be done relatively quickly and cheaply in comparison to structure based methods. Specifically, it is important to highlight the time and experimental cost associated with structure based methods allowing for a lesser number of perturbations that can be tested over similar time periods. Although PDZ domains show highly selective interaction pattern, our results indicates with high accuracy that our current classification approach is highly correlated to previously published reports and known classification pattern of PDZ domains. Specifically, our wavelet based approach was able to extract important sequences characteristics and features of PDZ domains. By testing several different methods, we show that the bigram based technique performs the best for classification; from there on, we focused on assessing how these feature can be used to obtain critical motifs. These critical motifs were indeed important and present in the conserved GLGF repeat. Also, we identified the ‘TH’ bigram and specifically Histidine as crucial, which has been shown to make hydrogen bonds with Serine/Threonine of the ligands for Class I domains and for Class II the bigrams of LG, LQ, LK which are often located on αB and also one of the parts of binding pocket. We will concentrate more about the direct concept of position in the future work.