Date Published: October 9, 2011
Publisher: Hindawi Publishing Corporation
Author(s): Nada Basit, Harry Wechsler.
Wet laboratory mutagenesis to determine enzyme activity changes is expensive and time consuming. This paper expands on standard one-shot learning by proposing an incremental transductive method (T2bRF) for the prediction of enzyme mutant activity during mutagenesis using Delaunay tessellation and 4-body statistical potentials for representation. Incremental learning is in tune with both eScience and actual experimentation, as it accounts for cumulative annotation effects of enzyme mutant activity over time. The experimental results reported, using cross-validation, show that overall the incremental transductive method proposed, using random forest as base classifier, yields better results compared to one-shot learning methods. T2bRF is shown to yield 90% on T4 and LAC (and 86% on HIV-1). This is significantly better than state-of-the-art competing methods, whose performance yield is at 80% or less using the same datasets.
A chain of amino acids in a given sequence forms the primary structure that makes up a protein and determines its functions. Proteins are necessary for virtually every activity in the human body . There are twenty distinct amino acids that make up the polypeptides. They are known as proteinogenic or standard amino acids [1, 2]. The order of these amino acids in the chain, known as the primary sequence, is very important. Changes in even one amino acid (e.g., substituting one kind of amino acid, at a given location, with a different one) can affect the way the protein functions, that is, its activity. Such a substitution is an example of a mutation in the protein’s amino acid sequence and is characteristic of a single-site mutation.
The relevance of mutagenesis is straightforward. As an example, let us consider sickle-cell anemia. It is an autosomal recessive genetic blood disorder affecting red blood cells, which is caused by a single error in the gene for hemoglobin. The incorrect amino acid at one position in the molecule causes the normally lozenge-shaped red blood cells to become rigid, and take the form of a sickle. This leads to a number of complications and shortens life expectancy to 42 in males and 48 in females . We note for completeness that a mutation, by definition, is not limited to a change in only a single amino acid. Multiple site mutations, also known as multiple-point mutations, can occur when more than one amino acid mutates. This paper considers only single-site mutations.
Transduction is different from inductive inference. It is local inference (“estimation”) that moves from particular(s) to particular(s) [19, 20]. In contrast to inductive inference, where one uses empirical data to approximate a functional dependency (the inductive step (that moves from particular to general)) and then uses the dependency learned to evaluate the values of the function at points of interest (the deductive step (that moves from general to particular)), one now directly infers (using transduction) the values of the function only at the points of interest from the training data [21, 22]. Inference takes place using both labeled and unlabeled data, which are complementary to each other. Transduction incorporates unlabeled data, characteristic of test (“query”) samples, in the classification process responsible to label them for the purpose of prediction. It further seeks for a consistent and stable labeling across both (near-by) training (“labeled”) and test data. Transduction seeks here to authenticate mutations whose function, for example, activity, is unknown, in a fashion that is most consistent with the given activities of known but similar protein and/or their mutations from the PDB. The search for putative labels (for unlabeled samples) seeks to make the labels for both training and test data compatible or equivalently to make the training and test error consistent.
Much of the research on learning, in general, and modeling and prediction for the purpose of protein function prediction, in particular, has been done using one-shot learning where all the data is available at once for both training (for the purpose of one-shot learning) and cross-validation (for the purpose of performance evaluation) using randomized partitions [5, 16, 26]. This section provides details first on the best learning methods used for training and validation for one-shot learning, and then introduces alternative methods for incremental learning. All but one of the methods considered (decision trees [27, 28]) are characteristic of voting or ensemble methods. The random forest classifier , characteristic of voting methods, consists of a collection of decision trees. It combines the predictions made by multiple decision trees (e.g., taking the mode of the results of the individual trees) to obtain the final label (“class”).
The mutations under consideration for the purpose of enzyme mutant activity predictions are those of HIV-1 protease, bacteriophage T4 lysozyme, and Lac repressor. Data comes from the RCSB Protein Data Bank (PDB) (http://www.pdb.org), which is an international repository for processing and distribution of 3D macromolecular structure data, and is primarily determined experimentally. The Delaunay tessellations of HIV-1 protease, bacteriophage T4 lysozyme, and Lac repressor are based on the structural coordinates obtained from PDB accession files 3phv, 3lzm, and 1efa, respectively. The data is fed into the learning algorithms (see Section 4) in the form of residue profile vectors (see Section 2.3) courtesy of Masso and Vaisman [5, 26]. Prediction concerns activity, which is related to some particular function, and is characterized using binary labels. If a protein (wt or mutant) is carrying out some particular function at an acceptable level, compared to some predefined threshold, then the protein’s activity is considered “active” (+1). If a protein, due to a mutation or otherwise, does not perform the same particular function at an acceptable level (with respect to the wt protein or otherwise) or ceases to function at all, then the protein activity is considered “inactive” (−1) . Using the protein hemoglobin again as an example, its activity will be considered “active” if it is able to successfully transport oxygen from the lungs to the rest of the body (tissues) where it releases the oxygen for cell use, and collects carbon dioxide to return to the lungs. The hemoglobin’s activity would be considered “inactive” if it was not able to perform this function or was not able to perform its function at the expected level of efficiency, for example, due to sickle-cell disease. For all the mutation examples in our datasets, the “ground truth” protein activity in each case has been experimentally determined in the lab and provides the binary class labels used for training and validation. The characteristics of the datasets are briefly described next.
All the experiments were run using the three datasets described in the previous section, that is, HIV-1, T4, and LAC. 4-fold and 10-fold cross-validation is employed using smart (balanced) partitioning. The results reported are based upon an average of 10 runs using four performance evaluation indexes: average accuracy, standard deviation, sensitivity, and specificity. Confusion (“contingency”) matrices are derived for protein “binary” function (“activity”) prediction for each experiment using different learning algorithms. Using TP, TN, FP, and FN to indicate true positive, true negative, false positive, and false negative rates, respectively, the performance indexes are defined as follows. Accuracy is defined as (TP + TN)/(TP + TN + FP + FN). Sensitivity (the true positive rate), which is defined as TP/(TP + FN) is a measure of how well the positive class is predicted. A test with high sensitivity has a low Type II error rate. While a good and useful performance indicator, its sensitivity does not describe how well predictions are made for the other classes, in this case the negative class. Towards that end, specificity (the true negative rate) is defined as TN/(TN + FP). A test with high specificity has a low Type I error rate.
This paper expands on standard one-shot learning for enzyme mutant activity prediction using incremental learning. The computational approach proposed is driven by existing methods for protein sequence representation using Delaunay tessellation and 4-body statistical potential. The novelty of the paper comes from the use of transduction strategies for incremental learning. The use of random forest has been empirically found to perform best as base classifier for both one-shot learning and incremental learning. The novel enzyme mutant activity prediction method—T2bRF—driven by incremental transduction using random forests as base classifier, has been found empirically and cross-validated to compare favorably against current state-of-the-art contending methods.