Date Published: July 21, 2017
Publisher: Public Library of Science
Author(s): Lei Jia, Yaxiong Sun, Alexandre G. de Brevern.
Chemical stability is a major concern in the development of protein therapeutics due to its impact on both efficacy and safety. Protein “hotspots” are amino acid residues that are subject to various chemical modifications, including deamidation, isomerization, glycosylation, oxidation etc. A more accurate prediction method for potential hotspot residues would allow their elimination or reduction as early as possible in the drug discovery process. In this work, we focus on prediction models for asparagine (Asn) deamidation. Sequence-based prediction method simply identifies the NG motif (amino acid asparagine followed by a glycine) to be liable to deamidation. It still dominates deamidation evaluation process in most pharmaceutical setup due to its convenience. However, the simple sequence-based method is less accurate and often causes over-engineering a protein. We introduce structure-based prediction models by mining available experimental and structural data of deamidated proteins. Our training set contains 194 Asn residues from 25 proteins that all have available high-resolution crystal structures. Experimentally measured deamidation half-life of Asn in penta-peptides as well as 3D structure-based properties, such as solvent exposure, crystallographic B-factors, local secondary structure and dihedral angles etc., were used to train prediction models with several machine learning algorithms. The prediction tools were cross-validated as well as tested with an external test data set. The random forest model had high enrichment in ranking deamidated residues higher than non-deamidated residues while effectively eliminated false positive predictions. It is possible that such quantitative protein structure–function relationship tools can also be applied to other protein hotspot predictions. In addition, we extensively discussed metrics being used to evaluate the performance of predicting unbalanced data sets such as the deamidation case.
Chemical stability is a major concern in the development of protein therapeutics due to its impact on both efficacy and safety. Protein “hotspots” are amino acid residues that are subject to various chemical modifications, including deamidation, isomerization, glycosylation, oxidation etc. Deamidation primarily happens on an asparagine (Asn) residue. Its C-terminus residue’s backbone nitrogen atom conducts a nucleophilic attack to the Asn’s side chain amide group carbon atom. An intermediate ring-closed succinimide residue is proposed to form. The succinimide residue then conducts fast hydrolysis to lead to the final product aspartic acid (Asp) or iso aspartic acid (IsoAsp)  (Fig 1). Therefore, the deamidation process causes an Asn to Asp / IsoAsp mutation. Glutamine (Gln) residue can also undergo the deamidation process. However, Gln deamidation happens at a much slower rate than Asn [2, 3], so it is a less concern. Deamidation of asparagine residues in biological pharmaceuticals is a major cause of degradation if the therapeutical proteins are not formulated and stored appropriately . If deamidation occurs in the monoclonal antibody’s complementarity determining region (CDR), the antibody’s binding potency can be affected. Evaluating Asn deamidation liability is a very important step during the engineering process of therapeutical protein development.
The machine learning models, using structure-based descriptors, described in this work for deamidation prediction achieved improved accuracy compared to existing methods. The application can make deamidation predictions to proteins but not limited to antibodies. Compared to sequence-based methods, structure-based methods provide insights to better understanding the molecular basis of deamidation event. Enriching the training data set to contain more diversified protein deamidation data would further increase the prediction accuracy. Using machine learning methods to predict structure-function relationship in protein engineering faces lots of challenges. Proteins have highly diversified structures being encoded by their primary sequences. And proteins undergo concerted dynamics in order to conduct their functions. In this work, the descriptor set for deamidation prediction has been well developed in connection to chemical reactions. The set includes an experimental measurement as well as structural features. However, there could be improvements when the dynamics of the protein, which can be investigated by molecular dynamics simulations, is considered. Under the circumstance of the imperfect training data set and descriptor set, the statistical machine learning algorithms play important roles in removing the noise, and amplifying the signal of the data. Theoretically speaking, when perfect descriptors can be obtained, the very simple statistical algorithm should be good enough to construct a predictive model without overfitting the data. While in reality, we always balance the three key components: the training data set, the descriptor set, and statistical algorithms, in a prediction modeling process to optimize prediction performance. Finally, since protein engineering data sets are often noisy and unbalanced, it’s critical to select proper statistical metrics for performance evaluation.