Research Article: Predicting protein model correctness in Coot using machine learning

Date Published: August 01, 2020

Publisher: International Union of Crystallography

Author(s): Paul S. Bond, Keith S. Wilson, Kevin D. Cowtan.

http://doi.org/10.1107/S2059798320009080

Abstract

Two neural networks were trained to predict the correctness of protein residues by combining multiple validation metrics in Coot. Using the predicted correctness to automatically prune models led to significant improvements in the Buccaneer pipeline.

Partial Text

Manual completion of a model is a very time-consuming step in macromolecular structure solution. Initial models from homologues or from automated model-building programs will contain errors that must be identified and corrected. The primary method for identifying errors is visual examination of the model, the 2mFo − DFc map and the mFo − DFc map by the crystallographer, using a model-building program such as Coot (Emsley & Cowtan, 2004 ▸; Emsley et al., 2010 ▸). Errors can often be identified by visual examination alone. However, other validation metrics become more important in guiding decisions when the density is less obvious, for example in less ordered regions or lower resolution structures. Coot provides validation tools to identify Ramachandran outliers, unusual rotamers and other potential errors, as well as an interface to some tools from MolProbity (Williams et al., 2018 ▸). The job of the crystallographer is to combine all of these sources of information and decide whether the model is acceptable or whether it needs to be changed. The work presented here aims to emulate this decision-making process by using machine learning to predict the correctness of protein residues. Machine learning is well suited for this problem as expected patterns in the data are not written into the model in advance but can be found through analysis of the training data. A recent example from the field of crystallography is the use of initial data-processing statistics to predict whether the data are suitable for successful structure determination through SAD/MAD phasing (Vollmar et al., 2020 ▸).

Calculations were performed on a Scientific Linux 7.7 server with two AMD EPYC 7451 CPUs and 256 GB RAM. Programs were sourced from CCP4 7.0.076 (Winn et al., 2011 ▸).

The correctness of 382 485 residues in 639 Buccaneer models was assigned by automatic comparison with the models deposited in the PDB for these structures. Residues were given correctness values of either 0 or 1, which was performed separately for main chain and side chains. This method of producing target correctness values is not perfect, but the vast majority of residues will be labelled correctly. Manual labelling of each residue is too slow and it is important to have a large number of data points for the machine learning to work well.

Although the addition of the pruning step leads to improvements in the Buccaneer pipeline, the correctness score is far from optimal. One of the main problems is that machine learning was carried out as a mixture of classification and regression. Regression was used in order to obtain a continuous correctness score instead of a binary classification. However, as the target data were categorical, i.e. all samples had a target correctness of 1 or 0, it would have been better to use a classifier and obtain continuous values in the form of the predicted probabilities for each class. Another option would be to perform regression against a different, continuous target; for example, the r.m.s.d. between the atoms of the query structure and the reference structure. This has the advantage that no cutoff has to be chosen, although it may also have difficulties in that a residue built into the solvent 5 Å away from the structure is no different to one 10 Å away. Classification using the r.m.s.d. could be a solution to this, but it does not have to be binary: for example, the classes could be an r.m.s.d. of <0.5 Å, <1 Å, <2 Å and ≥2 Å. The Coot ML Correctness script and scripts used for training the neural networks are available at https://doi.org/10.15124/44145f0a-5d82-4604-9494-7cf71190bd82. Coot version 0.8.9.2 or later is required for the script to work. The new pruning steps added to the Buccaneer pipeline in CCP4i2 will be available in CCP4 version 7.1. They can be turned on and off from the Options tab on the Input page of the task.   Source: http://doi.org/10.1107/S2059798320009080

 

Leave a Reply

Your email address will not be published.