Date Published: August 01, 2017
Publisher: International Union of Crystallography
Author(s): Su Datt Lam, Sayoni Das, Ian Sillitoe, Christine Orengo.
This paper reviews the recent advances in computational template-based structural modelling and proposes the subclustering of protein domain superfamilies to guide the template-selection process.
In May 2017, the Protein Data Bank (PDB; Berman et al., 2000 ▸) celebrated a milestone release of 130 000 entries. There is still a steady flow of new structures, with more than 100 added each week. However, there remains an ever-widening gap between sequence and structure space, with more than 85 million protein sequences currently deposited in the UniProtKB/TrEMBL database (The UniProt Consortium, 2017 ▸). Thanks to structural genomics initiatives (Nair et al., 2009 ▸; Terwilliger, 2011 ▸; Schwede, 2013 ▸), which have deliberately solved the structures of structurally uncharacterized families, there are increasing numbers of sequences for which there are homologues of known structure. Various protein structure modelling approaches have been developed. In this review, we focus on comparative modelling.
The most commonly used and most accurate protein structure modelling method is comparative modelling, which predicts the structure of an unknown protein using known information from one or more homologous partners. Comparative modelling usually involves three steps: (i) the identification of template structures for modelling the query protein, (ii) sequence alignment between the template and the query, and (iii) modelling the structure of the query.
Whilst it is outside the scope of this article to provide a historical review of developments in comparative modelling, we highlight some recent breakthroughs which have improved performance. An exciting recent development relates to more accurate predictions for residue–residue contacts. Residue-contact information has been used in the past, albeit not very successfully (i.e. with >80% of false positives; Monastyrskyy et al., 2014 ▸), and whilst these approaches included co-evolution methods, performance was poor because it was difficult to separate indirect couplings from direct couplings. In addition, very sequence-diverse multiple sequence alignments were typically required. Recently, methods based on direct coupling analysis have been able to disentangle direct couplings from indirect couplings (Marks et al., 2011 ▸; Jones et al., 2012 ▸; Nugent & Jones, 2012 ▸; Kamisetty et al., 2013 ▸). Furthermore, in some cases the problem of obtaining a sufficient number of diverse sequences can be solved by using metagenome data (Ovchinnikov et al., 2017 ▸).
A good-quality protein model should resemble a native protein. Native proteins usually have compact, well packed three-dimensional structures. The spatial features of the residues should comply with empirically characterized constraints on torsional angles captured in Ramachandran plots (Ramachandran et al., 1963 ▸). Hydrophobic side chains of the protein are buried to reduce unfavourable contacts with water molecules. Hydrogen bonds, disulfide bridges, salt bridges and covalent bonds should be present, as these facilitate the folding and packing of the polypeptide chain.
As mentioned above, there have been several recent developments in comparative modelling, and many excellent servers are now available for biologists wishing to model the structure of a query protein [for more information on the servers that are currently highly ranked, see Modi et al. (2016 ▸) or http://predictioncenter.org/casp12/zscores_final.cgi%5D. Therefore, for the remainder of this article, since the focus in our group is more related to providing libraries of structural templates and a library of structural models, we consider resources providing large repositories of pre-calculated three-dimensional models. The methods used to generate these repositories have either not been regularly assessed by CASP or do not currently rank top in CASP [although some, for example Phyre2 (Kelley et al., 2015 ▸) and pGenThreader (Lobley et al., 2009 ▸) have had overall good rankings for over 20 years]. However, they have been applied to generate large or very large libraries of models and can therefore be useful for larger-scale requests from biologists.
As mentioned in §2.1, several approaches are used to identify a close relative with known structure for use as a template for comparative modelling. Where very close homologues are available (≥40% sequence identity), it is possible to detect the closest template using the results returned by BLAST. However, when only remote homologues are available it is best to scan against sequence profiles or HMMs constructed from closely related sets of homologues, for example within a SCOP or CATH superfamily. The Orengo group recently developed a subclassification of CATH protein domain superfamilies that clusters relatives that are likely to have very similar structures and functions.
Below, we highlight a few selected examples of recent developments in techniques that exploit comparative models to improve the structural determination or structural coverage of large-scale macromolecular assemblies.
The last few years have been an exciting era for the protein structural modelling community. There have been substantial improvements in residue-contact prediction thanks to the use of direct coupling analysis, better statistical machine learning and the huge amount of new sequence data that is being provided by metagenome analyses. Many groups are now employing residue-contact prediction to enhance the performance of their methods. Better profile methods such as conditional random forest and Markov random fields have improved the accuracy of the template-selection process. In addition, we have demonstrated the value of organizing domain superfamilies into functional families (CATH FunFams) for template selection. CATH FunFams group relatives that are highly likely to be of similar structure and function. They are generated using a new functional subclassification in CATH-Gene3D, which constrains clustering of relatives by ensuring that any new relatives joining a particular cluster match the highly conserved functional determinants for that cluster (for example likely specificity-determining residues that influence the type of compounds bound or protein interactions). The improvement in accuracy for template selection relative to the HMM-based strategy used by HHsearch is therefore likely to be owing to the fact that the FunFam template-selection process only allows very remote relatives to be selected if they share the same or highly similar residues at key functional sites. Although HHsearch uses a powerful search strategy for remote homologues, there is no explicit constraint to ensure that equivalent functional residues are matched.