Date Published: February 01, 2018
Publisher: International Union of Crystallography
Author(s): Jane S. Richardson, Christopher J. Williams, Bradley J. Hintze, Vincent B. Chen, Michael G. Prisant, Lizbeth L. Videau, David C. Richardson.
An overview is provided of current crystallographic model validation of proteins and RNA, both foundations and criteria, at all resolution ranges, together with advice on how to correct specific types of problems and when you should not try so hard that you are overfitting.
Structure validation highlights the good parts and identifies possible problems in macromolecular structures (initially for X-ray crystallography, but also for neutron, NMR and cryo-EM methods). It has aspects that address (i) the experimental data, (ii) the modeled coordinates and (iii) the model-to-data fit. It was spurred into existence around 1990 after two high-profile chain mis-tracings, and started with Rfree (Brünger, 1992 ▸), bond lengths and angles (Engh & Huber, 1991 ▸), twinning (Yeates, 1997 ▸), and the multi-criterion systems of PROCHECK (Laskowski et al., 1993 ▸), OOPS (Jones et al., 1991 ▸) and WHATCHECK (Hooft et al., 1996 ▸). Here, we will emphasize validation of the model, from physical principles and prior experience, always within the requirement for a good model-to-data fit.
The starting point for our MolProbity validation is the H atom. H atoms comprise about half of the atoms in biological macromolecules, as the ‘twigs’ at the outer edges of the covalent tree structure. About three quarters of all contacts within or between molecules have an H atom on one or both sides. Historically, H atoms were seldom included because they make calculations expensive and visualizations more cluttered, and because they are not directly observable in crystallography except at extremely high resolution, since they have only a single electron, and even this electron does not diffract well. However, we know that they are really there, both from chemistry and from our very best structures (Fig. 1 ▸). All systems include some sort of ‘bump’ check, but most are simple center-to-center distance for the heavier atoms (C, N, O…) using the poor approximation of united atom radii that account for hydrogen volume but not directionality. We aim to convince the reader that explicit H atoms can be placed quite accurately from good heavier atom positions, and that considering the detailed geometry of their hydrogen bonds and steric contacts revolutionizes the ability to model and understand local structure.
Since serious problems usually show up in multiple criteria, and since it is possible to ‘game’ any one measure at the expense of the others, model validation should be as comprehensive as feasible. Fig. 4 ▸ shows a key to all of the three-dimensional graphical outlier markup in MolProbity, as seen on the website and in the subsequent figures. Outliers are also listed in chart or table form, with their parameters and scores. All-atom contact dots and spikes have already been explained in Fig. 2 ▸(a). Bond-length and angle parameters were pioneered by Engh & Huber (1991 ▸) for proteins and by Berman and Olson (Gelbin et al., 1996 ▸) for nucleic acids; they are shown here as springs or fans, in red if too broad and in blue if too short. They have been updated primarily by making them context-dependent: on φ, ψ by Karplus and Dunbrack (Berkholz et al., 2009 ▸), on ribose pucker by us (Jain et al., 2015 ▸) or by combining angles into the Cβ deviation, which flags a Cβ position forced to be nontetrahedral by an incorrect local fit of either the side chain or the backbone (Lovell et al., 2003 ▸). Outliers in conformation (combinations of rotatable dihedrals) are shown in gold for side-chain rotamers, in green for Ramachandran and as magenta crosses for ribose pucker. Newer validation types are designed to be more robust at low resolution (CaBLAM) or to flag systematic errors such as too many cis-non-Pro peptides. Each of the last five will be discussed further below.
A misfitting inside a molecule usually produces outliers in more than one criterion. A bad clash means that one or both of the clashing atoms is in the wrong position, but it is seldom the case that just moving the two atoms apart is the right answer. For instance, for the selenomethionine (Mse351) in Fig. 7 ▸(a) all of the clashes have the methyl group in common, and the side chain is a rotamer outlier. Among rotamers that keep the Se atom centered in its clear density, the mmm rotamer (−60, −60, −60°) fits with no clashes and even places the methyl in a tiny bit of electron density, providing a win–win answer.
High resolution is wonderful, but complex, and requires much more work than one might expect. The hardest part is correctly sorting out the many alternate conformations that are visible. Even when each atom position shows a distinct peak, they often cross back and forth confusingly, making it hard to trace more than one valid, self-consistent model through them. Historically, the problems are worst in alternates B and higher, since most validation systems have not evaluated these at all. MolProbity has always given some assessment of multiple alternates, and we are now in the process of making this functionality more complete and easy to use. Some advice in the meantime is that if the occupancies are discernably different, use the relative peak heights to join up atoms. Clashes with neighboring groups often mean that they also need modeled alternates, perhaps not differing enough to have been obvious. Nearby waters are the most frequent offenders, as in Fig. 9 ▸(a), since they are really part of the alternate conformation network but are typically only given high B factors rather the partial occupancy and alternate flags that they really deserve. Watch out for deviant bond-length, angle and Cβ deviations, which signal that the alternates are either incorrectly mixed or that separate alternates were not defined widely enough along the covalent structure. If alternates for a side chain are fitted with Cβ atoms >0.2 Å apart then alternates should also be defined in the backbone, and if any alternate backbone atoms are widely separated then the alternates should not rejoin at the peptide bonds (as is typically performed), but only at the flanking Cα atoms. PHENIX can now extend alternates in this manner (Deis et al., 2013 ▸) far enough out to avoid the sort of dire geometry seen in Fig. 9 ▸(b). In another program, try duplicating and alternate flagging the few extra atoms that need to separate slightly.
There is no question that low resolution (≥3 Å) is a truly difficult challenge. Discernable bumps for carbonyl O atoms mostly disappear, giving ill-defined backbone conformations. Some side-chain atoms should genuinely lie outside the density, not only confusing rotamer choice but also tempting both people and software to scrunch them back in. Electron-density connectivity is part way along in its change from following the atomic connectivity at 2 Å to being a slab for β-sheet and a solid tube for α-helix by 5–6 Å, and it makes this change through inconsistent, misleading intermediate forms. Crystallographic methods were developed at resolutions where one can first trace the chain and then deal with side chains, but at low resolution they mix together, with the size and position of local side chains causing backbone density to break or coalesce in the wrong patterns. The best overall advice is always to fit structure as much more regular and ideal than it looks, which essentially always turns out to be the right answer if a higher resolution structure is solved later. For instance, Fig. 10 ▸(a) shows that the blobby bit of structure in Fig. 10 ▸(b) is actually a very regular β-strand with full hydrogen bonding, good rotamers and no Ramachandran or clash outliers. Once regular stretches of secondary structure have been fitted, they can be bent or shifted somewhat better into the overall density by refinement tools such as DEN (Schröder et al., 2010 ▸), jelly body (Murshudov et al., 2011 ▸) or morphing (Terwilliger et al., 2012 ▸), but preferably with the help of judicious hydrogen-bond restraints to minimize distortion.
RNA structure has quite different properties than either DNA or protein, and its complex tertiary structures, catalytic and binding functions, and roles in large dynamic molecular machines make it highly important. RNA backbone conformation is crucial to all of these functions, but has too many variables to model straightforwardly at the usual resolutions attainable, where the backbone between ribose and phosphate is a featureless tube. Fortunately, there are some tools to help with this problem (Jain et al., 2015 ▸).
Most of this validation has been around for quite a long time, and one might think that all one needs to do is check out the sliders and worst specifics on the PDB report. However, new issues keep arising from new methodologies or from unusual things in new macromolecules. Much of validation is now built into automated procedures, which is great, but occasionally this makes something new go wrong. The most notable current example is with cis-non-Pro peptides. As seen in Fig. 13 ▸(a) there are indeed genuine examples, which are almost always functionally important, but they are very rare: only about one in 3000 nonproline peptides are cis. However, as first pointed out by Croll (2015 ▸), in the last ten years there has been an epidemic of their overuse, with >10% of structures containing ≥30 times too many cis-non-Pro peptides (Fig. 13 ▸b), probably without the depositors realising. Once a conformation is in lists or fragment libraries, it becomes used whenever it happens to fit just a little better, and overall usage is not tracked. The cis-non-Pro case is even worse than random, since a cis peptide is more compact than a trans peptide and fits better into patchy density or contracted density in a low-resolution loop (Fig. 13 ▸c). The first, easy stopgap is that cis-non-Pro peptides are now prominently flagged (see Fig. 4 ▸; Williams & Richardson, 2015 ▸) in MolProbity, PHENIX and Coot so that people will be aware of them. Highly twisted peptides (ω > 30° nonplanar) are also now flagged; they are essentially never correct, since only two clear examples of >30° have been found in good reference data (Berkholz et al., 2012 ▸).
Here are the main take-home points for your own work.(i) At least one person should look at the map and at each of the worst outliers (Fig. 15 ▸).(ii) The goal is few outliers, not zero.(iii) Follow the Zen of model anomalies.(1) Correct most of them.(2) Treasure the genuine few.(3) Then rest serenely.