Date Published: May 01, 2019
Publisher: International Union of Crystallography
Author(s): Rafiga C. Masmaliyeva, Garib N. Murshudov.
This paper describes some statistical tools for analyses of macromolecular B values.
Refinement and validation of atomic models elucidated using crystallographic, and more and more increasingly single-particle cryo-EM, methods (Frank, 2006 ▸) are essential steps in the derivation of reliable three-dimensional structures of macromolecules. Atomic refinement procedures based on Bayesian statistics are now routine (Bricogne, 1997 ▸; Murshudov et al., 2011 ▸; Pannu & Read, 1996 ▸). Prior structural and chemical information pertaining to building blocks of macromolecules are used during refinement (Vagin et al., 2004 ▸; Long et al., 2017 ▸; Moriarty et al., 2009 ▸; Nicholls et al., 2012 ▸; Smart et al., 2011 ▸, 2012 ▸) as well as for validation (Davis et al., 2007 ▸; Read et al., 2011 ▸). This aids the derivation of chemically and structurally sensible atomic models that are consistent with prior knowledge, whilst transferring as much information from the experimental data to the model as possible via the likelihood function. There are a number of research papers and software tools that are dedicated to the validation of positional parameters of atomic models (see Chen et al., 2010 ▸; Hooft et al., 1996 ▸; Read et al., 2011 ▸, and references therein). These papers and the corresponding software tools have been instrumental in improving the quality of the models deposited in the PDB (Read et al., 2011 ▸; Berman et al., 2002 ▸). As highlighted by Pozharski et al. (2013 ▸) and Weichenberger et al. (2013 ▸), there is still a long way to go before we can claim that the quality of the models in the PDB agrees well with prior knowledge and optimally reflects experimental data.
Individual atomic ADPs are proportional to the variances of positional parameters; ADPs represent the mobility of atoms as well as the accuracy of the positional parameters. In Bayesian statistics, the inverse-gamma distribution (IGD) is often used as a prior probability distribution to model the variance of a normal distribution (see, for example, O’Hagan, 1994 ▸). Since ADPs are proportional to the variances, we can postulate that it is likely that the distribution of ADPs will resemble the IGD. However, since average B values depend on the sharpening level of the data, we add an additional shift parameter to the IGD. Sharpening/blurring should change the average B value without affecting the shape of the distribution, except in cases where over-sharpening produces negative or many small B values. Therefore, we assume that the distribution of isotropic ADPs can be modelled using the SIGD, This distribution has three parameters: shape (α), scale (β) and shift (B0). If there is no over/under-sharpening of Fourier coefficients then B0 = 0, although this is rarely the case. Changing ADPs from B to u = B/8π2 only affects the scale and shift parameters. Since the shape parameter is also known as the degrees of freedom, it is tempting to assume that since the positional parameters of atoms reside in a three-dimensional space, the shape parameter of the SIGD would be around 3. However, we refine this parameter using soft harmonic restraints to ensure that the estimation of SIGD parameters is stable while allowing some variability.
Since the 3D maps used for model building in crystallographic and cryo-EM experiments correspond to densities of electrons and electrostatic potentials, respectively, it is interesting to analyse the effect of B values on the peak height at the centre of atoms for a given resolution. It is clear that the peak heights are dependent on atom types, occupancies, resolution and B values. In order to allow comparison of atomic peak heights, we ignore the effect of different atom types and occupancies; we essentially treat all atoms as point atoms1. As a result of resolution cutoff and B values (atomic mobility), the density becomes smeared out; this affects the values of the density map at the centre of atoms. The density corresponding to the point atom with B value equal to Bmod and resolution smax = 1/dmax can be calculated using (see, for example, Chapman, 1995 ▸)This is the shape of the point-atom density corresponding to a given resolution and B value. The density at the centre can be calculated by letting x = 0:As can be seen, the density at the centre of the atoms depends on the resolution as well as on the B value. The real observed density will also depend on the noise level, the weights used in map calculations, the occupancies of the atoms, the quality of the amplitudes and phases, the number, types and proximity of neighbouring atoms, the overall anisotropy of the data and many other factors. However, a very simple analysis of peak heights should shed some light onto what can be expected at a given resolution. Even if the distribution of the B values is known, it is tricky to derive a closed-form expression for the distribution of peak heights at the atomic centre; therefore, in the following analysis we will use only empirical and simulated distribution histograms for peak-height analysis.
We considered the 89 862 entries from the PDB, as of December 2016, for which the experimental method was X-ray crystallography. For further analysis, we used only the models for which the high-resolution diffraction limit is between 1.5 and 3 Å. To avoid dealing with noncrystallographic symmetry constraints, the use of which is not always clear from the PDB entry, we removed virus structures. Of the remaining models, we were able to refine 46 952 automatically using the refinement program REFMAC5 (Kovalevskiy et al., 2018 ▸) available from CCP4 (Winn et al., 2011 ▸). Reasons for refinement failure include (i) the ligand present in the PDB was not in the CCP4 monomer library at the time of refinement (this was the most common case), (ii) no structure-factor amplitudes2 and (iii) space-group inconsistencies between the PDB file and the reflection-data file. The remaining crystal structures contained roughly 160 000 chains, among which there were 145 800 protein chains. We also excluded cases with R factors higher than 35%. We used these crystal structures and corresponding chains for further analysis. Table 1 ▸ gives a short summary of the selection of PDB entries.
We have demonstrated that there is a need to model as well as to validate atomic ADPs. It is demonstrated that for many macromolecular structures the SIGD can be used to model the distribution of ADPs. Even if the B-value distribution over the whole structure does not obey the SIGD, the individual chains/domains will obey this distribution. When the distributions of B values for different chains/domains are different there can be at least two reasons: (i) different domains/subunits have different contacts depending on the environment and (ii) there are disordered and/or mismodelled regions that have naturally higher B values, reflecting errors in the positional parameters. Such multimodality affects the density and therefore the interpretability of the maps. Future work will include the refinement of multimodality parameters (the number of classes and parameters of the SIGD for each class) using such techniques as the expectation-maximization algorithm.