Research Article: On the application of the expected log-likelihood gain to decision making in molecular replacement

Date Published: April 01, 2018

Publisher: International Union of Crystallography

Author(s): Robert D. Oeffner, Pavel V. Afonine, Claudia Millán, Massimo Sammito, Isabel Usón, Randy J. Read, Airlie J. McCoy.


The expected log-likelihood gain can be used to predict the outcome of molecular replacement and optimize molecular-replacement strategies.

Partial Text

Solving the phase problem by molecular replacement is a problem of signal to noise; the signal for the correct placement of the model must be found amongst the noise of incorrect placements. The signal of a placement is indicated by its translation-function Z-score (TFZ), which is the number of standard deviations over the mean (Z-score) for the log-likelihood gain on intensity (LLGI) in the translation function (TF). The most sensitive function for scoring the placements is a maximum-likelihood function based on the Rice distribution (LLGI). For a single acentric reflection, where EC is the normalized structure-factor amplitude calculated from the placed model, σA is the fraction of the calculated structure factor that is correlated with the observed structure factor, and Ee (the ‘effective E’) and Dobs are derived nontrivially from the observed intensity and its standard deviation (Iobs and , respectively) as described in detail in Read & McCoy (2016 ▸).

The applications discussed below are implemented from Phaser-2.8. Phaser is distributed through the CCP4 (Winn et al., 2011 ▸) and PHENIX (Adams et al., 2010 ▸) software suites. The functionality associated with the eLLG is available from the MR_AUTO, MR_ELLG and PRUNE modes of Phaser, either from the command line or from the Python interface (see the Phaser documentation; McCoy et al., 2009). All functionality can be imported to Python via Boost.Python (Abrahams & Grosse-Kunstleve, 2003 ▸). Details of the implementation of each eLLG-based functionality described in the sections below are given in the relevant section.

If the eLLG for placing a model in the asymmetric unit is well over the solved-LLG then structure solution is likely to be straightforward: high signal to noise and an unambiguous solution.

The eLLG calculation accounts for the trade-off between fm and Δm, in which small accurate models may give a higher eLLG than larger more inaccurate models. Searching for models in the order of decreasing eLLG should optimize the path to structure solution.

At low resolution, where σA is low owing to errors in modelling solvent and there are fewer reflections in each resolution shell, the eLLG rises slowly as the resolution of the data increases (Fig. 2 ▸). At resolutions where d ≫ Δm each reflection contributes a similar amount to the eLLG, which therefore rises more rapidly with increasing d* (Fig. 2 ▸). At higher resolutions, the contribution to the eLLG from each reflection again drops, and reflections added at resolutions d < 1.8 × Δm do not increase the eLLG significantly (Fig. 2 ▸). An effective eLLG limit is reached asymptotically, with the limit reached in any given case determined by the estimated Δm. This is as expected: the structure-factor contributions from the model are almost uncorrelated with those from the true structure when the Bragg spacing is much less than Δm. For reference, 1.8 × Δm is called the Δm-limited resolution. Fragment-based molecular replacement for proteins has its origins in the solution of helical proteins by placing short polyalanine helices (Glykos & Kokkinidis, 2003 ▸; Rodríguez et al., 2009 ▸). A similar method was developed for RNA, using canonical RNA structure motifs to build full solutions (Robertson et al., 2010 ▸). Much recent work has focused on the generation of more general structural fragments, including those from distant homologues (ARCIMBOLDO_SHREDDER; Sammito et al., 2014 ▸; Millán et al., 2018 ▸), libraries of structural motifs (ARCIMBOLDO_BORGES; Sammito et al., 2013 ▸) or molecular modelling (AMPLE; Bibby et al., 2012 ▸). These methods rely on the generation of small but extremely accurate (low coordinate error) fragments, followed by expansion of the placed fragments using aggressive density-modification and model-building methods, such as those implemented in SHELXE (Sheldrick, 2010 ▸). A single atom is a perfect partial model (Δm = 0). For such a model, σA2 ∝ fm and hence LLGI ∝ fm2. Molecular replacement with a single atom, when the structure is large and fm is small, requires many reflections because as the number of ordered atoms in the asymmetric unit increases, the LLGI per reflection decreases (∝ fm2) faster than the number of reflections increases for a proportional unit-cell volume (∝ fm). More reflections may come from higher resolution data or a larger unit cell with the same number of scattering centres (higher solvent content). Since fm also depends on the scattering curve, atoms of the same element type but with lower B factors will be found with a higher LLGI than those with high B factors. Also affecting the scattering factor are the form factors; with regard to protein, S atoms scatter proportionately more at higher resolution than C, N and O atoms. This effect, however, can be negated by a B factor raised by as little as 2 Å2 above the Wilson B factor (Wilson, 1942 ▸). Se atoms in selenomethionine-incorporated proteins are poorer targets for single-atom molecular replacement than their atomic number suggests (Z = 34), since selenomethionine residues often display high mobility or disorder (Dauter & Dauter, 1999 ▸). Editing of structures from the Protein Data Bank prior to molecular replacement is a well established method for improving the signal, and often makes the difference between success and failure (Schwarzenbacher et al., 2004 ▸; Bunkóczi & Read, 2011 ▸; Bunkóczi et al., 2015 ▸). Editing methods range from simple truncation of side chains in the model (polyalanine or polyserine), through the selected removal of atoms based on side-chain substitution, removal of loops and altering B factors, to full molecular modelling. At the end of molecular replacement, model editing usually occurs as one of the first steps in structure refinement. Twinning reduces the LLGI, and so a correction term should, in principle, be applied to the eLLG. The reduction in the eLLG was studied for hemihedral and tetartohedral crystal twinning, which are particular cases of (pseudo)merohedral twinning where the number of twinned domains is two and four, respectively. The BETA–BLIP structure (Strynadka et al., 1996 ▸), which has previously been used as a test case for Phaser (Storoni et al., 2004 ▸; McCoy et al., 2005 ▸; McCoy, 2007 ▸), was used to generate simulated data with different hemihedral twin fractions, and the LLGI was calculated for the structure given the simulated data (Fig. 8 ▸a). The relationship between the LLGI and the twin fraction is approximately linear for hemihedral twinning, so that a twin fraction of a half leads to a halving of the LLGI for untwinned data. A higher order twinning test was performed with the structure of human complement factor 1 (PDB entry 2xrc), which has P1 symmetry and tetartohedral twinning. For perfect tetarto­hedral twinning the degree of reduction in the LLGI was a factor of four (Fig. 8 ▸b). Experienced users of Phaser may wish to see a solution with LLGI ≫ 64 and TFZ ≫ 8 to increase the certainty that the solution is correct. While an LLGI > 64 and a TFZ > 8 have been proven to be significant, a target-eLLG of 225, equivalent to TFZ = 15, was found to optimize the time to structure solution. It is likely that the preference of the experienced user for LLGI ≫ 64 and TFZ ≫ 8 is partly informed by their experience of the time taken for structure solution, rather than the outcome. To give the user additional information about the certainty of a solution after automated molecular replace­ment with Phaser, a ‘TFZ-equivalent’ is calculated, which is the TFZ that would have been obtained if the refined position were found (i.e. located exactly on the search grid) in a translation function performed with the model in the refined orientation, using all data.




Leave a Reply

Your email address will not be published.