Date Published: March 01, 2016
Publisher: International Union of Crystallography
Author(s): Randy J. Read, Airlie J. McCoy.
A new Rice-function approximation for the effect of intensity-measurement errors improves the treatment of weak intensity data in calculating log-likelihood-gain scores in crystallographic applications including experimental phasing, molecular replacement and refinement.
For macromolecular crystallography, maximum-likelihood functions are required in order to account for the large model errors that are present during phasing. In this way, macromolecular crystallography differs from small-molecule crystallography, where the model errors are small and the most widely used and successful program for refinement, SHELXL (Sheldrick, 2015 ▸), uses a least-squares (intensity) target. Compared with the model errors, the relatively smaller data errors have not been the focus of the development of macromolecular likelihood functions, but recent advances have raised the importance of dealing properly with both large model and large data errors. Most prominently, it has been demonstrated that useful information can be extracted from very weak diffraction data (Ling et al., 1998 ▸; Karplus & Diederichs, 2012 ▸). This has coincided with the uptake of photon-counting area detectors for macromolecular crystallography, on which data are frequently integrated beyond traditional resolution limits [for example, where the merged I/σ(I) > 2]. Lastly, structure solution is increasingly being attempted with pathologies such as twinning, high anisotropy and translational NCS (Read et al., 2013 ▸). In the last two of these cases, weak data with high error cannot be excluded because they form an essential part of the analysis.
As described above, the deficiencies in the current treatments of experimental errors are numerous and varied. However, it is clear that working directly with intensities avoids the problems associated with conversion to amplitudes and has the advantage of keeping the target function closer to the actual observations. This is the strength of the MLI target. On the other hand, given the utility of the multivariate complex normal distribution (relating phased structure factors) in deriving crystallographic likelihood targets (Read, 2001 ▸, 2003 ▸; McCoy et al., 2004 ▸), there are significant advantages in an approach that approximates intensity errors in some way as complex structure-factor errors, thus yielding targets based on Rice functions. Combining the strengths of the MLI target with the strengths of a target based on the Rice function would be ideal.
To obtain a Rice-function-based LLG target that uses Ee and Dobs to represent the intensity measurement and its experimental error, what is needed is the probability of Ee given the calculated structure factor EC. We can obtain this by first constructing a joint probability distribution, in the form of a multivariate complex normal distribution, involving the phased structure factors Ee and EC, as well as the unknown true structure factor E as a dummy variable. For normalized structure factors, the covariance matrix is a correlation matrix with ones along the diagonal. The off-diagonal elements involving the true E are σA (for EC) and Dobs (for Ee). For two random variables that differ in independent ways from a common variable, the correlation term is the product of their individual correlations to the common variable. This can be seen in the correlation matrix presented in (15), in which a superscript asterisk indicates the complex conjugate,
Starting from observed diffraction data, there are a number of steps that must be carried out to use the new log-likelihood-gain intensity targets. When adapting programs that already use Rice-function likelihood targets, much of the underlying machinery can be preserved. The following discusses the changes that have been introduced in Phaser (McCoy et al., 2007 ▸) to use intensity data for molecular-replacement calculations.
Separate tests have been carried out to determine how well the Rice-function approximations for measurement error alone represent the exact probability distributions, and how well the LLGI target approximates the exact LLG.
In essence, the LLGI function for accounting for experimental errors in log-likelihood-gain target functions starts by finding values for two parameters, the effective E value (Ee) and Dobs, which can stay constant throughout a phasing or refinement calculation. Ee serves the role of the observed normalized amplitude and, when the σA values characterizing the effects of model error are multiplied by Dobs, the resulting Rice LLGI function provides an excellent approximation to a true LLG that could only be evaluated by numerical integration. Even though LLGI is cast in terms of a function that (for the acentric case) implies complex errors, it is developed as an approximation to a log-likelihood gain based on the MLI target. As a result, the underlying statistical model is shared with the MLI target.
LLGI has been implemented and tested in Phaser. Releases from v.2.5.7 will accept intensities in preference to amplitudes for molecular replacement, and a future version will accept intensities in preference to amplitudes for SAD phasing. Please refer to the documentation (http://www.phaser.cimr.cam.ac.uk) for details.