Research Article: Using more than 801 296 small-molecule crystal structures to aid in protein structure refinement and analysis

Date Published: March 01, 2017

Publisher: International Union of Crystallography

Author(s): Jason C. Cole, Ilenia Giangreco, Colin R. Groom.


A guide to how the Cambridge Structural Database can be used to aid macromolecular crystallography.

Partial Text

The Cambridge Structural Database (CSD; Groom et al., 2016 ▸) is a carefully curated collection of more than 800 000 structures of organic and metal–organic compounds provided by the Cambridge Crystallographic Data Centre (CCDC). The CSD has proven to be an invaluable resource for chemistry since its creation in 1965, and is heavily used in pharmaceutical research and development as well as in academic research. It is to the physical sciences what the Protein Data Bank (PDB; Berman et al., 2003 ▸) is to the life sciences, but the CSD is well used to further protein crystallographic methods. Indeed, the paper by Engh and Huber describing parametrizations for macromolecular refinement (Engh & Huber, 1991 ▸) opens with the sentence Bond-length and bond-angle parameters are derived from a statistical survey of X-ray structures of small compounds from the Cambridge Structural Database.

The CCDC provides a web service for researchers to access any individual structures (, enabling them to view and download the enhanced data sets deposited with the CCDC. This service is freely available to anyone in the world. In addition, the CSD-System software contains a number of programs that are relevant to protein crystallography. This software is used daily by structural chemists in well over a thousand institutions, but is also of tremendous value to structural biologists.

The individual molecules in a crystal lattice interact with their nearest neighbours. This allows one to generate statistical propensity values for the likelihood of any particular atom being in a certain position with respect to any other atom. Such propensities can be displayed graphically and compared with the observed interactions in a protein–ligand complex, either to validate the fit of a ligand or to guide chemical synthesis attempts to improve affinity. The range of questions that such tools can answer is incredible: for example, ‘I propose a fit whereby an F atom appears to act as a hydrogen-bond acceptor, is this likely?’, ‘my molecule contains a thiazole ring, would the nitrogen of an oxazole ring form similar interactions with the sulfur?’ and ‘my binding site contains a tryptophan residue, which functional groups are commonly seen to interact with these?’. To answer all such questions is far beyond the remit of this article, therefore we will restrict ourselves to one, fairly typical question: ‘my ligand contains an oxazole ring close to the hydroxyl group of a serine side chain. It is possible for me to fit this ligand such that either the nitrogen or the oxygen of the thiazole ring is closest to this serine, as I cannot distinguish between an N and an O atom in my electron-density maps. Which orientation is most likely?’ This question is trivial to answer and requires little computational or chemical expertise. A simple search of the CSD identifies over 8000 oxazole fragments, 52 of which hydrogen-bond to a hydroxyl group, all via the nitrogen (Fig. 1 ▸). One would, therefore need extremely strong evidence to position an oxazole ring in the alternative orientation.

The CSD is also a library of molecular geometries and conformations. These are captured in the knowledge base Mogul. The most common application of Mogul in macromolecular crystallography is to check whether the geometry of a ligand modelled into electron density is plausible. Our indication of ‘plausibility’ is whether such geometry is common amongst the structures in the CSD. The coordinates of a modelled ligand can be loaded into the program Mercury and a geometry check performed. The ligand is automatically fragmented to match fragments for which distributions have been pre-calculated from relevant structures in the CSD. The approach taken considers both the chemical identity of a specific fragment and any nearby atoms in order to identify the most relevant distributions. A Mogul analysis for KIT kinase (PDB entry 4hvs; Zhang et al., 2013 ▸) is shown in Fig. 3 ▸.

The sections above describe how one can use specific information derived from the CSD to understand and perhaps optimize the geometry and interactions of a bound ligand. This can be of enormous help in the interpretation of electron density; however, more automated approaches are possible. The protein–ligand docking program GOLD, which is part of the CSD-Enterprise system available to all academic researchers, combines an understanding of both molecular geometry and interactions to generate plausible docking modes for ligands in protein structures (Fig. 4 ▸). These docking modes provide ‘electron-density naïve’ views of how a ligand might bind to a protein, giving the crystallographer unbiased alternatives for consideration when interpreting electron density. Docking methods such as this rely on relatively simple atom–atom scoring functions, which balance summed inter­action scores against considerations of ligand geometry. Of particular help is the ability to visualize the individual atomic contributions to these scores. Atoms scoring poorly may be contributing little to the binding of a ligand, or may be indicative of a misinterpretation of the electron density. Again, one would need compelling electron density to propose a fit of a ligand where its functional groups did not coincide with where docking software suggested they were most likely to be. Of course, where this data exists, differences between docking predictions and experimental predictions provide a wealth of ideas for the synthetic chemist.

As we have mentioned, Engh & Huber (1991 ▸) generated bond-length and bond-angle parameters for use in refinement of macromolecular crystal structures using information from the CSD. At the time of their work, the CSD contained around 100 000 entries; we now have over 800 000 structures available, allowing us to improve our restraints. As noted by Touw & Vriend (2010 ▸), the original Engh & Huber parameters overgeneralize, a known and inevitable consequence of the data available at the time. They illustrate this by showing how the τ angle at the α-carbon in a peptide chain is dependent on the broader environment of the peptide.

We generated our examples using both the graphical user interfaces provided within CSD-Enterprise and the CSD-System Python application programming interface (API; CCDC, 2015 ▸). This allows both script-based access to the functionality and provides a way to integrate this functionality into software commonly used by macromolecular crystallo­graphers.




Leave a Reply

Your email address will not be published.