Research Article: Validation and extraction of molecular-geometry information from small-molecule databases

Date Published: February 01, 2017

Publisher: International Union of Crystallography

Author(s): Fei Long, Robert A. Nicholls, Paul Emsley, Saulius Gražulis, Andrius Merkys, Antanas Vaitkus, Garib N. Murshudov.


The entries from a freely available small-molecule database, the Crystallography Open Database, have been validated and a reliable subset of molecules has been selected for the extraction of molecular-geometry information. The atom types and corresponding bond and angle classes derived from this database have been subjected to validation, the results of which are used by AceDRG in the derivation of new ligand descriptions.

Partial Text

Small-molecule databases such as the Cambridge Structural Database (CSD; Groom et al., 2016 ▸) and the Crystallography Open Database (COD; Gražulis et al., 2012 ▸) are a rich source of information that can be used for various purposes, including the extraction of molecular-geometry information and its use for the generation of new ligand descriptions (Engh & Huber, 1991 ▸; Parkinson et al., 1996 ▸; Bruno et al., 2004 ▸; Long et al., 2017 ▸; Moriarty et al., 2009 ▸; Emsley et al., 2010 ▸). However, the entries in these databases have been generated by experimental techniques and the coordinates are models describing these experiments. The influence of human factors affecting the reliability of derived atomic models should not be ignored. As is the case for any models derived from experimental observations, models are prone to errors and include mis­interpretations. There might be a multitude of reasons for errors/misinterpretations of the entries in these databases, including deficiencies in the experimental data (e.g. systematic and random errors during data acquisition), the software used, mismodelling of the experiment and, finally, plain cheating. The entries from these databases must be validated using criteria that are as strict as possible and selected with extreme care, as the derived data are likely to be used by many structural biologists for the refinement of macromolecular structures. The resulting macromolecular coordinates are deposited in the PDB (Berman et al., 2002 ▸) and are further used by the wider community of biologists. In many cases, the coordinates from the PDB are considered to be accurate and serve as observations for further classifications and analyses. If extreme care is not exercised and inaccurate molecular-geometry information is used in deriving coordinates then the errors can persist, and may affect future results and conclusions. Subsequently validation and cleaning up of such errors might become even more challenging than it is now. This puts an additional responsibility on software developers and, in particular, designers of molecular-geometry databases.

Entries from the COD and derived data are subjected to four main stages of validation before they are accepted for further use. These are as follows.(i) Validation of the database of small molecules. This is performed when the data are deposited in the COD. Gražulis et al. (2009 ▸) have described this step in detail, and thus we mention this stage here only briefly.(ii) AceDRG validation of coordinate CIF files and generated molecules.(iii) AceDRG validation of derived atom types, bonds and angles.(iv) Statistical validation of the data produced by AceDRG. This step is performed iteratively together with revisions of the atom-typing protocol. The results of validation are fed back to AceDRG to help with the fine-tuning of atom types.

As described by Long et al. (2017 ▸), the atom-type classes extracted by AceDRG from a small-molecule database (e.g. the COD) are used to classify bonds and angles. These classes are used to generate molecular-geometry information for new ligands. The validity of the tables, and thus the molecular-geometry information generated for a particular ligand, depends on the reliability of the crystal structures used to originally derive these tables, as well as on the suitability of the atom types. Although the number of atom-type classes derived by AceDRG, which encapsulate the local chemical environments of atoms, is large (more than 260 000, see Table 1 ▸), not all of chemical structure space will be covered.

The statistical methods described in the previous section were applied for the AceDRG-derived bond observation and bond class tables. In addition to cleaning up the bond classes, the results were fed back to fine-tune the atom-type classes. There were two types of feedback.(i) Selection of bonds and angles used for bond table classes were modified, i.e. stricter criteria were used by AceDRG to select the bonds to be used for mean and standard deviation calculations.(ii) Findings were used to fine-tune atom classes and bond classes.

Further validation was performed on the AceDRG output dictionaries. They underwent wholesale testing in three ways. The first was to check that the atom types obeyed the internal rules. An independent RDKit-based program was written using the Coot libraries to generate atom types and compare them with the canonical AceDRG implementation. Where differences were found, the AceDRG atom typing was updated so that full consistency was achieved for the 3000+ non-hydrogen extended organic set atom types.

Repositories of experimentally derived crystal structures such as the COD and the CSD are valuable sources, allowing the general structural properties of small compounds, including molecular-geometry information, to be studied and organized. However, since researchers of varying experience, with the help of software, derive these structures, it can be expected that these sources contain some erroneous or unreliable data. Moreover, the purpose of small-molecule research in general is very different from that of structural biology, and thus we can expect that many ligands of interest in structural biology are ill-represented in these databases. For example, these databases contain many organometallic compounds with varying complexity. Only 7% of compounds from the PDB’s Chemical Component Dictionary contain metals. The selection of relevant crystal structures from these repositories is a challenging problem that this contribution attempts to deal with.




Leave a Reply

Your email address will not be published.