Date Published: February 01, 2017
Publisher: International Union of Crystallography
Author(s): Fei Long, Robert A. Nicholls, Paul Emsley, Saulius Gražulis, Andrius Merkys, Antanas Vaitkus, Garib N. Murshudov.
The program AceDRG generates accurate stereochemical descriptions, and one or more conformations, of a given ligand. The program also analyses entries and extracts local environment-dependent atom types, bonds and angles from the Crystallography Open Database.
Macromolecular crystallography (MX) is the most widely used experimental technique in structural biology that allows the study of three-dimensional structures of macromolecules in atomic, and sometimes electronic, detail, which is an essential step in understanding biological processes. In recent years, single-particle cryo-EM has made substantial advances (Kühlbrandt, 2014 ▸) and thus is now being used routinely. Both techniques allow the derivation of snapshots of reactions or molecular binding processes. For this type of study, a structure of a single molecule is often not sufficient; it is more common to study structures of macromolecules in complex with small ligands mimicking intermediate states or close to a transition state. Moreover, the quality and quantity of the experimental data are often deficient (low resolution with small signal-to-noise ratio). This means that the data alone are not sufficient to derive chemically and structurally sensible atomic models; the data must be supplemented by prior knowledge pertaining to the chemistry and structure of the molecules under study in order to address the problem of missing high-resolution information (Murshudov et al., 2011 ▸; Nicholls et al., 2012 ▸; Schröder et al., 2010 ▸; Adams et al., 2010 ▸; Smart et al., 2012 ▸). Experimental data produced by MX and cryo-EM usually contain long-range information. As the resolution of the data increases, shorter and shorter-range information becomes available. Owing to the mobility of atoms and dynamic/static disorder, even at very high resolution electronic details are not visible, the signal is reduced and thus local resolution is reduced. Additional information is almost always needed. The most widely used information is that regarding the chemistry of bonds and angles (Vagin et al., 2004 ▸). This was recognized a long time ago, and has been used to stabilize atomic structure refinement when only limited and noisy data are available. For amino acids and nucleic acids the ‘ideal’ values have been tabulated a number of times (Engh & Huber, 1991 ▸, 2001 ▸; Parkinson et al., 1996 ▸). There are several good software tools designed for the derivation of accurate values for the bonds and angles in small molecules (Moriarty et al., 2009 ▸; Smart et al., 2011 ▸; Schüttelkopf & van Aalten, 2004 ▸). These are either based on molecular-mechanics force fields, Mogul (Bruno et al., 2004 ▸) from the Cambridge Structural Database (CSD) or semi-empirical quantum-chemical (QM) calculations (Rocha et al., 2006 ▸). Programs such as LIBCHECK (Vagin et al., 2004 ▸) and JLigand (Lebedev et al., 2012 ▸) available from CCP4 (Winn et al., 2011 ▸) can also be used to generate ligand descriptions with sufficient quality.
AceDRG is a multifunctional software tool that analyses molecules in small-molecule databases (currently only the COD), extracts all atom types, bond lengths and angles from those databases, and organizes them in a hierarchical manner. It reads an input file containing basic chemical information about a ligand, such as a bonding graph and stereochemistry. It derives atom types from the bonding graph and maps them to those extracted from the small-molecule database. It can also generate one or more coordinate sets corresponding to energetically favourable conformation(s) of ligands.
The atom typing used in AceDRG encapsulates the local topological and chemical environments of atoms. This includes the atom’s number of bonds and those of its neighbours (up to the third neighbours) and, if they belong to ring(s), information regarding ring size and aromaticity. The current algorithm only considers the extended organic set of atoms: B, C, N, O, S, P, Se, F, Cl, Br, I and H. These atoms cover 93% of the chemical entities contained in the PDB. Dealing with metals requires a different approach; they will be dealt with in the future.
Once all atom types have been identified and classified, AceDRG creates and organizes tables pertaining to bonds and angles. Since the number of potentially different atom types is infinite, it is possible for a pair of atom types in a given compound, as defined above, to not be in the list of bonds. Therefore, we need well organized tables of atom types, bonds and angles for the fast and efficient searching of exact atom types as well as fast generalization, if and when needed.
Fig. 5 ▸ shows a flow chart describing the derivation of stereochemical information and coordinate set(s) using basic chemistry as input. The workflow is relatively simple and comprises four steps.(i) Read the input file, which contains bonding information. At this step, either AceDRG directly (mmCIF format) or RDKit is used to read the files and organize minimal information about atoms and bonds. If mmCIF is used as an input file then AceDRG checks whether the file contains a SMILES string. If it does, then RDKit is used for conformer generation. However, when a SMILES string is used the atom names are lost. AceDRG uses an exact graph isomorphism algorithm to match the atom names generated by RDKit to those in the input mmCIF file, ensuring that the atom names are retained. If the input mmCIF file does not include a SMILES string then AceDRG converts this file to an SDF MOL file (Dalby et al., 1992 ▸) and feeds it to RDKit to generate the initial conformation. The current version of AceDRG accepts CCD mmCIF, SMILES string, SDF MOL (Dalby et al., 1992 ▸) and SYBYL MOL2 (Clark et al., 1989 ▸) file formats. RDKit is used for the interpretation of SMILES, SDF MOL and MOL2 files.(ii) In the second step, initial models are generated and the molecule is sanitized using both RDKit and AceDRG functionality. The chemistry of the molecule is verified, ensuring that it conforms to basic chemical rules. In addition, information regarding functional groups and pH is used to protonate or deprotonate functional groups such as carboxyl groups, phosphates and sulfate groups. If explicit H atoms are defined in the SMILES string, AceDRG retains these H atoms.(iii) At this stage, atom types are generated for each atom in the initial model. The AceDRG tables are then consulted to find the corresponding ‘ideal’ bond and angle values. Plane groups and chiral centres are also added and an initial mmCIF dictionary file is created.(iv) Finally, the coordinates corresponding to the initial conformations from step (i) are optimized using the idealization mode of REFMAC, together with the initial mmCIF dictionary file just generated. The optimized coordinates are then added to the output mmCIF dictionary file. In its default mode, AceDRG generates 20 different conformations and then idealizes them before selecting the best one according to REFMAC5 geometry information. The final output is an mmCIF dictionary file and a PDB file containing the coordinates.
Here, we use two examples from the PDB to demonstrate AceDRG-generated dictionary values in practice. In general, the bond lengths and angles generated by AceDRG seem to be reasonably accurate (Tucker & Steiner, 2017 ▸). The first example aims to demonstrate that although the bond values generated from AceDRG are more accurate, and thus the refined structure should in principle be better in terms of chemical structure, the differences between structures refined using different dictionary values are so small that they are barely visible by eye and are unlikely to cause incorrect biological conclusions. The second example demonstrates the importance of aromaticity perception, and how it may affect inferred biological conclusions.
The program AceDRG has been designed to extract and organize atom types from small-molecule databases. The current version uses the freely available COD, although the algorithms and implementations are flexible, and any source of reliable small-molecule coordinate sets can be used to supplement/update/replace the relevant tables.