Date Published: August 01, 2011
Publisher: International Union of Crystallography
Author(s): Ian J. Bruno, Gregory P. Shields, Robin Taylor.
An improved algorithm has been written for assigning chemical structures to incoming entries to the Cambridge Structural Database.
For 45 years the Cambridge Structural Database (CSD; Allen, 2002 ▸) has been maintained by the Cambridge Crystallographic Data Centre (CCDC) as the definitive collection of small-molecule organic and metallo-organic crystal structures. Throughout this time, the CCDC has had the following core aspirations:(i) that the database should afford comprehensive coverage of published crystal structures in its area of remit;(ii) that it should achieve high standards of accuracy;(iii) that it should be accompanied by effective search software.The focus of this paper is on the important issue of assigning the correct chemical structure (bond types, formal charges etc.) to each incoming entry. This task is obviously germane to the second aspiration but also to the third, since most searches of the CSD are substructure searches which cannot give accurate results unless chemical structures are assigned reliably. If every incoming structure to the CSD were accompanied by an accurate, machine-readable chemical diagram provided by the authors, the problem of structure assignment would be largely solved. However, sadly this is far from being the case, nor is there any indication that it will become so in the immediate future.
Apart from the obvious bond types (including aromatic), the CSD also makes use of quadruple bonds for some metal–metal linkages, pi bonds for poly hapto-bound metal ligands, and delocalized bonds. The latter are used for systems such as bidentate acetylacetonato and have the advantage over representations using alternate single and double bonds that they correctly reflect local symmetry. Some metal–metal bonds have non-integral bond orders that cannot be represented in the CSD at present. Recently, quintuple bonds have been reported in some chromium dimers (e.g. Nguyen et al., 2005 ▸) and the possibility of even higher order bonds has been discussed (Radius & Breher, 2006 ▸). These bond types are not currently allowed in the CSD, although there should be little difficulty in adding them. There is no mechanism in the CSD for indicating a radical, which makes it impossible to accurately show the bonding in e.g. structures involving semiquinone anion radicals.
The detection of bonds and symmetry expansion is based on the Unique Molecule Program (Allen et al., 1974 ▸). However, instead of using elemental radii, an upper distance limit for each element–element pair was employed, allowing the finer control of bond-distance limits. For many element pairs, the starting values were the sum of the CCDC covalent radius values (Cambridge Crystallographic Data Centre, 2011 ▸) plus a tolerance of 0.45 Å. Values for bonds between s-block and p-block elements (e.g. Na—O) were based on the s-block radii of Kerr (2002 ▸) plus a tolerance of 0.40 Å. A utility program was written for comparing the connectivity calculated with these distances with the connectivity in the CSD. For each element pair, the program produces a list of the lengths of bonds that are (a) present in the new connectivity but not in the CSD, and (b) vice versa. This program was used to validate the distance limits for a subset of ca 32 000 entries. Where there were many discrepancies for an element pair, values were manually optimized by inspection of histograms of bonding and non-bonding distances in the CSD and the validation repeated. In a number of cases (e.g. Ag—Ag bonds, Fig. 9 ▸) there is substantial overlap between bonding and non-bonding distance distributions, reflecting differing opinions of the authors of the original publications.
Disorder assembly and group information may be given explicitly in the CIF using the _atom_site_disorder_assembly and _atom_site_disorder_group data items. Alternatively, it may possibly be deduced from site occupancies (_atom_site_occupancy). We have developed improved algorithms for resolving disorder, making use of all these data items.
Each assigned structure is given a reliability score which can take the values 0, 1, 2 or 3, larger values indicating greater reliability. The assessment procedure is rule-based and was developed empirically. Table 2 ▸ lists the dependence of the score on the various assessment criteria used. These fall into two categories. Some, such as the presence of a metal atom, are not in themselves indicative of error but are features known to make structure assignment difficult and therefore less reliable. Others, such as a non-planar double bond, are directly suggestive of possible error. In addition to the score, warning messages about suspect features are reported. Table 3 ▸ shows an example for a structure assignment of relatively low reliability. Reports such as this often indicate clearly the points of error in the assigned structure.
This section describes illustrative results based on the CSD entries discussed in §1 (Figs. 1–7 ▸ ▸ ▸ ▸ ▸ ▸ ▸), starting with GEBXOA. This is assigned with a triple bond between the Ru atoms. The actual bond order from electron counting is 2.5 (Chakravarty et al., 1986 ▸). Metal–metal multiple bonds are often assigned correctly, although it is also common for the assigned bond order to be out by 1 in either direction. Missing H atoms in GEBXOA are inferred correctly.
The algorithm was validated on a random sample of 1777 structures with CSD accession dates falling in May 2009. None of the structures was used in developing the algorithm or contributed to its underlying data files. The CIFs received by the CCDC were used as input and the resulting structure assignments were compared with those in the CSD, all of which were created by the normal CCDC editing process. Each algorithmically produced assignment was categorized as identical, acceptable or incorrect. Identical assignments were those for which there was an exact match (bonds, bond types, atom charges, inferred missing H atoms, and, where relevant, polymer unit) with the corresponding CSD assignment for all molecules and ions in the asymmetric unit, including disordered moieties.
We have described the structure-assignment algorithm used to help CCDC editors add new entries to the CSD. Effectively, the algorithm exploits the chemical information in the CSD to interpret and add value to the atomic coordinates obtained from the diffraction experiment. The algorithm has the potential for wider use as a tool for adding chemical knowledge to newly determined crystal structures, thereby increasing the degree to which high-throughput crystallography can be automated. It has also facilitated the release of entries as part of the CSD X-Press system. Entries in CSD X-Press have had chemistry assigned by the new structure-assignment algorithm and are accompanied by an automatically generated two-dimensional diagram, together with data items that are available in the CIF. When no compound name is present in a CIF, an attempt is made to automatically generate one based on the assigned chemistry using ACD/Name (Advanced Chemistry Development, Inc., 2010 ▸). Importantly, entries in CSD X-Press are given a star rating based on the reliability score produced by the structure-assignment algorithm. This provides users with an indication of the confidence they can have in the chemical assignment when deciding how to handle structures as part of a scientific study. CSD X-Press entries are made available through WebCSD (Thomas et al., 2010 ▸) where they are clearly highlighted as pending enhancement (e.g. resolution of any structure-assignment problems) by editorial staff before inclusion in the main CSD. The introduction of CSD X-Press allows earlier public release of structures that have value added to the data present in the original CIFs, primarily as a result of the new structure assignment algorithm.