Research Article: Strategies for carbohydrate model building, refinement and validation

Date Published: February 01, 2017

Publisher: International Union of Crystallography

Author(s): Jon Agirre.


This article addresses many of the typical difficulties that a structural biologist may face when dealing with carbohydrates, with an emphasis on problem solving in the resolution range where X-ray crystallography and cryo-electron microscopy are expected to overlap in the next decade.

Partial Text

The author does not intend to rewrite fairytale canon, but to bridge the 15-year gap between the biotechnological breakthroughs highlighted in the now classic Science editorial (Hurtley et al., 2001 ▸) that the title of this section alludes to and the current state of the art in structural glycobiology. For the past 35 years and apparently conforming to some kind of law, carbohydrate-containing structure depositions, signified by a red line in Fig. 1 ▸, have steadily matched 10% of the annual total. However, the balance within this seemingly fixed percentage has strikingly changed in the past decade: glycosylation, which groups a number of post-translational and co-translational covalent modifications of proteins with sugars, has become increasingly frequent. N-glycosylation alone (blue line in Fig. 1 ▸), the most frequently reported type, has increased from 2.9% in 2000 to 5.5% of the total in 2013. While ligand carbohydrates continue to be the focus of many biotechnological and biomedical studies, it would seem that the contribution of glycosylation to eukaryotic protein folding, stability and function is progressively taking the spotlight. This is already having implications: while the number of ligand sugars per structure will usually be within one to a couple of dozen at most, heavily glycosylated structures are becoming more frequent and can contain over 100 monosaccharides each (see, for example, Agirre et al., 2016 ▸; Gudmundsson et al., 2016 ▸; Stewart-Jones et al., 2016 ▸), increasing the number of deposited monosaccharide models per year. Cryo-electron microscopy (cryo-EM), a structural technique that does not depend on the ordered packing of particles into crystals, is not vulnerable to the deleterious effects that external glycans may have (Pallesen et al., 2016 ▸), and thus is expected to contribute strongly to this trend in forthcoming years.

Sugars come in many stereochemistries, configurations, forms and conformations (for a concise introduction, see Bertozzi & Rabuka, 2009 ▸). In an enzyme-free reaction (usually catalysed by a dilute base or acid), they may interconvert from an open-chain form to a furanose cyclic form (five-membered saturated ring) or a pyranose cyclic form (six-membered saturated ring). These transitions depend on the stability of each form, and all forms can co-exist in solution, although conversion from the cyclic form to the open chain requires a free hemiacetal (if the sugar is an aldose) or a hemiketal (if the sugar is a ketose) group, i.e. that the sugar is not linked to another through C1 (C2 if the sugar is a ketose). Stereochemistry defines the sugar, and particular attention must be paid to two key conventions: absolute configuration and anomeric configuration. The absolute configuration of a monosaccharide, identified by a small capital d or l, is denoted by the configuration of the stereocentre furthest away from the anomeric C atom (usually referred to as the configurational atom; see Fig. 2 ▸, substituent in magenta colour; in the open-chain form right indicates dextro and left indicates laevo; in the cyclic structures up indicates dextro and down indicates laevo). With every cyclization, a choice of anomeric configuration is made based on the stereochemical relationship of the resulting hydroxyl group with respect to the anomeric reference atom, which will be the configurational atom except in some special cases (e.g. sialic acids), where multiple configurational prefixes are indicated. These configurations, termed anomers, are denoted as α (different stereochemistry at both stereocentres) or β (the same stereochemistry), typically involving comparison of the position (up, down) of the C1 hydroxyl group (C2 in ketoses) with that of the C atom linked to C5 (C6 in ketoses) for the most common monosaccharides. The interconversion between two anomeric forms is called mutarotation and is illustrated in Fig. 2 ▸, which has been annotated with the proportions determined experimentally for d-fructose (a ketose) by Flood et al. (1996 ▸). These proportions can help us to understand how stable each form is. The different anomeric configurations affect this stability, as the torsional strain around the link from the anomeric centre to the adjacent C atom will differ. In order to minimize such strain, the conformation of the substituents when viewed across such a link should be staggered (i.e. the substituents of one C atom are interleaved with those of the other C atom) rather than eclipsed, which would lead to van der Waals (vdW) repulsion. As mutarotation requires the sugar to pass through the open-chain form, only those monosaccharides that are either free or at the reducing end (see below) of a polysaccharide will be able to interconvert between anomeric forms.

A number of co-translational and post-translational covalent modifications of protein residues with carbohydrates are categorized according to the glycosylation type. These modifications are not per se encoded in genomes, although the modified amino acids may conform to a sequence motif, but instead are fully dependent on the available glycosyltransferases and glycan-processing enzymes (Rini et al., 2009 ▸). Hence, the structural possibilities are limited and usually particular to the expression system used. Based on a genomic analysis, it has been estimated that more than 50% of human proteins are glycosylated (Apweiler et al., 1999 ▸).

All major macromolecular crystallographic refinement packages use a monomer dictionary library to organize prior chemical knowledge in the form of geometric restraints. Many of them have extended or have used at some point, with manual curation, the initial CCP4 monomer library of Vagin et al. (2004 ▸). This initiative produced, using LIBCHECK (Vagin et al., 2004 ▸) with irregular results (see below), geometric targets consistent with Engh & Huber (1991 ▸) from all of the entities (henceforth monomers) in the PDB Chemical Component Dictionary (PDBCCD) at that point in time. The PDBCCD is the place of reference for obtaining codes, names and chemical descriptions of the very building blocks that structural biology relies upon: monomers. These are stored in files containing a topological description of the monomer along with example Cartesian coordinates, extracted from a deposited experimental structure, and/or computationally idealized coordinates (Westbrook et al., 2015 ▸). Both sets are available from the PDBCCD in SDF format (Molecular Design Ltd,), and can be inspected with either PyMOL (v,8; Schrödinger) or UCSF Chimera (Pettersen et al., 2004 ▸). While the two sets of coordinates should always be representative and almost identical for simple monomers, discrepancies do occur. Calculating the minimal energy conformation for larger structures, for example polysaccharides, with many degrees of freedom can be very expensive in computational terms, and can fail to replicate what is found in nature. Monosaccharides, like other saturated rings, pose particular problems for minimization, thus the results need experimental validation, ideally with a high-resolution small-molecule structure. One such example is the PDBCCD entry IDS (2-O-sulfo-α-l-iduronic acid), an l-sugar, which includes a 1C4 chair conformer in the idealized coordinates (Fig. 7 ▸, top panel) and a high-energy 2SO skew-boat conformer that was determined by solution NMR (Mulloy et al., 1993 ▸) in the example coordinates (Fig. 7 ▸, middle panel). Furthermore, a different answer, a lowest-energy 4C1 chair, is obtained when generating a conformer from its SMILES string by sampling the torsional space of the monomer randomly with RDKit (Landrum, 2016 ▸) followed by energy minimization (Fig. 7 ▸, bottom panel). So the question for the user is ‘what is the most probable conformation to be used as starting coordinates?’. The 1C4 conformer has the large sulfate group in the less-preferred, steric clash-prone axial location, whereas the 2SO skew-boat conformer shows clear angle strain; the computed 4C1 chair conformer shows little strain and has most substituents, including the sulfate, in the preferred equatorial location. However, we know that the cyclization reaction locks l-sugars, at least initially, in a 1C4 conformation (Fig. 3 ▸, southern hemisphere in the Cremer–Pople diagram), and the sugar is not going to traverse any south-to-north conformational itinerary without enzymatic intervention, as the energetic penalty would exceed the final benefit, which would be in the region of 2 kcal mol−1 as estimated by RDKit, by an order of magnitude (Davies et al., 2012 ▸). Thus, sampling conformations in torsional space can help to find a global energy minimum, but one that might not be attainable in nature. Similarly, using a solution NMR structure as a model might prove an even worse choice, as this technique is able to capture snapshots of dynamic transitions and these are unlikely to be representative of crystalline molecule populations. To date all occurrences of IDS within PDB entries solved crystallographically at atomic resolution (better than 1.5 Å) have the ring in the 1C4 chair conformation. May this cautionary tale serve to highlight why including experimentally determined and manually curated small-molecule structures in monomer dictionaries (as the PDB is currently doing in collaboration with the Cambridge Crystallographic Data Centre; CCDC) is essential in many of the most debatable cases.

Initially, the PDB encoded both anomeric configurations into a single three-letter code. Consequently, refinement programs had to rely on MODRES records to rename each residue and point to the correct set of restraints. The PDB archive was then remediated (Henrick et al., 2008 ▸) and the PDBCCD now holds independent three-letter codes for each anomer (see Table 2 ▸ for the correspondence between IUPAC long and short names and the PDBCCD notation), making the renaming process unnecessary. While most of the sugars appear to be fine, β-d-xylose (XYP), a sugar that is central to plant biology, still does not follow the same standard atom-naming convention. This issue has caused problems downstream, as programs operating on the PDBCCD definition may not recognize this entry as a sugar. Such is the case with LIBCHECK (Vagin et al., 2004 ▸), which was used to generate the CCP4 monomer library: indeed, this entry is classified as a ‘non-polymer’ instead of ‘pyranose’, and therefore REFMAC5 (Murshudov et al., 2011 ▸) is unable to detect glycosidic type links between XYP and any other sugar, including XYP. Other, potentially unrelated issues that LIBCHECK has with sugars include the generation of one 0° endocyclic torsion which keeps four ring atoms coplanar and therefore imposes the wrong envelope or half-chair conformations. Privateer (Agirre, Iglesias-Fernández et al., 2015 ▸) will report any incorrect torsions found in the library if run from the command line. The problem is known to affect at least 60 sugar entries in the CCP4 monomer library, including NAG, BGC and BMA. These problems, along with the fact that the geometry target that LIBCHECK produces is consistent with Engh & Huber (1991 ▸), which is now inconsistent with the new context-dependent geometries, highlight the need for a regeneration of the whole library using a modern tool such as ACEDRG (Long et al., 2017 ▸).

There are three pillars in carbohydrate model validation: nomenclature, structure and conformation. Any mistakes affecting nomenclature, structure or both can lead to a distorted ring, incorrect bond conformations or both. Higher-energy ring or bond conformations do not necessarily spawn from previous mistakes introduced during model building, but can result from refining a model against low-resolution data with fewer restraints than required. Such problems, which span across all refinement programs, were highlighted recently using N-glycan-forming d-pyranosides as a subject study (Agirre, Davies et al., 2015 ▸).

The computational side of structural glycobiology is slowly catching up with the rest of the field. For validation methods to succeed in preventing many of the mistakes mentioned above, they have to be integrated much more closely into the structure-determination process. Web services, while being generally easy to use and requiring a setup as simple as a browser, represent an unsurmountable barrier for confidential projects, and even in nonconfidential ones they tend to occupy a residual, often overlooked, step at the end of such process.




Leave a Reply

Your email address will not be published.