Date Published: March 31, 2017
Publisher: Public Library of Science
Author(s): Yana Valasatava, Anthony R. Bradley, Alexander S. Rose, Jose M. Duarte, Andreas Prlić, Peter W. Rose, Freddie Salsbury.
The size and complexity of 3D macromolecular structures available in the Protein Data Bank is constantly growing. Current tools and file formats have reached limits of scalability. New compression approaches are required to support the visualization of large molecular complexes and enable new and scalable means for data analysis. We evaluated a series of compression techniques for coordinates of 3D macromolecular structures and identified the best performing approaches. By balancing compression efficiency in terms of the decompression speed and compression ratio, and code complexity, our results provide the foundation for a novel standard to represent macromolecular coordinates in a compact and useful file format.
The Protein Data Bank (PDB) , the archive for 3D structures of biological macromolecules, has rapidly grown over the last few years. Developments in the major experimental techniques enable high-throughput structure determination and the number of deposited structures now exceeds 124,000 entries, increasing by about 10,000 entries per year. The PDB is not only growing in numbers, but newly released PDB entries are also growing in complexity. New integrative methods that combine multiple modelling and experimental techniques, most notably Electron Microscopy, now determine structures of up to the megadalton (MDa) range at atomic resolution [2–4].
In this article, we focus on the compression of 3D coordinates of macromolecules as they are challenging for general purpose compression techniques. General compression tools such as GZIP are efficient when the redundancy in data is high, like the redundancy of the language in a text (e.g. repetitive words). For example, GZIP locates repetitive strings within a text file and replaces those strings temporarily with shorter codes to make the overall file size smaller. However, the coordinates coming from experimental data generally do not exhibit such syntactic redundancy. The proposed approaches use the knowledge about structural features of biological macromolecules to create a compact representation of their atomic coordinates. Specifically, we developed two types of strategies: (i) intramolecular compression that operates on the sequence of atoms within a polymer chain; and (ii) intermolecular compression designed for the compression of special cases of multiple chains with identical atoms, such as NMR models and structures with repeated identical subunits.
In this paper, we explored various compression methods for macromolecular structures, describing the main ideas behind each technique. We analyzed intra- and intermolecular, lossy and lossless compression approaches based on different encoding algorithms. Lossy compression can be used in applications that tolerate data loss without noticeable loss of performance, for example molecular visualization. On the other hand, methods such as structure refinement or molecular force field applications may be sensitive to small changes in coordinates. Therefore, lossless compression is usually a preferred choice while compressing scientific data and we centered our analysis on the lossless compression algorithms. In the following, we compare the performance of the presented compression approaches and discuss the combination of methods that yield best compression.
We investigated compression approaches for 3D coordinates of macromolecular structures. The coordinates data contain a high level of entropy and are therefore poorly compressed by the general-purpose compression tools. To achieve better compression, we applied bespoke encoding methods to create a more compact representation of the atomic coordinates. The performance of compression methods was evaluated against benchmark data from the PDB. We demonstrated that the intramolecular compression based on the combination of integer & delta encoding, recursive indexing packing and GZIP entropy compression is very efficient for compressing atomic coordinates of macromolecules with lossless and lossy schemes.