Research Article: ReCoil – an algorithm for compression of extremely large datasets of dna data

Date Published: October 11, 2011

Publisher: BioMed Central

Author(s): Vladimir Yanovsky.


The growing volume of generated DNA sequencing data makes the problem of its long term storage increasingly important. In this work we present ReCoil – an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data. Typically each position of DNA sequence is covered by multiple reads of a short read dataset and our algorithm makes use of resulting redundancy to achieve high compression rate.

Partial Text

ReCoil uses the natural idea that if two strings s1 and s2 are sufficiently similar, then it is more space efficient to store one of the strings and the differences between it and the second string, then to store both strings explicitly.

ReCoil was implemented in C++ using the STXXL [17] library of external memory data structures. The k-mer size selected was 15 for the similarity graph construction and 10 for the encoding step. We tested ReCoil on both simulated and real datasets. The simulated datasets were generated by making random samples of given length from Human Chromosome 14, adding single-nucleotide errors (insertions, deletions or substitutions) with probability 0.02 and reverse complementing each read with probability 0.5. All our generated datasets were of the same size of 1.8 billion nucleotides.

Our goal in this work was to design an algorithm that maximizes compression ratio. Yet the ability to retrieve a sequence without full decompression will improve applicability of the algorithm. One simple way to accomplish this would be to store the sequences at some levels of the spanning tree explicitly. Then in order to decode a sequence of a node one must go up in the tree until reaching a level of not encoded nodes.

The author declares that he has no competing interests.

VY conceived of the study, implemented the algorithms and tested them.