Research Article: External memory BWT and LCP computation for sequence collections with applications

Date Published: March 8, 2019

Publisher: BioMed Central

Author(s): Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, Guilherme P. Telles.

http://doi.org/10.1186/s13015-019-0140-0

Abstract

Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.

We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs.

We prove that our algorithm performs documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathcal {O}}(n, mathsf {maxlcp})$$end{document}O(nmaxlcp) sequential I/Os, where n is the total length of the collection and documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$mathsf {maxlcp}$$end{document}maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

Partial Text

A fundamental problem in bioinformatics is the ability to efficiently search into the billions of DNA sequences produced by NGS studies. The Burrows Wheeler transform (BWT) [1] is a well known structure which is the starting point for the construction of compressed indices for collections of sequences [2]. The BWT is often complemented with the longest common prefix (LCP) array [3] since the latter makes it possible to efficiently emulate Suffix Tree algorithms [4, 5]. The construction of such data structures is a challenging problem. Although the final outcome is a compressed index, construction algorithms can be memory hungry and the necessity of developing lightweight algorithms was recognized since the very beginning of the field [6–8]. In lightweight algorithms it is assumed that the input and the output fit in main memory but the amount of additional working memory is sublinear with the size of the input.

Let documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{}[1,n]$$end{document}s[1,n] denote a string of length n over an alphabet documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$Sigma$$end{document}Σ of constant size documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$sigma$$end{document}σ. As usual, we assume documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{}[n]$$end{document}s[n] is a special symbol (end-marker) not appearing elsewhere in documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{}$$end{document}s and lexicographically smaller than any other symbol. We write documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{}[i,j]$$end{document}s[i,j] to denote the substring documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{}[i] {mathsf {s}}_{}[i+1] cdots {mathsf {s}}_{}[j]$$end{document}s[i]s[i+1]⋯s[j]. If documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$jge n$$end{document}j≥n we assume documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{}[i,j] = {mathsf {s}}_{}[i,n]$$end{document}s[i,j]=s[i,n]. If documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$i>j$$end{document}i>j or documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$i > n$$end{document}i>n then documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{}[i,j]$$end{document}s[i,j] is the empty string. Given two strings documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{1}$$end{document}s1 and documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{2}$$end{document}s2 we write documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{1} preceq {mathsf {s}}_{2}$$end{document}s1⪯s2 (documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{1} prec {mathsf {s}}_{2}$$end{document}s1≺s2) to denote that documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{1}$$end{document}s1 is lexicographically (strictly) smaller than documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{2}$$end{document}s2. We denote by documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$mathsf {LCP}({mathsf {s}}_{1},{mathsf {s}}_{2})$$end{document}LCP(s1,s2) the length of the longest common prefix between documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{1}$$end{document}s1 and documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${mathsf {s}}_{2}$$end{document}s2.

The eGap algorithm for computing the multi-string BWT and LCP array in external memory works in three phases. First it builds multi-string BWTs for sub-collections in internal memory, then it merges these BWTs in external memory and generates the LCP values. Finally, it sorts the LCP values in external memory.

In this section we report on an experimental study comparing between the eGap algorithm and the other known external memory tools computing the BWT and LCP arrays of sequence collections. We implemented eGap in ANSI C based on the code of Gap [34] and gSACA-K [32]. eGap source code is freely available at https://github.com/felipelouza/egap/. All tested algorithms were compiled with GNU GCC ver. 4.6.3, with optimizing option -O3. The experiments were conducted on a machine with GNU/Linux Debian 7.0/64 bits operating system using an Intel i7-3770 3.4 GHz processor with 8 MB cache, 32 GB of RAM and a 2.0 TB SATA hard disk with 7200 RPM and 64 MB cache. The complete set of experiments took about 70 days of computing time.Table 1Datasets used in our experimentsNameSize GBN. of stringsMax LenAve LenMax LCPAve LCPshort8.085,899,3451001009927.90long8.028,633,11530030029990.28pacbio.10008.08,589,9341000100087618.05pacbio8.0942,24871,5619116308418.32Columns 4 and 5 show the maximum and average lengths of the single strings. Columns 6 and 7 show the maximum and average LCPs of the collections

In this section we show that the eGap algorithm, in addition to the BWT and LCP arrays, can output additional information useful to design efficient external memory algorithms for three well known problems on sequence collections: (i) the computation of maximal repeats, (ii) the all pairs suffix–prefix overlaps, and (iii) the construction of succinct de Bruijn graphs. For these problems we describe algorithms which are derived from known (internal memory) algorithms suitably modified so that they process the input data in a single sequential scan.

In this paper we have described eGap, a new algorithm for the computation of the BWT and LCP arrays of large collection of sequences. Depending on the amount of available memory, eGap uses an external or semi-external strategy for computing the BWT and LCP values. An experimental comparison of the available tools for BWT and LCP arrays computation shows that eGap is the fastest tool in many scenarios and was the only tool capable of completing the computation within a reasonable time frame for all kind of input data.

 

Source:

http://doi.org/10.1186/s13015-019-0140-0

 

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments