Research Article: Fractal MapReduce decomposition of sequence alignment

Date Published: May 2, 2012

Publisher: BioMed Central

Author(s): Jonas S Almeida, Alexander Grüneberg, Wolfgang Maass, Susana Vinga.

http://doi.org/10.1186/1748-7188-7-12

Abstract

The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.

In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a “alignment-free” solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.

The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser’s emergence as an environment for high performance distributed computing.

Partial Text

Since 2008 the decrease in sequencing costs is far steeper than of those of computing [1]. Projecting from these trends promises to deliver the $1000 genome by 2014, making it inescapable that the costs of analyzing the raw sequence data will exceed those of its generation. In contrast, the algorithms used to process and compare sequences largely rely on the dynamic programming solutions proposed by Smith-Waterman and Needleman-Wunsch in the 70’s and 80’s [2,3]. This is not to say that the implementation of alignment algorithms has not become more efficient, quite the opposite has taken place. For example, there are several capable algorithmic solutions [4] to align the vast number of short reads that next generation sequencing techniques produce a reference genome. However, better implementations of dynamic programming do not by themselves remove its limited scalability, which has motivated research into a variety of alignment-free methods in the last decade [5-9].

In [17] we first noted that by adding the distances of forwardly and backwardly encoded coordinates we could estimate the length of the full similar segment. This had the interesting property that the length of the similar sequence could be approached by comparing the forward and backward coordinates of any two homologous units. This is the defining feature explored by the map-reduce decomposition described here. This composite of forward and backward CGR coordinate encoding for alphabets of any length was designated as Universal Sequence Maps (USM). A compact library [usm.m] was then developed in Mathworks m-code to support motif density kernels [19]. The library provided here (usm.js) advances that work by producing exact measure of similar length (Equations 3-5), and weaving its use in Equation 6 with MapReduce parallelization of sequence comparison.

The Universal Sequence Map (USM) procedure expands the Chaos Game Representation (CGR) approach to “alignment-free” analysis of sequences of any alphabet. Not only is the sequence comparison procedure described here performed without recourse to dynamic programming alignment, but multiple layers of nested map-reduce distribution provide maximally parallelized workflows to find the length of the similar segment shared by any two sequence units. If this basic alignment operation can be streamlined by the USM procedure into the scalable and distributed processing form described here, the expectation is that other sequence analysis operations can be similarly decomposed, including more advanced types of alignment proceedures. This may be particularly significant given the large amount of sequence information now being generated by NextGen methodologies. The proposed MapReduce decomposition was implemented in “the language of the web”, JavaScript (ecmascript), both out of convenience and in arguable anticipation of the native use of web-browsers for distributed computing.

The authors declare that they have no competing interests.

JSA identified and implemented the USM decomposition procedure; AG assisted with the use of the Map Reduce programming pattern, with oversight by WM in framing it in the wider context of computational architectures; SV was integral part of the re-design of the USM procedure, with a key contribution in its mathematical representation. JSA wrote the report and all authors participated in its revision.

 

Source:

http://doi.org/10.1186/1748-7188-7-12

 

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments