Date Published: May 2, 2012
Publisher: BioMed Central
Author(s): Jonas S Almeida, Alexander Grüneberg, Wolfgang Maass, Susana Vinga.
The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.
In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a “alignment-free” solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.
The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser’s emergence as an environment for high performance distributed computing.
Since 2008 the decrease in sequencing costs is far steeper than of those of computing . Projecting from these trends promises to deliver the $1000 genome by 2014, making it inescapable that the costs of analyzing the raw sequence data will exceed those of its generation. In contrast, the algorithms used to process and compare sequences largely rely on the dynamic programming solutions proposed by Smith-Waterman and Needleman-Wunsch in the 70’s and 80’s [2,3]. This is not to say that the implementation of alignment algorithms has not become more efficient, quite the opposite has taken place. For example, there are several capable algorithmic solutions  to align the vast number of short reads that next generation sequencing techniques produce a reference genome. However, better implementations of dynamic programming do not by themselves remove its limited scalability, which has motivated research into a variety of alignment-free methods in the last decade [5-9].
In  we first noted that by adding the distances of forwardly and backwardly encoded coordinates we could estimate the length of the full similar segment. This had the interesting property that the length of the similar sequence could be approached by comparing the forward and backward coordinates of any two homologous units. This is the defining feature explored by the map-reduce decomposition described here. This composite of forward and backward CGR coordinate encoding for alphabets of any length was designated as Universal Sequence Maps (USM). A compact library [usm.m] was then developed in Mathworks m-code to support motif density kernels . The library provided here (usm.js) advances that work by producing exact measure of similar length (Equations 3-5), and weaving its use in Equation 6 with MapReduce parallelization of sequence comparison.
The authors declare that they have no competing interests.
JSA identified and implemented the USM decomposition procedure; AG assisted with the use of the Map Reduce programming pattern, with oversight by WM in framing it in the wider context of computational architectures; SV was integral part of the re-design of the USM procedure, with a key contribution in its mathematical representation. JSA wrote the report and all authors participated in its revision.