Research Article: Framework for Parallel Preprocessing of Microarray Data Using Hadoop

Date Published: March 29, 2018

Publisher: Hindawi

Author(s): Amirhossein Sahlabadi, Ravie Chandren Muniyandi, Mahdi Sahlabadi, Hossein Golshanbafghy.


Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach.

Partial Text

Thousands of genes are expressed through microarray. The abundance of produced messenger RNA (mRNA) for the expressed genes can be studied using microarray-based methods where it allows large-scale analyses of gene expression simultaneously [1].

The proposed approach distributes the microarray data over nodes by Hadoop HDFS. Then it runs the preprocessing method in the model of MapReduce programing to propagate the jobs and tasks across all nodes. Consequently, preprocessing performance of microarray data increases.

The dataset used in this experiment is breast cancer data collected from National Center for Biotechnology Information (NCBI). Microarray cancerous data is found in Gene Expression Omnibus Database (GEO) [31]. This database contains genes and microarray as well as various organism datasets. The GEO accession number for this dataset is GSE4922 which includes list of all GSM files from a single experiment. The dataset used in this experiment starts with GSM110625.CEL and ends with GSM111122.CEL. All tumor samples are evaluated on GPL96 and GPL97. GPL stands for GEO Platform which indicates specific type of platform. GPL96 is a GeneChip of Affymetrix Human Genome U133A Array [HU-133A] and GPL97 is a GeneChip of Affymetrix Human Genome U133B Array [HU-133B]. Both GeneChips are manufactured by Affymetrix [32].

Figure 6 compares the parallel RMA approach with the standard sequential RMA preprocessing method. The result shows that as the volume and number of the files increase, parallel preprocessing takes less time in comparison with sequential one.

In this paper, the proposed approach exploits Hadoop and R integration in order to preprocess the microarray data by RMA algorithm in a parallel manner for the first time in bioinformatics. According to the experiment, the result shows performance improvement; as the volume of files increases, it requires less time to preprocess the data compared to the sequential one. Besides, preprocessing of hundreds of microarray datasets using sequential RMA is not possible or, even in some cases, it takes days to accomplish due to its heavy memory usage. The main memory limits are caused by the structure of the AffyBatch class. The AffyBatch will be created by importing  .CEL files into the R software and is a container for storing probe-level data. The number of arrays which can be imported strongly depends on the architecture of the computer system (e.g., 32-bit Linux system with 4 GB main memory can support 160  .CEL files). The partition of data and distribution to several nodes solves the main memory problems and accelerates the methods [33]. Therefore, the MapReduce implementation of RMA allows processing any size of files conveniently with higher speed. The proposed method has the capability to be implemented in high numbers of clusters with high computational power and memory to handle huge amounts of bioinformatics of data.




Leave a Reply

Your email address will not be published.