Date Published: March 28, 2019
Publisher: Public Library of Science
Author(s): Manali Rupji, Bhakti Dwivedi, Jeanne Kowalski, Xia Li.
Since their inception, several tools have been developed for cluster analysis and heatmap construction. The application of such tools to the number and types of genome-wide data available from next generation sequencing (NGS) technologies requires the adaptation of statistical concepts, such as in defining a most variable gene set, and more intricate cluster analyses method to address multiple omic data types. Additionally, the growing number of publicly available datasets has created the desire to estimate the statistical significance of a gene signature derived from one dataset to similarly group samples based on another dataset. The currently available number of tools and their combined use for generating heatmaps, along with the several adaptations of statistical concepts for addressing the higher dimensionality of genome-wide NGS-derived data, has created a further challenge in the ability to replicate heatmap results. We introduce NOJAH (NOt Just Another Heatmap), an interactive tool that defines and implements a workflow for genome-wide cluster analysis and heatmap construction by creating and combining several tools into a single user interface. NOJAH includes several newly developed scripts for techniques that though frequently applied are not sufficiently documented to allow for replicability of results. These techniques include: defining a most variable gene set (a.k.a., ‘core genes’), estimating the statistical significance of a gene signature to separate samples into clusters, and performing a result merging integrated cluster analysis. With only a user uploaded dataset, NOJAH provides as output, among other things, the minimum documentation required for replicating heatmap results. Additionally, NOJAH contains five different existing R packages that are connected in the interface by their functionality as part of a defined workflow for genome-wide cluster analysis. The NOJAH application tool is available at http://bbisr.shinyapps.winship.emory.edu/NOJAH/http://shinygispa.winship.emory.edu/shinyGISPA/ with corresponding source code available at https://github.com/bbisr-shinyapps/NOJAH/.
Data from next generation sequencing (NGS) technologies have created a level of dimensionality that has greatly exceeded that of prior, microarray-based genome-wide datasets, resulting in the need for innovative approaches to cluster analysis and heatmap construction. For this reason, several disparate methods have been developed to address such needs. For example, consensus clustering was introduced as a method for estimating the number of clusters [1–3]. The concept of defining a ‘most variable’ gene set was introduced to address the much higher dimension of NGS data by filtering out genes with little to no differences among samples with respect to some molecular data type and performing a cluster analysis on the remaining, ‘core gene set.’ This approach has resulted in the use of several definitions applied to define a core gene set, most of which are insufficiently documented to enable their replicability. Other concepts in cluster analysis, such as silhouette widths for examining the tightness of clusters, though around for some time, have gained renewed interest for their use in defining a ‘core sample set’ within the context of genomic data cluster analysis, an approach that has been particularly useful when clustering many samples . We have collectively placed these new approaches and new adaptions of existing methods for genome-wide cluster analysis and heatmap construction into the following general, genome-wide heatmap analysis workflow: 1) define a most variable gene set (a.k.a., ‘core genes’); 2) perform cluster analysis using core genes and construct heatmap of results; 3) estimate the number of clusters; 4) define a core sample set and update the heatmap using both core genes and core samples.
NOJAH is a web-interface developed using the Shiny R package  hosted on a private Centos OS server and requires only a stable internet connection to run. The source code is written in the R programming language (https://www.r-project.org/) and is freely available to download from the GitHub (https://github.com/bbisr-shinyapps/NOJAH/). The main R packages used in NOJAH include: heatmap.2, gplots, ConsensusClusterPlus , and dendextend . NOJAH was tested using google chrome on a 64-bit, x64-based processor Windows 10 Enterprise machine with 32GB of RAM and an Intel(R) Core(TM) i7-7820HQ CPU at 2.90 GHz and MacBook Pro version 10.11.6 and 2.8 GHz Intel Core i7 processor,16GB RAM with 1600 MHz DDR3 memory and using Firefox (firefox quantum 62.0.3 (64-bit)) browser.
Our NOJAH application tool provides a comprehensive resource to users for conducting a genome-wide heatmap analysis. NOJAH is flexible in terms of data input, and can be applied to any data type and platform, such as mRNA expression, miRNA expression, methylation, copy number or variants. Additional features in NOJAH include interactive settings for defining core genes and core samples and combined results clustering, along with the flexibility to include phenotype information through use of a color bar. While we have demonstrated the utility of NOJAH using a TCGA BRCA data set of gene expression, any high-dimensional quantitative data may be used as input.
Identification of gene signatures is crucial in cancer genomics. Prognostic gene signatures within a cancer type constitute a set of genes whose expression changes reveal important information about tumor diagnosis, prognosis and even therapeutic response [6, 15]. The dependence on the use of heatmaps to apply published gene signatures for tumor subtyping is increasing and along with it, the challenges in obtaining results. With a comprehensive workflow in hand as a single application tool, there is little room for computational error by invoking several separate tools to accomplish the end task of applying a gene signature for tumor subtyping. Additionally, with a workflow that includes as output the parameters used to obtain results, the replicability of them is more feasible than with documenting several steps from several programs and approaches.