Research Article: A High-Throughput Computational Framework for Identifying Significant Copy Number Aberrations from Array Comparative Genomic Hybridisation Data

Date Published: September 13, 2012

Publisher: Hindawi Publishing Corporation

Author(s): Ian Roberts, Stephanie A. Carter, Cinzia G. Scarpini, Konstantina Karagavriilidou, Jenny C. J. Barna, Mark Calleja, Nicholas Coleman.


Reliable identification of copy number aberrations (CNA) from comparative genomic hybridization data would be improved by the availability of a generalised method for processing large datasets. To this end, we developed swatCGH, a data analysis framework and region detection heuristic for computational grids. swatCGH analyses sequentially displaced (sliding) windows of neighbouring probes and applies adaptive thresholds of varying stringency to identify the 10% of each chromosome that contains the most frequently occurring CNAs. We used the method to analyse a published dataset, comparing data preprocessed using four different DNA segmentation algorithms, and two methods for prioritising the detected CNAs. The consolidated list of the most commonly detected aberrations confirmed the value of swatCGH as a simplified high-throughput method for identifying biologically significant CNA regions of interest.

Partial Text

Correlating specific genomic copy number aberrations (CNA) with disease is an important and challenging first step in biomarker discovery [1]. Detecting CNAs that define genomic regions of interest using array comparative genomic hybridisation (aCGH) requires precise integration of probe signal amplitude, size (i.e., width) of copy number imbalanced region, and frequency of imbalance across a sample set, all referenced to relevant clinico-pathologic features.

swatCGH was designed as a simplified approach to selecting CNA regions of interest from aCGH data. This open source method enables consolidation of data across a sample set and can accommodate the large information content of high-resolution analyses, where theoretical limits extend beyond millions of probes by thousands of samples as defined by R data frame properties (see R documentation at The method incorporates sliding windows, as signal intensities estimated from groups of neighbouring probes are less likely to be subject to noise perturbation than discrete probes. Adaptive thresholds applied on a per chromosome basis increase the probability of identifying lower prevalence abnormalities that may contribute to significant patterns of disease heterogeneity, paralleling an aim of GTS, but in contrast to methods such as GISTIC, that are weighted towards oncogene detection. Selection of prioritized candidate targets is not computed by integration across probe window sizes. Instead, users are able to select a results panel based on a window size most appropriate for the array probe density used, and review outcomes for a range of probe window sizes by navigation through the web-based CNA reports. In our approach, the process for ranking CNA regions of interest is driven by mean signal intensity, preventing omission of significant nonannotated regions of the genome, and supporting inclusion of important lower prevalence abnormalities. The overall method is robust, systematic, and customizable, with all parameters specified in a single text file. The reporting of all analysis steps undertaken enables ready evaluation of all genomic loci, not just those in the ranked lists.