Research Article: Getting DNA copy numbers without control samples

Date Published: August 16, 2012

Publisher: BioMed Central

Author(s): Maria Ortiz-Estevez, Ander Aramburu, Angel Rubio.


The selection of the reference to scale the data in a copy number analysis has paramount importance to achieve accurate estimates. Usually this reference is generated using control samples included in the study. However, these control samples are not always available and in these cases, an artificial reference must be created. A proper generation of this signal is crucial in terms of both noise and bias.

Five human datasets (a subset of HapMap samples, Glioblastoma Multiforme (GBM), Ovarian, Prostate and Lung Cancer experiments) have been analyzed. It is shown that using only tumoral samples, NSA is able to remove the bias in the copy number estimation, to reduce the noise and therefore, to increase the ability to detect copy number aberrations (CNAs). These improvements allow NSA to also detect recurrent aberrations more accurately than other state of the art methods.

NSA provides a robust and accurate reference for scaling probe signals data to CN values without the need of control samples. It minimizes the problems of bias, noise and batch effects in the estimation of CNs. Therefore, NSA scaling approach helps to better detect recurrent CNAs than current methods. The automatic selection of references makes it useful to perform bulk analysis of many GEO or ArrayExpress experiments without the need of developing a parser to find the normal samples or possible batches within the data. The method is available in the open-source R package NSA, which is an add-on to the framework.

Partial Text

A DNA copy number aberration (CNA) is a pathological amplification or deletion of a part of the genome (a chromosome, one of their arms or a segment) which has been related to cancer development. In CNAs, DNA copy numbers (CNs) may be larger (gains and amplifications) or smaller (deletions and homozygous deletions) than the normal state (CN = 2).

NSA is a population-based multi-array method for scaling any SNP & CN array technology, e.g. Affymetrix and Illumina. It identifies the normal regions within the samples, finds optimal weights to account for hybridization batches, calculates the corresponding references and, finally, performs a two-dimensional scaling.

In this part, it is shown that the results using NSA outperforms the use of control samples from a different lab or using a robust median of the tumoral samples (which are the most used methods). This improvement in performance appears both in noise and bias. Since there is not a ground truth to compare against, we have used three indirect aspects to state the performance: the ability to find CNAs along the genome, the quality of the estimated CNs in regions that are known to be normal and the ability to find recurrently aberrated regions.

This paper describes NSA, an algorithm to scale the summarized SNP signals to CN values by finding normal regions within tumoral samples. NSA is platform (Illumina or Affymetrix) and pre-processing method (dChip, CRMAv2, ACNE, CalMaTe…) independent. The synthetic reference generated by NSA using only tumoral samples gives more accurate results than either using control samples from different labs or using all the tumoral samples. Indeed, NSA results are close to the ones obtained using control samples from the same lab within the dataset. In addition, NSA includes an algorithm to deal with batch effects. It automatically computes an optimal reference for each sample (that in our tests is strongly related to the hybridization batches). Batch information is not required to run NSA; the algorithm automatically identifies the proper samples to compute the reference using only the signals of the microarrays. NSA minimizes the problem of bias for samples with a large number of similar aberrations (i.e. most of them are gains or deletions). For these samples, the predicted CNs for normal regions tends to compensate the aberration including a bias. This is a potential problem that also occurs using MCS (control samples from the same lab). NSA is able to effectively discover the normal regions and uses them to scale the data diminishing any bias that appears in the normalization step.

The proposed NSA method is available in the NSA package implemented in R (R Development Core Team, 2010). This package includes an add on to the high-level aroma.affymetrix framework
[33], which allows NSA to be applied to very large SNP data sets. It is publicly available at CRAN repository in a package called “NSA”.

LH, Level of Heterozygosity; HNCNs, Heterozygous Neutral Copy Number 775 SNPs; CNA, Copy Number Aberration; CNs, DNA Copy Numbers; SNP, Single 776 Nucleotide Polymorphism; LOH, Loss of Heterozygosity; CNVs, Copy Number 777 Variations; NSA, Normality Search Algorithm; BER, Batch Effect Removal; QP, 778 Quadratic Programming; MCS, Median of Control samples; MHS, Median of 779 HapMap samples; MTS, Median of tumoral samples.

T he authors declare that they have no competing interests.

MO conceived the idea and jointly with AA developed the add-on to the framework. AR developed the algorithm to account for batch effects. MO, AA and AR wrote the manuscript and developed the software to compare NSA against other algorithms. All authors read and approved the final manuscript.




Leave a Reply

Your email address will not be published.