Date Published: January 16, 2012
Publisher: BioMed Central
Author(s): Stefan Enroth, Claes R Andersson, Robin Andersson, Claes Wadelius, Mats G Gustafsson, Jan Komorowski.
High-throughput sequencing is becoming the standard tool for investigating protein-DNA interactions or epigenetic modifications. However, the data generated will always contain noise due to e.g. repetitive regions or non-specific antibody interactions. The noise will appear in the form of a background distribution of reads that must be taken into account in the downstream analysis, for example when detecting enriched regions (peak-calling). Several reported peak-callers can take experimental measurements of background tag distribution into account when analysing a data set. Unfortunately, the background is only used to adjust peak calling and not as a pre-processing step that aims at discerning the signal from the background noise. A normalization procedure that extracts the signal of interest would be of universal use when investigating genomic patterns.
We formulated such a normalization method based on linear regression and made a proof-of-concept implementation in R and C++. It was tested on simulated as well as on publicly available ChIP-seq data on binding sites for two transcription factors, MAX and FOXA1 and two control samples, Input and IgG. We applied three different peak-callers to (i) raw (un-normalized) data using statistical background models and (ii) raw data with control samples as background and (iii) normalized data without additional control samples as background. The fraction of called regions containing the expected transcription factor binding motif was largest for the normalized data and evaluation with qPCR data for FOXA1 suggested higher sensitivity and specificity using normalized data over raw data with experimental background.
The proposed method can handle several control samples allowing for correction of multiple sources of bias simultaneously. Our evaluation on both synthetic and experimental data suggests that the method is successful in removing background noise.
High-throughput sequencing of chromatin immunoprecipitated DNA, or ChIP-seq , has replaced microarray-based techniques as the standard tool for investigating protein-DNA interactions in the cell. However, the data generated will always contain noise due to sequencing biases, PCR-artefacts, low complexity regions/mappability, chromatin structure or non-specific antibody interactions in the ChIP-step. The noise appears as a background distribution of reads, or tags, which must be taken into account in downstream analyses such as peak-calling.
Normalizing is a vital part of any next generation sequencing study. For microarray based techniques there exist many different types of normalizing methods directed at different sources of bias (e.g. dye effects or background noise). To the best of our knowledge, up until now, there has not been any normalization method for ChIP-sequencing data that globally addresses effects, such as non-specific antibody interactions or background noise, which can be suppressed using control experiments. Many of the existing peak-callers are tailor-made for ChIP-sequencing data and can make use of a background model based on experimental control data, rather than purely theoretical statistical assumptions, to filter out regions that are also enriched in the control data. However, these approaches are inherently designed to be used for peak calling and are therefore not easily transformed into universal normalization methods. In order to be fully compliant with any type of analysis performed on ChIP-seq data it is also imperative that the resulting normalized signal is reported in the same format as the raw ChIP-seq data.
The authors declare that they have no competing interests.
SE and CRA conceived of the study, its design and wrote the manuscript. MGG, CW and JK participated in the design of the study and manuscript writing. RA participated in the design of the study. All authors read and approved the final manuscript.