Research Article: Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Date Published: June 17, 2019

Publisher: Public Library of Science

Author(s): Rajiv Movva, Peyton Greenside, Georgi K. Marinov, Surag Nair, Avanti Shrikumar, Anshul Kundaje, Chun-Hsi Huang.


The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

Partial Text

Changes in gene expression play a crucial role in a wide variety of cellular processes. Dissecting the precise mechanisms of gene regulation is therefore necessary to understand both the normal functioning of cells and the ways in which dysregulation of certain genes plays a role in disease states [1]. Gene expression in metazoans is regulated by several distinct classes of cis-regulatory elements (promoters, enhancers, insulators, and others), with the activity of multiple enhancers being integrated to determine the expression levels of the average mammalian gene [2]. The activity of each enhancer or promoter element itself is driven by the concerted action of multiple DNA binding proteins called transcription factors (TFs), which typically bind to combinatorial grammars of short sequence motifs embedded in regulatory DNA sequences.

Our overall workflow consists of three major components. First, we train and optimize CNNs that predict regulatory activity of noncoding DNA sequences as measured by MPRAs. Next, we estimate the predictive contributions (importance) of individual nucleotides in input DNA sequences and compare these to DNA sequence features with known biological function. Finally, we present case studies focused on discovering novel regulatory sequence grammars and identifying putative functional genetic variants associated with gene expression variation. In this section, we first provide an overview of MPRA experiments, and then discuss the details of these steps.

In recent years, functional genomic assays such as the numerous methods for profiling chromatin features, MPRAs, and pooled CRISPR perturbation screens have produced genomic data at unprecedented breadth, depth, and detail. MPRAs in particular present a highly scalable platform for finely dissecting the regulatory code of individual noncoding DNA elements, as they allow for large numbers of short sequences to be tested in parallel and in diverse cellular contexts. The expression-based readout of MPRAs is complementary to other assays like ChIP-seq (protein binding) and DNase-seq (chromatin accessibility), which do not directly measure effects on gene expression. Predictive models trained on MPRAs are hence more likely to be sensitive to identifying functional regulatory patterns that affect gene expression.