Research Article: Using graph models to find transcription factor modules: the hitting set problem and an exact algorithm

Date Published: January 16, 2013

Publisher: BioMed Central

Author(s): Songjian Lu, Xinghua Lu.


Systematically perturbing a cellular system and monitoring the effects of the perturbations on gene expression provide a powerful approach to study signal transduction in gene expression systems. A critical step of revealing a signal transduction pathway regulating gene expression is to identify transcription factors transmitting signals in the system. In this paper, we address the task of identifying modules of cooperative transcription factors based on results derived from systems-biology experiments at two levels: First, a graph algorithm is developed to identify a minimum set of co-operative TFs that covers the differentially expressed genes under each systematic perturbation. Second, using a clique-finding approach, modules of TFs that tend to consistently cooperate together under various perturbations are further identified. Our results indicate that this approach is capable of identifying many known TF modules based on the individual experiment; thus we provide a novel graph-based method of identifying context-specific and highly reused TF-modules.

Partial Text

In order to survive, a cell responds to a variety of environmental and internal perturbations, e.g., environmental stresses and gene mutations respectively. A common response to cellular perturbations is to activate gene expression programs that induce or repress expression of genes to cope with changed homeostatus. Signals which originate as a result of the perturbation are often propagated to transcription factors (TFs), which serve as bottlenecks of signal transduction pathways that regulate transcription programs. Often multiple TFs are involved in regulating one set of genes in a cooperative manner, hence they are referred to as a TF module, and their binding sites in the genome are often referred to as cis-regulatory modules.

Usually, finding an exact optimal solution for an NP-hard problem in practical time is difficult; it is not uncommon for such a program to run as long as months or years. Hence, heuristic or greedy algorithms are often designed to approximately solve NP-hard problems. However, heuristic or greedy algorithms cannot guarantee the performance. Therefore, in order to design an exact algorithm, we investigated the characteristic of our problem and found that the degrees of many genes were very small (i.e., they were only connected to a small number of TFs in the protein-DNA interaction graph [9,10]). More specifically, about 70% of genes had degrees less than or equal to 3, about 85% of genes had degrees which are at most 5, and about 96% of genes had degrees less than or equal to 10. This characteristic enabled us to design an efficient exact algorithm, which was presented in Figure 2, for the problem. In addition to its efficiency, our algorithm is the first exact algorithm that can solve a weighted t-cover hitting set problem.

When applied to the protein-DNA interaction graph induced from ChIP-chip experiments [9,10] and the results from microarray experiments from Hughes et al.,[8], our algorithms identify TF modules at two levels: First, given genes differentially expressed in each perturbation experiment, we find a set of cooperative TFs at the perturbation-instance level in a context-specific manner. Second, we further combine context-specific TFs to find TF modules that are repeatedly utilized at the system level. In the following subsections, we show that our approach produces biologically sensible results at both levels.

In this paper, we have developed graph-based approaches to address the problem of finding cooperative TF modules at two levels. First, given a set of co-regulated genes, we find a set of TFs that cooperatively regulate the genes in a context-specific manner. Second, given a collection of context-specific TF modules, we find the TFs that tend to function cooperatively in multiple instances at systems level, where the behind idea here is: if two TFs are working together in multiple time, then it is more possible that they are in the same TF module. For the first part, we cast the task as a weighted t-cover hitting set problem and developed an exact algorithm to solve the problem. The main contribution of this paper is that, by taking advantage of the knowledge of the limited gene degrees, we have developed a very efficient exact algorithm capable of solving the problem at hand in a practical time. For the second problem, we cast the task as a clique-finding problem, and our approach produced results that are biologically sensible and generate new biological hypotheses. Our graph-based approaches are significantly different from statistics-based approaches, hence providing a new perspective to study transcriptional regulation [2].

aIn applications, if there is any subset whose size is less than t, we add dummy element/elements to make its size to t.

The authors declare that they have no competing interests.

Both SL and XL analyzed biological data, established and modified the mathematical model, and drafted the manuscript. SL made the major contribution in designing the algorithm. Both authors read and approved the final manuscript.




Leave a Reply

Your email address will not be published.