Date Published: March 27, 2019
Publisher: Public Library of Science
Author(s): Lu Cheng, Mu Zhu, Li Chen.
We study computational approaches for detecting SNP-SNP interactions that are characterized by a set of “two-locus, two-allele, two-phenotype and complete-penetrance” disease models. We argue that existing methods, which use data to determine a best-fitting disease model for each pair of SNPs prior to screening, may be too greedy. We present a less greedy strategy which, for each given pair of SNPs, limits the number of candidate disease models to a set of prototypes determined a priori.
For many years, scientists have tried to identify single-nucleotide polymorphisms (SNPs) that are associated with various diseases, but over the years it is becoming apparent that single genetic variations can explain only very little heritability. This has come to be known as the so-called “missing heritability problem” [1–3], and has prompted many scientists to conjecture that perhaps SNP-SNP interactions are more prevalent than we had previously thought .
Before we describe our approach in more detail, we first provide some motivations by discussing some weaknesses of existing methods. We should emphasize that these are merely some examples of scenarios in which PTY can be seen to have certain advantages over MDR and RS. They are by no means the only—or even necessarily the main —such scenarios. The reason why they are being presented, rather than others, is because they are still relatively easy for us to describe with a reasonable amount of clarity, whether algebraically (Section 2.1), verbally (Section 2.2), or both (Section 2.3).
We now describe our approach in more detail. First, we derive a metric to measure the similarity (or equivalently, difference) between two disease models. Then, we cluster all disease models into a few groups and select a prototype model from each group. Finally, we screen each pair of SNPs against the set of prototype models. The set of prototype models is decided a priori, without considering the disease status of individuals in the data set. This is what makes our approach less greedy, and less data-adaptive, than existing methods such as MDR and RS.
To motivate our approach, we already presented a few simulated examples in Section 2, where we concentrated on evidence that our approach appears to overcome various weaknesses of existing approaches. In this section, we assess our approach more generally with a number of simulated examples that are commonly examined in the literature.
In this section, we report our analysis of the phase I bipolar disorder data from the Wellcome Trust Case Control Consortium (WTCCC) . Because our method is aimed at screening SNP-pairs for different epistatic effects (rather than individual SNPs for main effects), we focus on the complementary value that our method offers—in particular, its ability to find relevant SNPs that other methods may still miss.
This paper is concerned with screening pairs of SNPs, rather than just individual SNPs, for their association with various phenotypes. The complication is that there are many mechanisms—corresponding to different epistatic effects and described by different disease models—for a pair of SNPs to be associated with the outcome.