Research Article: Estimating population size via line graph reconstruction

Date Published: July 5, 2013

Publisher: BioMed Central

Author(s): Bjarni V Halldórsson, Dima Blokh, Roded Sharan.

http://doi.org/10.1186/1748-7188-8-17

Abstract

We propose a novel graph theoretic method to estimate haplotype population size from genotype data. The method considers only the potential sharing of haplotypes between individuals and is based on transforming the graph of potential haplotype sharing into a line graph using a minimum number of edge and vertex deletions.

We show that the resulting line graph deletion problems are NP complete and provide exact integer programming solutions for them. We test our approach using extensive simulations of multiple population evolution and genotypes sampling scenarios. Our results also indicate that the method may be useful in comparing populations and it may be used as a first step in a method for haplotype phasing.

Our computational experiments show that when most of the sharings are true sharings the problem can be solved very fast and the estimated size is very close to the true size; when many of the potential sharings do not stem from true haplotype sharing, our method gives reasonable lower bounds on the underlying number of haplotypes. In comparison, a naive approach of phasing the input genotypes provides trivial upper bounds of twice the number of genotypes.

Partial Text

A fundamental problem in population studies is the estimation of the size of the underlying haplotype pool. In these studies, such as genomewide association studies, a number of individuals are genotyped at some single nucleotide polymorphism (SNP) locations. Since we cannot observe the haplotypes of an individual, a common approach to the size estimation problem is to phase the genotype data, i.e., computationally predict the underlying haplotypes. However, haplotype phasing is a notoriously difficult problem [1] and can be optimally solved for small instances only [2].

Let G = (V,E) be a graph with a set V of n vertices and a set E of m edges. The line graph of G, denoted by L(G), is constructed by having a node in L(G) for each edge in G and an edge between two nodes of L(G) if the corresponding edges are adjacent in G.

Here we provide integer programming formulations for Problems 2.1, 2.2 and 2.3. The formulations rely on the characterization given in Lemma 2.1. We let [i,j] represent {i,i + 1,…,j}.

We comprehensively test our algorithm for deleting edges and vertices on data from the International HapMap Consortium [5] and under two simulated scenarios, corresponding to a bottleneck population isolate and a population that has continuously undergone recombination and mutation. For the population isolate we show that we can almost fully recover the haplotype structure from the sharing information alone. For the population that has undergone continuous recombination and mutation we get a bound on the number of haplotypes that is close to the true number of haplotypes in the population. Our experiments further reveal that the occurrence of genotypes showing a large degree of sharing does not materially affect our ability to estimate the number of haplotypes in the remaining population. The computational experiments were done using CPLEX 12, making use of a single CPU processor core with 4GB of memory.

We show that the problem of assigning alleles to individuals when only information about the sharing of alleles between individuals is known is equivalent to the problem of determining whether a graph is a line graph. When sharing information is not perfect we give polynomial size integer programming algorithms for determining allele sharing. We show how this method can be used to estimate the genetic diversity of different populations. We suspect that the method proposed here may be useful in computing important parameters in population genetics such as the effective population size, however a careful analysis of its statistical properties is necessary.

The authors declare that they have no competing interests.

Methodology was developed by BVH and RS. Experiments were run by DB. Manuscript was written by BVH, DB and RS. All authors read and approved the final manuscript.

 

Source:

http://doi.org/10.1186/1748-7188-8-17

 

Leave a Reply

Your email address will not be published.