Research Article: A fast and accurate enumeration-based algorithm for haplotyping a triploid individual

Date Published: June 1, 2018

Publisher: BioMed Central

Author(s): Jingli Wu, Qian Zhang.

http://doi.org/10.1186/s13015-018-0129-0

Abstract

Haplotype assembly, reconstructing haplotypes from sequence data, is one of the major computational problems in bioinformatics. Most of the current methodologies for haplotype assembly are designed for diploid individuals. In recent years, genomes having more than two sets of homologous chromosomes have attracted many research groups that are interested in the genomics of disease, phylogenetics, botany and evolution. However, there is still a lack of methods for reconstructing polyploid haplotypes.

In this work, the minimum error correction with genotype information (MEC/GI) model, an important combinatorial model for haplotyping a single individual, is used to study the triploid individual haplotype reconstruction problem. A fast and accurate enumeration-based algorithm enumeration haplotyping triploid with least difference (EHTLD) is proposed for solving the MEC/GI model. The EHTLD algorithm tries to reconstruct the three haplotypes according to the order of single nucleotide polymorphism (SNP) loci along them. When reconstructing a given SNP site, the EHTLD algorithm enumerates three kinds of SNP values in terms of the corresponding site’s genotype value, and chooses the one, which leads to the minimum difference between the reconstructed haplotypes and the sequenced fragments covering that SNP site, to fill the SNP loci being reconstructed.

Extensive experimental comparisons were performed between the EHTLD algorithm and the well known HapCompass and HapTree. Compared with algorithms HapCompass and HapTree, the EHTLD algorithm can reconstruct more accurate haplotypes, which were proven by a number of experiments.

Partial Text

As a large number of sequencing data are available, the investigation of genetic variations has become one of the main topics in bioinformatics. Single nucleotide polymorphism (SNP), the most widespread form of variation, is believed to be the major genetic cause to phenotypic variability. A sequence of SNPs along a chromosome is referred to as a haplotype, which is more important for complete comprehending the complex genetic polymorphisms than isolated SNPs. Increasing evidence shows that haplotypes play a crucial role in studying the variations relating to diseases prediction and gene expression [1]. Therefore, computational methods to infer haplotypes are needed, for determining haplotypes is both time consuming and expensive by direct using biological experiments. In recent decade, the presented computational haplotyping algorithms generally fall into three categories [2]: (1) population haplotyping with genotype data [3, 4]; (2) population haplotyping with fragment data [5]; (3) individual haplotyping with fragment data [6]. In this paper, individual haplotyping problem is studied for a triploid individual.

Triploid somatic cells contain three sets of chromosomes, i.e., a triploid organism has three copies of each chromosome. Since haplotype consists of the sequence of all SNPs along a chromosome, a triploid individual owns three haplotypes. It is commonly regarded that a SNP locus shows merely two possible alleles, hence the major allele can be represented as documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$text{`}0text{‘}$$end{document}`0’ and the minor one can be represented as documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$text{`}1text{‘}$$end{document}`1’. A haplotype can be encoded as a string over a 2-letter alphabet documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${0,1}$$end{document}{0,1} instead of four real bases {A,T,C,G}. A genotype is the conflation of three haplotypes on the homologous chromosomes. When three alleles at a SNP site have identical values, this SNP site is called a homozygous site, otherwise it is called a heterozygous site. For example, documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$(000)^{T}$$end{document}(000)T or documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$(111)^{T}$$end{document}(111)T represents the genotype value at a homozygous SNP site, while documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$(001)^{T}$$end{document}(001)T or documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$(011)^{T}$$end{document}(011)T represents the genotype value at a heterozygous SNP site. Suppose that m aligned SNP fragments, coming from three haplotypes of length n, are generated by DNA sequencing experiments. Let M denote an documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$mtimes n$$end{document}m×n SNP matrix over the alphabet documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$${0,1,-}$$end{document}{0,1,-} (− denotes the value is null). As shown in Fig. 1a, each row represents a SNP fragment, each column represents a SNP site, and each entry M[i, j] denotes the SNP allele of the ith fragment at the jth SNP site. Let documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$G=(g_{1},g_{2},ldots ,g_{n})$$end{document}G=(g1,g2,…,gn) denote the genotype matrix corresponding to M, where documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$g_{j}=(g_{j1},g_{j2},g_{j3})^{T} (g_{jk}in {0,1}, quad k=1,2,3, quad j=1,2,ldots ,n)$$end{document}gj=(gj1,gj2,gj3)T(gjk∈{0,1},k=1,2,3,j=1,2,…,n) denotes the genotype value at the jth SNP site. Figure 1b shows an example of the genotype matrix.Fig. 1An example of SNP fragment matrix and genotype matrix. a SNP fragment matrix M , b genotype matrix G

In this section, the EHTLD algorithm is described. The input consists of a SNP matrix M and a genotype matrix G. The output is three assembled haplotypes h = (documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$h_{1}$$end{document}h1, documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$h_{2}$$end{document}h2, documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$h_{3}$$end{document}h3) of length n. In the first step of this algorithm, the matrices M and G are preprocessed by removing the homozygous SNPs, which do not play a role in assembling haplotypes. Subsequently, enumerates three kinds of values for the jth (j = 2, 3,…,n) SNP site in terms of its genotype, and chooses the one leading to the minimum difference between the reconstructed haplotypes and the fragments covering the jth site. After this iteration process is completed, three haplotypes documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$h’=(h’_{1}, h’_{2}h’_{3}$$end{document}h′=(h1′,h2′h3′) having only heterozygous SNP sites are built, for only heterozygous SNPs are remained in the preprocessed matrices. Finally, documentclass[12pt]{minimal}
usepackage{amsmath}
usepackage{wasysym}
usepackage{amsfonts}
usepackage{amssymb}
usepackage{amsbsy}
usepackage{mathrsfs}
usepackage{upgreek}
setlength{oddsidemargin}{-69pt}
begin{document}$$h’$$end{document}h′ is augmented by inserting the SNPs discarded in preprocessing step and the final haplotypes h is obtained. Some key steps of the EHTLD algorithm will be introduced in detail as follows.

In this section, the EHTLD algorithm is compared with two state-of-the-art algorithms, i.e., the HapCompass [12] and the HapTree [13] algorithms. Algorithms EHTLD and HapCompass were implemented on a Windows 7 and the compiler was Microsoft Visual C# 2012. The Python program HapTree (v0.1), downloaded from http://groups.csail.mit.edu/cb/haptree/, was implemented on a Linux system. All the tests below are conducted on a 64 bit PC with Intel Core i5 2.50GHz CPU and 6GB RAM. One hundred data sets were generated for each parameter setting. The average over 100 runs at each parameter setting was calculated and presented.

The minimum error correction with genotype information (MEC/GI) model is one of the important computational models for solving single individual SNP haplotyping problem. In this paper, an enumeration-based algorithm EHTLD is presented for haplotyping a triploid single individual by using this model. Algorithm EHTLD reconstructs the three haplotypes according to the order of SNP loci along them. For a SNP site being reconstructed, the EHTLD algorithm enumerates three possible values in terms of the site’s genotype, and chooses the one leading to the minimum difference between the reconstructed haplotypes and the fragments covering that SNP site. The reconstructed alleles of a SNP site mainly depend on the fragments which cover the site, and are little affected by other former reconstructed alleles. Therefore, the former wrongly reconstructed SNP alleles would not affect the latter reconstructed SNP alleles, i.e., reconstructed errors on the former SNP alleles would not spread to the latter ones. The kind of enumeration strategy can also be apply to reconstruct haplotypes of other ploidy, which will be studied in the future.

 

Source:

http://doi.org/10.1186/s13015-018-0129-0

 

Leave a Reply

Your email address will not be published.