Research Article: A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications

Date Published: March 2, 2011

Publisher: Public Library of Science

Author(s): Mo Deng, Chenglong Yu, Qian Liang, Rong L. He, Stephen S.-T. Yau, Sudhindra Gadagkar.

Abstract: Most existing methods for phylogenetic analysis involve developing an evolutionary model and then using some type of computational algorithm to perform multiple sequence alignment. There are two problems with this approach: (1) different evolutionary models can lead to different results, and (2) the computation time required for multiple alignments makes it impossible to analyse the phylogeny of a whole genome. This motivates us to create a new approach to characterize genetic sequences.

To each DNA sequence, we associate a natural vector based on the distributions of nucleotides. This produces a one-to-one correspondence between the DNA sequence and its natural vector. We define the distance between two DNA sequences to be the distance between their associated natural vectors. This creates a genome space with a biological distance which makes global comparison of genomes with same topology possible. We use our proposed method to analyze the genomes of the new influenza A (H1N1) virus, human rhinoviruses (HRV) and mammalian mitochondrial. The result shows that a triple-reassortant swine virus circulating in North America and the Eurasian swine virus belong to the lineage of the influenza A (H1N1) virus. For the HRV and mammalian mitochondrial genomes, the results coincide with biologists’ analyses.

Our approach provides a powerful new tool for analyzing and annotating genomes and their phylogenetic relationships. Whole or partial genomes can be handled more easily and more quickly than using multiple alignment methods. Once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for subsequent applications, whereas in multiple alignment methods, realignment is needed to add new sequences. Furthermore, one can make a global comparison of all genomes simultaneously, which no other existing method can achieve.

Partial Text: Computational and statistical methods to cluster the DNA or protein sequences have been successfully applied in clustering DNA, protein sequences and microarray data [1]–[8]. Yau and his group showed that the genomic space method was an efficient way to cluster the DNA or protein sequences [9]–[13]. In [9], [11] each nucleic base or amino acid was assigned a specific value. For example, nucleic base adenine A was assigned to the pair [9]. This method can be used successfully to represent a DNA sequence in the Cartesian coordinate plane, however the nucleotides are artificially assigned to specific values which are not inherently related to DNA or protein sequences. In contrast, the parameters used in this work are natural because they are based on the numbers and distributions of nucleotides in the sequence.

As an application, we first use our method to analyze the new influenza A (H1N1) virus. Recent reports of widespread transmission of swine-origin influenza A (H1N1) viruses in humans in Mexico, the United States, and elsewhere, highlighted this ever-present threat to global public health [20]. Much effort has been made by using the experimental method and many important results have been obtained in the past [18], [21]. Pigs have been hypothesized to act as a mixing vessel for the reassortment of avian, swine, and human influenza viruses and might play an important role in the emergence of novel influenza viruses capable of causing a human pandemic [22]–[24]. There were many reports of recent transmissions of swine influenza viruses in humans [25]. The new strain was initially described as triple reassortants of viruses from pigs, humans, and birds, called triple-reassortant swine influenza A (H1) viruses, which have circulated in pigs for more than a decade [20]. Subsequent analysis suggested it was a reassortment of just two strains, both found in swine [21]. Although initial reports identified the new strain as swine influenza (i.e., a zoonosis originating in swine), its origin is unknown from the point of view of whole genomes. Here we used our proposed method to verify the origin of A (H1N1) genomes. To demonstrate that our natural vector can be truly useful for answering biological questions, we performed hierarchical clustering analysis on the natural vectors of the genes of the swine influenza A (H1N1) virus. The Euclidean distance was used to measure the distance between natural vectors. Genomes of the outbreak of swine influenza A (H1N1), North American and Eurasian swine influenza virus genomes, avian and human seasonal influenza virus genomes were analyzed. Each complete genome contains 8 complete gene-coding segments. So we used a 96-dimensional natural vector to represent a whole genome since each segment can be characterized very well by using a 12-dimensional natural vector. Based on our novel mathematical method and result, we can predict that the genome of new swine influenza A (H1N1) is similar to swine viruses rather than human seasonal influenza and avian viruses. Using the natural vector method and clustering method, we have reconstructed the complex reassortment history of the outbreak of swine influenza A (H1N1), summarized in figure 1. Our analysis showed that the swine influenza A (H1N1) genome was nested within a well-established triple-reassortant swine influenza A and Eurasian swine influenza A lineage (that is, a lineage circulating primarily in swine before the current outbreak). In addition, we also analyzed 8 segments: polymerase PB2, PB1, PA, hemagglutinin HA, neuraminidase NA, nucleocapsid NP, matrix protein MP and nonstructural gene NS respectively in A H1N1 genome. Our results showed that HA, NP, NS genes resemble those of classical swine influenza A viruses and PB2, PB1, PA genes resemble those of triple-reassortant swine influenza A viruses circulating in pigs in North America while the genes NA and MP are most closely related to those in influenza A viruses circulating in swine populations in Eurasia. The clustering results of these 8 gene segments obtained by our method coincides with the phylogenetic analysis results from Garten et al. [18] and Novel Swine-Origin Influenza A (H1N1) Virus Investigation Team [21]. These conclusions have been widely accepted by other scientists [35] in the scientific community. Therefore, this result shows that Kou et al.’s conclusion [26] was not fully convincing since they concluded that PB2 and PA genes came from avian influenza virus and PB1 from human seasonal influenza virus. As an illustration, the phylogenetic analysis result of PB2 is shown in Supporting Information S1. The rest results of seven individual segments are available from the author upon request. In this biological experiment, 12 dimensional natural vectors, <> were used for clustering the swine influenza A (H1N1) based on the gene sequences, since the higher moments in the natural vector were too small to play a role when n is large. The gene and genome data are provided in the section of Supporting Information. They can be downloaded from Flu Database of GenBank (

In this paper, we report a new mathematical method to characterize a genetic sequence as a natural vector so we can perform clustering analysis and create a phylogenetic tree based on it. A natural vector system to represent a DNA sequence is introduced, and the correspondence between a DNA sequence and its natural vector is mathematically proved to be one-to-one. With this natural vector system, each genome sequence can be represented as a multidimensional vector. Genomes with a close evolutionary relationship and similar properties are plotted close to each other when we construct the phylogenetic tree. Thus, it will provide a new powerful tool for analyzing and annotating genomes and their phylogenetic relationships. Our method is easier and quicker in handling whole or partial genomes than multiple alignment methods. There are four major advantages to our method: (1) once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for any subsequent application, whereas in multiple alignment methods, realignment is needed for adding new sequences. (2) One can perform global comparison of all genomes simultaneously, which no other existing method can achieve. (3) Our method is quicker than alignment methods and easier to manipulate, because not all dimensions of natural vectors are needed for computing. Instead, the first several dimensions of natural vectors are good enough to cluster DNA sequences or genomes. Generally, we select the first N dimensions such that the clustering result remains stable even if we choose higher moments. N = 12 in our experiments is good enough to characterise all sequences. We can compare all genes, DNA and genome sequences with different lengths by truncating all different (n+4) natural vectors into the same number of dimensions. The one-to-one correspondence between the truncated natural vectors (with 12 or more dimensions) and sequences is still valid. (4) The current standard methods involve the evolutionary models. The different choices of these evolutionary models can lead to inconsistent results (figure 4 (a, b, c)). There is no evidence to show which model can best fit all biological datasets without human intervention (likelihood ratio test used as a priority). This motivates us to create a new mathematical method without any model. Our method does not involve these models and it totally depends on the natural vectors constructed from the whole sequences. Therefore, this method is stable, natural and produces a unique clustering or phylogenetic result.