Research Article: Virtual Northern Analysis of the Human Genome

Date Published: May 23, 2007

Publisher: Public Library of Science

Author(s): Evan H. Hurowitz, Iddo Drori, Victoria C. Stodden, David L. Donoho, Patrick O. Brown, Juan Valcarcel.

Abstract: We applied the Virtual Northern technique to human brain mRNA to systematically measure human mRNA transcript lengths on a genome-wide scale.

Partial Text: Now that the human genome sequence is nearly complete [1]–[3], the next step is to characterize the organization, function, and diversity of the human genome. Reliable computational detection and analysis of genes in mammalian genomes remains a challenge due to the low percentage of coding sequence, the existence of many short exons and long introns, and the high diversity of alternate transcript forms [1]. Therefore, most efforts to annotate the human genome have relied heavily on the analysis of expressed sequences generated from human RNA. Recently however, the focus has shifted from the generation of ESTs, which are generally short clones representing a fraction of their parent transcript, to the generation of full-length cDNAs. Due to a number of large-scale full-length cDNA sequencing projects, over 20,000 human genes have been validated by at least one putative full-length cDNA [4].

We applied the Virtual Northern technique to the human genome. Using mRNA purified from human brain as our sample, we obtained provisional length measurements from 21,257 cDNA clones representing a total of 11,536 human genes. Thus, we were able to derive at least one measurement of transcript length for nearly half of the 25,000 genes the human genome is predicted to encode, and from 6,238 of those genes at high (90%) confidence. This is a reasonably high fraction considering that we analyzed mRNA from only a single organ, albeit the organ with the highest transcriptional diversity [12], [13]. Our transcript length dataset has a mean and median of 2,165 nucleotides and 1,996 nucleotides respectively. These numbers agree well with previous estimates for the human genome [1]. At high (≥90%) confidence, only about 1.3% of the clones in our dataset detected two transcript lengths. Current estimates for alternative splicing are that 74% of multi-exon genes have alternate splice forms, and that alternatively spliced genes have an average of 2.7 different splice forms [11], [4]. Our detection rate for alternative transcript variants is expected to fall short of those estimates for two reasons. First, we only examined a single tissue, so our analysis was only able to detect transcript variants expressed in the brain. Second, our length fractionation procedure had a theoretical maximum resolution of about 5–6% of total transcript length. Any transcript variants whose lengths differ by less than that would not be reliably resolved. That range is sufficient to exclude detection of alternative splices resulting from the use of alternate exons of similar length, the inclusion/exclusion of a single short exon, or the use of alternate nearby 5′ or 3′-splice sites.