Date Published: March 21, 2019
Publisher: Public Library of Science
Author(s): Xiaolei Zhang, Weijun Pan, Vincenzo De Luca.
Multiscale signal processing techniques such as wavelet filtering have proved to be particularly successful in predicting exon sequences. Traditional wavelet predictor is domain filtering, and enforces exon features by weighting nucleotide values with coefficients. Such a measure performs linear filtering and is not suitable for preserving the short coding exons and the exon-intron boundaries. This paper describes a prediction framework that is capable of non-linearly processing DNA sequences while achieving high prediction rates. There are two key contributions. The first is the introduction of a genomic-inspired multiscale bilateral filtering (MSBF) which exploits both weighting coefficients in the spatial domain and nucleotide similarity in the range. Similarly to wavelet transform, the MSBF is also defined as a weighted sum of nucleotides. The difference is that the MSBF takes into account the variation of nucleotides at a specific codon position. The second contribution is the exploitation of inter-scale correlation in MSBF domain to find the inter-scale dependency on the differences between the exon signal and the background noise. This favourite property is used to sharp the important structures while weakening noise. Three benchmark data sets have been used in the evaluation of considered methods. By comparison with four existing techniques, the prediction results demonstrate that: the proposed method reveals at least improvement of 4.1%, 50.5%, 25.6%, 2.5%, 10.8%, 15.5%, 11.1%, 12.3%, 9.2% and 2.4% on the exons length of 1–24, 25–49, 50–74, 75–99, 100–124, 125–149, 150–174, 175–199, 200–299 and 300–300+, respectively. The MSBF of its nonlinear nature is good at energy compaction, which makes it capable of locating the sharp variations around short exons. The direct scale multiplication of coefficients at several adjacent scales obviously enhanced exon features while the noise contents were suppressed. We show that the non-linear nature and correlation-based property achieved in proposed predictor is greater than that for traditional filtering, which leads to better exon prediction performance. There are some possible applications of this predictor. Its good localization and protection of sharp variations will make the predictor be suitable to perform fault diagnosis of aero-engine.
Recent advancement in high-throughput analysis, such as next-generation sequencing, has resulted in the development of computational techniques for the rapid prediction of exons in DNA sequences. Although great progress has been made in the development of exon prediction algorithms, the challenge of determining the lengths and locations of short exons urgently needs to be solved [1–3]. The main difficulty in predicting short exons is that the intrinsic properties, such as codon biases, are harder to determine [3,4]. To date, there is no consensus about the definition and classification of short exons. Saeys et al. thought that the exons with lengths of <200 base pair (bp) might be considered small . Recently, two independent studies by Irimia et al.  in Cell and by Li et al.  in Genome Research defined one class of short exons called microexons and uncovered the features regulating the inclusion of these microexons. Irimia et al. reveal that the regulation of microexons (defined as exons with lengths of 3–15 bp) is highly dynamic during neuronal differentiation and the inclusion of these microexons can modulate the function of interaction domains of proteins involved in neurogenesis . In another study, Li et al. demonstrate that microexons (defined as exons with lengths of ≤51 bp) exhibit a high level of sequence conservation and they may possess brain-specific functions . Thus, knowledge pertaining to short exons in genomes is very important for understanding the functioning of proteins and the life processes. Therefore, the challenge of determining the lengths and locations of short exons urgently needs to be solved. In another work , we have briefly outlined the intrinsic advantages and limitations of the existing methods for predicting exons. In this paper, we focus on the development of a spectral analysis technique for finding exons in eukaryotic DNA sequences, as described below. Exons encode the biochemical processes and information involved in the pathway from DNA to proteins. In genomic sequence analysis, exon prediction based on the annotated sequences in the online databases is an important problem. For exon prediction, extracting the relevant features of short coding sequences is a major task because the subtle features of short exons are obscured by the strong presence of background noise. In practice, spectral analysis is an important tool for the discovery of interesting patterns and structures in exon data. In this paper, we present a new exon-finding spectral analysis method that overcomes some of the shortcomings of current predicting techniques. The MP-MSBF predictor takes advantage of the nonlinear filtering and the dependency information between scales, which makes it capable of short exon prediction. We see some possible applications of this predictor. The correlation-based property and nonlinear nature of this technique allow the selection of a characteristic frequency from surrounding noise and thereby makes it possible to offer good localization and protection of sharp variations for locating hot spots in proteins and performing fault diagnosis of aero-engine. Source: http://doi.org/10.1371/journal.pone.0205050