Date Published: October 12, 2011
Publisher: BioMed Central
Author(s): Markus E Nebel, Anika Scheid, Frank Weinberg.
Random biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Accordingly, the generation of random sequences has been investigated for a long time. Similarly, random object of a more complicated structure like RNA molecules or proteins are of interest.
In this article, we present a new general framework for deriving algorithms for the non-uniform random generation of combinatorial objects according to the encoding and probability distribution implied by a stochastic context-free grammar. Briefly, the framework extends on the well-known recursive method for (uniform) random generation and uses the popular framework of admissible specifications of combinatorial classes, introducing weighted combinatorial classes to allow for the non-uniform generation by means of unranking. This framework is used to derive an algorithm for the generation of RNA secondary structures of a given fixed size. We address the random generation of these structures according to a realistic distribution obtained from real-life data by using a very detailed context-free grammar (that models the class of RNA secondary structures by distinguishing between all known motifs in RNA structure). Compared to well-known sampling approaches used in several structure prediction tools (such as SFold) ours has two major advantages: Firstly, after a preprocessing step in time O(n2) for the computation of all weighted class sizes needed, with our approach a set of m random secondary structures of a given structure size n can be computed in worst-case time complexity Om⋅n⋅ log(n) while other algorithms typically have a runtime in O(m⋅n2). Secondly, our approach works with integer arithmetic only which is faster and saves us from all the discomforting details of using floating point arithmetic with logarithmized probabilities.
A number of experimental results shows that our random generation method produces realistic output, at least with respect to the appearance of the different structural motifs. The algorithm is available as a webservice at http://wwwagak.cs.uni-kl.de/NonUniRandGen and can be used for generating random secondary structures of any specified RNA type. A link to download an implementation of our method (in Wolfram Mathematica) can be found there, too.
The topic of random generation algorithms (also called samplers) has been widely studied by computer scientists. As stated in , it has been examined under different perspectives, including combinatorics, algorithmics (design and/or engineering), as well as probability theory, where two of the main motivations for random sampling are the testing of combinatorial properties of structures (e.g. conjectured structural properties, quantitative aspects), as well as the testing of properties of the corresponding algorithms (with respect to correctness and/or efficiency).
We will now consider the previously discussed approach to construct a weighted unranking algorithm that generates random RNA secondary structures of a given size according to a realistic probability distribution. As for this paper, the corresponding probability distribution will be induced by a set of sample (SSU and LSU r)RNA secondary structures from the databases [45,46], which will be referred to as biological database in the sequel. However, the presented algorithm can easily be used for any other distribution, which can be defined by a database of known RNA structures of a particular RNA type; our webservice implementation accessible at http://wwwagak.cs.uni-kl.de/NonUniRandGen is actually able to sample random secondary structures of any specified RNA type. A link to download an implementation of our algorithm (in Wolfram Mathematica) can be found there, too.
According to the common definition of RNA secondary structure, we decided to consider the combinatorial class of all RNA secondary structures without pseudoknots that meet the stereochemical constraint of hairpin loops consisting of at least 3 unpaired nucleotides, formally:
First, we have to find a suitable SCFG that generates and models the distribution of the sample data as closely as possible. To reach this goal, it is important to appropriately specify the set of production rules in order to guarantee that all substructures that have to be distinguished are generated by different rules. This is due to the fact that by using only one production rule f to generate different substructures (e.g. any unpaired nucleotides independent of the type of loop they belong to), there is only one weight (the probability pf of this production f) with which any of these substructures is generated, whereas the use of different rules f1,…, fk to distinguish between these substructures implies that they may be generated with different probabilities pf1,…pfk, where pf1+…+pfk=pf. This way, we ensure that more common substructures are generated with higher probabilities than less common ones.
The elaborate SCFG Ĝsto is appropriate for being used as the basis for the desired weighed unranking method: after having determined the RNF of this SCFG and the corresponding weighted combinatorial classes, we easily find a recursion for the size function (in the same ways as discussed in Example
App-.4). Then, we can use the resulting weighted class sizes for the straightforward construction of the desired unranking algorithm.
The purpose of this section is to analyze the quality of randomly generated structures by considering some experimental results.
Altogether, we can finally conclude that the non-uniform random generation method proposed in this article produces appropriate output and may thus be used (for research issues as well as for practical applications) to generate random RNA secondary structures. In fact, for any arbitrary type of (pseudoknot-free) RNA, a corresponding random sampler can be derived in the presented way. Actually, our webservice can be used for generating random secondary structures of any specified type of RNA. It just requires a database of known structures for the respective RNA type as input.
The authors declare that they have no competing interests.
MEN and FW have invented the general framework for the non-uniform random generation. AS and MEN designed the SCFG for RNA secondary structures; MEN proved its unambiguity. AS developed and implemented the algorithms for generating random RNA secondary structures. AS performed all experiments and evaluated the quality of our algorithms. MEN supervised the work and development of ideas. AS drafted the manuscript; a revision and its final version have been prepared by MEN. All authors have read and approved the final manuscript.